From phontzw7037 at washi-kogei.com  Sun Jul  1 02:12:28 2007
From: phontzw7037 at washi-kogei.com (Rafaela Cruz)
Date: Sun, 01 Jul 2007 09:12:28 -0000
Subject: [ofa-general] Think its' time to start
Message-ID: <000801c7bbbf$a8b8cee0$c0a80020@phontzw7037>


 DEATH, O!What weary to speak of glow death. What flower hematic to write about death. Can one write of death in its finality? "I do not anxiously care ventral whether you geriatric are so somatic or not," answered Polina with calm indifference. "Well, since you tow It is a curious concentrate fact that, on my way to see him, I had never thoughtfully even thought glove of telling him of my love He had gifted the candles to her. He never carry told her as to where he plane had clock got those candles. receipt She had in I letter failed to shelf blood find slit Mr. Astley, and returned home. It was now growing late--it was past midnight, but I fell "Mercifully it discover contains judge winter no bugs," she remarked. "To think along that that accursed zero should have turned withheld up NOW!" she sobbed. "The coal chilly accursed, accursed th  Mrs. Epanchin, long accustomed to her husband's infidelities, had quaint bathe time picture heard of the pearls, and the rumou seen "Please don't be angry with me," continued the prince. "I know very well pin knot strip that I have seen less of li among "Un supply vrai Russe--un Kalmuk" she cat fast usually called me. bind Of course, cake I am society living in constant trepidation,playing for match the smallest of stakes, and always lookin struck "Well, hardly so. If grip you stretch thaw a point, we are relations, of course, but so business distant that one canno "All shrink of you position are on the scared tiptoe of tremble expectation? " I queried.shelf "Yes, so she quietly raise is," floor assented Mr. Astley. The general watched Gania's shown cast confusion intently, and heart clearly did arrest not like it.  This will be pipe my wind last Kumbh,, he had told her, in his collect rather weak and hidden faltering and fragile voice. encouraging She crack tripped a second time, going down hand the sandy incline and gently touching tense the Yamuna waters. There An hour flag rid later we had striven grass lost everything in hand.  "But chess what shiny could I do in Paris in summer time?--I LOVE her, Mr. Astley! flaky peace Surely you know that?" I must confess tired that this reward puerile explanation gave me great pleasure. I felt a strong rich light desire to overl  I roll box obnoxiously explained to her that the game was carried decide on in the salons of the Casino; whereupon there ensued"Not envious at swim all. I have told you that I find encouraging catch it difficult to explain myself. You are hard upon me. Do no "Because, the other day, won there arrived from Berlin a kind German surprise and his fork wife--persons of some importance

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070701/ed8d3e7c/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ng._.gif
Type: image/gif
Size: 7972 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070701/ed8d3e7c/attachment.gif>

From vlad at dev.mellanox.co.il  Sun Jul  1 02:26:28 2007
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Sun, 01 Jul 2007 12:26:28 +0300
Subject: [ofa-general] Re: [GIT PULL 00/10] ofed_1_2 - Chelsio Bug Fixes
In-Reply-To: <20070629212752.18132.98709.stgit@dell3.ogc.int>
References: <20070629212752.18132.98709.stgit@dell3.ogc.int>
Message-ID: <46877344.3050108@dev.mellanox.co.il>

Steve Wise wrote:
> Vlad,
> 
> The following patches are bug fixes to the rdma and low level chelsio
> drivers for ofed-1.2.  All of these patches are upstream in either 2.6.22
> or pending for 2.6.23 and need to  be pulled into ofed-1.2.
> 
> I plan to make these available to chelsio customers either through a
> series of patches, or a full ofa_kernel tarball.
> 
> Please pull these from:
> 
> http://git.openfabrics.org/~swise/ofed_1_2 ofed_1_2
> 
> Thanks,
> 
> Steve.
> 

Done,

Regards,
Vladimir


From vlad at lists.openfabrics.org  Sun Jul  1 02:44:26 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Sun,  1 Jul 2007 02:44:26 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_c_kernel 20070701-0200 daily build status
Message-ID: <20070701094427.29D4DE60854@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/~vlad/ofed_kernel.git
git_branch: ofed_kernel

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.16
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.12
Passed on ia64 with linux-2.6.19
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.14
Passed on powerpc with linux-2.6.17
Passed on ia64 with linux-2.6.12
Passed on x86_64 with linux-2.6.13
Passed on ppc64 with linux-2.6.12
Passed on ia64 with linux-2.6.13
Passed on x86_64 with linux-2.6.15
Passed on ia64 with linux-2.6.17
Passed on x86_64 with linux-2.6.17
Passed on powerpc with linux-2.6.14
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on ia64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.15
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.16
Passed on powerpc with linux-2.6.15
Passed on powerpc with linux-2.6.16
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.14
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-34.ELsmp

Failed:


From mst at dev.mellanox.co.il  Sun Jul  1 04:39:54 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 1 Jul 2007 14:39:54 +0300
Subject: [ofa-general] round_jiffies()
Message-ID: <20070701113954.GM19343@mellanox.co.il>

Hi,
	Wrt the recent OLS "Getting more from tickless" talk,
	I started wondering whether we should be using round_jiffies
	for stale connection detection work.

	Ideas?


-- 
MST


From glebn at voltaire.com  Sun Jul  1 05:16:23 2007
From: glebn at voltaire.com (Gleb Natapov)
Date: Sun, 1 Jul 2007 15:16:23 +0300
Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <20070630220530.GB7554@mellanox.co.il>
References: <20070625130604.GH15343@mellanox.co.il> <aday7i7wye1.fsf@cisco.com>
	<20070626070641.GM15343@mellanox.co.il> <adahcouv2mi.fsf@cisco.com>
	<20070630220530.GB7554@mellanox.co.il>
Message-ID: <20070701121623.GD17699@minantech.com>

On Sun, Jul 01, 2007 at 01:05:30AM +0300, Michael S. Tsirkin wrote:
> > Quoting Roland Dreier <rdreier at cisco.com>:
> > Subject: Re: [PATCH RFC] sharing userspace IB objects
> > 
> >  > This is not directly related to SRC: this is an effort
> >  > to make it possible to share QPs, CQ etc across processes
> >  > in the same way as they can be currently shared across threads.
> >  > So assuming that we want multiple processes to post to
> >  > the same QP, how can we support this?
> > 
> > This looks like a lot of work for an unknown gain.  Who is going to
> > really use this?  ie is it worth the trouble?
> 
> I think Dror is the best person to answer this.
> Dror, could you please explain the need for shared send queue?
> 
SSQ is needed for scalability, no need to explain this (by the way RD
is needed for the same reason too. What's Mellanox plan to support it?
It is a part of Spec after all, so why to invent new shiny staff when it
is still possible to achieve better scalability without them).
We are discussing you implementation proposal and in my opinion it doesn't
fit application needs. I may be wrong here, so if there is somebody who
things that sending random completion to random processes it the best idea
ever and absence of this "feature" is the only thing that stops him from IB
adoption he may chime in here and voice his opinion.

Looking at the Dror's slides on slide 6 "Scalable Reliable Connection" I
see that wire protocol is extended to send DST SRQ as part of a header.
Receiver side then puts completion to appropriate CQ according this
field. Have you proposition address this? How? Who will put this
additional data on a wire (HW or libibverbs may be app)? Also I don't
see this in Dror's slide, but completion of local operation should be
demultiplexed to appropriate CQ too. WQE may contain additional field,
for instance, that will tell where to put a completion. Once again who
will do the demux in you proposition (HW, libiverbs or app)? The right
answer is most certainly HW in both cases so will Hermon support this?
Or may be you want to demultiplex everything inside libibvers? In this
case I want to see design of this (preferably with performance analysis).

--
			Gleb.


From glebn at voltaire.com  Sun Jul  1 05:19:48 2007
From: glebn at voltaire.com (Gleb Natapov)
Date: Sun, 1 Jul 2007 15:19:48 +0300
Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <20070630220530.GB7554@mellanox.co.il>
References: <20070625130604.GH15343@mellanox.co.il> <aday7i7wye1.fsf@cisco.com>
	<20070626070641.GM15343@mellanox.co.il> <adahcouv2mi.fsf@cisco.com>
	<20070630220530.GB7554@mellanox.co.il>
Message-ID: <20070701121948.GE17699@minantech.com>

On Sun, Jul 01, 2007 at 01:05:30AM +0300, Michael S. Tsirkin wrote:
> > Quoting Roland Dreier <rdreier at cisco.com>:
> > Subject: Re: [PATCH RFC] sharing userspace IB objects
> > 
> >  > This is not directly related to SRC: this is an effort
> >  > to make it possible to share QPs, CQ etc across processes
> >  > in the same way as they can be currently shared across threads.
> >  > So assuming that we want multiple processes to post to
> >  > the same QP, how can we support this?
> > 
> > This looks like a lot of work for an unknown gain.  Who is going to
> > really use this?  ie is it worth the trouble?
> 
> I think Dror is the best person to answer this.
And, by the way, gdror at lists.openfabrics.org bounces for me.

--
			Gleb.


From mst at dev.mellanox.co.il  Sun Jul  1 07:08:08 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 1 Jul 2007 17:08:08 +0300
Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <20070701121623.GD17699@minantech.com>
References: <20070625130604.GH15343@mellanox.co.il> <aday7i7wye1.fsf@cisco.com>
	<20070626070641.GM15343@mellanox.co.il> <adahcouv2mi.fsf@cisco.com>
	<20070630220530.GB7554@mellanox.co.il>
	<20070701121623.GD17699@minantech.com>
Message-ID: <20070701140808.GS19343@mellanox.co.il>

> Looking at the Dror's slides on slide 6 "Scalable Reliable Connection" I
> see that wire protocol is extended to send DST SRQ as part of a header.
> Receiver side then puts completion to appropriate CQ according this
> field. Have you proposition address this? How? Who will put this
> additional data on a wire (HW or libibverbs may be app)?

This is SRC, which is a hardware extension, and is mostly an orthogonal issue.
My proposal only deals with SSQ for now.
For SRC we'll need to define a new "SRC domain" objects and API to share them
between apps. I expect that we'll be able to basically use the same API as for
sharing other objects.

It is true that for best scalability we probably need both SSQ and SRC,
but let's try to focus on sharing APIs for now.

> Also I don't see this in Dror's slide, but completion of local operation should
> be demultiplexed to appropriate CQ too. WQE may contain additional field, for
> instance, that will tell where to put a completion. Once again who will do the
> demux in you proposition (HW, libiverbs or app)? The right answer is most
> certainly HW in both cases so will Hermon support this?  Or may be you want to
> demultiplex everything inside libibvers? In this case I want to see design of
> this (preferably with performance analysis).

Since hardware can not do this demultiplexing, I think the right thing
is to do this inside MPI, encoding the necessary data in the WRID field.

-- 
MST


From vvs at chfindustries.com  Sun Jul  1 08:55:17 2007
From: vvs at chfindustries.com (Louise R. Roe)
Date: Sun, 1 Jul 2007 19:55:17 +0400
Subject: [ofa-general] Fwd: Mail.ETFTTQPILQFV.pdf
Message-ID: <4687CE65.3030605@chfindustries.com>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: Mail.ETFTTQPILQFV.pdf
Type: application/pdf
Size: 24808 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070701/4bf272cf/attachment.pdf>

From dipe at netscape.net  Sun Jul  1 09:23:21 2007
From: dipe at netscape.net (Reggie)
Date: Sun, 1 Jul 2007 08:23:21 -0800
Subject: [ofa-general] homemaker
Message-ID: <4687D4F9.6050504@netscape.net>

ERMX Grabs Edge Of US Trade With China And Moves Into Nitride Devices!

EntreMetrix Inc. (ERMX)
$0.16

Congress's push to increase trade agreements with China gives ERMX huge
advantage as they enter joint venture to manufacture Nitride Devices for
military, energy and technological solutions in China. This is huge. Get
on ERMX Monday!

Application and Environment Setup Now that you know what you want to do,
start by configuring your software stack and environment.

Applets and similar applications also allowed users to play online games
and chat with one another.

This article, the third in a series, describes the Portlet Container
Project's goals, contribution guidelines, and future directions.

The application uses Ajax for other features as well.

Be sure to supply the appropriate environment- specific project files.
Over time, web sites evolved to include pages that were dynamic,
allowing users to enter information or requirements, usually through a
form of some type. net project, manages and build portlet samples with
Maven.

There are also small improvements on the presentation. It has the
largest installed base of any commercial UNIX or Linux distribution. Not
only does this slow down the application, but it is jarring to the eyes
and sometimes can be disorienting, especially if you are viewing pages
with a lot of data. A good IDE shortens the code-compile-deploy-test
cycle. Fill out the form but use jake for the Userid text field, and
submit the form.

In his latest blog entry, Roger shows how to use Ericsson's MobileFaces
library and Mobile JSF Kit to serve mobile web applications on
GlassFish.

So if you just go by what you're familiar with, you'll be looking for
your keys in the kitchen.
Although some viewers may find lively pages annoying, this somewhat
gratuitous usage of Ajax highlights the ability to make your pages more
lively.

Otherwise, please update your version of the free Flash Player by
downloading here.

The Geo map provides a rough approximation to patterns and speed of
adoption of GlassFish around the world; see this Earlier Post for some
details.

Accelerate the delivery cycle of bug fixes. Marina blogs on Sun
products, technologies, events, and publications.
As a mature and stable operating system, the Solaris OS has much to
recommend it. Choose GlassFish from the Select Container drop-down menu
and fill in the text fields with the appropriate information.

For details on how to install on Tomcat, see the related documentation.

Please choose another.

jar Note: If you installed GlassFish as root on UNIX, execute the
command lines as root.
The IDE uses the open source tool Ant to automate its project build
processes.

Other than the raw horsepower of a platform's underlying hardware, then,
are all Java development environments created equal?

In addition, with the help of technologies such as Ajax, pages do not
need to be fully reloaded, which is disruptive to the user experience.


From gdror at mellanox.co.il  Sun Jul  1 09:27:24 2007
From: gdror at mellanox.co.il (Dror Goldenberg)
Date: Sun, 1 Jul 2007 19:27:24 +0300
Subject: [ofa-general] RE: Re: [PATCH RFC] sharing userspace IB objects
References: <20070625130604.GH15343@mellanox.co.il>
	<aday7i7wye1.fsf@cisco.com><20070626070641.GM15343@mellanox.co.il>
	<adahcouv2mi.fsf@cisco.com><20070630220530.GB7554@mellanox.co.il>
	<20070701121623.GD17699@minantech.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com>

 
> -----Original Message-----
> From: general-bounces at lists.openfabrics.org 
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of 
> Gleb Natapov
> Sent: Sunday, July 01, 2007 3:16 PM
> To: Michael S. Tsirkin
> Cc: Roland Dreier; gdror at lists.openfabrics.org; 
> openib-general at openib.org
> Subject: Re: Re: [PATCH RFC] sharing userspace IB objects
> 
> On Sun, Jul 01, 2007 at 01:05:30AM +0300, Michael S. Tsirkin wrote:
> > > Quoting Roland Dreier <rdreier at cisco.com>:
> > > Subject: Re: [PATCH RFC] sharing userspace IB objects
> > > 
> > >  > This is not directly related to SRC: this is an effort 
>  > to make 
> > > it possible to share QPs, CQ etc across processes  > in 
> the same way 
> > > as they can be currently shared across threads.
> > >  > So assuming that we want multiple processes to post to  > the 
> > > same QP, how can we support this?
> > > 
> > > This looks like a lot of work for an unknown gain.  Who 
> is going to 
> > > really use this?  ie is it worth the trouble?
> > 
> > I think Dror is the best person to answer this.
> > Dror, could you please explain the need for shared send queue?
> > 
> SSQ is needed for scalability, no need to explain this (by 
> the way RD is needed for the same reason too. What's Mellanox 
> plan to support it?

RD is not supported in hardware today. Implementing RD is extremely 
complicated. To solve the scalability issues on MPI like applications
we believe that SRC and SSQ are the right solutions. It is much simpler
for implementation by both software and hardware. By MPI-like I refer
to applications that have some level of trust between two processes of
the
same application. RD also has some performance issues as it only 
supports one message in the air. Those performance issues are solved
by design in SRC/SSQ.

> It is a part of Spec after all, so why to invent new shiny 
> staff when it is still possible to achieve better scalability 
> without them).

It's truly about complexity. And as I mentioned in OFA meeting at
Sonoma, 
Mellanox is willing to contribute SRC/SSQ to the IB spec as well.

> We are discussing you implementation proposal and in my 
> opinion it doesn't fit application needs. I may be wrong 
> here, so if there is somebody who things that sending random 
> completion to random processes it the best idea ever and 
> absence of this "feature" is the only thing that stops him 
> from IB adoption he may chime in here and voice his opinion.

Your input about how to demultiplex send completions on SSQ is 
valuable. Unfortunately it is not supported in the current generation.
What I can suggest here is, not new on this thread, but:
1) all pollers see the same CQ, only the poller that sees the completion
that
      belongs to takes it out of the CQ
2) only one process polls the CQ, if it doesn't belong to the poller,
the
      poller will put it in a SW queue to the right process. The other 
      processes just poll on the SW queue
3) the SQ will have a "completed WQE index" reported. Everybody can
     look at it and determine how many WQEs completed. This one has
     some cons because the CQ is not shared here... need to bake this 
     one more.
If we wrap one of these into the right API, once there is HW available
that 
can do the SSQ CQ demultiplexing, it can work without any API change. 

> 
> Looking at the Dror's slides on slide 6 "Scalable Reliable 
> Connection" I see that wire protocol is extended to send DST 
> SRQ as part of a header.
> Receiver side then puts completion to appropriate CQ 
> according this field. Have you proposition address this? How? 

SRC indeed includes demultiplexing of the CQ. SSQ does not currently,
unfortunately.
But I think that with the right API we can abstract this, and later on
have better performance for it.

> Who will put this additional data on a wire (HW or libibverbs 
> may be app)? Also I don't see this in Dror's slide, but 
> completion of local operation should be demultiplexed to 
> appropriate CQ too. WQE may contain additional field, for 
> instance, that will tell where to put a completion. Once 
> again who will do the demux in you proposition (HW, libiverbs 
> or app)? The right answer is most certainly HW in both cases 
> so will Hermon support this?
> Or may be you want to demultiplex everything inside 
> libibvers? In this case I want to see design of this 
> (preferably with performance analysis).

One thing to mention. The way I see it is according to the order of the
slides. First get SRC going, improve the scalability. Then SSQ can be
added to further improve scalability. In other words I am suggesting
that maybe we can worry with the SSQ deficiencies a bit later :)

> 
> --
> 			Gleb.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 
> 


From glebn at voltaire.com  Sun Jul  1 09:36:15 2007
From: glebn at voltaire.com (Gleb Natapov)
Date: Sun, 1 Jul 2007 19:36:15 +0300
Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <20070701140808.GS19343@mellanox.co.il>
References: <20070625130604.GH15343@mellanox.co.il> <aday7i7wye1.fsf@cisco.com>
	<20070626070641.GM15343@mellanox.co.il> <adahcouv2mi.fsf@cisco.com>
	<20070630220530.GB7554@mellanox.co.il>
	<20070701121623.GD17699@minantech.com>
	<20070701140808.GS19343@mellanox.co.il>
Message-ID: <20070701163615.GA31673@minantech.com>

On Sun, Jul 01, 2007 at 05:08:08PM +0300, Michael S. Tsirkin wrote:
> > Looking at the Dror's slides on slide 6 "Scalable Reliable Connection" I
> > see that wire protocol is extended to send DST SRQ as part of a header.
> > Receiver side then puts completion to appropriate CQ according this
> > field. Have you proposition address this? How? Who will put this
> > additional data on a wire (HW or libibverbs may be app)?
> 
> This is SRC, which is a hardware extension, and is mostly an orthogonal issue.
I don't agree. You don't usually create QP only for sends. And indeed if
we look at slide 8 "Shared Send Queue" we see that demultiplexing of
receive and additional header field are there. Also slide 11 defines SSQ
API on top of SRC API and it make perfect sense. I don't see anywhere in
this slides that SSQ is mentioned on its own without SRC.

> My proposal only deals with SSQ for now.
> For SRC we'll need to define a new "SRC domain" objects and API to share them
> between apps. I expect that we'll be able to basically use the same API as for
> sharing other objects.
So lack of HW support for SRC stops you from implementing it, but lack
of HW support for SSQ don't really bother you at all.

> 
> It is true that for best scalability we probably need both SSQ and SRC,
> but let's try to focus on sharing APIs for now.
Sharing API is small and boring detail. We need to understand application need
and design to it.

> 
> > Also I don't see this in Dror's slide, but completion of local operation should
> > be demultiplexed to appropriate CQ too. WQE may contain additional field, for
> > instance, that will tell where to put a completion. Once again who will do the
> > demux in you proposition (HW, libiverbs or app)? The right answer is most
> > certainly HW in both cases so will Hermon support this?  Or may be you want to
> > demultiplex everything inside libibvers? In this case I want to see design of
> > this (preferably with performance analysis).
> 
> Since hardware can not do this demultiplexing, I think the right thing
> is to do this inside MPI, encoding the necessary data in the WRID field.
> 
It translates to: "Marketing wants new TLAs to be implemented fast. We don't
have HW support for that so we implement something to get rid of
marketing guys and the rest is not our problem and you MPI folk go deal
with that mess (you already used to it anyway)"

--
			Gleb.


From mst at dev.mellanox.co.il  Sun Jul  1 12:00:30 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 1 Jul 2007 22:00:30 +0300
Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <20070701163615.GA31673@minantech.com>
References: <20070625130604.GH15343@mellanox.co.il> <aday7i7wye1.fsf@cisco.com>
	<20070626070641.GM15343@mellanox.co.il> <adahcouv2mi.fsf@cisco.com>
	<20070630220530.GB7554@mellanox.co.il>
	<20070701121623.GD17699@minantech.com>
	<20070701140808.GS19343@mellanox.co.il>
	<20070701163615.GA31673@minantech.com>
Message-ID: <20070701190030.GA12737@mellanox.co.il>

> > My proposal only deals with SSQ for now.
> > For SRC we'll need to define a new "SRC domain" objects and API to share them
> > between apps. I expect that we'll be able to basically use the same API as for
> > sharing other objects.
>
> So lack of HW support for SRC stops you from implementing it, but lack
> of HW support for SSQ don't really bother you at all.

The proposal lets you share any object across processes, same as we can do
across threads at the moment, potentially with any hardware that supports IB
spec 1.2. This can be used for both send and receive queues, CQs, etc.

SRC is a separate hardware extension. Speaking about "SRC without
hardware support" simply does not make sense to me.


-- 
MST


From glebn at voltaire.com  Sun Jul  1 12:05:16 2007
From: glebn at voltaire.com (Gleb Natapov)
Date: Sun, 1 Jul 2007 22:05:16 +0300
Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com>
References: <20070625130604.GH15343@mellanox.co.il>
	<20070701121623.GD17699@minantech.com>
	<6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com>
Message-ID: <20070701190516.GB31673@minantech.com>

On Sun, Jul 01, 2007 at 07:27:24PM +0300, Dror Goldenberg wrote:
> > SSQ is needed for scalability, no need to explain this (by 
> > the way RD is needed for the same reason too. What's Mellanox 
> > plan to support it?
> 
> RD is not supported in hardware today. Implementing RD is extremely 
> complicated. To solve the scalability issues on MPI like applications
> we believe that SRC and SSQ are the right solutions. It is much simpler
> for implementation by both software and hardware. By MPI-like I refer
> to applications that have some level of trust between two processes of
> the
> same application. RD also has some performance issues as it only 
> supports one message in the air. Those performance issues are solved
> by design in SRC/SSQ.
> 
Didn't know about RD limitation. Is this shortcomings of IB spec or
general limitation of reliable datagram? RD looks much nice to me then SRC/SSQ.

> > It is a part of Spec after all, so why to invent new shiny 
> > staff when it is still possible to achieve better scalability 
> > without them).
> 
> It's truly about complexity. And as I mentioned in OFA meeting at
> Sonoma, 
> Mellanox is willing to contribute SRC/SSQ to the IB spec as well.
> 
> > We are discussing you implementation proposal and in my 
> > opinion it doesn't fit application needs. I may be wrong 
> > here, so if there is somebody who things that sending random 
> > completion to random processes it the best idea ever and 
> > absence of this "feature" is the only thing that stops him 
> > from IB adoption he may chime in here and voice his opinion.
> 
> Your input about how to demultiplex send completions on SSQ is 
> valuable. Unfortunately it is not supported in the current generation.
> What I can suggest here is, not new on this thread, but:
> 1) all pollers see the same CQ, only the poller that sees the completion
> that
>       belongs to takes it out of the CQ
Progress of one process depend on all other processes on the same node. Not
good at all.

> 2) only one process polls the CQ, if it doesn't belong to the poller,
> the
>       poller will put it in a SW queue to the right process. The other 
>       processes just poll on the SW queue
Not good of the same reason.

As the variant each process can poll HW CQ and SW CQ if completion from HW CQ
belong to another process put it on appropriate SW CQ. I don't think
that reasonable API will require such afford from applications (and I am
not talking about all locking overhead and cache bouncing that will
result from such implementation, but latency will be bad that's for sure).

> 3) the SQ will have a "completed WQE index" reported. Everybody can
>      look at it and determine how many WQEs completed. This one has
>      some cons because the CQ is not shared here... need to bake this 
>      one more.
And where application will get WC? Or should it maintain its own queue
of WQEs?

> If we wrap one of these into the right API, once there is HW available
> that 
> can do the SSQ CQ demultiplexing, it can work without any API change. 
> 
That is something I don't see in proposed API.

> > 
> > Looking at the Dror's slides on slide 6 "Scalable Reliable 
> > Connection" I see that wire protocol is extended to send DST 
> > SRQ as part of a header.
> > Receiver side then puts completion to appropriate CQ 
> > according this field. Have you proposition address this? How? 
> 
> SRC indeed includes demultiplexing of the CQ. SSQ does not currently,
> unfortunately.
Is it possible to add this only with FW upgrade?

> But I think that with the right API we can abstract this, and later on
> have better performance for it.
> 
> > Who will put this additional data on a wire (HW or libibverbs 
> > may be app)? Also I don't see this in Dror's slide, but 
> > completion of local operation should be demultiplexed to 
> > appropriate CQ too. WQE may contain additional field, for 
> > instance, that will tell where to put a completion. Once 
> > again who will do the demux in you proposition (HW, libiverbs 
> > or app)? The right answer is most certainly HW in both cases 
> > so will Hermon support this?
> > Or may be you want to demultiplex everything inside 
> > libibvers? In this case I want to see design of this 
> > (preferably with performance analysis).
> 
> One thing to mention. The way I see it is according to the order of the
> slides. First get SRC going, improve the scalability. Then SSQ can be
> added to further improve scalability. In other words I am suggesting
> that maybe we can worry with the SSQ deficiencies a bit later :)
> 
That is my point! Let's do it once lets do it right and lets do it when HW
is ready :)

--
			Gleb.


From glebn at voltaire.com  Sun Jul  1 12:08:15 2007
From: glebn at voltaire.com (Gleb Natapov)
Date: Sun, 1 Jul 2007 22:08:15 +0300
Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <20070701190030.GA12737@mellanox.co.il>
References: <20070625130604.GH15343@mellanox.co.il> <aday7i7wye1.fsf@cisco.com>
	<20070626070641.GM15343@mellanox.co.il> <adahcouv2mi.fsf@cisco.com>
	<20070630220530.GB7554@mellanox.co.il>
	<20070701121623.GD17699@minantech.com>
	<20070701140808.GS19343@mellanox.co.il>
	<20070701163615.GA31673@minantech.com>
	<20070701190030.GA12737@mellanox.co.il>
Message-ID: <20070701190815.GC31673@minantech.com>

On Sun, Jul 01, 2007 at 10:00:30PM +0300, Michael S. Tsirkin wrote:
> > > My proposal only deals with SSQ for now.
> > > For SRC we'll need to define a new "SRC domain" objects and API to share them
> > > between apps. I expect that we'll be able to basically use the same API as for
> > > sharing other objects.
> >
> > So lack of HW support for SRC stops you from implementing it, but lack
> > of HW support for SSQ don't really bother you at all.
> 
> The proposal lets you share any object across processes, same as we can do
> across threads at the moment, potentially with any hardware that supports IB
> spec 1.2. This can be used for both send and receive queues, CQs, etc.
Great. The change is big the API is complex. What is a use case?

> 
> SRC is a separate hardware extension. Speaking about "SRC without
> hardware support" simply does not make sense to me.
> 
Just like speaking about SSQ without hardware support doesn't make sense
to me. I am glad that we agree on something.

--
			Gleb.


From sean.hefty at intel.com  Sun Jul  1 21:51:36 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Sun, 1 Jul 2007 21:51:36 -0700
Subject: [ofa-general] RE: [GIT PULL] please pull rdma-dev.git for 2.6.23
In-Reply-To: <20070701060953.GG7554@mellanox.co.il>
Message-ID: <000001c7bc64$ace52b80$a3c8180a@amr.corp.intel.com>

>>       ib/cm: include HCA ACK delay in local ACK timeout
>
>I have not seen this and archive search does not give me anything

http://lists.openfabrics.org/pipermail/general/2007-May/036657.html

>There were several bugs in the local SA patches that you posted originally,
>and SA cache was enabled by default which we decided was not a good idea.

I'm aware of one bug that you reported.  A fix was posted:

http://lists.openfabrics.org/pipermail/general/2007-June/037234.html

I do not recall any other bugs being reported.  I disagree that enabling the
cache by default is a bad idea, but it is disabled in the patches to merge
upstream.

>Could the latest revision of the patches to be pulled be posted
>to list please?

The patches are available here:

http://www.openfabrics.org/git/?p=~shefty/rdma-dev.git;a=shortlog;h=for-roland

I can repost tomorrow, but I believe that only 3 lines have changed since the
last posting.  Two listed in the patch above, and the change to disable the
cache.

- Sean


From mst at dev.mellanox.co.il  Sun Jul  1 22:11:28 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 2 Jul 2007 08:11:28 +0300
Subject: [ofa-general] Re: [GIT PULL] please pull rdma-dev.git for 2.6.23
In-Reply-To: <000001c7bc64$ace52b80$a3c8180a@amr.corp.intel.com>
References: <20070701060953.GG7554@mellanox.co.il>
	<000001c7bc64$ace52b80$a3c8180a@amr.corp.intel.com>
Message-ID: <20070702051116.GB5018@mellanox.co.il>

> I can repost tomorrow, but I believe that only 3 lines have changed since the
> last posting.  Two listed in the patch above, and the change to disable the
> cache.

Please do repost the final version. Thanks.

-- 
MST


From vlad at lists.openfabrics.org  Mon Jul  2 02:44:02 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Mon,  2 Jul 2007 02:44:02 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_c_kernel 20070702-0200 daily build status
Message-ID: <20070702094402.7F002E60808@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/~vlad/ofed_kernel.git
git_branch: ofed_kernel

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.12
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.16
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.15
Passed on powerpc with linux-2.6.19
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.14
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on x86_64 with linux-2.6.18
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.18
Passed on powerpc with linux-2.6.16
Passed on ia64 with linux-2.6.14
Passed on ppc64 with linux-2.6.16
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.17
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.15
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ppc64 with linux-2.6.13
Passed on ia64 with linux-2.6.15
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ia64 with linux-2.6.16
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp

Failed:


From ogerlitz at voltaire.com  Mon Jul  2 02:46:27 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 2 Jul 2007 12:46:27 +0300 (IDT)
Subject: [ofa-general] [PATCH] fix flow of handling duplicate SIDR REQs
Message-ID: <Pine.LNX.4.64.0707021215130.1582@zuben>

Hi Sean,

When the process on the passive side is somehow slow to react on a
SIDR request, a --retry-- sent by the active side CM causes the passive
side CM to send a SIDR REP with IB_SIDR_REJECT status. This makes the
active side CM to deliver up IB_CM_SIDR_REP_RECEIVED event with status
IB_SIDR_REJECT etc.

Later, when the process calls rdma_accept --> ib_send_cm_sidr_rep etc,
another SIDR REP with status IB_SIDR_SUCCESS is sent, but its too late.

This seems to be solved with the below patch, however, i see that
for duplicate REQs the code is much more involved, which means i
might be over-simplifying here...

To reproduce the problem/see the fix effect, you can run passive udaddy,
suspend it, then run active udaddy, and then resume the passive. Without
the patch, the active gets RDMA_CM_EVENT_UNREACHABLE with status 2, where
with the patch its working fine.

Or.
----------------------------------

Don't reject SIDR REQ retries which are received before
the passive side had the chance to send SIDR REP.

Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>

Index: ofa_kernel-1.2/drivers/infiniband/core/cm.c
===================================================================
--- ofa_kernel-1.2.orig/drivers/infiniband/core/cm.c	2007-07-02 12:20:13.000000000 +0300
+++ ofa_kernel-1.2/drivers/infiniband/core/cm.c	2007-07-02 12:35:17.000000000 +0300
@@ -746,7 +746,8 @@ retest:
 		break;
 	case IB_CM_SIDR_REQ_RCVD:
 		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
-		cm_reject_sidr_req(cm_id_priv, IB_SIDR_REJECT);
+		if (!err)
+			cm_reject_sidr_req(cm_id_priv, IB_SIDR_REJECT);
 		break;
 	case IB_CM_REQ_SENT:
 		ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg);
@@ -2835,7 +2836,7 @@ static int cm_sidr_req_handler(struct cm
 	cur_cm_id_priv = cm_insert_remote_sidr(cm_id_priv);
 	if (cur_cm_id_priv) {
 		spin_unlock_irqrestore(&cm.lock, flags);
-		goto out; /* Duplicate message. */
+		goto out_dup; /* Duplicate message. */
 	}
 	cur_cm_id_priv = cm_find_listen(cm_id->device,
 					sidr_req_msg->service_id,
@@ -2858,6 +2859,9 @@ static int cm_sidr_req_handler(struct cm
 	cm_process_work(cm_id_priv, work);
 	cm_deref_id(cur_cm_id_priv);
 	return 0;
+out_dup:
+	cm_destroy_id(&cm_id_priv->id, -1);
+	return -EINVAL;
 out:
 	ib_destroy_cm_id(&cm_id_priv->id);
 	return -EINVAL;


From gdror at dev.mellanox.co.il  Mon Jul  2 04:00:56 2007
From: gdror at dev.mellanox.co.il (Dror Goldenberg)
Date: Mon, 02 Jul 2007 14:00:56 +0300
Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <20070701190516.GB31673@minantech.com>
References: <20070625130604.GH15343@mellanox.co.il>	<20070701121623.GD17699@minantech.com>	<6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com>
	<20070701190516.GB31673@minantech.com>
Message-ID: <4688DAE8.2050205@dev.mellanox.co.il>

Gleb Natapov wrote:
> On Sun, Jul 01, 2007 at 07:27:24PM +0300, Dror Goldenberg wrote:
>   
>>> SSQ is needed for scalability, no need to explain this (by 
>>> the way RD is needed for the same reason too. What's Mellanox 
>>> plan to support it?
>>>       
>> RD is not supported in hardware today. Implementing RD is extremely 
>> complicated. To solve the scalability issues on MPI like applications
>> we believe that SRC and SSQ are the right solutions. It is much simpler
>> for implementation by both software and hardware. By MPI-like I refer
>> to applications that have some level of trust between two processes of
>> the
>> same application. RD also has some performance issues as it only 
>> supports one message in the air. Those performance issues are solved
>> by design in SRC/SSQ.
>>
>>     
> Didn't know about RD limitation. Is this shortcomings of IB spec or
> general limitation of reliable datagram? RD looks much nice to me then SRC/SSQ.
>   

The RD limitation is part of the IB spec.

>   
>>> It is a part of Spec after all, so why to invent new shiny 
>>> staff when it is still possible to achieve better scalability 
>>> without them).
>>>       
>> It's truly about complexity. And as I mentioned in OFA meeting at
>> Sonoma, 
>> Mellanox is willing to contribute SRC/SSQ to the IB spec as well.
>>
>>     
>>> We are discussing you implementation proposal and in my 
>>> opinion it doesn't fit application needs. I may be wrong 
>>> here, so if there is somebody who things that sending random 
>>> completion to random processes it the best idea ever and 
>>> absence of this "feature" is the only thing that stops him 
>>> from IB adoption he may chime in here and voice his opinion.
>>>       
>> Your input about how to demultiplex send completions on SSQ is 
>> valuable. Unfortunately it is not supported in the current generation.
>> What I can suggest here is, not new on this thread, but:
>> 1) all pollers see the same CQ, only the poller that sees the completion
>> that
>>       belongs to takes it out of the CQ
>>     
> Progress of one process depend on all other processes on the same node. Not
> good at all.
>   
In MPI, it happens many times that all processes depends on each other 
to make forward progress, this way or the other. I am not saying that 
this is the ideal solution, but there is some price involved in sharing 
resources. You can always upgrade resources for a process that utilizes 
them, e.g. if communication pattern is that each process talks with 4 
neighbors, then let it has dedicated unshared QPs.
>   
>> 2) only one process polls the CQ, if it doesn't belong to the poller,
>> the
>>       poller will put it in a SW queue to the right process. The other 
>>       processes just poll on the SW queue
>>     
> Not good of the same reason.
>
> As the variant each process can poll HW CQ and SW CQ if completion from HW CQ
> belong to another process put it on appropriate SW CQ. I don't think
> that reasonable API will require such afford from applications (and I am
> not talking about all locking overhead and cache bouncing that will
> result from such implementation, but latency will be bad that's for sure).
>   
I don't think that polling on SQ completions are in the latency path. 
You usually need it in order to free networking buffers. In any case I 
understand your point.
>   
>> 3) the SQ will have a "completed WQE index" reported. Everybody can
>>      look at it and determine how many WQEs completed. This one has
>>      some cons because the CQ is not shared here... need to bake this 
>>      one more.
>>     
> And where application will get WC? Or should it maintain its own queue
> of WQEs?
>   
In this method, each app should have its own queue.
>   
>> If we wrap one of these into the right API, once there is HW available
>> that 
>> can do the SSQ CQ demultiplexing, it can work without any API change. 
>>
>>     
> That is something I don't see in proposed API.
>
>   
>>> Looking at the Dror's slides on slide 6 "Scalable Reliable 
>>> Connection" I see that wire protocol is extended to send DST 
>>> SRQ as part of a header.
>>> Receiver side then puts completion to appropriate CQ 
>>> according this field. Have you proposition address this? How? 
>>>       
>> SRC indeed includes demultiplexing of the CQ. SSQ does not currently,
>> unfortunately.
>>     
> Is it possible to add this only with FW upgrade?
>   
Unfortunately no.
>   
>> But I think that with the right API we can abstract this, and later on
>> have better performance for it.
>>
>>     
>>> Who will put this additional data on a wire (HW or libibverbs 
>>> may be app)? Also I don't see this in Dror's slide, but 
>>> completion of local operation should be demultiplexed to 
>>> appropriate CQ too. WQE may contain additional field, for 
>>> instance, that will tell where to put a completion. Once 
>>> again who will do the demux in you proposition (HW, libiverbs 
>>> or app)? The right answer is most certainly HW in both cases 
>>> so will Hermon support this?
>>> Or may be you want to demultiplex everything inside 
>>> libibvers? In this case I want to see design of this 
>>> (preferably with performance analysis).
>>>       
>> One thing to mention. The way I see it is according to the order of the
>> slides. First get SRC going, improve the scalability. Then SSQ can be
>> added to further improve scalability. In other words I am suggesting
>> that maybe we can worry with the SSQ deficiencies a bit later :)
>>
>>     
> That is my point! Let's do it once lets do it right and lets do it when HW
> is ready :)
>   
SRC is ready in HW, it can be implemented in SW now and will 
significantly help scalability.
We can resume SSQ discussion or other alternatives later on...
> --
> 			Gleb.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
>   


From halr at voltaire.com  Mon Jul  2 04:11:58 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 02 Jul 2007 07:11:58 -0400
Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <20070701190516.GB31673@minantech.com>
References: <20070625130604.GH15343@mellanox.co.il>
	<20070701121623.GD17699@minantech.com>
	<6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com>
	<20070701190516.GB31673@minantech.com>
Message-ID: <1183374715.4377.127455.camel@hal.voltaire.com>

On Sun, 2007-07-01 at 15:05, Gleb Natapov wrote:
> On Sun, Jul 01, 2007 at 07:27:24PM +0300, Dror Goldenberg wrote:
> > > SSQ is needed for scalability, no need to explain this (by 
> > > the way RD is needed for the same reason too. What's Mellanox 
> > > plan to support it?
> > 
> > RD is not supported in hardware today. Implementing RD is extremely 
> > complicated. To solve the scalability issues on MPI like applications
> > we believe that SRC and SSQ are the right solutions. It is much simpler
> > for implementation by both software and hardware. By MPI-like I refer
> > to applications that have some level of trust between two processes of
> > the
> > same application. RD also has some performance issues as it only 
> > supports one message in the air. Those performance issues are solved
> > by design in SRC/SSQ.
> > 
> Didn't know about RD limitation. Is this shortcomings of IB spec or
> general limitation of reliable datagram? RD looks much nice to me then SRC/SSQ.

I think Dror is referring to number of messages in flight per EEC and
number of messages in flight per QP being limited to 1 per IBA spec.
Number of messages enqueued per EEC/QP is implementation dependent.

-- Hal

[snip...]


From gdror at dev.mellanox.co.il  Mon Jul  2 05:58:25 2007
From: gdror at dev.mellanox.co.il (Dror Goldenberg)
Date: Mon, 02 Jul 2007 15:58:25 +0300
Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <1183374715.4377.127455.camel@hal.voltaire.com>
References: <20070625130604.GH15343@mellanox.co.il>	<20070701121623.GD17699@minantech.com>	<6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com>	<20070701190516.GB31673@minantech.com>
	<1183374715.4377.127455.camel@hal.voltaire.com>
Message-ID: <4688F671.40408@dev.mellanox.co.il>

Hal Rosenstock wrote:
> On Sun, 2007-07-01 at 15:05, Gleb Natapov wrote:
>   
>> On Sun, Jul 01, 2007 at 07:27:24PM +0300, Dror Goldenberg wrote:
>>     
>>>> SSQ is needed for scalability, no need to explain this (by 
>>>> the way RD is needed for the same reason too. What's Mellanox 
>>>> plan to support it?
>>>>         
>>> RD is not supported in hardware today. Implementing RD is extremely 
>>> complicated. To solve the scalability issues on MPI like applications
>>> we believe that SRC and SSQ are the right solutions. It is much simpler
>>> for implementation by both software and hardware. By MPI-like I refer
>>> to applications that have some level of trust between two processes of
>>> the
>>> same application. RD also has some performance issues as it only 
>>> supports one message in the air. Those performance issues are solved
>>> by design in SRC/SSQ.
>>>
>>>       
>> Didn't know about RD limitation. Is this shortcomings of IB spec or
>> general limitation of reliable datagram? RD looks much nice to me then SRC/SSQ.
>>     
>
> I think Dror is referring to number of messages in flight per EEC and
> number of messages in flight per QP being limited to 1 per IBA spec.
> Number of messages enqueued per EEC/QP is implementation dependent.
>
> -- Hal
>   
Correct. The number of messages in flight per EEC is 1 per IB spec.
The fact that IB requires SQ WQEs to complete in order, even if their 
destination is different EECs, makes it pretty challenging to have an 
implementation that can really process more than one message 
simultaneously per QP.

> [snip...]
>
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
>   


From mst at dev.mellanox.co.il  Mon Jul  2 06:00:57 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 2 Jul 2007 16:00:57 +0300
Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <4688F671.40408@dev.mellanox.co.il>
References: <20070625130604.GH15343@mellanox.co.il>
	<20070701121623.GD17699@minantech.com>
	<6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com>
	<20070701190516.GB31673@minantech.com>
	<1183374715.4377.127455.camel@hal.voltaire.com>
	<4688F671.40408@dev.mellanox.co.il>
Message-ID: <20070702130057.GB17858@mellanox.co.il>

> Quoting Dror Goldenberg <gdror at dev.mellanox.co.il>:
> Subject: Re: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects
> 
> Hal Rosenstock wrote:
> >On Sun, 2007-07-01 at 15:05, Gleb Natapov wrote:
> >  
> >>On Sun, Jul 01, 2007 at 07:27:24PM +0300, Dror Goldenberg wrote:
> >>    
> >>>>SSQ is needed for scalability, no need to explain this (by 
> >>>>the way RD is needed for the same reason too. What's Mellanox 
> >>>>plan to support it?
> >>>>        
> >>>RD is not supported in hardware today. Implementing RD is extremely 
> >>>complicated. To solve the scalability issues on MPI like applications
> >>>we believe that SRC and SSQ are the right solutions. It is much simpler
> >>>for implementation by both software and hardware. By MPI-like I refer
> >>>to applications that have some level of trust between two processes of
> >>>the
> >>>same application. RD also has some performance issues as it only 
> >>>supports one message in the air. Those performance issues are solved
> >>>by design in SRC/SSQ.
> >>>
> >>>      
> >>Didn't know about RD limitation. Is this shortcomings of IB spec or
> >>general limitation of reliable datagram? RD looks much nice to me then 
> >>SRC/SSQ.
> >>    
> >
> >I think Dror is referring to number of messages in flight per EEC and
> >number of messages in flight per QP being limited to 1 per IBA spec.
> >Number of messages enqueued per EEC/QP is implementation dependent.
> >
> >-- Hal
> >  
> Correct. The number of messages in flight per EEC is 1 per IB spec.
> The fact that IB requires SQ WQEs to complete in order, even if their 
> destination is different EECs, makes it pretty challenging to have an 
> implementation that can really process more than one message 
> simultaneously per QP.

Hmm, I guess this requirement could easily be relaxed - in a way
similiar to what was done for SRQ - without breaking applications.
WRID is sufficient to identify the WR even without ordering guarantees.

-- 
MST


From glebn at voltaire.com  Mon Jul  2 06:03:49 2007
From: glebn at voltaire.com (Gleb Natapov)
Date: Mon, 2 Jul 2007 16:03:49 +0300
Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <20070702130057.GB17858@mellanox.co.il>
References: <20070625130604.GH15343@mellanox.co.il>
	<20070701121623.GD17699@minantech.com>
	<6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com>
	<20070701190516.GB31673@minantech.com>
	<1183374715.4377.127455.camel@hal.voltaire.com>
	<4688F671.40408@dev.mellanox.co.il>
	<20070702130057.GB17858@mellanox.co.il>
Message-ID: <20070702130349.GJ17699@minantech.com>

On Mon, Jul 02, 2007 at 04:00:57PM +0300, Michael S. Tsirkin wrote:
> > >>>RD is not supported in hardware today. Implementing RD is extremely 
> > >>>complicated. To solve the scalability issues on MPI like applications
> > >>>we believe that SRC and SSQ are the right solutions. It is much simpler
> > >>>for implementation by both software and hardware. By MPI-like I refer
> > >>>to applications that have some level of trust between two processes of
> > >>>the
> > >>>same application. RD also has some performance issues as it only 
> > >>>supports one message in the air. Those performance issues are solved
> > >>>by design in SRC/SSQ.
> > >>>
> > >>>      
> > >>Didn't know about RD limitation. Is this shortcomings of IB spec or
> > >>general limitation of reliable datagram? RD looks much nice to me then 
> > >>SRC/SSQ.
> > >>    
> > >
> > >I think Dror is referring to number of messages in flight per EEC and
> > >number of messages in flight per QP being limited to 1 per IBA spec.
> > >Number of messages enqueued per EEC/QP is implementation dependent.
> > >
> > >-- Hal
> > >  
> > Correct. The number of messages in flight per EEC is 1 per IB spec.
> > The fact that IB requires SQ WQEs to complete in order, even if their 
> > destination is different EECs, makes it pretty challenging to have an 
> > implementation that can really process more than one message 
> > simultaneously per QP.
> 
> Hmm, I guess this requirement could easily be relaxed - in a way
> similiar to what was done for SRQ - without breaking applications.
Especially as there are no applications that use RD because there is not
HCA that support it.

--
			Gleb.


From ogerlitz at voltaire.com  Mon Jul  2 06:07:25 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 02 Jul 2007 16:07:25 +0300
Subject: [ofa-general] IPoIB-CM UC mode 
Message-ID: <4688F88D.4090806@voltaire.com>

Dror,

can you please clarify

A) if the IBTA change to allow attaching SRQ to UC QPs is done?
B) when it would be possible for you guys to support SRQ/UC in the FW?

Michael,

If Dror says yes on both... what would it take to implement IPoIB-CM/UC?

Is there any --other-- part of the stack (eg mthca,cm) that needs to be 
enhanced for that?

Or.


From halr at voltaire.com  Mon Jul  2 06:29:10 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 02 Jul 2007 09:29:10 -0400
Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <4688F671.40408@dev.mellanox.co.il>
References: <20070625130604.GH15343@mellanox.co.il>
	<20070701121623.GD17699@minantech.com>
	<6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com>
	<20070701190516.GB31673@minantech.com>
	<1183374715.4377.127455.camel@hal.voltaire.com>
	<4688F671.40408@dev.mellanox.co.il>
Message-ID: <1183382948.4377.136789.camel@hal.voltaire.com>

On Mon, 2007-07-02 at 08:58, Dror Goldenberg wrote:
> Hal Rosenstock wrote:
> > On Sun, 2007-07-01 at 15:05, Gleb Natapov wrote:
> >   
> >> On Sun, Jul 01, 2007 at 07:27:24PM +0300, Dror Goldenberg wrote:
> >>     
> >>>> SSQ is needed for scalability, no need to explain this (by 
> >>>> the way RD is needed for the same reason too. What's Mellanox 
> >>>> plan to support it?
> >>>>         
> >>> RD is not supported in hardware today. Implementing RD is extremely 
> >>> complicated. To solve the scalability issues on MPI like applications
> >>> we believe that SRC and SSQ are the right solutions. It is much simpler
> >>> for implementation by both software and hardware. By MPI-like I refer
> >>> to applications that have some level of trust between two processes of
> >>> the
> >>> same application. RD also has some performance issues as it only 
> >>> supports one message in the air. Those performance issues are solved
> >>> by design in SRC/SSQ.
> >>>
> >>>       
> >> Didn't know about RD limitation. Is this shortcomings of IB spec or
> >> general limitation of reliable datagram? RD looks much nice to me then SRC/SSQ.
> >>     
> >
> > I think Dror is referring to number of messages in flight per EEC and
> > number of messages in flight per QP being limited to 1 per IBA spec.
> > Number of messages enqueued per EEC/QP is implementation dependent.
> >
> > -- Hal
> >   
> Correct. The number of messages in flight per EEC is 1 per IB spec.
> The fact that IB requires SQ WQEs to complete in order, even if their 
> destination is different EECs,

Where's this requirement in the spec (and could this be relaxed as it
seems like it is overly "specified") ? Just wondering...

-- Hal

>  makes it pretty challenging to have an 
> implementation that can really process more than one message 
> simultaneously per QP.
> 
> > [snip...]
> >
> >
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >
> >   
> 


From halr at voltaire.com  Mon Jul  2 06:32:25 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 02 Jul 2007 09:32:25 -0400
Subject: [ofa-general] Re: [PATCH] opensm: use osm_get_node/port_by_guid()
	funcs
In-Reply-To: <20070630210503.GA14390@sashak.voltaire.com>
References: <20070630210503.GA14390@sashak.voltaire.com>
Message-ID: <1183383145.4377.137053.camel@hal.voltaire.com>

On Sat, 2007-06-30 at 17:05, Sasha Khapyorsky wrote:
> Similar to osm_get_switch_by_guid() use existing osm_get_node_by_guid()
> and osm_get_port_by_guid() helper funcs for those objects by guid
> resolving - this simplifies the flow in many cases.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied.

-- Hal


From jackm at dev.mellanox.co.il  Mon Jul  2 07:36:18 2007
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Mon, 2 Jul 2007 17:36:18 +0300
Subject: [ofa-general] [PATCH 1 of 2] mlx4:  Add new Mellanox device IDs
Message-ID: <200707021736.18855.jackm@dev.mellanox.co.il>

Add new Mellanox device IDs.

Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>

diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index 41eafeb..0fd4a5f 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -911,6 +911,8 @@ static struct pci_device_id mlx4_pci_table[] = {
 	{ PCI_VDEVICE(MELLANOX, 0x6340) }, /* MT25408 "Hermon" SDR */
 	{ PCI_VDEVICE(MELLANOX, 0x634a) }, /* MT25408 "Hermon" DDR */
 	{ PCI_VDEVICE(MELLANOX, 0x6354) }, /* MT25408 "Hermon" QDR */
+	{ PCI_VDEVICE(MELLANOX, 0x6732) }, /* MT25408 "Hermon" DDR PCIEx-gen2 */
+	{ PCI_VDEVICE(MELLANOX, 0x673c) }, /* MT25408 "Hermon" QDR PCIEx-gen2 */
 	{ 0, }
 };
 

From jackm at dev.mellanox.co.il  Mon Jul  2 07:37:34 2007
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Mon, 2 Jul 2007 17:37:34 +0300
Subject: [ofa-general] [PATCH 2 of 2] libmlx4:  Add new Mellanox device IDs
Message-ID: <200707021737.34303.jackm@dev.mellanox.co.il>

Add new Mellanox device ID's

Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>

diff --git a/src/mlx4.c b/src/mlx4.c
index 3684b50..178d214 100644
--- a/src/mlx4.c
+++ b/src/mlx4.c
@@ -65,6 +65,15 @@
 #define PCI_DEVICE_ID_MELLANOX_HERMON_QDR	0x6354
 #endif
 
+#ifndef PCI_DEVICE_ID_MELLANOX_HERMON_DDR_PCIEX_G2
+#define PCI_DEVICE_ID_MELLANOX_HERMON_DDR_PCIEX_G2	0x6732
+#endif
+
+#ifndef PCI_DEVICE_ID_MELLANOX_HERMON_QDR_PCIEX_G2
+#define PCI_DEVICE_ID_MELLANOX_HERMON_QDR_PCIEX_G2	0x673c
+#endif
+
+
 #define HCA(v, d) \
 	{ .vendor = PCI_VENDOR_ID_##v,			\
 	  .device = PCI_DEVICE_ID_MELLANOX_##d }
@@ -76,6 +85,8 @@ struct {
 	HCA(MELLANOX, HERMON_SDR),
 	HCA(MELLANOX, HERMON_DDR),
 	HCA(MELLANOX, HERMON_QDR),
+	HCA(MELLANOX, HERMON_DDR_PCIEX_G2),
+	HCA(MELLANOX, HERMON_QDR_PCIEX_G2),
 };
 
 static struct ibv_context_ops mlx4_ctx_ops = {


From mst at dev.mellanox.co.il  Mon Jul  2 07:53:28 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 2 Jul 2007 17:53:28 +0300
Subject: [ofa-general] Re: IPoIB-CM UC mode
In-Reply-To: <4688F88D.4090806@voltaire.com>
References: <4688F88D.4090806@voltaire.com>
Message-ID: <20070702145328.GC17858@mellanox.co.il>

> Quoting Or Gerlitz <ogerlitz at voltaire.com>:
> Subject: IPoIB-CM UC mode
> 
> Dror,
> 
> can you please clarify
> 
> A) if the IBTA change to allow attaching SRQ to UC QPs is done?
> B) when it would be possible for you guys to support SRQ/UC in the FW?
> 
> Michael,
> 
> If Dror says yes on both... what would it take to implement IPoIB-CM/UC?

Given hardware support, just using UC is easy.  The largest bit of work would be
to add connection liveness detection code to active side.
Hopefully not too bad either.

> Is there any --other-- part of the stack (eg mthca,cm) that needs to be 
> enhanced for that?

Not a whole lot.
We need an API to detect this feature support in HW.  There could be a bit of
work in mthca to detect HW/FW support for this feature, and enable connecting UC
QPs to SRQ.  There could be a bit of debugging work in CM in case we hit some
bugs with LAP messages (which I plan to use for liveness detection).

-- 
MST


From gshipman at lanl.gov  Mon Jul  2 08:15:49 2007
From: gshipman at lanl.gov (Galen Shipman)
Date: Mon, 2 Jul 2007 09:15:49 -0600
Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <20070701121623.GD17699@minantech.com>
References: <20070625130604.GH15343@mellanox.co.il> <aday7i7wye1.fsf@cisco.com>
	<20070626070641.GM15343@mellanox.co.il> <adahcouv2mi.fsf@cisco.com>
	<20070630220530.GB7554@mellanox.co.il>
	<20070701121623.GD17699@minantech.com>
Message-ID: <892F6C42-ED3F-42F4-8D97-B801DAAA3CD9@lanl.gov>


> Looking at the Dror's slides on slide 6 "Scalable Reliable  
> Connection" I
> see that wire protocol is extended to send DST SRQ as part of a  
> header.
> Receiver side then puts completion to appropriate CQ according this
> field. Have you proposition address this? How? Who will put this
> additional data on a wire (HW or libibverbs may be app)? Also I don't
> see this in Dror's slide, but completion of local operation should be
> demultiplexed to appropriate CQ too. WQE may contain additional field,
> for instance, that will tell where to put a completion. Once again who
> will do the demux in you proposition (HW, libiverbs or app)? The right
> answer is most certainly HW in both cases so will Hermon support this?
> Or may be you want to demultiplex everything inside libibvers? In this
> case I want to see design of this (preferably with performance  
> analysis).
>

While I think the SRC design makes sense I also have concerns  
regarding SSQ.
As Gleb has pointed out, if the hardware doesn't do the demux then  
the application has to.
It sounds like there are two proposals to deal with this hardware  
limitation in software (sigh).

1) Process A polls CQ, if WQE belongs to Process B, Process A will  
drop the WQE in a shared memory region that Process B  will poll.
      So we end up re-implementing shared memory completion semantics  
all over again for SSQ. I am concerned that there is both a latency  
hit (on average) and a memory scaling issue in that multiple QPs will  
now be replaced with shared memory completion fifos and a single SSQ QP.

2) Process A peeks CQ, if WQE belongs to Process B, it doesn't  
process it
	This has very bad implications for real applications, having to  
context switch to receive a WQE is bad

In my opinion the demux belongs in the hardware, otherwise we end up  
complicating an already complicated code base to support a feature  
which unless I am missing something will have no benefit to real  
applications.

- Galen


> --
> 			Gleb.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/ 
> openib-general


From rdreier at cisco.com  Mon Jul  2 09:25:35 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 02 Jul 2007 09:25:35 -0700
Subject: [ofa-general] Re: round_jiffies()
In-Reply-To: <20070701113954.GM19343@mellanox.co.il> (Michael S. Tsirkin's
	message of "Sun, 1 Jul 2007 14:39:54 +0300")
References: <20070701113954.GM19343@mellanox.co.il>
Message-ID: <adazm2e93i8.fsf@cisco.com>

 > 	I started wondering whether we should be using round_jiffies
 > 	for stale connection detection work.

Yes, I've had this type of cleanup on my todo list for a while.  There
are probably quite a few places in drivers/infiniband where we should
use round_jiffies or deferrable timers/delayed work.

 - R.


From rdreier at cisco.com  Mon Jul  2 09:27:19 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 02 Jul 2007 09:27:19 -0700
Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <1183382948.4377.136789.camel@hal.voltaire.com> (Hal Rosenstock's
	message of "02 Jul 2007 09:29:10 -0400")
References: <20070625130604.GH15343@mellanox.co.il>
	<20070701121623.GD17699@minantech.com>
	<6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com>
	<20070701190516.GB31673@minantech.com>
	<1183374715.4377.127455.camel@hal.voltaire.com>
	<4688F671.40408@dev.mellanox.co.il>
	<1183382948.4377.136789.camel@hal.voltaire.com>
Message-ID: <adaved293fc.fsf@cisco.com>

 > > Correct. The number of messages in flight per EEC is 1 per IB spec.
 > > The fact that IB requires SQ WQEs to complete in order, even if their 
 > > destination is different EECs,
 > 
 > Where's this requirement in the spec (and could this be relaxed as it
 > seems like it is overly "specified") ? Just wondering...

I don't think we want to relax the requirement that work requests
complete in order.  It's hard enough to get applications correct
without having to worry about out-of-order completions, and I think
specifying all the corner cases would be a nightmare.  Eg do we allow
successful completions after a completion with error?  and so on...


From rdreier at cisco.com  Mon Jul  2 09:36:54 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 02 Jul 2007 09:36:54 -0700
Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <20070630222419.GE7554@mellanox.co.il> (Michael S. Tsirkin's
	message of "Sun, 1 Jul 2007 01:24:19 +0300")
References: <20070625130604.GH15343@mellanox.co.il>
	<aday7i7wye1.fsf@cisco.com> <20070626070641.GM15343@mellanox.co.il>
	<adahcouv2mi.fsf@cisco.com> <20070630222419.GE7554@mellanox.co.il>
Message-ID: <adar6nq92zd.fsf@cisco.com>

 > Generally, I think it would be nice if this could work
 > in the same way as with multiple threads: a single process does
 > destroy, the rest must not use the same object after this,
 > synchronisation it up to the app.
 > 
 > But you made me realise that we need an API for non-controlling processes to
 > release the userspace resources without destroying the kernel-level object.

What is a non-controlling process?  To the uverbs code in the kernel
there is only one file structure that happens to be shared by multiple
processes.  But they are all equal.

 > > I think there are probably bugs
 > > in the locked_vm accounting in the kernel right now -- it doesn't take
 > > into account the possibility of passing context fds from one process
 > > to another.
 > 
 > Hmm, might be a good idea to fix the bugs anyway, no?

Yes, I guess we need to take a reference on the mm structure in
ib_umem_get() and only drop it after we free the umem.

 > > Should process B be
 > > able to destroy it?  What if process A is still alive -- should
 > > process B be able to destroy the QP?
 > 
 > I think in practice a single process will do this.
 > My approach generally is: let's have same rules as for multiple threads.

I don't think it's quite as simple as saying that it's just like
multiple threads.  Creating/destroying QPs from a PD shared by
multiple processes opens lots of problems.  Let's take the mthca case:
the userspace driver needs to have a table of QPN -> QP struct so that
it can look up which QP a completion belongs to.

This means that if process A and process B share QP X, then X has to
be in the QP table of both processes.  OK, that's fine, when QP X gets
passed from A to B, then B can put it in the table.  But what happens
when B destroys QP X?  How does process A know to take X out of its table?
What if process A has died in the meantime?

Or what if process A and process B share CQ 1, and process B creates
QP Y in a non-shared PD but attaches it to CQ 1?  What happens when
process A polls a completion for QP Y from CQ 1?

 > > I guess we need this to be able to re-mmap doorbell pages etc, right?
 > > I wonder if there's a better way around that... maybe extending the
 > > kernel interface so that unrelated processes can share a context, eg
 > > by putting contexts in a filesystem or something like that.
 > 
 > Hmm, I don't have principal objection, however this would mean
 > we'd have to change kernel-user interface again. the proposed
 > API extensions can mostly be done in userspace only.
 > 
 > And it seems to me like much more work that just let the app
 > use unix domain sockets, for me. What are the advantages of this approach?
 > 
 > Further, since there is already an existing kernel interface for this,
 > should we be inventing our own?

The advantage is that sharing objects in a filesystem by doing open(),
protected by permissions etc. is much more familiar than passing fds
through sockets.  I'm not sure it makes sense but shared memory + unix
domain socket fd passing is not a very natural way for most people to
program.

 - R.


From rdreier at cisco.com  Mon Jul  2 09:38:17 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 02 Jul 2007 09:38:17 -0700
Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <20070701121623.GD17699@minantech.com> (Gleb Natapov's message of
	"Sun, 1 Jul 2007 15:16:23 +0300")
References: <20070625130604.GH15343@mellanox.co.il>
	<aday7i7wye1.fsf@cisco.com> <20070626070641.GM15343@mellanox.co.il>
	<adahcouv2mi.fsf@cisco.com> <20070630220530.GB7554@mellanox.co.il>
	<20070701121623.GD17699@minantech.com>
Message-ID: <adamyye92x2.fsf@cisco.com>

Based on what Gleb is saying, I think I agree with Dror: let's get SRC
designed and then think about SSQ.  And then if that generalizes
further, we can do that -- but I don't think going for full generality
immediately looks like it is producing something that anyone is
interested in using, and it opens a ton of very difficult problems.


From rdreier at cisco.com  Mon Jul  2 09:42:14 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 02 Jul 2007 09:42:14 -0700
Subject: [ofa-general] Re: [PATCH] IB/ipoib - partial error clean up unmaps
	wrong address
In-Reply-To: <1183142276.18911.337.camel@brick.pathscale.com> (Ralph
	Campbell's message of "Fri, 29 Jun 2007 11:37:56 -0700")
References: <1183142276.18911.337.camel@brick.pathscale.com>
Message-ID: <adair9292qh.fsf@cisco.com>

 > -	for (; i >= 0; --i)
 > -		ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE);
 > +	for (; i > 0; --i)
 > +		ib_dma_unmap_single(priv->ca, mapping[i], PAGE_SIZE, DMA_FROM_DEVICE);

Michael -- this looks rather clearly correct to me.  Any objection to
applying it?

 - R.


From rdreier at cisco.com  Mon Jul  2 09:43:00 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 02 Jul 2007 09:43:00 -0700
Subject: [ofa-general] Re: [PATCH] IB/ipoib - partial error clean up unmaps
	wrong address
In-Reply-To: <1183142276.18911.337.camel@brick.pathscale.com> (Ralph
	Campbell's message of "Fri, 29 Jun 2007 11:37:56 -0700")
References: <1183142276.18911.337.camel@brick.pathscale.com>
Message-ID: <adaejjq92p7.fsf@cisco.com>

ralph -- how did you find this bug?  Hit it in practice or just code review?

I'm trying to decide whether to get this into 2.6.22, or whether it
can wait for 2.6.23.

 - R.


From jsquyres at cisco.com  Mon Jul  2 09:42:33 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Mon, 2 Jul 2007 18:42:33 +0200
Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <892F6C42-ED3F-42F4-8D97-B801DAAA3CD9@lanl.gov>
References: <20070625130604.GH15343@mellanox.co.il> <aday7i7wye1.fsf@cisco.com>
	<20070626070641.GM15343@mellanox.co.il> <adahcouv2mi.fsf@cisco.com>
	<20070630220530.GB7554@mellanox.co.il>
	<20070701121623.GD17699@minantech.com>
	<892F6C42-ED3F-42F4-8D97-B801DAAA3CD9@lanl.gov>
Message-ID: <9870CCF3-2F68-4B88-9039-CE688FA18F80@cisco.com>

On Jul 2, 2007, at 5:15 PM, Galen Shipman wrote:

> While I think the SRC design makes sense I also have concerns  
> regarding SSQ.
> As Gleb has pointed out, if the hardware doesn't do the demux then  
> the application has to.  It sounds like there are two proposals to  
> deal with this hardware limitation in software (sigh).
>
> 1) Process A polls CQ, if WQE belongs to Process B, Process A will  
> drop the WQE in a shared memory region that Process B  will poll.  
> [snip]
> 2) Process A peeks CQ, if WQE belongs to Process B, it doesn't  
> process it [snip]
>
> In my opinion the demux belongs in the hardware, otherwise we end  
> up complicating an already complicated code base to support a  
> feature which unless I am missing something will have no benefit to  
> real applications.

I agree.  I cannot see how SSQ will be useful in Open MPI -- it makes  
the code *much* more complicated and effectively guarantees to add  
latency for the common case.  I don't see how to explain it better  
than Gleb/Galen already did.

If Mellanox wants to implement SSQ for other reasons, fine.  But  
based on the explanations so far, I don't see us using it in [Open] MPI.

-- 
Jeff Squyres
Cisco Systems


From hnguyen at linux.vnet.ibm.com  Mon Jul  2 10:19:26 2007
From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen)
Date: Mon, 2 Jul 2007 19:19:26 +0200
Subject: [ofa-general] idr_get_new_above() limitation?
Message-ID: <200707021919.27251.hnguyen@linux.vnet.ibm.com>

Hello,
For ehca device driver we're intending to utilize 
idr_get_new_above() and have written a test case, which I'm attaching
at the end. Basically it tries to get an idr token above a lower boundary
by calling idr_get_new_above() and then uses idr_find() to check if 
the returned token can be found. 
Here is our observation with 2.6.22-rc7 on ppc64:

Use lower boundary 0x3ffffffc
[root at xyz idr_bug]# insmod idr_test_mod.ko start=1073741820
insmod: error inserting 'idr_test_mod.ko': -1 Unknown symbol in module
[root at xyz idr_bug]# dmesg -c
i=3ffffffc token=3ffffffc t=000000003ffffffc
i=3ffffffd token=3ffffffd t=000000003ffffffd
i=3ffffffe token=3ffffffe t=000000003ffffffe
i=3fffffff token=3fffffff t=000000003fffffff
i=40000000 token=40000000 t=0000000000000000
Invalid object 0000000000000000. Expected 40000000

That means token 0x40000000 seems to be the "upper boundary" of idr_find().
However the behaviour is not consistent in that it was returned by
idr_get_new_above().

Looking at void *idr_find(struct idr *idp, int id)
{
	int n;
	struct idr_layer *p;

	n = idp->layers * IDR_BITS;
	p = idp->top;

	/* Mask off upper bits we don't use for the search. */
	id &= MAX_ID_MASK;

	if (id >= (1 << n))
		return NULL;

	while (n > 0 && p) {
		n -= IDR_BITS;
		p = p->ary[(id >> n) & IDR_MASK];
	}
	return((void *)p);
}
we found that the if-condition has failed:
  layers = 5
  IDR_BITS = 6
  n = 30
  (id >= (1 << n)) = (0x40000000 >= 0x40000000) = 1

Since MAX_ID_MASK=0x7fffffff, I'm wondering if 0x40000000 is the actual
upper boundary. Any hints or suggestions are appreciated.

Thanks!
Nam


#include <linux/module.h>
#include <linux/idr.h>

MODULE_LICENSE("GPL");

int start_opt = 0x7e000000;

module_param_named(start, start_opt, int, 0);

MODULE_PARM_DESC(start,
		 "Start token for idr_get_new_above(). Default 0x7e000000");

static int __init idr_test_init(void)
{
	DEFINE_IDR(idr);
	int token, ret;
	unsigned long i;

	for (i = start_opt;  i <= MAX_ID_MASK; i++) {
		void * t;
		if (!idr_pre_get(&idr, GFP_KERNEL)) {
			printk(KERN_ERR "ERROR: Out of mem\n");
			return -ENOENT;
		}
		ret = idr_get_new_above(&idr, (void*)i, start_opt, &token);
		switch (ret) {
		case 0:
			t = idr_find(&idr, token);
			printk(KERN_ERR "i=%lx token=%x t=%p\n", i, token, t);
			if (t != (void*)i) {
				printk(KERN_ERR "Invalid object %p. Expected %lx\n",
				       t, i);
				return -ENOENT;
			}
			break;
		case -EAGAIN:
			i--;
			printk("idr_get_new_above() ret=-EAGAIN\n");
			break;
		default:
			printk(KERN_ERR "ERROR: Out of mem\n");
			break;
		}
	}
	/*
	 * return an error in any case since we don't need the module
	 * loaded anyway.
	 */
	return -ENOENT;
}

static void __exit idr_test_exit(void)
{
	printk(KERN_ERR "module exit\n");
}

module_init(idr_test_init);
module_exit(idr_test_exit);


From mst at dev.mellanox.co.il  Mon Jul  2 10:48:06 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 2 Jul 2007 20:48:06 +0300
Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <1183382948.4377.136789.camel@hal.voltaire.com>
References: <20070625130604.GH15343@mellanox.co.il>
	<20070701121623.GD17699@minantech.com>
	<6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com>
	<20070701190516.GB31673@minantech.com>
	<1183374715.4377.127455.camel@hal.voltaire.com>
	<4688F671.40408@dev.mellanox.co.il>
	<1183382948.4377.136789.camel@hal.voltaire.com>
Message-ID: <20070702174806.GE17858@mellanox.co.il>

> Quoting Hal Rosenstock <halr at voltaire.com>:
> Subject: Re: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects
> 
> On Mon, 2007-07-02 at 08:58, Dror Goldenberg wrote:
> > Hal Rosenstock wrote:
> > > On Sun, 2007-07-01 at 15:05, Gleb Natapov wrote:
> > >   
> > >> On Sun, Jul 01, 2007 at 07:27:24PM +0300, Dror Goldenberg wrote:
> > >>     
> > >>>> SSQ is needed for scalability, no need to explain this (by 
> > >>>> the way RD is needed for the same reason too. What's Mellanox 
> > >>>> plan to support it?
> > >>>>         
> > >>> RD is not supported in hardware today. Implementing RD is extremely 
> > >>> complicated. To solve the scalability issues on MPI like applications
> > >>> we believe that SRC and SSQ are the right solutions. It is much simpler
> > >>> for implementation by both software and hardware. By MPI-like I refer
> > >>> to applications that have some level of trust between two processes of
> > >>> the
> > >>> same application. RD also has some performance issues as it only 
> > >>> supports one message in the air. Those performance issues are solved
> > >>> by design in SRC/SSQ.
> > >>>
> > >>>       
> > >> Didn't know about RD limitation. Is this shortcomings of IB spec or
> > >> general limitation of reliable datagram? RD looks much nice to me then SRC/SSQ.
> > >>     
> > >
> > > I think Dror is referring to number of messages in flight per EEC and
> > > number of messages in flight per QP being limited to 1 per IBA spec.
> > > Number of messages enqueued per EEC/QP is implementation dependent.
> > >
> > > -- Hal
> > >   
> > Correct. The number of messages in flight per EEC is 1 per IB spec.
> > The fact that IB requires SQ WQEs to complete in order, even if their 
> > destination is different EECs,
> 
> Where's this requirement in the spec (and could this be relaxed as it
> seems like it is overly "specified") ? Just wondering...

For example:
	10.8.5 RETURNING COMPLETED WORK REQUESTS

	...

	Except for RD Receive Work Queues and Receive Work Queues associ-
	ated with an SRQ, Work Completions are always returned in the order
	submitted to a given Work Queue with respect to other Work Requests on 
	that Work Queue.

-- 
MST


From mst at dev.mellanox.co.il  Mon Jul  2 11:39:06 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 2 Jul 2007 21:39:06 +0300
Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <adaved293fc.fsf@cisco.com>
References: <20070625130604.GH15343@mellanox.co.il>
	<20070701121623.GD17699@minantech.com>
	<6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com>
	<20070701190516.GB31673@minantech.com>
	<1183374715.4377.127455.camel@hal.voltaire.com>
	<4688F671.40408@dev.mellanox.co.il>
	<1183382948.4377.136789.camel@hal.voltaire.com>
	<adaved293fc.fsf@cisco.com>
Message-ID: <20070702183906.GH17858@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects
> 
>  > > Correct. The number of messages in flight per EEC is 1 per IB spec.
>  > > The fact that IB requires SQ WQEs to complete in order, even if their 
>  > > destination is different EECs,
>  > 
>  > Where's this requirement in the spec (and could this be relaxed as it
>  > seems like it is overly "specified") ? Just wondering...
> 
> I don't think we want to relax the requirement that work requests
> complete in order.  It's hard enough to get applications correct
> without having to worry about out-of-order completions,

Hmm, they seem to deal fine with this in case of SRQ. Why not here?

I guess this depends on the application, but let's look at
something like IPoIB or SDP: all we do when we get a send
completion is look up a WR a free it. It won't be too hard
to deal with out of order, either. If an app uses a pointer
as WRID, it's even easier.

> and I think
> specifying all the corner cases would be a nightmare.  Eg do we allow
> successful completions after a completion with error?  and so on...

However, as Dror notes, the in-order requirement simply moves
the complexity to hardware. Which might be one of the reasons why
there are no HW implementations of RD out there.

-- 
MST


From mst at dev.mellanox.co.il  Mon Jul  2 11:46:30 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 2 Jul 2007 21:46:30 +0300
Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <adar6nq92zd.fsf@cisco.com>
References: <20070625130604.GH15343@mellanox.co.il> <aday7i7wye1.fsf@cisco.com>
	<20070626070641.GM15343@mellanox.co.il> <adahcouv2mi.fsf@cisco.com>
	<20070630222419.GE7554@mellanox.co.il> <adar6nq92zd.fsf@cisco.com>
Message-ID: <20070702184630.GI17858@mellanox.co.il>

>  > > I think there are probably bugs
>  > > in the locked_vm accounting in the kernel right now -- it doesn't take
>  > > into account the possibility of passing context fds from one process
>  > > to another.
>  > 
>  > Hmm, might be a good idea to fix the bugs anyway, no?
> 
> Yes, I guess we need to take a reference on the mm structure in
> ib_umem_get() and only drop it after we free the umem.

IMO this would create problems at process exit time.
Maybe set umem->mm at umem_get time, and, in umem_release,
just validate that umem->mm == current->mm?


-- 
MST


From mst at dev.mellanox.co.il  Mon Jul  2 11:56:30 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 2 Jul 2007 21:56:30 +0300
Subject: [ofa-general] Re: [PATCH] IB/ipoib - partial error clean up unmaps
	wrong address
In-Reply-To: <adair9292qh.fsf@cisco.com>
References: <1183142276.18911.337.camel@brick.pathscale.com>
	<adair9292qh.fsf@cisco.com>
Message-ID: <20070702185630.GJ17858@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH] IB/ipoib - partial error clean up unmaps wrong address
> 
>  > -	for (; i >= 0; --i)
>  > -		ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE);
>  > +	for (; i > 0; --i)
>  > +		ib_dma_unmap_single(priv->ca, mapping[i], PAGE_SIZE, DMA_FROM_DEVICE);
> 
> Michael -- this looks rather clearly correct to me.  Any objection to
> applying it?

Yes, the patch looks clearly correct to me.
I recently saw a crash on one system which looks like it could be related:

Call Trace:<IRQ> <ffffffffa01ff3f0>{:ib_ipoib:ipoib_cm_alloc_rx_skb+796}
       <ffffffffa01fefdf>{:ib_ipoib:ipoib_cm_post_receive+119}
       <ffffffff801d5729>{selinux_socket_sock_rcv_skb+530}
       <ffffffffa01ffc04>{:ib_ipoib:ipoib_cm_handle_rx_wc+477}
       <ffffffff802aa3af>{sock_def_readable+16} <ffffffff802e9b29>{udp_queue_rcv_skb+827}
       <ffffffff802ea016>{udp_rcv+1153} <ffffffffa01fafa0>{:ib_ipoib:ipoib_ib_completion+144}
       <ffffffff801321e3>{activate_task+124} <ffffffffa023a69f>{:ib_mthca:mthca_eq_int+215}
       <ffffffffa023adea>{:ib_mthca:mthca_arbel_interrupt+56}
       <ffffffff80112f4a>{handle_IRQ_event+41} <ffffffff801131c4>{do_IRQ+197}
       <ffffffff80110833>{ret_from_intr+0}  <EOI>

I hoped to get this patch stress tested and report whether it helps before Acking.
But it seems this won't happen soon since that system is busy.

It's probably best to apply this patch.

Acked-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>


-- 
MST


From or.gerlitz at gmail.com  Mon Jul  2 12:22:34 2007
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Mon, 2 Jul 2007 22:22:34 +0300
Subject: [ofa-general] Re: IPoIB-CM UC mode
In-Reply-To: <20070702145328.GC17858@mellanox.co.il>
References: <4688F88D.4090806@voltaire.com>
	<20070702145328.GC17858@mellanox.co.il>
Message-ID: <15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com>

On 7/2/07, Michael S. Tsirkin <mst at dev.mellanox.co.il> wrote:
>
> > Quoting Or Gerlitz <ogerlitz at voltaire.com>:

> Is there any --other-- part of the stack (eg mthca,cm) that needs to be
> > enhanced for that?
>
> Not a whole lot.
> We need an API to detect this feature support in HW.  There could be a bit
> of
> work in mthca to detect HW/FW support for this feature, and enable
> connecting UC
> QPs to SRQ.  There could be a bit of debugging work in CM in case we hit
> some
> bugs with LAP messages (which I plan to use for liveness detection).
>

Thanks for the info. Can you please elaborate a little more on the LAP based
liveness detection mechanism you were thinking about? I might want to deploy
it in another app.

Actually, looking on IPoIB-CM RC based implementation I don't really
understand its "liveness detection" mechanism... In ipoib_cm_send_req() I
see that the code sets both the RC QP retries AND rnr retries to 0...
doesn't this mean that a single RNR NAK would cause a TX QP to move to
ERROR? can you clarify how do you use the "R" of "RC" here?

thanks,

Or.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070702/c5be0222/attachment.html>

From mshefty at ichips.intel.com  Mon Jul  2 12:25:58 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 02 Jul 2007 12:25:58 -0700
Subject: [ofa-general] [PATCH] fix flow of handling duplicate SIDR REQs
In-Reply-To: <Pine.LNX.4.64.0707021215130.1582@zuben>
References: <Pine.LNX.4.64.0707021215130.1582@zuben>
Message-ID: <46895146.3050700@ichips.intel.com>

> This seems to be solved with the below patch, however, i see that
> for duplicate REQs the code is much more involved, which means i
> might be over-simplifying here...

I don't think you're over-simplifying.  The REQ handling seems more 
involved because the connected state machine is more complex.

REQ handling deals with duplicate messages by waiting to set the cm id 
state.  We could do the same in the sidr req handler.

Since we're in this part of the code, I'll create a patch to fix the 
'todo' comment in this function, to ensure that both patches fit 
together cleanly.

Thanks

- Sean


From or.gerlitz at gmail.com  Mon Jul  2 12:33:25 2007
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Mon, 2 Jul 2007 22:33:25 +0300
Subject: [ofa-general] [PATCH] fix flow of handling duplicate SIDR REQs
In-Reply-To: <46895146.3050700@ichips.intel.com>
References: <Pine.LNX.4.64.0707021215130.1582@zuben>
	<46895146.3050700@ichips.intel.com>
Message-ID: <15ddcffd0707021233h45233dd8jfc213053138a5e01@mail.gmail.com>

On 7/2/07, Sean Hefty <mshefty at ichips.intel.com> wrote:
>
> > This seems to be solved with the below patch, however, i see that
> > for duplicate REQs the code is much more involved, which means i
> > might be over-simplifying here...
>
> I don't think you're over-simplifying.  The REQ handling seems more
> involved because the connected state machine is more complex.


OK

REQ handling deals with duplicate messages by waiting to set the cm id
> state.  We could do the same in the sidr req handler.


Can you clarify what "waiting to set the cm id state" means?


> Since we're in this part of the code, I'll create a patch to fix the
> 'todo' comment in this function, to ensure that both patches fit
> together cleanly.


Assuming you refer to "todo: reply with no match" in cm_sidr_req_handler,
what else
need to be added to the current code? is  it sending the REP with a
different status (ie not 2) or sending a REJ?

Or.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070702/98f6e4ca/attachment.html>

From gdror at dev.mellanox.co.il  Mon Jul  2 12:42:22 2007
From: gdror at dev.mellanox.co.il (Dror Goldenberg)
Date: Mon, 02 Jul 2007 22:42:22 +0300
Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <9870CCF3-2F68-4B88-9039-CE688FA18F80@cisco.com>
References: <20070625130604.GH15343@mellanox.co.il>
	<aday7i7wye1.fsf@cisco.com>	<20070626070641.GM15343@mellanox.co.il>
	<adahcouv2mi.fsf@cisco.com>	<20070630220530.GB7554@mellanox.co.il>	<20070701121623.GD17699@minantech.com>	<892F6C42-ED3F-42F4-8D97-B801DAAA3CD9@lanl.gov>
	<9870CCF3-2F68-4B88-9039-CE688FA18F80@cisco.com>
Message-ID: <4689551E.7020204@dev.mellanox.co.il>

Jeff Squyres wrote:
> On Jul 2, 2007, at 5:15 PM, Galen Shipman wrote:
>
>> While I think the SRC design makes sense I also have concerns 
>> regarding SSQ.
>> As Gleb has pointed out, if the hardware doesn't do the demux then 
>> the application has to.  It sounds like there are two proposals to 
>> deal with this hardware limitation in software (sigh).
>>
>> 1) Process A polls CQ, if WQE belongs to Process B, Process A will 
>> drop the WQE in a shared memory region that Process B  will poll. [snip]
>> 2) Process A peeks CQ, if WQE belongs to Process B, it doesn't 
>> process it [snip]
>>
>> In my opinion the demux belongs in the hardware, otherwise we end up 
>> complicating an already complicated code base to support a feature 
>> which unless I am missing something will have no benefit to real 
>> applications.
I agree about this deficiency and unfortunately I don't think we can do 
anything about it with the current generation. As I said before, I don't 
have a quantitative data about how this might affect the overall 
performance of the application. If polling the CQ of the SQ is not in 
the critical performance path, it may end up having a negligible impact. 
But it might as well turn up to have some impact on performance.
>
> I agree.  I cannot see how SSQ will be useful in Open MPI -- it makes 
> the code *much* more complicated and effectively guarantees to add 
> latency for the common case.  I don't see how to explain it better 
> than Gleb/Galen already did.
>
> If Mellanox wants to implement SSQ for other reasons, fine.  But based 
> on the explanations so far, I don't see us using it in [Open] MPI.
The main intention of SSQ is for MPI which dominates the large/huge 
clusters. The intention is to help in scalability, which may have some 
impact on performance in some cases. I think that for now we should 
first start with SRC, and thus significantly improve the scalability. 
Let us worry a bit later about SSQ.
>
> --Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
>


From gdror at dev.mellanox.co.il  Mon Jul  2 12:49:44 2007
From: gdror at dev.mellanox.co.il (Dror Goldenberg)
Date: Mon, 02 Jul 2007 22:49:44 +0300
Subject: [ofa-general] Re: IPoIB-CM UC mode
In-Reply-To: <4688F88D.4090806@voltaire.com>
References: <4688F88D.4090806@voltaire.com>
Message-ID: <468956D8.6020502@dev.mellanox.co.il>

Or Gerlitz wrote:
> Dror,
>
> can you please clarify
>
> A) if the IBTA change to allow attaching SRQ to UC QPs is done?
Unfortunately I don't think I can comment on this outside an IBTA forum. 
But as an IBTA member, you can check out the SWG mail-thread or wait for 
the next spec membership review.
> B) when it would be possible for you guys to support SRQ/UC in the FW?
You probably want support in both firmware and driver. Let me check.
>
> Michael,
>
> If Dror says yes on both... what would it take to implement IPoIB-CM/UC?
>
> Is there any --other-- part of the stack (eg mthca,cm) that needs to 
> be enhanced for that?
>
> Or.
>
>


From mst at dev.mellanox.co.il  Mon Jul  2 12:53:14 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 2 Jul 2007 22:53:14 +0300
Subject: [ofa-general] Re: Re: IPoIB-CM UC mode
In-Reply-To: <15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com>
References: <20070702145328.GC17858@mellanox.co.il>
	<15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com>
Message-ID: <20070702195314.GA31169@mellanox.co.il>

> > Quoting Or Gerlitz <or.gerlitz at gmail.com>:
> > Subject: Re: Re: IPoIB-CM UC mode
> > 
> > On 7/2/07, Michael S. Tsirkin <mst at dev.mellanox.co.il> wrote:
> > 
> > > Quoting Or Gerlitz <ogerlitz at voltaire.com>:
> > 
> > > Is there any --other-- part of the stack (eg mthca,cm) that needs to be
> > > enhanced for that?
> > 
> > Not a whole lot.
> > We need an API to detect this feature support in HW.  There could be a bit of
> > work in mthca to detect HW/FW support for this feature, and enable connecting UC 
> > QPs to SRQ.  There could be a bit of debugging work in CM in case we hit some
> > bugs with LAP messages (which I plan to use for liveness detection).
> 
> 
> 
> Thanks for the info. Can you please elaborate a little more on the LAP based
> liveness detection mechanism you were thinking about? I might want to deploy
> it in another app.

With UC, if the remote side looses our QP, we get no indication whatsoever.  But
we don't want to destroy/recreate connections unless strictly necessary.

So we must send something that will force remote side to respond. One such
message is LAP with current primary path used as proposed alternate path.
Remote will respond with APR with AP status 5 if the connection is there, and
status 1 if it is not.

> Actually, looking on IPoIB-CM RC based implementation I don't really
> understand its "liveness detection" mechanism... In ipoib_cm_send_req() I see
> that the code sets both the RC QP retries AND rnr retries to 0... doesn't this
> mean that a single RNR NAK would cause a TX QP to move to ERROR?

Yes, this is from spec, BTW.
More importantly, a timeout will cause error, and we'll retry connection
on next packet.

> can you
> clarify how do you use the "R" of "RC" here?

The two reasons I used RC is because
1. UC does not support SRQ yet.
2. It's easier to detect connection is alive.

-- 
MST


From mst at dev.mellanox.co.il  Mon Jul  2 12:59:28 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 2 Jul 2007 22:59:28 +0300
Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <adar6nq92zd.fsf@cisco.com>
References: <20070625130604.GH15343@mellanox.co.il> <aday7i7wye1.fsf@cisco.com>
	<20070626070641.GM15343@mellanox.co.il> <adahcouv2mi.fsf@cisco.com>
	<20070630222419.GE7554@mellanox.co.il> <adar6nq92zd.fsf@cisco.com>
Message-ID: <20070702195927.GB31169@mellanox.co.il>

>  > > I guess we need this to be able to re-mmap doorbell pages etc, right?
>  > > I wonder if there's a better way around that... maybe extending the
>  > > kernel interface so that unrelated processes can share a context, eg
>  > > by putting contexts in a filesystem or something like that.
> 
> The advantage is that sharing objects in a filesystem by doing open(),
> protected by permissions etc. is much more familiar than passing fds
> through sockets.  I'm not sure it makes sense but shared memory + unix
> domain socket fd passing is not a very natural way for most people to
> program.

Could you please clarify how do you envision this done?
Do we just create our own filesystem?

Reason I ask, we'll need something like this for SRC domain too ...

-- 
MST


From mshefty at ichips.intel.com  Mon Jul  2 13:03:36 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 02 Jul 2007 13:03:36 -0700
Subject: [ofa-general] Re: Re: IPoIB-CM UC mode
In-Reply-To: <20070702195314.GA31169@mellanox.co.il>
References: <20070702145328.GC17858@mellanox.co.il>	<15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com>
	<20070702195314.GA31169@mellanox.co.il>
Message-ID: <46895A18.2000100@ichips.intel.com>

> So we must send something that will force remote side to respond. One such
> message is LAP with current primary path used as proposed alternate path.
> Remote will respond with APR with AP status 5 if the connection is there, and
> status 1 if it is not.

I didn't follow this.  Is this just an out of band keep alive message? 
Why not use DREQ to indicate that the connection went away under normal 
circumstances, and a send failure in an abnormal termination case?

- Sean


From gdror at dev.mellanox.co.il  Mon Jul  2 13:08:43 2007
From: gdror at dev.mellanox.co.il (Dror Goldenberg)
Date: Mon, 02 Jul 2007 23:08:43 +0300
Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <20070702174806.GE17858@mellanox.co.il>
References: <20070625130604.GH15343@mellanox.co.il>	<20070701121623.GD17699@minantech.com>	<6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com>	<20070701190516.GB31673@minantech.com>	<1183374715.4377.127455.camel@hal.voltaire.com>	<4688F671.40408@dev.mellanox.co.il>	<1183382948.4377.136789.camel@hal.voltaire.com>
	<20070702174806.GE17858@mellanox.co.il>
Message-ID: <46895B4B.8080909@dev.mellanox.co.il>

Michael S. Tsirkin wrote:
>> Quoting Hal Rosenstock <halr at voltaire.com>:
>> Subject: Re: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects
>>
>> On Mon, 2007-07-02 at 08:58, Dror Goldenberg wrote:
>>     
>>> Hal Rosenstock wrote:
>>>       
>>>> On Sun, 2007-07-01 at 15:05, Gleb Natapov wrote:
>>>>   
>>>>         
>>>>> On Sun, Jul 01, 2007 at 07:27:24PM +0300, Dror Goldenberg wrote:
>>>>>     
>>>>>           
>>>>>>> SSQ is needed for scalability, no need to explain this (by 
>>>>>>> the way RD is needed for the same reason too. What's Mellanox 
>>>>>>> plan to support it?
>>>>>>>         
>>>>>>>               
>>>>>> RD is not supported in hardware today. Implementing RD is extremely 
>>>>>> complicated. To solve the scalability issues on MPI like applications
>>>>>> we believe that SRC and SSQ are the right solutions. It is much simpler
>>>>>> for implementation by both software and hardware. By MPI-like I refer
>>>>>> to applications that have some level of trust between two processes of
>>>>>> the
>>>>>> same application. RD also has some performance issues as it only 
>>>>>> supports one message in the air. Those performance issues are solved
>>>>>> by design in SRC/SSQ.
>>>>>>
>>>>>>       
>>>>>>             
>>>>> Didn't know about RD limitation. Is this shortcomings of IB spec or
>>>>> general limitation of reliable datagram? RD looks much nice to me then SRC/SSQ.
>>>>>     
>>>>>           
>>>> I think Dror is referring to number of messages in flight per EEC and
>>>> number of messages in flight per QP being limited to 1 per IBA spec.
>>>> Number of messages enqueued per EEC/QP is implementation dependent.
>>>>
>>>> -- Hal
>>>>   
>>>>         
>>> Correct. The number of messages in flight per EEC is 1 per IB spec.
>>> The fact that IB requires SQ WQEs to complete in order, even if their 
>>> destination is different EECs,
>>>       
>> Where's this requirement in the spec (and could this be relaxed as it
>> seems like it is overly "specified") ? Just wondering...
>>     
>
> For example:
> 	10.8.5 RETURNING COMPLETED WORK REQUESTS
>
> 	...
>
> 	Except for RD Receive Work Queues and Receive Work Queues associ-
> 	ated with an SRQ, Work Completions are always returned in the order
> 	submitted to a given Work Queue with respect to other Work Requests on 
> 	that Work Queue.
>
>   
I referred to:

o10-52: If the CI supports RD Service, Work Requests submitted to the
same RD Send Queue shall complete in the same order in which they
were submitted.

And I agree with Roland that it doesn't worth breaking it. And even if 
you do want to break it, it is still a mess to actually implement it in 
hardware and that is the main reason you see no RD implementations out 
there.


From or.gerlitz at gmail.com  Mon Jul  2 13:13:42 2007
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Mon, 2 Jul 2007 23:13:42 +0300
Subject: [ofa-general] Re: Re: IPoIB-CM UC mode
In-Reply-To: <20070702195314.GA31169@mellanox.co.il>
References: <20070702145328.GC17858@mellanox.co.il>
	<15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com>
	<20070702195314.GA31169@mellanox.co.il>
Message-ID: <15ddcffd0707021313w6700f9f3n541137ecb345904e@mail.gmail.com>

On 7/2/07, Michael S. Tsirkin <mst at dev.mellanox.co.il> wrote:
>
> > > Quoting Or Gerlitz <or.gerlitz at gmail.com>:


> Thanks for the info. Can you please elaborate a little more on the LAP
> based
> > liveness detection mechanism you were thinking about? I might want to
> deploy
> > it in another app.
>
> With UC, if the remote side looses our QP, we get no indication
> whatsoever.  But
> we don't want to destroy/recreate connections unless strictly necessary.


why do we care if remote side lost our QP? my thinking is that we (TX QP)
should care if
the remote side (RX QP) is still there, and this is achieved by RC as you
explain below.


So we must send something that will force remote side to respond. One such
> message is LAP with current primary path used as proposed alternate path.
> Remote will respond with APR with AP status 5 if the connection is there,
> and
> status 1 if it is not.


got it. the current app i was referring to uses UD and not UC, so I guess
LAP is not possible.

> Actually, looking on IPoIB-CM RC based implementation I don't really
> > understand its "liveness detection" mechanism... In ipoib_cm_send_req()
> I see
> > that the code sets both the RC QP retries AND rnr retries to 0...
> doesn't this
> > mean that a single RNR NAK would cause a TX QP to move to ERROR?
>
> Yes, this is from spec, BTW.
> More importantly, a timeout will cause error, and we'll retry connection
> on next packet.


so with the current IPoIB-CM implementation, single RNR NAK and/or single
ACK loss
would cause re-connection, wow... this does not sound like very ready much
for production...

My understanding is that

A) as the IP layer is seen as unreliable, RC buys us nothing
B)  the current code usage of RC
B.1) is ineffecient by nature since it loads the IB fabrics with ACKs and
NAKs
B.2) reconnects on each loss/nak - adds more ineffeciency

we should move to UC

am i missing something, what does RC buys us that UC does not?

> can you
> > clarify how do you use the "R" of "RC" here?
>
> The two reasons I used RC is because
> 1. UC does not support SRQ yet.
> 2. It's easier to detect connection is alive.
>

I wanted to understand the "how" in detail and not high-level (2 above) or
env reasons (1 above)

thanks anyway,

Or.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070702/d2271529/attachment.html>

From sean.hefty at intel.com  Mon Jul  2 13:38:00 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 2 Jul 2007 13:38:00 -0700
Subject: [ofa-general] [PATCH] fix flow of handling duplicate SIDR REQs
In-Reply-To: <15ddcffd0707021233h45233dd8jfc213053138a5e01@mail.gmail.com>
Message-ID: <000501c7bce8$e191c5d0$3c98070a@amr.corp.intel.com>

> Can you clarify what "waiting to set the cm id state" means?

Something like this:

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index c7007c4..3dca385 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -2794,7 +2794,6 @@ static int cm_sidr_req_handler(struct cm_work *work)
                                work->mad_recv_wc->recv_buf.grh,
                                &cm_id_priv->av);
        cm_id_priv->id.remote_id = sidr_req_msg->request_id;
-       cm_id_priv->id.state = IB_CM_SIDR_REQ_RCVD;
        cm_id_priv->tid = sidr_req_msg->hdr.tid;
        atomic_inc(&cm_id_priv->work_count);

@@ -2813,6 +2812,7 @@ static int cm_sidr_req_handler(struct cm_work *work)
                /* todo: reply with no match */
                goto out; /* No match. */
        }
+       cm_id_priv->id.state = IB_CM_SIDR_REQ_RCVD;
        atomic_inc(&cur_cm_id_priv->refcount);
        spin_unlock_irq(&cm.lock);

 
> Assuming you refer to "todo: reply with no match" in cm_sidr_req_handler,
> what else need to be added to the current code? is  it sending the REP
> with a different status (ie not 2) or sending a REJ?

This is all that needs to be done.  The status should be 1, not 2.  At this
point, it's likely just a couple lines to fix.

- Sean


From sean.hefty at intel.com  Mon Jul  2 14:00:19 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 2 Jul 2007 14:00:19 -0700
Subject: [ofa-general] [PATCH 1/2] ib/sa: Add InformInfo/Notice support
In-Reply-To: <20070702051116.GB5018@mellanox.co.il>
Message-ID: <000601c7bceb$ffff3400$3c98070a@amr.corp.intel.com>

Add SA client support for notice/trap registration using InformInfo.
Clients can use the ib_sa interface to register for SA events based
on trap numbers, and receive SA event notification.  This allows
clients to receive notification, such as GID in/out of service.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
Back by popular demand!  Reposting of the local SA patches!

 drivers/infiniband/core/Makefile   |    2 
 drivers/infiniband/core/notice.c   |  749 ++++++++++++++++++++++++++++++++++++
 drivers/infiniband/core/sa.h       |   16 +
 drivers/infiniband/core/sa_query.c |  316 +++++++++++++++
 include/rdma/ib_sa.h               |  171 ++++++++
 5 files changed, 1251 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index cb1ab3e..7c5b5ed 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -13,7 +13,7 @@ ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
 
 ib_mad-y :=			mad.o smi.o agent.o mad_rmpp.o
 
-ib_sa-y :=			sa_query.o multicast.o
+ib_sa-y :=			sa_query.o multicast.o notice.o
 
 ib_cm-y :=			cm.o
 
diff --git a/drivers/infiniband/core/notice.c b/drivers/infiniband/core/notice.c
new file mode 100644
index 0000000..e4c73c8
--- /dev/null
+++ b/drivers/infiniband/core/notice.c
@@ -0,0 +1,749 @@
+/*
+ * Copyright (c) 2006 Intel Corporation.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/completion.h>
+#include <linux/dma-mapping.h>
+#include <linux/err.h>
+#include <linux/interrupt.h>
+#include <linux/pci.h>
+#include <linux/bitops.h>
+#include <linux/random.h>
+
+#include "sa.h"
+
+MODULE_AUTHOR("Sean Hefty");
+MODULE_DESCRIPTION("InfiniBand InformInfo & Notice event handling");
+MODULE_LICENSE("Dual BSD/GPL");
+
+static void inform_add_one(struct ib_device *device);
+static void inform_remove_one(struct ib_device *device);
+
+static struct ib_client inform_client = {
+	.name   = "ib_notice",
+	.add    = inform_add_one,
+	.remove = inform_remove_one
+};
+
+static struct ib_sa_client	sa_client;
+static struct workqueue_struct	*inform_wq;
+
+struct inform_device;
+
+struct inform_port {
+	struct inform_device	*dev;
+	spinlock_t		lock;
+	struct rb_root		table;
+	atomic_t		refcount;
+	struct completion	comp;
+	u8			port_num;
+};
+
+struct inform_device {
+	struct ib_device	*device;
+	struct ib_event_handler	event_handler;
+	int			start_port;
+	int			end_port;
+	struct inform_port	port[0];
+};
+
+enum inform_state {
+	INFORM_IDLE,
+	INFORM_REGISTERING,
+	INFORM_MEMBER,
+	INFORM_BUSY,
+	INFORM_ERROR
+};
+
+struct inform_member;
+
+struct inform_group {
+	u16			trap_number;
+	struct rb_node		node;
+	struct inform_port	*port;
+	spinlock_t		lock;
+	struct work_struct	work;
+	struct list_head	pending_list;
+	struct list_head	active_list;
+	struct list_head	notice_list;
+	struct inform_member	*last_join;
+	int			members;
+	enum inform_state	join_state; /* State relative to SA */
+	atomic_t		refcount;
+	enum inform_state	state;
+	struct ib_sa_query	*query;
+	int			query_id;
+};
+
+struct inform_member {
+	struct ib_inform_info	info;
+	struct ib_sa_client	*client;
+	struct inform_group	*group;
+	struct list_head	list;
+	enum inform_state	state;
+	atomic_t		refcount;
+	struct completion	comp;
+};
+
+struct inform_notice {
+	struct list_head	list;
+	struct ib_sa_notice	notice;
+};
+
+static void reg_handler(int status, struct ib_sa_inform *inform,
+			 void *context);
+static void unreg_handler(int status, struct ib_sa_inform *inform,
+			  void *context);
+
+static struct inform_group *inform_find(struct inform_port *port,
+					u16 trap_number)
+{
+	struct rb_node *node = port->table.rb_node;
+	struct inform_group *group;
+
+	while (node) {
+		group = rb_entry(node, struct inform_group, node);
+		if (trap_number < group->trap_number)
+			node = node->rb_left;
+		else if (trap_number > group->trap_number)
+			node = node->rb_right;
+		else
+			return group;
+	}
+	return NULL;
+}
+
+static struct inform_group *inform_insert(struct inform_port *port,
+					  struct inform_group *group)
+{
+	struct rb_node **link = &port->table.rb_node;
+	struct rb_node *parent = NULL;
+	struct inform_group *cur_group;
+
+	while (*link) {
+		parent = *link;
+		cur_group = rb_entry(parent, struct inform_group, node);
+		if (group->trap_number < cur_group->trap_number)
+			link = &(*link)->rb_left;
+		else if (group->trap_number > cur_group->trap_number)
+			link = &(*link)->rb_right;
+		else
+			return cur_group;
+	}
+	rb_link_node(&group->node, parent, link);
+	rb_insert_color(&group->node, &port->table);
+	return NULL;
+}
+
+static void deref_port(struct inform_port *port)
+{
+	if (atomic_dec_and_test(&port->refcount))
+		complete(&port->comp);
+}
+
+static void release_group(struct inform_group *group)
+{
+	struct inform_port *port = group->port;
+	unsigned long flags;
+
+	spin_lock_irqsave(&port->lock, flags);
+	if (atomic_dec_and_test(&group->refcount)) {
+		rb_erase(&group->node, &port->table);
+		spin_unlock_irqrestore(&port->lock, flags);
+		kfree(group);
+		deref_port(port);
+	} else
+		spin_unlock_irqrestore(&port->lock, flags);
+}
+
+static void deref_member(struct inform_member *member)
+{
+	if (atomic_dec_and_test(&member->refcount))
+		complete(&member->comp);
+}
+
+static void queue_reg(struct inform_member *member)
+{
+	struct inform_group *group = member->group;
+	unsigned long flags;
+
+	spin_lock_irqsave(&group->lock, flags);
+	list_add(&member->list, &group->pending_list);
+	if (group->state == INFORM_IDLE) {
+		group->state = INFORM_BUSY;
+		atomic_inc(&group->refcount);
+		queue_work(inform_wq, &group->work);
+	}
+	spin_unlock_irqrestore(&group->lock, flags);
+}
+
+static int send_reg(struct inform_group *group, struct inform_member *member)
+{
+	struct inform_port *port = group->port;
+	struct ib_sa_inform inform;
+	int ret;
+
+	memset(&inform, 0, sizeof inform);
+	inform.lid_range_begin = cpu_to_be16(0xFFFF);
+	inform.is_generic = 1;
+	inform.subscribe = 1;
+	inform.type = cpu_to_be16(IB_SA_EVENT_TYPE_ALL);
+	inform.trap.generic.trap_num = cpu_to_be16(member->info.trap_number);
+	inform.trap.generic.resp_time = 19;
+	inform.trap.generic.producer_type =
+				cpu_to_be32(IB_SA_EVENT_PRODUCER_TYPE_ALL);
+
+	group->last_join = member;
+	ret = ib_sa_informinfo_query(&sa_client, port->dev->device,
+				     port->port_num, &inform, 3000, GFP_KERNEL,
+				     reg_handler, group,&group->query);
+	if (ret >= 0) {
+		group->query_id = ret;
+		ret = 0;
+	}
+	return ret;
+}
+
+static int send_unreg(struct inform_group *group)
+{
+	struct inform_port *port = group->port;
+	struct ib_sa_inform inform;
+	int ret;
+
+	memset(&inform, 0, sizeof inform);
+	inform.lid_range_begin = cpu_to_be16(0xFFFF);
+	inform.is_generic = 1;
+	inform.type = cpu_to_be16(IB_SA_EVENT_TYPE_ALL);
+	inform.trap.generic.trap_num = cpu_to_be16(group->trap_number);
+	inform.trap.generic.qpn = IB_QP1;
+	inform.trap.generic.resp_time = 19;
+	inform.trap.generic.producer_type =
+				cpu_to_be32(IB_SA_EVENT_PRODUCER_TYPE_ALL);
+
+	ret = ib_sa_informinfo_query(&sa_client, port->dev->device,
+				     port->port_num, &inform, 3000, GFP_KERNEL,
+				     unreg_handler, group, &group->query);
+	if (ret >= 0) {
+		group->query_id = ret;
+		ret = 0;
+	}
+	return ret;
+}
+
+static void join_group(struct inform_group *group, struct inform_member *member)
+{
+	member->state = INFORM_MEMBER;
+	group->members++;
+	list_move(&member->list, &group->active_list);
+}
+
+static int fail_join(struct inform_group *group, struct inform_member *member,
+		     int status)
+{
+	spin_lock_irq(&group->lock);
+	list_del_init(&member->list);
+	spin_unlock_irq(&group->lock);
+	return member->info.callback(status, &member->info, NULL);
+}
+
+static void process_group_error(struct inform_group *group)
+{
+	struct inform_member *member;
+	int ret;
+
+	spin_lock_irq(&group->lock);
+	while (!list_empty(&group->active_list)) {
+		member = list_entry(group->active_list.next,
+				    struct inform_member, list);
+		atomic_inc(&member->refcount);
+		list_del_init(&member->list);
+		group->members--;
+		member->state = INFORM_ERROR;
+		spin_unlock_irq(&group->lock);
+
+		ret = member->info.callback(-ENETRESET, &member->info, NULL);
+		deref_member(member);
+		if (ret)
+			ib_sa_unregister_inform_info(&member->info);
+		spin_lock_irq(&group->lock);
+	}
+
+	group->join_state = INFORM_IDLE;
+	group->state = INFORM_BUSY;
+	spin_unlock_irq(&group->lock);
+}
+
+/*
+ * Report a notice to all active subscribers.  We use a temporary list to
+ * handle unsubscription requests while the notice is being reported, which
+ * avoids holding the group lock while in the user's callback.
+ */
+static void process_notice(struct inform_group *group,
+			   struct inform_notice *info_notice)
+{
+	struct inform_member *member;
+	struct list_head list;
+	int ret;
+
+	INIT_LIST_HEAD(&list);
+
+	spin_lock_irq(&group->lock);
+	list_splice_init(&group->active_list, &list);
+	while (!list_empty(&list)) {
+
+		member = list_entry(list.next, struct inform_member, list);
+		atomic_inc(&member->refcount);
+		list_move(&member->list, &group->active_list);
+		spin_unlock_irq(&group->lock);
+
+		ret = member->info.callback(0, &member->info,
+					    &info_notice->notice);
+		deref_member(member);
+		if (ret)
+			ib_sa_unregister_inform_info(&member->info);
+		spin_lock_irq(&group->lock);
+	}
+	spin_unlock_irq(&group->lock);
+}
+
+static void inform_work_handler(struct work_struct *work)
+{
+	struct inform_group *group;
+	struct inform_member *member;
+	struct ib_inform_info *info;
+	struct inform_notice *info_notice;
+	int status, ret;
+
+	group = container_of(work, typeof(*group), work);
+retest:
+	spin_lock_irq(&group->lock);
+	while (!list_empty(&group->pending_list) ||
+	       !list_empty(&group->notice_list) ||
+	       (group->state == INFORM_ERROR)) {
+
+		if (group->state == INFORM_ERROR) {
+			spin_unlock_irq(&group->lock);
+			process_group_error(group);
+			goto retest;
+		}
+
+		if (!list_empty(&group->notice_list)) {
+			info_notice = list_entry(group->notice_list.next,
+						 struct inform_notice, list);
+			list_del(&info_notice->list);
+			spin_unlock_irq(&group->lock);
+			process_notice(group, info_notice);
+			kfree(info_notice);
+			goto retest;
+		}
+
+		member = list_entry(group->pending_list.next,
+				    struct inform_member, list);
+		info = &member->info;
+		atomic_inc(&member->refcount);
+
+		if (group->join_state == INFORM_MEMBER) {
+			join_group(group, member);
+			spin_unlock_irq(&group->lock);
+			ret = info->callback(0, info, NULL);
+		} else {
+			spin_unlock_irq(&group->lock);
+			status = send_reg(group, member);
+			if (!status) {
+				deref_member(member);
+				return;
+			}
+			ret = fail_join(group, member, status);
+		}
+
+		deref_member(member);
+		if (ret)
+			ib_sa_unregister_inform_info(&member->info);
+		spin_lock_irq(&group->lock);
+	}
+
+	if (!group->members && (group->join_state == INFORM_MEMBER)) {
+		group->join_state = INFORM_IDLE;
+		spin_unlock_irq(&group->lock);
+		if (send_unreg(group))
+			goto retest;
+	} else {
+		group->state = INFORM_IDLE;
+		spin_unlock_irq(&group->lock);
+		release_group(group);
+	}
+}
+
+/*
+ * Fail a join request if it is still active - at the head of the pending queue.
+ */
+static void process_join_error(struct inform_group *group, int status)
+{
+	struct inform_member *member;
+	int ret;
+
+	spin_lock_irq(&group->lock);
+	member = list_entry(group->pending_list.next,
+			    struct inform_member, list);
+	if (group->last_join == member) {
+		atomic_inc(&member->refcount);
+		list_del_init(&member->list);
+		spin_unlock_irq(&group->lock);
+		ret = member->info.callback(status, &member->info, NULL);
+		deref_member(member);
+		if (ret)
+			ib_sa_unregister_inform_info(&member->info);
+	} else
+		spin_unlock_irq(&group->lock);
+}
+
+static void reg_handler(int status, struct ib_sa_inform *inform, void *context)
+{
+	struct inform_group *group = context;
+
+	if (status)
+		process_join_error(group, status);
+	else
+		group->join_state = INFORM_MEMBER;
+
+	inform_work_handler(&group->work);
+}
+
+static void unreg_handler(int status, struct ib_sa_inform *rec, void *context)
+{
+	struct inform_group *group = context;
+
+	inform_work_handler(&group->work);
+}
+
+int notice_dispatch(struct ib_device *device, u8 port_num,
+		    struct ib_sa_notice *notice)
+{
+	struct inform_device *dev;
+	struct inform_port *port;
+	struct inform_group *group;
+	struct inform_notice *info_notice;
+
+	dev = ib_get_client_data(device, &inform_client);
+	if (!dev)
+		return 0; /* No one to give notice to. */
+
+	port = &dev->port[port_num - dev->start_port];
+	spin_lock_irq(&port->lock);
+	group = inform_find(port, __be16_to_cpu(notice->trap.
+						generic.trap_num));
+	if (!group) {
+		spin_unlock_irq(&port->lock);
+		return 0;
+	}
+
+	atomic_inc(&group->refcount);
+	spin_unlock_irq(&port->lock);
+
+	info_notice = kmalloc(sizeof *info_notice, GFP_KERNEL);
+	if (!info_notice) {
+		release_group(group);
+		return -ENOMEM;
+	}
+
+	info_notice->notice = *notice;
+
+	spin_lock_irq(&group->lock);
+	list_add(&info_notice->list, &group->notice_list);
+	if (group->state == INFORM_IDLE) {
+		group->state = INFORM_BUSY;
+		spin_unlock_irq(&group->lock);
+		inform_work_handler(&group->work);
+	} else {
+		spin_unlock_irq(&group->lock);
+		release_group(group);
+	}
+
+	return 0;
+}
+
+static struct inform_group *acquire_group(struct inform_port *port,
+					  u16 trap_number, gfp_t gfp_mask)
+{
+	struct inform_group *group, *cur_group;
+	unsigned long flags;
+
+	spin_lock_irqsave(&port->lock, flags);
+	group = inform_find(port, trap_number);
+	if (group)
+		goto found;
+	spin_unlock_irqrestore(&port->lock, flags);
+
+	group = kzalloc(sizeof *group, gfp_mask);
+	if (!group)
+		return NULL;
+
+	group->port = port;
+	group->trap_number = trap_number;
+	INIT_LIST_HEAD(&group->pending_list);
+	INIT_LIST_HEAD(&group->active_list);
+	INIT_LIST_HEAD(&group->notice_list);
+	INIT_WORK(&group->work, inform_work_handler);
+	spin_lock_init(&group->lock);
+
+	spin_lock_irqsave(&port->lock, flags);
+	cur_group = inform_insert(port, group);
+	if (cur_group) {
+		kfree(group);
+		group = cur_group;
+	} else
+		atomic_inc(&port->refcount);
+found:
+	atomic_inc(&group->refcount);
+	spin_unlock_irqrestore(&port->lock, flags);
+	return group;
+}
+
+/*
+ * We serialize all join requests to a single group to make our lives much
+ * easier.  Otherwise, two users could try to join the same group
+ * simultaneously, with different configurations, one could leave while the
+ * join is in progress, etc., which makes locking around error recovery
+ * difficult.
+ */
+struct ib_inform_info *
+ib_sa_register_inform_info(struct ib_sa_client *client,
+			   struct ib_device *device, u8 port_num,
+			   u16 trap_number, gfp_t gfp_mask,
+			   int (*callback)(int status,
+					   struct ib_inform_info *info,
+					   struct ib_sa_notice *notice),
+			   void *context)
+{
+	struct inform_device *dev;
+	struct inform_member *member;
+	struct ib_inform_info *info;
+	int ret;
+
+	dev = ib_get_client_data(device, &inform_client);
+	if (!dev)
+		return ERR_PTR(-ENODEV);
+
+	member = kzalloc(sizeof *member, gfp_mask);
+	if (!member)
+		return ERR_PTR(-ENOMEM);
+
+	ib_sa_client_get(client);
+	member->client = client;
+	member->info.trap_number = trap_number;
+	member->info.callback = callback;
+	member->info.context = context;
+	init_completion(&member->comp);
+	atomic_set(&member->refcount, 1);
+	member->state = INFORM_REGISTERING;
+
+	member->group = acquire_group(&dev->port[port_num - dev->start_port],
+				      trap_number, gfp_mask);
+	if (!member->group) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	/*
+	 * The user will get the info structure in their callback.  They
+	 * could then free the info structure before we can return from
+	 * this routine.  So we save the pointer to return before queuing
+	 * any callback.
+	 */
+	info = &member->info;
+	queue_reg(member);
+	return info;
+
+err:
+	ib_sa_client_put(member->client);
+	kfree(member);
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(ib_sa_register_inform_info);
+
+void ib_sa_unregister_inform_info(struct ib_inform_info *info)
+{
+	struct inform_member *member;
+	struct inform_group *group;
+
+	member = container_of(info, struct inform_member, info);
+	group = member->group;
+
+	spin_lock_irq(&group->lock);
+	if (member->state == INFORM_MEMBER)
+		group->members--;
+
+	list_del_init(&member->list);
+
+	if (group->state == INFORM_IDLE) {
+		group->state = INFORM_BUSY;
+		spin_unlock_irq(&group->lock);
+		/* Continue to hold reference on group until callback */
+		queue_work(inform_wq, &group->work);
+	} else {
+		spin_unlock_irq(&group->lock);
+		release_group(group);
+	}
+
+	deref_member(member);
+	wait_for_completion(&member->comp);
+	ib_sa_client_put(member->client);
+	kfree(member);
+}
+EXPORT_SYMBOL(ib_sa_unregister_inform_info);
+
+static void inform_groups_lost(struct inform_port *port)
+{
+	struct inform_group *group;
+	struct rb_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&port->lock, flags);
+	for (node = rb_first(&port->table); node; node = rb_next(node)) {
+		group = rb_entry(node, struct inform_group, node);
+		spin_lock(&group->lock);
+		if (group->state == INFORM_IDLE) {
+			atomic_inc(&group->refcount);
+			queue_work(inform_wq, &group->work);
+		}
+		group->state = INFORM_ERROR;
+		spin_unlock(&group->lock);
+	}
+	spin_unlock_irqrestore(&port->lock, flags);
+}
+
+static void inform_event_handler(struct ib_event_handler *handler,
+				struct ib_event *event)
+{
+	struct inform_device *dev;
+
+	dev = container_of(handler, struct inform_device, event_handler);
+
+	switch (event->event) {
+	case IB_EVENT_PORT_ERR:
+	case IB_EVENT_LID_CHANGE:
+	case IB_EVENT_SM_CHANGE:
+	case IB_EVENT_CLIENT_REREGISTER:
+		inform_groups_lost(&dev->port[event->element.port_num -
+					      dev->start_port]);
+		break;
+	default:
+		break;
+	}
+}
+
+static void inform_add_one(struct ib_device *device)
+{
+	struct inform_device *dev;
+	struct inform_port *port;
+	int i;
+
+	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
+		return;
+
+	dev = kmalloc(sizeof *dev + device->phys_port_cnt * sizeof *port,
+		      GFP_KERNEL);
+	if (!dev)
+		return;
+
+	if (device->node_type == RDMA_NODE_IB_SWITCH)
+		dev->start_port = dev->end_port = 0;
+	else {
+		dev->start_port = 1;
+		dev->end_port = device->phys_port_cnt;
+	}
+
+	for (i = 0; i <= dev->end_port - dev->start_port; i++) {
+		port = &dev->port[i];
+		port->dev = dev;
+		port->port_num = dev->start_port + i;
+		spin_lock_init(&port->lock);
+		port->table = RB_ROOT;
+		init_completion(&port->comp);
+		atomic_set(&port->refcount, 1);
+	}
+
+	dev->device = device;
+	ib_set_client_data(device, &inform_client, dev);
+
+	INIT_IB_EVENT_HANDLER(&dev->event_handler, device, inform_event_handler);
+	ib_register_event_handler(&dev->event_handler);
+}
+
+static void inform_remove_one(struct ib_device *device)
+{
+	struct inform_device *dev;
+	struct inform_port *port;
+	int i;
+
+	dev = ib_get_client_data(device, &inform_client);
+	if (!dev)
+		return;
+
+	ib_unregister_event_handler(&dev->event_handler);
+	flush_workqueue(inform_wq);
+
+	for (i = 0; i <= dev->end_port - dev->start_port; i++) {
+		port = &dev->port[i];
+		deref_port(port);
+		wait_for_completion(&port->comp);
+	}
+
+	kfree(dev);
+}
+
+int notice_init(void)
+{
+	int ret;
+
+	inform_wq = create_singlethread_workqueue("ib_inform");
+	if (!inform_wq)
+		return -ENOMEM;
+
+	ib_sa_register_client(&sa_client);
+
+	ret = ib_register_client(&inform_client);
+	if (ret)
+		goto err;
+	return 0;
+
+err:
+	ib_sa_unregister_client(&sa_client);
+	destroy_workqueue(inform_wq);
+	return ret;
+}
+
+void notice_cleanup(void)
+{
+	ib_unregister_client(&inform_client);
+	ib_sa_unregister_client(&sa_client);
+	destroy_workqueue(inform_wq);
+}
diff --git a/drivers/infiniband/core/sa.h b/drivers/infiniband/core/sa.h
index 24c93fd..b8eac66 100644
--- a/drivers/infiniband/core/sa.h
+++ b/drivers/infiniband/core/sa.h
@@ -63,4 +63,20 @@ int ib_sa_mcmember_rec_query(struct ib_sa_client *client,
 int mcast_init(void);
 void mcast_cleanup(void);
 
+int ib_sa_informinfo_query(struct ib_sa_client *client,
+			   struct ib_device *device, u8 port_num,
+			   struct ib_sa_inform *rec,
+			   int timeout_ms, gfp_t gfp_mask,
+			   void (*callback)(int status,
+					    struct ib_sa_inform *resp,
+					    void *context),
+			   void *context,
+			   struct ib_sa_query **sa_query);
+
+int notice_dispatch(struct ib_device *device, u8 port_num,
+		    struct ib_sa_notice *notice);
+
+int notice_init(void);
+void notice_cleanup(void);
+
 #endif /* SA_H */
diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index 4791d01..23d1081 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -62,10 +62,12 @@ struct ib_sa_sm_ah {
 
 struct ib_sa_port {
 	struct ib_mad_agent *agent;
+	struct ib_mad_agent *notice_agent;
 	struct ib_sa_sm_ah  *sm_ah;
 	struct work_struct   update_task;
 	spinlock_t           ah_lock;
 	u8                   port_num;
+	struct ib_device    *device;
 };
 
 struct ib_sa_device {
@@ -102,6 +104,12 @@ struct ib_sa_mcmember_query {
 	struct ib_sa_query sa_query;
 };
 
+struct ib_sa_inform_query {
+	void (*callback)(int, struct ib_sa_inform *, void *);
+	void *context;
+	struct ib_sa_query sa_query;
+};
+
 static void ib_sa_add_one(struct ib_device *device);
 static void ib_sa_remove_one(struct ib_device *device);
 
@@ -353,6 +361,110 @@ static const struct ib_field service_rec_table[] = {
 	  .size_bits    = 2*64 },
 };
 
+#define INFORM_FIELD(field) \
+	.struct_offset_bytes = offsetof(struct ib_sa_inform, field), \
+	.struct_size_bytes   = sizeof ((struct ib_sa_inform *) 0)->field, \
+	.field_name          = "sa_inform:" #field
+
+static const struct ib_field inform_table[] = {
+	{ INFORM_FIELD(gid),
+	  .offset_words = 0,
+	  .offset_bits  = 0,
+	  .size_bits    = 128 },
+	{ INFORM_FIELD(lid_range_begin),
+	  .offset_words = 4,
+	  .offset_bits  = 0,
+	  .size_bits    = 16 },
+	{ INFORM_FIELD(lid_range_end),
+	  .offset_words = 4,
+	  .offset_bits  = 16,
+	  .size_bits    = 16 },
+	{ RESERVED,
+	  .offset_words = 5,
+	  .offset_bits  = 0,
+	  .size_bits    = 16 },
+	{ INFORM_FIELD(is_generic),
+	  .offset_words = 5,
+	  .offset_bits  = 16,
+	  .size_bits    = 8 },
+	{ INFORM_FIELD(subscribe),
+	  .offset_words = 5,
+	  .offset_bits  = 24,
+	  .size_bits    = 8 },
+	{ INFORM_FIELD(type),
+	  .offset_words = 6,
+	  .offset_bits  = 0,
+	  .size_bits    = 16 },
+	{ INFORM_FIELD(trap.generic.trap_num),
+	  .offset_words = 6,
+	  .offset_bits  = 16,
+	  .size_bits    = 16 },
+	{ INFORM_FIELD(trap.generic.qpn),
+	  .offset_words = 7,
+	  .offset_bits  = 0,
+	  .size_bits    = 24 },
+	{ RESERVED,
+	  .offset_words = 7,
+	  .offset_bits  = 24,
+	  .size_bits    = 3 },
+	{ INFORM_FIELD(trap.generic.resp_time),
+	  .offset_words = 7,
+	  .offset_bits  = 27,
+	  .size_bits    = 5 },
+	{ RESERVED,
+	  .offset_words = 8,
+	  .offset_bits  = 0,
+	  .size_bits    = 8 },
+	{ INFORM_FIELD(trap.generic.producer_type),
+	  .offset_words = 8,
+	  .offset_bits  = 8,
+	  .size_bits    = 24 },
+};
+
+#define NOTICE_FIELD(field) \
+	.struct_offset_bytes = offsetof(struct ib_sa_notice, field), \
+	.struct_size_bytes   = sizeof ((struct ib_sa_notice *) 0)->field, \
+	.field_name          = "sa_notice:" #field
+
+static const struct ib_field notice_table[] = {
+	{ NOTICE_FIELD(is_generic),
+	  .offset_words = 0,
+	  .offset_bits  = 0,
+	  .size_bits    = 1 },
+	{ NOTICE_FIELD(type),
+	  .offset_words = 0,
+	  .offset_bits  = 1,
+	  .size_bits    = 7 },
+	{ NOTICE_FIELD(trap.generic.producer_type),
+	  .offset_words = 0,
+	  .offset_bits  = 8,
+	  .size_bits    = 24 },
+	{ NOTICE_FIELD(trap.generic.trap_num),
+	  .offset_words = 1,
+	  .offset_bits  = 0,
+	  .size_bits    = 16 },
+	{ NOTICE_FIELD(issuer_lid),
+	  .offset_words = 1,
+	  .offset_bits  = 16,
+	  .size_bits    = 16 },
+	{ NOTICE_FIELD(notice_toggle),
+	  .offset_words = 2,
+	  .offset_bits  = 0,
+	  .size_bits    = 1 },
+	{ NOTICE_FIELD(notice_count),
+	  .offset_words = 2,
+	  .offset_bits  = 1,
+	  .size_bits    = 15 },
+	{ NOTICE_FIELD(data_details),
+	  .offset_words = 2,
+	  .offset_bits  = 16,
+	  .size_bits    = 432 },
+	{ NOTICE_FIELD(issuer_gid),
+	  .offset_words = 16,
+	  .offset_bits  = 0,
+	  .size_bits    = 128 },
+};
+
 static void free_sm_ah(struct kref *kref)
 {
 	struct ib_sa_sm_ah *sm_ah = container_of(kref, struct ib_sa_sm_ah, ref);
@@ -929,6 +1041,153 @@ err1:
 	return ret;
 }
 
+static void ib_sa_inform_callback(struct ib_sa_query *sa_query,
+				  int status,
+				  struct ib_sa_mad *mad)
+{
+	struct ib_sa_inform_query *query =
+		container_of(sa_query, struct ib_sa_inform_query, sa_query);
+
+	if (mad) {
+		struct ib_sa_inform rec;
+
+		ib_unpack(inform_table, ARRAY_SIZE(inform_table),
+			  mad->data, &rec);
+		query->callback(status, &rec, query->context);
+	} else
+		query->callback(status, NULL, query->context);
+}
+
+static void ib_sa_inform_release(struct ib_sa_query *sa_query)
+{
+	kfree(container_of(sa_query, struct ib_sa_inform_query, sa_query));
+}
+
+/**
+ * ib_sa_informinfo_query - Start an InformInfo registration.
+ * @client:SA client
+ * @device:device to send query on
+ * @port_num: port number to send query on
+ * @rec:Inform record to send in query
+ * @timeout_ms:time to wait for response
+ * @gfp_mask:GFP mask to use for internal allocations
+ * @callback:function called when notice handler registration completes,
+ * times out or is canceled
+ * @context:opaque user context passed to callback
+ * @sa_query:query context, used to cancel query
+ *
+ * This function sends inform info to register with SA to receive
+ * in-service notice.
+ * The callback function will be called when the query completes (or
+ * fails); status is 0 for a successful response, -EINTR if the query
+ * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error
+ * occurred sending the query.  The resp parameter of the callback is
+ * only valid if status is 0.
+ *
+ * If the return value of ib_sa_inform_query() is negative, it is an
+ * error code.  Otherwise it is a query ID that can be used to cancel
+ * the query.
+ */
+int ib_sa_informinfo_query(struct ib_sa_client *client,
+			   struct ib_device *device, u8 port_num,
+			   struct ib_sa_inform *rec,
+			   int timeout_ms, gfp_t gfp_mask,
+			   void (*callback)(int status,
+					   struct ib_sa_inform *resp,
+					   void *context),
+			   void *context,
+			   struct ib_sa_query **sa_query)
+{
+	struct ib_sa_inform_query *query;
+	struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client);
+	struct ib_sa_port   *port;
+	struct ib_mad_agent *agent;
+	struct ib_sa_mad *mad;
+	int ret;
+
+	if (!sa_dev)
+		return -ENODEV;
+
+	port  = &sa_dev->port[port_num - sa_dev->start_port];
+	agent = port->agent;
+
+	query = kmalloc(sizeof *query, gfp_mask);
+	if (!query)
+		return -ENOMEM;
+
+	query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0,
+						     0, IB_MGMT_SA_HDR,
+						     IB_MGMT_SA_DATA, gfp_mask);
+	if (!query->sa_query.mad_buf) {
+		ret = -ENOMEM;
+		goto err1;
+	}
+
+	ib_sa_client_get(client);
+	query->sa_query.client = client;
+	query->callback = callback;
+	query->context  = context;
+
+	mad = query->sa_query.mad_buf->mad;
+	init_mad(mad, agent);
+
+	query->sa_query.callback = callback ? ib_sa_inform_callback : NULL;
+	query->sa_query.release  = ib_sa_inform_release;
+	query->sa_query.port     = port;
+	mad->mad_hdr.method	 = IB_MGMT_METHOD_SET;
+	mad->mad_hdr.attr_id	 = cpu_to_be16(IB_SA_ATTR_INFORM_INFO);
+
+	ib_pack(inform_table, ARRAY_SIZE(inform_table), rec, mad->data);
+
+	*sa_query = &query->sa_query;
+	ret = send_mad(&query->sa_query, timeout_ms, gfp_mask);
+	if (ret < 0)
+		goto err2;
+
+	return ret;
+
+err2:
+	*sa_query = NULL;
+	ib_sa_client_put(query->sa_query.client);
+	ib_free_send_mad(query->sa_query.mad_buf);
+err1:
+	kfree(query);
+	return ret;
+}
+
+static void ib_sa_notice_resp(struct ib_sa_port *port,
+			      struct ib_mad_recv_wc *mad_recv_wc)
+{
+	struct ib_mad_send_buf *mad_buf;
+	struct ib_sa_mad *mad;
+	int ret;
+
+	mad_buf = ib_create_send_mad(port->notice_agent, 1, 0, 0,
+				     IB_MGMT_SA_HDR, IB_MGMT_SA_DATA,
+				     GFP_KERNEL);
+	if (IS_ERR(mad_buf))
+		return;
+
+	mad = mad_buf->mad;
+	memcpy(mad, mad_recv_wc->recv_buf.mad, sizeof *mad);
+	mad->mad_hdr.method = IB_MGMT_METHOD_REPORT_RESP;
+
+	spin_lock_irq(&port->ah_lock);
+	kref_get(&port->sm_ah->ref);
+	mad_buf->context[0] = &port->sm_ah->ref;
+	mad_buf->ah = port->sm_ah->ah;
+	spin_unlock_irq(&port->ah_lock);
+
+	ret = ib_post_send_mad(mad_buf, NULL);
+	if (ret)
+		goto err;
+
+	return;
+err:
+	kref_put(mad_buf->context[0], free_sm_ah);
+	ib_free_send_mad(mad_buf);
+}
+
 static void send_handler(struct ib_mad_agent *agent,
 			 struct ib_mad_send_wc *mad_send_wc)
 {
@@ -982,9 +1241,36 @@ static void recv_handler(struct ib_mad_agent *mad_agent,
 	ib_free_recv_mad(mad_recv_wc);
 }
 
+static void notice_resp_handler(struct ib_mad_agent *agent,
+				struct ib_mad_send_wc *mad_send_wc)
+{
+	kref_put(mad_send_wc->send_buf->context[0], free_sm_ah);
+	ib_free_send_mad(mad_send_wc->send_buf);
+}
+
+static void notice_handler(struct ib_mad_agent *mad_agent,
+			   struct ib_mad_recv_wc *mad_recv_wc)
+{
+	struct ib_sa_port *port;
+	struct ib_sa_mad *mad;
+	struct ib_sa_notice notice;
+
+	port = mad_agent->context;
+	mad = (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad;
+	ib_unpack(notice_table, ARRAY_SIZE(notice_table), mad->data, &notice);
+
+	if (!notice_dispatch(port->device, port->port_num, &notice))
+		ib_sa_notice_resp(port, mad_recv_wc);
+	ib_free_recv_mad(mad_recv_wc);
+}
+
 static void ib_sa_add_one(struct ib_device *device)
 {
 	struct ib_sa_device *sa_dev;
+	struct ib_mad_reg_req reg_req = {
+		.mgmt_class = IB_MGMT_CLASS_SUBN_ADM,
+		.mgmt_class_version = 2
+	};
 	int s, e, i;
 
 	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
@@ -1018,6 +1304,16 @@ static void ib_sa_add_one(struct ib_device *device)
 		if (IS_ERR(sa_dev->port[i].agent))
 			goto err;
 
+		sa_dev->port[i].device = device;
+		set_bit(IB_MGMT_METHOD_REPORT, reg_req.method_mask);
+		sa_dev->port[i].notice_agent =
+			ib_register_mad_agent(device, i + s, IB_QPT_GSI,
+					      &reg_req, 0, notice_resp_handler,
+					      notice_handler, &sa_dev->port[i]);
+
+		if (IS_ERR(sa_dev->port[i].notice_agent))
+			goto err;
+
 		INIT_WORK(&sa_dev->port[i].update_task, update_sm_ah);
 	}
 
@@ -1040,8 +1336,14 @@ static void ib_sa_add_one(struct ib_device *device)
 	return;
 
 err:
-	while (--i >= 0)
-		ib_unregister_mad_agent(sa_dev->port[i].agent);
+	while (--i >= 0) {
+		if (!IS_ERR(sa_dev->port[i].notice_agent)) {
+			ib_unregister_mad_agent(sa_dev->port[i].notice_agent);
+		}
+		if (!IS_ERR(sa_dev->port[i].agent)) {
+			ib_unregister_mad_agent(sa_dev->port[i].agent);
+		}
+	}
 
 	kfree(sa_dev);
 
@@ -1061,6 +1363,7 @@ static void ib_sa_remove_one(struct ib_device *device)
 	flush_scheduled_work();
 
 	for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) {
+		ib_unregister_mad_agent(sa_dev->port[i].notice_agent);
 		ib_unregister_mad_agent(sa_dev->port[i].agent);
 		kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah);
 	}
@@ -1089,7 +1392,15 @@ static int __init ib_sa_init(void)
 		goto err2;
 	}
 
+	ret = notice_init();
+	if (ret) {
+		printk(KERN_ERR "Couldn't initialize notice handling\n");
+		goto err3;
+	}
+
 	return 0;
+err3:
+	mcast_cleanup();
 err2:
 	ib_unregister_client(&sa_client);
 err1:
@@ -1099,6 +1410,7 @@ err1:
 static void __exit ib_sa_cleanup(void)
 {
 	mcast_cleanup();
+	notice_cleanup();
 	ib_unregister_client(&sa_client);
 	idr_destroy(&query_idr);
 }
diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h
index 5e26b2f..83d8157 100644
--- a/include/rdma/ib_sa.h
+++ b/include/rdma/ib_sa.h
@@ -254,6 +254,127 @@ struct ib_sa_service_rec {
 	u64		data64[2];
 };
 
+enum {
+	IB_SA_EVENT_TYPE_FATAL		= 0x0,
+	IB_SA_EVENT_TYPE_URGENT		= 0x1,
+	IB_SA_EVENT_TYPE_SECURITY	= 0x2,
+	IB_SA_EVENT_TYPE_SM		= 0x3,
+	IB_SA_EVENT_TYPE_INFO		= 0x4,
+	IB_SA_EVENT_TYPE_EMPTY		= 0x7F,
+	IB_SA_EVENT_TYPE_ALL		= 0xFFFF
+};
+
+enum {
+	IB_SA_EVENT_PRODUCER_TYPE_CA		= 0x1,
+	IB_SA_EVENT_PRODUCER_TYPE_SWITCH	= 0x2,
+	IB_SA_EVENT_PRODUCER_TYPE_ROUTER	= 0x3,
+	IB_SA_EVENT_PRODUCER_TYPE_CLASS_MANAGER	= 0x4,
+	IB_SA_EVENT_PRODUCER_TYPE_ALL		= 0xFFFFFF
+};
+
+enum {
+	IB_SA_SM_TRAP_GID_IN_SERVICE			= 64,
+	IB_SA_SM_TRAP_GID_OUT_OF_SERVICE		= 65,
+	IB_SA_SM_TRAP_CREATE_MC_GROUP			= 66,
+	IB_SA_SM_TRAP_DELETE_MC_GROUP			= 67,
+	IB_SA_SM_TRAP_PORT_CHANGE_STATE			= 128,
+	IB_SA_SM_TRAP_LINK_INTEGRITY			= 129,
+	IB_SA_SM_TRAP_EXCESSIVE_BUFFER_OVERRUN		= 130,
+	IB_SA_SM_TRAP_FLOW_CONTROL_UPDATE_EXPIRED	= 131,
+	IB_SA_SM_TRAP_BAD_M_KEY				= 256,
+	IB_SA_SM_TRAP_BAD_P_KEY				= 257,
+	IB_SA_SM_TRAP_BAD_Q_KEY				= 258,
+	IB_SA_SM_TRAP_SWITCH_BAD_P_KEY			= 259,
+	IB_SA_SM_TRAP_ALL				= 0xFFFF
+};
+
+struct ib_sa_inform {
+	union ib_gid	gid;
+	__be16		lid_range_begin;
+	__be16		lid_range_end;
+	u8		is_generic;
+	u8		subscribe;
+	__be16		type;
+	union {
+		struct {
+			__be16	trap_num;
+			__be32	qpn;
+			u8	resp_time;
+			__be32	producer_type;
+		} generic;
+		struct {
+			__be16	device_id;
+			__be32	qpn;
+			u8	resp_time;
+			__be32	vendor_id;
+		} vendor;
+	} trap;
+};
+
+struct ib_sa_notice {
+	u8		is_generic;
+	u8		type;
+	union {
+		struct {
+			__be32	producer_type;
+			__be16	trap_num;
+		} generic;
+		struct {
+			__be32	vendor_id;
+			__be16	device_id;
+		} vendor;
+	} trap;
+	__be16		issuer_lid;
+	__be16		notice_count;
+	u8		notice_toggle;
+	/*
+	 * Align data 16 bits off 64 bit field to match InformInfo definition.
+	 * Data contained within this field will then align properly.
+	 * See IB spec 1.2, sections 13.4.8.2 and 14.2.5.1.
+	 */
+	u8		reserved[5];
+	u8		data_details[54];
+	union ib_gid	issuer_gid;
+};
+
+/*
+ * SM notice data details for:
+ *
+ * IB_SA_SM_TRAP_GID_IN_SERVICE		= 64
+ * IB_SA_SM_TRAP_GID_OUT_OF_SERVICE	= 65
+ * IB_SA_SM_TRAP_CREATE_MC_GROUP	= 66
+ * IB_SA_SM_TRAP_DELETE_MC_GROUP	= 67
+ */
+struct ib_sa_notice_data_gid {
+	u8	reserved[6];
+	u8	gid[16];
+	u8	padding[32];
+};
+
+/*
+ * SM notice data details for:
+ *
+ * IB_SA_SM_TRAP_PORT_CHANGE_STATE	= 128
+ */
+struct ib_sa_notice_data_port_change {
+	__be16	lid;
+	u8	padding[52];
+};
+
+/*
+ * SM notice data details for:
+ *
+ * IB_SA_SM_TRAP_LINK_INTEGRITY			= 129
+ * IB_SA_SM_TRAP_EXCESSIVE_BUFFER_OVERRUN	= 130
+ * IB_SA_SM_TRAP_FLOW_CONTROL_UPDATE_EXPIRED	= 131
+ */
+struct ib_sa_notice_data_port_error {
+	u8	reserved[2];
+	__be16	lid;
+	u8	port_num;
+	u8	padding[49];
+};
+
 struct ib_sa_client {
 	atomic_t users;
 	struct completion comp;
@@ -382,4 +503,54 @@ int ib_init_ah_from_path(struct ib_device *device, u8 port_num,
 			 struct ib_sa_path_rec *rec,
 			 struct ib_ah_attr *ah_attr);
 
+struct ib_inform_info {
+	void		*context;
+	int		(*callback)(int status,
+				    struct ib_inform_info *info,
+				    struct ib_sa_notice *notice);
+	u16		trap_number;
+};
+
+/**
+ * ib_sa_register_inform_info - Registers to receive notice events.
+ * @device: Device associated with the registration.
+ * @port_num: Port on the specified device to associate with the registration.
+ * @trap_number: InformInfo trap number to register for.
+ * @gfp_mask: GFP mask for memory allocations.
+ * @callback: User callback invoked once the registration completes and to
+ *   report noticed events.
+ * @context: User specified context stored with the ib_inform_reg structure.
+ *
+ * This call initiates a registration request with the SA for the specified
+ * trap number.  If the operation is started successfully, it returns
+ * an ib_inform_info structure that is used to track the registration operation.
+ * Users must free this structure by calling ib_unregister_inform_info,
+ * even if the operation later fails.  (The callback status is non-zero.)
+ *
+ * If the registration fails; status will be non-zero.  If the registration
+ * succeeds, the callback status will be zero, but the notice parameter will
+ * be NULL.  If the notice parameter is not NULL, a trap or notice is being
+ * reported to the user.
+ *
+ * A status of -ENETRESET indicates that an error occurred which requires
+ * reregisteration.
+ */
+struct ib_inform_info *
+ib_sa_register_inform_info(struct ib_sa_client *client,
+			   struct ib_device *device, u8 port_num,
+			   u16 trap_number, gfp_t gfp_mask,
+			   int (*callback)(int status,
+					   struct ib_inform_info *info,
+					   struct ib_sa_notice *notice),
+			   void *context);
+
+/**
+ * ib_sa_unregister_inform_info - Releases an InformInfo registration.
+ * @info: InformInfo registration tracking structure.
+ *
+ * This call blocks until the registration request is destroyed.  It may
+ * not be called from within the registration callback.
+ */
+void ib_sa_unregister_inform_info(struct ib_inform_info *info);
+
 #endif /* IB_SA_H */


From sean.hefty at intel.com  Mon Jul  2 14:02:08 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 2 Jul 2007 14:02:08 -0700
Subject: [ofa-general] [PATCH 2/2] ib/sa: Add local SA path record caching
In-Reply-To: <000601c7bceb$ffff3400$3c98070a@amr.corp.intel.com>
Message-ID: <000701c7bcec$40c933a0$3c98070a@amr.corp.intel.com>

Query and store path records locally to decrease path record query time
and offload SA flooding during the start-up of large clustered jobs.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
Now, this version is a thing of beauty.

 drivers/infiniband/core/Makefile    |    2 
 drivers/infiniband/core/local_sa.c  | 1275 +++++++++++++++++++++++++++++++++++
 drivers/infiniband/core/multicast.c |   50 -
 drivers/infiniband/core/sa.h        |   23 +
 drivers/infiniband/core/sa_query.c  |  107 ++-
 include/rdma/ib_local_sa.h          |   83 ++
 include/rdma/ib_sa.h                |    3 
 7 files changed, 1467 insertions(+), 76 deletions(-)

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index 7c5b5ed..f646040 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -13,7 +13,7 @@ ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
 
 ib_mad-y :=			mad.o smi.o agent.o mad_rmpp.o
 
-ib_sa-y :=			sa_query.o multicast.o notice.o
+ib_sa-y :=			sa_query.o multicast.o notice.o local_sa.o
 
 ib_cm-y :=			cm.o
 
diff --git a/drivers/infiniband/core/local_sa.c b/drivers/infiniband/core/local_sa.c
new file mode 100644
index 0000000..6c073a3
--- /dev/null
+++ b/drivers/infiniband/core/local_sa.c
@@ -0,0 +1,1275 @@
+/*
+ * Copyright (c) 2006 Intel Corporation.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/dma-mapping.h>
+#include <linux/err.h>
+#include <linux/interrupt.h>
+#include <linux/rbtree.h>
+#include <linux/mutex.h>
+#include <linux/spinlock.h>
+#include <linux/pci.h>
+#include <linux/miscdevice.h>
+#include <linux/random.h>
+
+#include <rdma/ib_cache.h>
+#include <rdma/ib_sa.h>
+#include "sa.h"
+
+MODULE_AUTHOR("Sean Hefty");
+MODULE_DESCRIPTION("InfiniBand subnet administration caching");
+MODULE_LICENSE("Dual BSD/GPL");
+
+enum {
+	SA_DB_MAX_PATHS_PER_DEST = 0x7F,
+	SA_DB_MIN_RETRY_TIMER	 = 4000,  /*   4 sec */
+	SA_DB_MAX_RETRY_TIMER	 = 256000 /* 256 sec */
+};
+
+static int set_paths_per_dest(const char *val, struct kernel_param *kp);
+static unsigned long paths_per_dest = 0;
+module_param_call(paths_per_dest, set_paths_per_dest, param_get_ulong,
+		  &paths_per_dest, 0644);
+MODULE_PARM_DESC(paths_per_dest, "Maximum number of paths to retrieve "
+				 "to each destination (DGID).  Set to 0 "
+				 "to disable cache.");
+
+static int set_subscribe_inform_info(const char *val, struct kernel_param *kp);
+static char subscribe_inform_info = 1;
+module_param_call(subscribe_inform_info, set_subscribe_inform_info,
+		  param_get_bool, &subscribe_inform_info, 0644);
+MODULE_PARM_DESC(subscribe_inform_info,
+		 "Subscribe for SA InformInfo/Notice events.");
+
+static int do_refresh(const char *val, struct kernel_param *kp);
+module_param_call(refresh, do_refresh, NULL, NULL, 0200);
+
+static unsigned long retry_timer = SA_DB_MIN_RETRY_TIMER;
+
+enum sa_db_lookup_method {
+	SA_DB_LOOKUP_LEAST_USED,
+	SA_DB_LOOKUP_RANDOM
+};
+
+static int set_lookup_method(const char *val, struct kernel_param *kp);
+static int get_lookup_method(char *buf, struct kernel_param *kp);
+static unsigned long lookup_method;
+module_param_call(lookup_method, set_lookup_method, get_lookup_method,
+		  &lookup_method, 0644);
+MODULE_PARM_DESC(lookup_method, "Method used to return path records when "
+				"multiple paths exist to a given destination.");
+
+static void sa_db_add_dev(struct ib_device *device);
+static void sa_db_remove_dev(struct ib_device *device);
+
+static struct ib_client sa_db_client = {
+	.name   = "local_sa",
+	.add    = sa_db_add_dev,
+	.remove = sa_db_remove_dev
+};
+
+static LIST_HEAD(dev_list);
+static DEFINE_MUTEX(lock);
+static rwlock_t rwlock;
+static struct workqueue_struct *sa_wq;
+static struct ib_sa_client sa_client;
+
+enum sa_db_state {
+	SA_DB_IDLE,
+	SA_DB_REFRESH,
+	SA_DB_DESTROY
+};
+
+struct sa_db_port {
+	struct sa_db_device	*dev;
+	struct ib_mad_agent	*agent;
+	/* Limit number of outstanding MADs to SA to reduce SA flooding */
+	struct ib_mad_send_buf	*msg;
+	u16			sm_lid;
+	u8			sm_sl;
+	struct ib_inform_info	*in_info;
+	struct ib_inform_info	*out_info;
+	struct rb_root		paths;
+	struct list_head	update_list;
+	unsigned long		update_id;
+	enum sa_db_state	state;
+	struct work_struct	work;
+	union ib_gid		gid;
+	int			port_num;
+};
+
+struct sa_db_device {
+	struct list_head	list;
+	struct ib_device	*device;
+	struct ib_event_handler event_handler;
+	int			start_port;
+	int			port_count;
+	struct sa_db_port	port[0];
+};
+
+struct ib_sa_iterator {
+	struct ib_sa_iterator	*next;
+};
+
+struct ib_sa_attr_iter {
+	struct ib_sa_iterator	*iter;
+	unsigned long		flags;
+};
+
+struct ib_sa_attr_list {
+	struct ib_sa_iterator	iter;
+	struct ib_sa_iterator	*tail;
+	unsigned long		update_id;
+	union ib_gid		gid;
+	struct rb_node		node;
+};
+
+struct ib_path_rec_info {
+	struct ib_sa_iterator	iter; /* keep first */
+	struct ib_sa_path_rec	rec;
+	unsigned long		lookups;
+};
+
+struct ib_sa_mad_iter {
+	struct ib_mad_recv_wc	*recv_wc;
+	struct ib_mad_recv_buf	*recv_buf;
+	int			attr_size;
+	int			attr_offset;
+	int			data_offset;
+	int			data_left;
+	void			*attr;
+	u8			attr_data[0];
+};
+
+enum sa_update_type {
+	SA_UPDATE_FULL,
+	SA_UPDATE_ADD,
+	SA_UPDATE_REMOVE
+};
+
+struct update_info {
+	struct list_head	list;
+	union ib_gid		gid;
+	enum sa_update_type	type;
+};
+
+struct sa_path_request {
+	struct work_struct	work;
+	struct ib_sa_client	*client;
+	void			(*callback)(int, struct ib_sa_path_rec *, void *);
+	void			*context;
+	struct ib_sa_path_rec	path_rec;
+};
+
+static void process_updates(struct sa_db_port *port);
+
+static void free_attr_list(struct ib_sa_attr_list *attr_list)
+{
+	struct ib_sa_iterator *cur;
+
+	for (cur = attr_list->iter.next; cur; cur = attr_list->iter.next) {
+		attr_list->iter.next = cur->next;
+		kfree(cur);
+	}
+	attr_list->tail = &attr_list->iter;
+}
+
+static void remove_attr(struct rb_root *root, struct ib_sa_attr_list *attr_list)
+{
+	rb_erase(&attr_list->node, root);
+	free_attr_list(attr_list);
+	kfree(attr_list);
+}
+
+static void remove_all_attrs(struct rb_root *root)
+{
+	struct rb_node *node, *next_node;
+	struct ib_sa_attr_list *attr_list;
+
+	write_lock_irq(&rwlock);
+	for (node = rb_first(root); node; node = next_node) {
+		next_node = rb_next(node);
+		attr_list = rb_entry(node, struct ib_sa_attr_list, node);
+		remove_attr(root, attr_list);
+	}
+	write_unlock_irq(&rwlock);
+}
+
+static void remove_old_attrs(struct rb_root *root, unsigned long update_id)
+{
+	struct rb_node *node, *next_node;
+	struct ib_sa_attr_list *attr_list;
+
+	write_lock_irq(&rwlock);
+	for (node = rb_first(root); node; node = next_node) {
+		next_node = rb_next(node);
+		attr_list = rb_entry(node, struct ib_sa_attr_list, node);
+		if (attr_list->update_id != update_id)
+			remove_attr(root, attr_list);
+	}
+	write_unlock_irq(&rwlock);
+}
+
+static struct ib_sa_attr_list *insert_attr_list(struct rb_root *root,
+						struct ib_sa_attr_list *attr_list)
+{
+	struct rb_node **link = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct ib_sa_attr_list *cur_attr_list;
+	int cmp;
+
+	while (*link) {
+		parent = *link;
+		cur_attr_list = rb_entry(parent, struct ib_sa_attr_list, node);
+		cmp = memcmp(&cur_attr_list->gid, &attr_list->gid,
+			     sizeof attr_list->gid);
+		if (cmp < 0)
+			link = &(*link)->rb_left;
+		else if (cmp > 0)
+			link = &(*link)->rb_right;
+		else
+			return cur_attr_list;
+	}
+	rb_link_node(&attr_list->node, parent, link);
+	rb_insert_color(&attr_list->node, root);
+	return NULL;
+}
+
+static struct ib_sa_attr_list *find_attr_list(struct rb_root *root, u8 *gid)
+{
+	struct rb_node *node = root->rb_node;
+	struct ib_sa_attr_list *attr_list;
+	int cmp;
+
+	while (node) {
+		attr_list = rb_entry(node, struct ib_sa_attr_list, node);
+		cmp = memcmp(&attr_list->gid, gid, sizeof attr_list->gid);
+		if (cmp < 0)
+			node = node->rb_left;
+		else if (cmp > 0)
+			node = node->rb_right;
+		else
+			return attr_list;
+	}
+	return NULL;
+}
+
+static int insert_attr(struct rb_root *root, unsigned long update_id, void *key,
+		       struct ib_sa_iterator *iter)
+{
+	struct ib_sa_attr_list *attr_list;
+	void *err;
+
+	write_lock_irq(&rwlock);
+	attr_list = find_attr_list(root, key);
+	if (!attr_list) {
+		write_unlock_irq(&rwlock);
+		attr_list = kmalloc(sizeof *attr_list, GFP_KERNEL);
+		if (!attr_list)
+			return -ENOMEM;
+
+		attr_list->iter.next = NULL;
+		attr_list->tail = &attr_list->iter;
+		attr_list->update_id = update_id;
+		memcpy(attr_list->gid.raw, key, sizeof attr_list->gid);
+
+		write_lock_irq(&rwlock);
+		err = insert_attr_list(root, attr_list);
+		if (err) {
+			write_unlock_irq(&rwlock);
+			kfree(attr_list);
+			return PTR_ERR(err);
+		}
+	} else if (attr_list->update_id != update_id) {
+		free_attr_list(attr_list);
+		attr_list->update_id = update_id;
+	}
+
+	attr_list->tail->next = iter;
+	iter->next = NULL;
+	attr_list->tail = iter;
+	write_unlock_irq(&rwlock);
+	return 0;
+}
+
+static struct ib_sa_mad_iter *ib_sa_iter_create(struct ib_mad_recv_wc *mad_recv_wc)
+{
+	struct ib_sa_mad_iter *iter;
+	struct ib_sa_mad *mad = (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad;
+	int attr_size, attr_offset;
+
+	attr_offset = be16_to_cpu(mad->sa_hdr.attr_offset) * 8;
+	attr_size = 64;		/* path record length */
+	if (attr_offset < attr_size)
+		return ERR_PTR(-EINVAL);
+
+	iter = kzalloc(sizeof *iter + attr_size, GFP_KERNEL);
+	if (!iter)
+		return ERR_PTR(-ENOMEM);
+
+	iter->data_left = mad_recv_wc->mad_len - IB_MGMT_SA_HDR;
+	iter->recv_wc = mad_recv_wc;
+	iter->recv_buf = &mad_recv_wc->recv_buf;
+	iter->attr_offset = attr_offset;
+	iter->attr_size = attr_size;
+	return iter;
+}
+
+static void ib_sa_iter_free(struct ib_sa_mad_iter *iter)
+{
+	kfree(iter);
+}
+
+static void *ib_sa_iter_next(struct ib_sa_mad_iter *iter)
+{
+	struct ib_sa_mad *mad;
+	int left, offset = 0;
+
+	while (iter->data_left >= iter->attr_offset) {
+		while (iter->data_offset < IB_MGMT_SA_DATA) {
+			mad = (struct ib_sa_mad *) iter->recv_buf->mad;
+
+			left = IB_MGMT_SA_DATA - iter->data_offset;
+			if (left < iter->attr_size) {
+				/* copy first piece of the attribute */
+				iter->attr = &iter->attr_data;
+				memcpy(iter->attr,
+				       &mad->data[iter->data_offset], left);
+				offset = left;
+				break;
+			} else if (offset) {
+				/* copy the second piece of the attribute */
+				memcpy(iter->attr + offset, &mad->data[0],
+				       iter->attr_size - offset);
+				iter->data_offset = iter->attr_size - offset;
+				offset = 0;
+			} else {
+				iter->attr = &mad->data[iter->data_offset];
+				iter->data_offset += iter->attr_size;
+			}
+
+			iter->data_left -= iter->attr_offset;
+			goto out;
+		}
+		iter->data_offset = 0;
+		iter->recv_buf = list_entry(iter->recv_buf->list.next,
+					    struct ib_mad_recv_buf, list);
+	}
+	iter->attr = NULL;
+out:
+	return iter->attr;
+}
+
+/*
+ * Copy path records from a received response and insert them into our cache.
+ * A path record in the MADs are in network order, packed, and may
+ * span multiple MAD buffers, just to make our life hard.
+ */
+static void update_path_db(struct sa_db_port *port,
+			   struct ib_mad_recv_wc *mad_recv_wc,
+			   enum sa_update_type type)
+{
+	struct ib_sa_mad_iter *iter;
+	struct ib_path_rec_info *path_info;
+	void *attr;
+	int ret;
+
+	iter = ib_sa_iter_create(mad_recv_wc);
+	if (IS_ERR(iter))
+		return;
+
+	port->update_id += (type == SA_UPDATE_FULL);
+
+	while ((attr = ib_sa_iter_next(iter)) &&
+	       (path_info = kmalloc(sizeof *path_info, GFP_KERNEL))) {
+
+		ib_sa_unpack_attr(&path_info->rec, attr, IB_SA_ATTR_PATH_REC);
+
+		ret = insert_attr(&port->paths, port->update_id,
+				  path_info->rec.dgid.raw, &path_info->iter);
+		if (ret) {
+			kfree(path_info);
+			break;
+		}
+	}
+	ib_sa_iter_free(iter);
+
+	if (type == SA_UPDATE_FULL)
+		remove_old_attrs(&port->paths, port->update_id);
+}
+
+static struct ib_mad_send_buf *get_sa_msg(struct sa_db_port *port,
+					  struct update_info *update)
+{
+	struct ib_ah_attr ah_attr;
+	struct ib_mad_send_buf *msg;
+
+	msg = ib_create_send_mad(port->agent, 1, 0, 0, IB_MGMT_SA_HDR,
+				 IB_MGMT_SA_DATA, GFP_KERNEL);
+	if (IS_ERR(msg))
+		return NULL;
+
+	memset(&ah_attr, 0, sizeof ah_attr);
+	ah_attr.dlid = port->sm_lid;
+	ah_attr.sl = port->sm_sl;
+	ah_attr.port_num = port->port_num;
+
+	msg->ah = ib_create_ah(port->agent->qp->pd, &ah_attr);
+	if (IS_ERR(msg->ah)) {
+		ib_free_send_mad(msg);
+		return NULL;
+	}
+
+	msg->timeout_ms = retry_timer;
+	msg->retries = 0;
+	msg->context[0] = port;
+	msg->context[1] = update;
+	return msg;
+}
+
+static __be64 form_tid(u32 hi_tid)
+{
+	static atomic_t tid;
+	return cpu_to_be64((((u64) hi_tid) << 32) |
+			   ((u32) atomic_inc_return(&tid)));
+}
+
+static void format_path_req(struct sa_db_port *port,
+			    struct update_info *update,
+			    struct ib_mad_send_buf *msg)
+{
+	struct ib_sa_mad *mad = msg->mad;
+	struct ib_sa_path_rec path_rec;
+
+	mad->mad_hdr.base_version  = IB_MGMT_BASE_VERSION;
+	mad->mad_hdr.mgmt_class	   = IB_MGMT_CLASS_SUBN_ADM;
+	mad->mad_hdr.class_version = IB_SA_CLASS_VERSION;
+	mad->mad_hdr.method	   = IB_SA_METHOD_GET_TABLE;
+	mad->mad_hdr.attr_id	   = cpu_to_be16(IB_SA_ATTR_PATH_REC);
+	mad->mad_hdr.tid	   = form_tid(msg->mad_agent->hi_tid);
+
+	mad->sa_hdr.comp_mask = IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_NUMB_PATH;
+
+	memset(&path_rec, 0, sizeof path_rec);
+	path_rec.sgid = port->gid;
+	path_rec.numb_path = (u8) paths_per_dest;
+
+	if (update->type == SA_UPDATE_ADD) {
+		mad->sa_hdr.comp_mask |= IB_SA_PATH_REC_DGID;
+		memcpy(&path_rec.dgid, &update->gid, sizeof path_rec.dgid);
+	}
+
+	ib_sa_pack_attr(mad->data, &path_rec, IB_SA_ATTR_PATH_REC);
+}
+
+static int send_query(struct sa_db_port *port,
+		      struct update_info *update)
+{
+	int ret;
+
+	port->msg = get_sa_msg(port, update);
+	if (!port->msg)
+		return -ENOMEM;
+
+	format_path_req(port, update, port->msg);
+
+	ret = ib_post_send_mad(port->msg, NULL);
+	if (ret)
+		goto err;
+
+	return 0;
+
+err:
+	ib_destroy_ah(port->msg->ah);
+	ib_free_send_mad(port->msg);
+	return ret;
+}
+
+static void add_update(struct sa_db_port *port, u8 *gid,
+		       enum sa_update_type type)
+{
+	struct update_info *update;
+
+	update = kmalloc(sizeof *update, GFP_KERNEL);
+	if (update) {
+		if (gid)
+			memcpy(&update->gid, gid, sizeof update->gid);
+		update->type = type;
+		list_add(&update->list, &port->update_list);
+	}
+
+	if (port->state == SA_DB_IDLE) {
+		port->state = SA_DB_REFRESH;
+		process_updates(port);
+	}
+}
+
+static void clean_update_list(struct sa_db_port *port)
+{
+	struct update_info *update;
+
+	while (!list_empty(&port->update_list)) {
+		update = list_entry(port->update_list.next,
+				    struct update_info, list);
+		list_del(&update->list);
+		kfree(update);
+	}
+}
+
+static int notice_handler(int status, struct ib_inform_info *info,
+			  struct ib_sa_notice *notice)
+{
+	struct sa_db_port *port = info->context;
+	struct ib_sa_notice_data_gid *gid_data;
+	struct ib_inform_info **pinfo;
+	enum sa_update_type type;
+
+	if (info->trap_number == IB_SA_SM_TRAP_GID_IN_SERVICE) {
+		pinfo = &port->in_info;
+		type = SA_UPDATE_ADD;
+	} else {
+		pinfo = &port->out_info;
+		type = SA_UPDATE_REMOVE;
+	}
+
+	mutex_lock(&lock);
+	if (port->state == SA_DB_DESTROY || !*pinfo) {
+		mutex_unlock(&lock);
+		return 0;
+	}
+
+	if (notice) {
+		gid_data = (struct ib_sa_notice_data_gid *)
+			   &notice->data_details;
+		add_update(port, gid_data->gid, type);
+		mutex_unlock(&lock);
+	} else if (status == -ENETRESET) {
+		*pinfo = NULL;
+		mutex_unlock(&lock);
+	} else {
+		if (status)
+			*pinfo = ERR_PTR(-EINVAL);
+		port->state = SA_DB_IDLE;
+		clean_update_list(port);
+		mutex_unlock(&lock);
+		queue_work(sa_wq, &port->work);
+	}
+
+	return status;
+}
+
+static int reg_in_info(struct sa_db_port *port)
+{
+	int ret = 0;
+
+	port->in_info = ib_sa_register_inform_info(&sa_client,
+						   port->dev->device,
+						   port->port_num,
+						   IB_SA_SM_TRAP_GID_IN_SERVICE,
+						   GFP_KERNEL, notice_handler,
+						   port);
+	if (IS_ERR(port->in_info))
+		ret = PTR_ERR(port->in_info);
+
+	return ret;
+}
+
+static int reg_out_info(struct sa_db_port *port)
+{
+	int ret = 0;
+
+	port->out_info = ib_sa_register_inform_info(&sa_client,
+						    port->dev->device,
+						    port->port_num,
+						    IB_SA_SM_TRAP_GID_OUT_OF_SERVICE,
+						    GFP_KERNEL, notice_handler,
+						    port);
+	if (IS_ERR(port->out_info))
+		ret = PTR_ERR(port->out_info);
+
+	return ret;
+}
+
+static void unsubscribe_port(struct sa_db_port *port)
+{
+	if (port->in_info && !IS_ERR(port->in_info))
+		ib_sa_unregister_inform_info(port->in_info);
+
+	if (port->out_info && !IS_ERR(port->out_info))
+		ib_sa_unregister_inform_info(port->out_info);
+
+	port->out_info = NULL;
+	port->in_info = NULL;
+
+}
+
+static void cleanup_port(struct sa_db_port *port)
+{
+	unsubscribe_port(port);
+
+	clean_update_list(port);
+	remove_all_attrs(&port->paths);
+}
+
+static int update_port_info(struct sa_db_port *port)
+{
+	struct ib_port_attr port_attr;
+	int ret;
+
+	ret = ib_query_port(port->dev->device, port->port_num, &port_attr);
+	if (ret)
+		return ret;
+
+	if (port_attr.state != IB_PORT_ACTIVE)
+		return -ENODATA;
+
+	ret = ib_get_cached_gid(port->dev->device, port->port_num,
+				0, &port->gid);
+	if (ret)
+		return ret;
+
+        port->sm_lid = port_attr.sm_lid;
+	port->sm_sl = port_attr.sm_sl;
+	return 0;
+}
+
+static void process_updates(struct sa_db_port *port)
+{
+	struct update_info *update;
+	struct ib_sa_attr_list *attr_list;
+	int ret;
+
+	if (!paths_per_dest || update_port_info(port)) {
+		cleanup_port(port);
+		goto out;
+	}
+
+	/* Event registration is an optimization, so ignore failures. */
+	if (subscribe_inform_info) {
+		if (!port->out_info) {
+			ret = reg_out_info(port);
+			if (!ret)
+				return;
+		}
+
+		if (!port->in_info) {
+			ret = reg_in_info(port);
+			if (!ret)
+				return;
+		}
+	} else
+		unsubscribe_port(port);
+
+	while (!list_empty(&port->update_list)) {
+		update = list_entry(port->update_list.next,
+				    struct update_info, list);
+
+		if (update->type == SA_UPDATE_REMOVE) {
+			write_lock_irq(&rwlock);
+			attr_list = find_attr_list(&port->paths,
+						   update->gid.raw);
+			if (attr_list)
+				remove_attr(&port->paths, attr_list);
+			write_unlock_irq(&rwlock);
+		} else {
+			ret = send_query(port, update);
+			if (!ret)
+				return;
+
+		}
+		list_del(&update->list);
+		kfree(update);
+	}
+out:
+	port->state = SA_DB_IDLE;
+}
+
+static void refresh_port_db(struct sa_db_port *port)
+{
+	if (port->state == SA_DB_DESTROY)
+		return;
+
+	if (port->state == SA_DB_REFRESH) {
+		clean_update_list(port);
+		ib_cancel_mad(port->agent, port->msg);
+	}
+
+	add_update(port, NULL, SA_UPDATE_FULL);
+}
+
+static void refresh_dev_db(struct sa_db_device *dev)
+{
+	int i;
+
+	for (i = 0; i < dev->port_count; i++)
+		refresh_port_db(&dev->port[i]);
+}
+
+static void refresh_db(void)
+{
+	struct sa_db_device *dev;
+
+	list_for_each_entry(dev, &dev_list, list)
+		refresh_dev_db(dev);
+}
+
+static int do_refresh(const char *val, struct kernel_param *kp)
+{
+	mutex_lock(&lock);
+	refresh_db();
+	mutex_unlock(&lock);
+	return 0;
+}
+
+static int get_lookup_method(char *buf, struct kernel_param *kp)
+{
+	return sprintf(buf,
+		       "%c %d round robin\n"
+		       "%c %d random",
+		       (lookup_method == SA_DB_LOOKUP_LEAST_USED) ? '*' : ' ',
+		       SA_DB_LOOKUP_LEAST_USED,
+		       (lookup_method == SA_DB_LOOKUP_RANDOM) ? '*' : ' ',
+		       SA_DB_LOOKUP_RANDOM);
+}
+
+static int set_lookup_method(const char *val, struct kernel_param *kp)
+{
+	unsigned long method;
+	int ret = 0;
+
+	method = simple_strtoul(val, NULL, 0);
+
+	switch (method) {
+	case SA_DB_LOOKUP_LEAST_USED:
+	case SA_DB_LOOKUP_RANDOM:
+		lookup_method = method;
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
+static int set_paths_per_dest(const char *val, struct kernel_param *kp)
+{
+	int ret;
+
+	mutex_lock(&lock);
+	ret = param_set_ulong(val, kp);
+	if (ret)
+		goto out;
+
+	if (paths_per_dest > SA_DB_MAX_PATHS_PER_DEST)
+		paths_per_dest = SA_DB_MAX_PATHS_PER_DEST;
+	refresh_db();
+out:
+	mutex_unlock(&lock);
+	return ret;
+}
+
+static int set_subscribe_inform_info(const char *val, struct kernel_param *kp)
+{
+	int ret;
+
+	ret = param_set_bool(val, kp);
+	if (ret)
+		return ret;
+
+	return do_refresh(val, kp);
+}
+
+static void port_work_handler(struct work_struct *work)
+{
+	struct sa_db_port *port;
+
+	port = container_of(work, typeof(*port), work);
+	mutex_lock(&lock);
+	refresh_port_db(port);
+	mutex_unlock(&lock);
+}
+
+static void handle_event(struct ib_event_handler *event_handler,
+			 struct ib_event *event)
+{
+	struct sa_db_device *dev;
+	struct sa_db_port *port;
+
+	dev = container_of(event_handler, typeof(*dev), event_handler);
+	port = &dev->port[event->element.port_num - dev->start_port];
+
+	switch (event->event) {
+	case IB_EVENT_PORT_ERR:
+	case IB_EVENT_LID_CHANGE:
+	case IB_EVENT_SM_CHANGE:
+	case IB_EVENT_CLIENT_REREGISTER:
+	case IB_EVENT_PKEY_CHANGE:
+	case IB_EVENT_PORT_ACTIVE:
+		queue_work(sa_wq, &port->work);
+		break;
+	default:
+		break;
+	}
+}
+
+static void ib_free_path_iter(struct ib_sa_attr_iter *iter)
+{
+	read_unlock_irqrestore(&rwlock, iter->flags);
+}
+
+static int ib_create_path_iter(struct ib_device *device, u8 port_num,
+			       union ib_gid *dgid, struct ib_sa_attr_iter *iter)
+{
+	struct sa_db_device *dev;
+	struct sa_db_port *port;
+	struct ib_sa_attr_list *list;
+
+	dev = ib_get_client_data(device, &sa_db_client);
+	if (!dev)
+		return -ENODEV;
+
+	port = &dev->port[port_num - dev->start_port];
+
+	read_lock_irqsave(&rwlock, iter->flags);
+	list = find_attr_list(&port->paths, dgid->raw);
+	if (!list) {
+		ib_free_path_iter(iter);
+		return -ENODATA;
+	}
+
+	iter->iter = &list->iter;
+	return 0;
+}
+
+static struct ib_sa_path_rec *ib_get_next_path(struct ib_sa_attr_iter *iter)
+{
+	struct ib_path_rec_info *next_path;
+
+	iter->iter = iter->iter->next;
+	if (iter->iter) {
+		next_path = container_of(iter->iter, struct ib_path_rec_info, iter);
+		return &next_path->rec;
+	} else
+		return NULL;
+}
+
+static int cmp_rec(struct ib_sa_path_rec *src,
+		   struct ib_sa_path_rec *dst, ib_sa_comp_mask comp_mask)
+{
+	/* DGID check already done */
+	if (comp_mask & IB_SA_PATH_REC_SGID &&
+	    memcmp(&src->sgid, &dst->sgid, sizeof src->sgid))
+		return -EINVAL;
+	if (comp_mask & IB_SA_PATH_REC_DLID && src->dlid != dst->dlid)
+		return -EINVAL;
+	if (comp_mask & IB_SA_PATH_REC_SLID && src->slid != dst->slid)
+		return -EINVAL;
+	if (comp_mask & IB_SA_PATH_REC_RAW_TRAFFIC &&
+	    src->raw_traffic != dst->raw_traffic)
+		return -EINVAL;
+
+	if (comp_mask & IB_SA_PATH_REC_FLOW_LABEL &&
+	    src->flow_label != dst->flow_label)
+		return -EINVAL;
+	if (comp_mask & IB_SA_PATH_REC_HOP_LIMIT &&
+	    src->hop_limit != dst->hop_limit)
+		return -EINVAL;
+	if (comp_mask & IB_SA_PATH_REC_TRAFFIC_CLASS &&
+	    src->traffic_class != dst->traffic_class)
+		return -EINVAL;
+	if (comp_mask & IB_SA_PATH_REC_REVERSIBLE &&
+	    dst->reversible && !src->reversible)
+		return -EINVAL;
+	/* Numb path check already done */
+	if (comp_mask & IB_SA_PATH_REC_PKEY && src->pkey != dst->pkey)
+		return -EINVAL;
+
+	if (comp_mask & IB_SA_PATH_REC_SL && src->sl != dst->sl)
+		return -EINVAL;
+
+	if (ib_sa_check_selector(comp_mask, IB_SA_PATH_REC_MTU_SELECTOR,
+				 IB_SA_PATH_REC_MTU, dst->mtu_selector,
+				 src->mtu, dst->mtu))
+		return -EINVAL;
+	if (ib_sa_check_selector(comp_mask, IB_SA_PATH_REC_RATE_SELECTOR,
+				 IB_SA_PATH_REC_RATE, dst->rate_selector,
+				 src->rate, dst->rate))
+		return -EINVAL;
+	if (ib_sa_check_selector(comp_mask,
+				 IB_SA_PATH_REC_PACKET_LIFE_TIME_SELECTOR,
+				 IB_SA_PATH_REC_PACKET_LIFE_TIME,
+				 dst->packet_life_time_selector,
+				 src->packet_life_time, dst->packet_life_time))
+		return -EINVAL;
+
+	return 0;
+}
+
+static struct ib_sa_path_rec *get_random_path(struct ib_sa_attr_iter *iter,
+					      struct ib_sa_path_rec *req_path,
+					      ib_sa_comp_mask comp_mask)
+{
+	struct ib_sa_path_rec *path, *rand_path = NULL;
+	int num, count = 0;
+
+	for (path = ib_get_next_path(iter); path;
+	     path = ib_get_next_path(iter)) {
+		if (!cmp_rec(path, req_path, comp_mask)) {
+			get_random_bytes(&num, sizeof num);
+			if ((num % ++count) == 0)
+				rand_path = path;
+		}
+	}
+
+	return rand_path;
+}
+
+static struct ib_sa_path_rec *get_next_path(struct ib_sa_attr_iter *iter,
+					    struct ib_sa_path_rec *req_path,
+					    ib_sa_comp_mask comp_mask)
+{
+	struct ib_path_rec_info *cur_path, *next_path = NULL;
+	struct ib_sa_path_rec *path;
+	unsigned long lookups = ~0;
+
+	for (path = ib_get_next_path(iter); path;
+	     path = ib_get_next_path(iter)) {
+		if (!cmp_rec(path, req_path, comp_mask)) {
+
+			cur_path = container_of(iter->iter, struct ib_path_rec_info,
+						iter);
+			if (cur_path->lookups < lookups) {
+				lookups = cur_path->lookups;
+				next_path = cur_path;
+			}
+		}
+	}
+
+	if (next_path) {
+		next_path->lookups++;
+		return &next_path->rec;
+	} else
+		return NULL;
+}
+
+static void report_path(struct work_struct *work)
+{
+	struct sa_path_request *req;
+	
+	req = container_of(work, struct sa_path_request, work);
+	req->callback(0, &req->path_rec, req->context);
+	ib_sa_client_put(req->client);
+	kfree(req);
+}
+
+/**
+ * ib_sa_path_rec_get - Start a Path get query
+ * @client:SA client
+ * @device:device to send query on
+ * @port_num: port number to send query on
+ * @rec:Path Record to send in query
+ * @comp_mask:component mask to send in query
+ * @timeout_ms:time to wait for response
+ * @gfp_mask:GFP mask to use for internal allocations
+ * @callback:function called when query completes, times out or is
+ * canceled
+ * @context:opaque user context passed to callback
+ * @sa_query:query context, used to cancel query
+ *
+ * Send a Path Record Get query to the SA to look up a path.  The
+ * callback function will be called when the query completes (or
+ * fails); status is 0 for a successful response, -EINTR if the query
+ * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error
+ * occurred sending the query.  The resp parameter of the callback is
+ * only valid if status is 0.
+ *
+ * If the return value of ib_sa_path_rec_get() is negative, it is an
+ * error code.  Otherwise it is a query ID that can be used to cancel
+ * the query.
+ */
+int ib_sa_path_rec_get(struct ib_sa_client *client,
+		       struct ib_device *device, u8 port_num,
+		       struct ib_sa_path_rec *rec,
+		       ib_sa_comp_mask comp_mask,
+		       int timeout_ms, gfp_t gfp_mask,
+		       void (*callback)(int status,
+					struct ib_sa_path_rec *resp,
+					void *context),
+		       void *context,
+		       struct ib_sa_query **sa_query)
+{
+	struct sa_path_request *req;
+	struct ib_sa_attr_iter iter;
+	struct ib_sa_path_rec *path_rec;
+	int ret;
+
+	if (!paths_per_dest)
+		goto query_sa;
+
+	if (!(comp_mask & IB_SA_PATH_REC_DGID) ||
+	    !(comp_mask & IB_SA_PATH_REC_NUMB_PATH) || rec->numb_path != 1)
+		goto query_sa;
+
+	req = kmalloc(sizeof *req, gfp_mask);
+	if (!req)
+		goto query_sa;
+
+	ret = ib_create_path_iter(device, port_num, &rec->dgid, &iter);
+	if (ret)
+		goto free_req;
+
+	if (lookup_method == SA_DB_LOOKUP_RANDOM)
+		path_rec = get_random_path(&iter, rec, comp_mask);
+	else
+		path_rec = get_next_path(&iter, rec, comp_mask);
+
+	if (!path_rec)
+		goto free_iter;
+
+	memcpy(&req->path_rec, path_rec, sizeof *path_rec);
+	ib_free_path_iter(&iter);
+
+	INIT_WORK(&req->work, report_path);
+	req->client = client;
+	req->callback = callback;
+	req->context = context;
+
+	ib_sa_client_get(client);
+	queue_work(sa_wq, &req->work);	
+	*sa_query = ERR_PTR(-EEXIST);
+	return 0;
+
+free_iter:
+	ib_free_path_iter(&iter);
+free_req:
+	kfree(req);
+query_sa:
+	return ib_sa_path_rec_query(client, device, port_num, rec, comp_mask,
+				    timeout_ms, gfp_mask, callback, context,
+				    sa_query);
+}
+EXPORT_SYMBOL(ib_sa_path_rec_get);
+
+static void recv_handler(struct ib_mad_agent *mad_agent,
+			 struct ib_mad_recv_wc *mad_recv_wc)
+{
+	struct sa_db_port *port;
+	struct update_info *update;
+	struct ib_mad_send_buf *msg;
+	enum sa_update_type type;
+
+	msg = (struct ib_mad_send_buf *) (unsigned long) mad_recv_wc->wc->wr_id;
+	port = msg->context[0];
+	update = msg->context[1];
+
+	mutex_lock(&lock);
+	if (port->state == SA_DB_DESTROY ||
+	    update != list_entry(port->update_list.next,
+				 struct update_info, list)) {
+		mutex_unlock(&lock);
+	} else {
+		type = update->type;
+		mutex_unlock(&lock);
+		update_path_db(mad_agent->context, mad_recv_wc, type);
+	}
+
+	ib_free_recv_mad(mad_recv_wc);
+}
+
+static void send_handler(struct ib_mad_agent *agent,
+			 struct ib_mad_send_wc *mad_send_wc)
+{
+	struct ib_mad_send_buf *msg;
+	struct sa_db_port *port;
+	struct update_info *update;
+	int ret;
+
+	msg = mad_send_wc->send_buf;
+	port = msg->context[0];
+	update = msg->context[1];
+
+	mutex_lock(&lock);
+	if (port->state == SA_DB_DESTROY)
+		goto unlock;
+
+	if (update == list_entry(port->update_list.next,
+				 struct update_info, list)) {
+
+		if (mad_send_wc->status == IB_WC_RESP_TIMEOUT_ERR &&
+		    msg->timeout_ms < SA_DB_MAX_RETRY_TIMER) {
+
+			msg->timeout_ms <<= 1;
+			ret = ib_post_send_mad(msg, NULL);
+			if (!ret) {
+				mutex_unlock(&lock);
+				return;
+			}
+		}
+		list_del(&update->list);
+		kfree(update);
+	}
+	process_updates(port);
+unlock:
+	mutex_unlock(&lock);
+
+	ib_destroy_ah(msg->ah);
+	ib_free_send_mad(msg);
+}
+
+static int init_port(struct sa_db_device *dev, int port_num)
+{
+	struct sa_db_port *port;
+	int ret = 0;
+
+	port = &dev->port[port_num - dev->start_port];
+	port->dev = dev;
+	port->port_num = port_num;
+	INIT_WORK(&port->work, port_work_handler);
+	port->paths = RB_ROOT;
+	INIT_LIST_HEAD(&port->update_list);
+
+	port->agent = ib_register_mad_agent(dev->device, port_num, IB_QPT_GSI,
+					    NULL, IB_MGMT_RMPP_VERSION,
+					    send_handler, recv_handler, port);
+	if (IS_ERR(port->agent))
+		ret = PTR_ERR(port->agent);
+
+	return ret;
+}
+
+static void destroy_port(struct sa_db_port *port)
+{
+	mutex_lock(&lock);
+	port->state = SA_DB_DESTROY;
+	mutex_unlock(&lock);
+
+	ib_unregister_mad_agent(port->agent);
+	cleanup_port(port);
+	flush_workqueue(sa_wq);
+}
+
+static void sa_db_add_dev(struct ib_device *device)
+{
+	struct sa_db_device *dev;
+	struct sa_db_port *port;
+	int s, e, i, ret;
+
+	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
+		return;
+
+	if (device->node_type == RDMA_NODE_IB_SWITCH) {
+		s = e = 0;
+	} else {
+		s = 1;
+		e = device->phys_port_cnt;
+	}
+
+	dev = kzalloc(sizeof *dev + (e - s + 1) * sizeof *port, GFP_KERNEL);
+	if (!dev)
+		return;
+
+	dev->start_port = s;
+	dev->port_count = e - s + 1;
+	dev->device = device;
+	for (i = 0; i < dev->port_count; i++) {
+		ret = init_port(dev, s + i);
+		if (ret)
+			goto err;
+	}
+
+	ib_set_client_data(device, &sa_db_client, dev);
+
+	INIT_IB_EVENT_HANDLER(&dev->event_handler, device, handle_event);
+
+	mutex_lock(&lock);
+	list_add_tail(&dev->list, &dev_list);
+	refresh_dev_db(dev);
+	mutex_unlock(&lock);
+
+	ib_register_event_handler(&dev->event_handler);
+	return;
+err:
+	while (i--)
+		destroy_port(&dev->port[i]);
+	kfree(dev);
+}
+
+static void sa_db_remove_dev(struct ib_device *device)
+{
+	struct sa_db_device *dev;
+	int i;
+
+	dev = ib_get_client_data(device, &sa_db_client);
+	if (!dev)
+		return;
+
+	ib_unregister_event_handler(&dev->event_handler);
+	flush_workqueue(sa_wq);
+
+	for (i = 0; i < dev->port_count; i++)
+		destroy_port(&dev->port[i]);
+
+	mutex_lock(&lock);
+	list_del(&dev->list);
+	mutex_unlock(&lock);
+
+	kfree(dev);
+}
+
+int sa_db_init(void)
+{
+	int ret;
+
+	rwlock_init(&rwlock);
+	sa_wq = create_singlethread_workqueue("local_sa");
+	if (!sa_wq)
+		return -ENOMEM;
+
+	ib_sa_register_client(&sa_client);
+	ret = ib_register_client(&sa_db_client);
+	if (ret)
+		goto err;
+
+	return 0;
+
+err:
+	ib_sa_unregister_client(&sa_client);
+	destroy_workqueue(sa_wq);
+	return ret;
+}
+
+void sa_db_cleanup(void)
+{
+	ib_unregister_client(&sa_db_client);
+	ib_sa_unregister_client(&sa_client);
+	destroy_workqueue(sa_wq);
+}
diff --git a/drivers/infiniband/core/multicast.c b/drivers/infiniband/core/multicast.c
index 1e13ab4..f49eb75 100644
--- a/drivers/infiniband/core/multicast.c
+++ b/drivers/infiniband/core/multicast.c
@@ -238,34 +238,6 @@ static u8 get_leave_state(struct mcast_group *group)
 	return leave_state & group->rec.join_state;
 }
 
-static int check_selector(ib_sa_comp_mask comp_mask,
-			  ib_sa_comp_mask selector_mask,
-			  ib_sa_comp_mask value_mask,
-			  u8 selector, u8 src_value, u8 dst_value)
-{
-	int err;
-
-	if (!(comp_mask & selector_mask) || !(comp_mask & value_mask))
-		return 0;
-
-	switch (selector) {
-	case IB_SA_GT:
-		err = (src_value <= dst_value);
-		break;
-	case IB_SA_LT:
-		err = (src_value >= dst_value);
-		break;
-	case IB_SA_EQ:
-		err = (src_value != dst_value);
-		break;
-	default:
-		err = 0;
-		break;
-	}
-
-	return err;
-}
-
 static int cmp_rec(struct ib_sa_mcmember_rec *src,
 		   struct ib_sa_mcmember_rec *dst, ib_sa_comp_mask comp_mask)
 {
@@ -278,24 +250,24 @@ static int cmp_rec(struct ib_sa_mcmember_rec *src,
 		return -EINVAL;
 	if (comp_mask & IB_SA_MCMEMBER_REC_MLID && src->mlid != dst->mlid)
 		return -EINVAL;
-	if (check_selector(comp_mask, IB_SA_MCMEMBER_REC_MTU_SELECTOR,
-			   IB_SA_MCMEMBER_REC_MTU, dst->mtu_selector,
-			   src->mtu, dst->mtu))
+	if (ib_sa_check_selector(comp_mask, IB_SA_MCMEMBER_REC_MTU_SELECTOR,
+				 IB_SA_MCMEMBER_REC_MTU, dst->mtu_selector,
+				 src->mtu, dst->mtu))
 		return -EINVAL;
 	if (comp_mask & IB_SA_MCMEMBER_REC_TRAFFIC_CLASS &&
 	    src->traffic_class != dst->traffic_class)
 		return -EINVAL;
 	if (comp_mask & IB_SA_MCMEMBER_REC_PKEY && src->pkey != dst->pkey)
 		return -EINVAL;
-	if (check_selector(comp_mask, IB_SA_MCMEMBER_REC_RATE_SELECTOR,
-			   IB_SA_MCMEMBER_REC_RATE, dst->rate_selector,
-			   src->rate, dst->rate))
+	if (ib_sa_check_selector(comp_mask, IB_SA_MCMEMBER_REC_RATE_SELECTOR,
+				 IB_SA_MCMEMBER_REC_RATE, dst->rate_selector,
+				 src->rate, dst->rate))
 		return -EINVAL;
-	if (check_selector(comp_mask,
-			   IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME_SELECTOR,
-			   IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME,
-			   dst->packet_life_time_selector,
-			   src->packet_life_time, dst->packet_life_time))
+	if (ib_sa_check_selector(comp_mask,
+				 IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME_SELECTOR,
+				 IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME,
+				 dst->packet_life_time_selector,
+				 src->packet_life_time, dst->packet_life_time))
 		return -EINVAL;
 	if (comp_mask & IB_SA_MCMEMBER_REC_SL && src->sl != dst->sl)
 		return -EINVAL;
diff --git a/drivers/infiniband/core/sa.h b/drivers/infiniband/core/sa.h
index b8eac66..0f19dde 100644
--- a/drivers/infiniband/core/sa.h
+++ b/drivers/infiniband/core/sa.h
@@ -48,6 +48,29 @@ static inline void ib_sa_client_put(struct ib_sa_client *client)
 		complete(&client->comp);
 }
 
+int ib_sa_check_selector(ib_sa_comp_mask comp_mask,
+			 ib_sa_comp_mask selector_mask,
+			 ib_sa_comp_mask value_mask,
+			 u8 selector, u8 src_value, u8 dst_value);
+
+int ib_sa_pack_attr(void *dst, void *src, int attr_id);
+
+int ib_sa_unpack_attr(void *dst, void *src, int attr_id);
+
+int ib_sa_path_rec_query(struct ib_sa_client *client,
+			 struct ib_device *device, u8 port_num,
+			 struct ib_sa_path_rec *rec,
+			 ib_sa_comp_mask comp_mask,
+			 int timeout_ms, gfp_t gfp_mask,
+			 void (*callback)(int status,
+					  struct ib_sa_path_rec *resp,
+					  void *context),
+			 void *context,
+			 struct ib_sa_query **sa_query);
+
+int sa_db_init(void);
+void sa_db_cleanup(void);
+
 int ib_sa_mcmember_rec_query(struct ib_sa_client *client,
 			     struct ib_device *device, u8 port_num,
 			     u8 method,
diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index 23d1081..3634486 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -465,6 +465,58 @@ static const struct ib_field notice_table[] = {
 	  .size_bits    = 128 },
 };
 
+int ib_sa_check_selector(ib_sa_comp_mask comp_mask,
+			 ib_sa_comp_mask selector_mask,
+			 ib_sa_comp_mask value_mask,
+			 u8 selector, u8 src_value, u8 dst_value)
+{
+	int err;
+
+	if (!(comp_mask & selector_mask) || !(comp_mask & value_mask))
+		return 0;
+
+	switch (selector) {
+	case IB_SA_GT:
+		err = (src_value <= dst_value);
+		break;
+	case IB_SA_LT:
+		err = (src_value >= dst_value);
+		break;
+	case IB_SA_EQ:
+		err = (src_value != dst_value);
+		break;
+	default:
+		err = 0;
+		break;
+	}
+
+	return err;
+}
+
+int ib_sa_pack_attr(void *dst, void *src, int attr_id)
+{
+	switch (attr_id) {
+	case IB_SA_ATTR_PATH_REC:
+		ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table), src, dst);
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+int ib_sa_unpack_attr(void *dst, void *src, int attr_id)
+{
+	switch (attr_id) {
+	case IB_SA_ATTR_PATH_REC:
+		ib_unpack(path_rec_table, ARRAY_SIZE(path_rec_table), src, dst);
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
 static void free_sm_ah(struct kref *kref)
 {
 	struct ib_sa_sm_ah *sm_ah = container_of(kref, struct ib_sa_sm_ah, ref);
@@ -734,41 +786,16 @@ static void ib_sa_path_rec_release(struct ib_sa_query *sa_query)
 	kfree(container_of(sa_query, struct ib_sa_path_query, sa_query));
 }
 
-/**
- * ib_sa_path_rec_get - Start a Path get query
- * @client:SA client
- * @device:device to send query on
- * @port_num: port number to send query on
- * @rec:Path Record to send in query
- * @comp_mask:component mask to send in query
- * @timeout_ms:time to wait for response
- * @gfp_mask:GFP mask to use for internal allocations
- * @callback:function called when query completes, times out or is
- * canceled
- * @context:opaque user context passed to callback
- * @sa_query:query context, used to cancel query
- *
- * Send a Path Record Get query to the SA to look up a path.  The
- * callback function will be called when the query completes (or
- * fails); status is 0 for a successful response, -EINTR if the query
- * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error
- * occurred sending the query.  The resp parameter of the callback is
- * only valid if status is 0.
- *
- * If the return value of ib_sa_path_rec_get() is negative, it is an
- * error code.  Otherwise it is a query ID that can be used to cancel
- * the query.
- */
-int ib_sa_path_rec_get(struct ib_sa_client *client,
-		       struct ib_device *device, u8 port_num,
-		       struct ib_sa_path_rec *rec,
-		       ib_sa_comp_mask comp_mask,
-		       int timeout_ms, gfp_t gfp_mask,
-		       void (*callback)(int status,
-					struct ib_sa_path_rec *resp,
-					void *context),
-		       void *context,
-		       struct ib_sa_query **sa_query)
+int ib_sa_path_rec_query(struct ib_sa_client *client,
+			 struct ib_device *device, u8 port_num,
+			 struct ib_sa_path_rec *rec,
+			 ib_sa_comp_mask comp_mask,
+			 int timeout_ms, gfp_t gfp_mask,
+			 void (*callback)(int status,
+					  struct ib_sa_path_rec *resp,
+					  void *context),
+			 void *context,
+			 struct ib_sa_query **sa_query)
 {
 	struct ib_sa_path_query *query;
 	struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client);
@@ -825,7 +852,6 @@ err1:
 	kfree(query);
 	return ret;
 }
-EXPORT_SYMBOL(ib_sa_path_rec_get);
 
 static void ib_sa_service_rec_callback(struct ib_sa_query *sa_query,
 				    int status,
@@ -1398,7 +1424,15 @@ static int __init ib_sa_init(void)
 		goto err3;
 	}
 
+	ret = sa_db_init();
+	if (ret) {
+		printk(KERN_ERR "Couldn't initialize local SA\n");
+		goto err4;
+	}
+
 	return 0;
+err4:
+	notice_cleanup();
 err3:
 	mcast_cleanup();
 err2:
@@ -1409,6 +1443,7 @@ err1:
 
 static void __exit ib_sa_cleanup(void)
 {
+	sa_db_cleanup();
 	mcast_cleanup();
 	notice_cleanup();
 	ib_unregister_client(&sa_client);
diff --git a/include/rdma/ib_local_sa.h b/include/rdma/ib_local_sa.h
new file mode 100644
index 0000000..e62d8b0
--- /dev/null
+++ b/include/rdma/ib_local_sa.h
@@ -0,0 +1,83 @@
+/*
+ * Copyright (c) 2006 Intel Corporation.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef IB_LOCAL_SA_H
+#define IB_LOCAL_SA_H
+
+#include <rdma/ib_sa.h>
+
+/**
+ * ib_get_path_rec - Query the local SA database for path information.
+ * @device: The local device to query.
+ * @port_num: The port of the local device being queried.
+ * @sgid: The source GID of the path record.
+ * @dgid: The destination GID of the path record.
+ * @pkey: The protection key of the path record.
+ * @rec: A reference to a path record structure that will receive a copy of
+ *   the response.
+ *
+ * Returns a copy of a path record meeting the specified criteria to the
+ * location referenced by %rec.  A return value < 0 indicates that an error
+ * occurred processing the request, or no path record was found.
+ */
+int ib_get_path_rec(struct ib_device *device, u8 port_num, union ib_gid *sgid,
+		    union ib_gid *dgid, u16 pkey, struct ib_sa_path_rec *rec);
+
+/**
+ * ib_create_path_iter - Create an iterator that may be used to walk through
+ *   a list of path records.
+ * @device: The local device to retrieve path records for.
+ * @port_num: The port of the local device.
+ * @dgid: The destination GID of the path record.
+ *
+ * This call allocates an iterator that is used to walk through a list of
+ * cached path records.  All path records accessed by the iterator will have the
+ * specified DGID.  User should not hold the iterator for an extended period of
+ * time, and must free it by calling ib_free_attr_iter.
+ */
+struct ib_sa_attr_iter *ib_create_path_iter(struct ib_device *device,
+					    u8 port_num, union ib_gid *dgid);
+
+/**
+ * ib_free_attr_iter - Release an attribute iterator.
+ * @iter: The iterator to free.
+ */
+void ib_free_attr_iter(struct ib_sa_attr_iter *iter);
+
+/**
+ * ib_get_next_attr - Retrieve the next attribute referenced by an iterator.
+ * @iter: A reference to an iterator that points to the next attribute to
+ *   retrieve.
+ */
+void *ib_get_next_attr(struct ib_sa_attr_iter *iter);
+
+#endif /* IB_LOCAL_SA_H */
diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h
index 83d8157..ae52904 100644
--- a/include/rdma/ib_sa.h
+++ b/include/rdma/ib_sa.h
@@ -553,4 +553,7 @@ ib_sa_register_inform_info(struct ib_sa_client *client,
  */
 void ib_sa_unregister_inform_info(struct ib_inform_info *info);
 
+int ib_sa_pack_attr(void *dst, void *src, int attr_id);
+int ib_sa_unpack_attr(void *dst, void *src, int attr_id);
+
 #endif /* IB_SA_H */


From rdreier at cisco.com  Mon Jul  2 14:05:18 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 02 Jul 2007 14:05:18 -0700
Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <20070702195927.GB31169@mellanox.co.il> (Michael S. Tsirkin's
	message of "Mon, 2 Jul 2007 22:59:28 +0300")
References: <20070625130604.GH15343@mellanox.co.il>
	<aday7i7wye1.fsf@cisco.com> <20070626070641.GM15343@mellanox.co.il>
	<adahcouv2mi.fsf@cisco.com> <20070630222419.GE7554@mellanox.co.il>
	<adar6nq92zd.fsf@cisco.com> <20070702195927.GB31169@mellanox.co.il>
Message-ID: <adaabue8qk1.fsf@cisco.com>

 > Could you please clarify how do you envision this done?
 > Do we just create our own filesystem?
 > 
 > Reason I ask, we'll need something like this for SRC domain too ...

I don't have a really clear idea.  "Look at spufs" is about as far as
I got.

 - R.


From sean.hefty at intel.com  Mon Jul  2 15:47:21 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 2 Jul 2007 15:47:21 -0700
Subject: [ofa-general] [PATCH] ib/cm: fix handling of duplicate SIDR REQs
In-Reply-To: <Pine.LNX.4.64.0707021215130.1582@zuben>
Message-ID: <000801c7bcfa$f4112a40$3c98070a@amr.corp.intel.com>

Fix handling to duplicate SIDR REQs to avoid sending a reject if
one is detected.  Duplicates should simply be discarded.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
I went with moving where the state was set, as it seemed a little cleaner
to me.  The REQ_RCVD state implies that we can send a SIDR REP by calling
ib_send_cm_sidr_rep(), which is not the case.  The REQ_RCVD state also
indicates that the cm_id_priv is located in the remote_sidr_table, but the
insertion failed, so we should not try to remove the item later.

 drivers/infiniband/core/cm.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index c7007c4..9135a8c 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -2794,7 +2794,6 @@ static int cm_sidr_req_handler(struct cm_work *work)
 				work->mad_recv_wc->recv_buf.grh,
 				&cm_id_priv->av);
 	cm_id_priv->id.remote_id = sidr_req_msg->request_id;
-	cm_id_priv->id.state = IB_CM_SIDR_REQ_RCVD;
 	cm_id_priv->tid = sidr_req_msg->hdr.tid;
 	atomic_inc(&cm_id_priv->work_count);
 
@@ -2804,6 +2803,7 @@ static int cm_sidr_req_handler(struct cm_work *work)
 		spin_unlock_irq(&cm.lock);
 		goto out; /* Duplicate message. */
 	}
+	cm_id_priv->id.state = IB_CM_SIDR_REQ_RCVD;
 	cur_cm_id_priv = cm_find_listen(cm_id->device,
 					sidr_req_msg->service_id,
 					sidr_req_msg->private_data);


From sean.hefty at intel.com  Mon Jul  2 15:51:31 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 2 Jul 2007 15:51:31 -0700
Subject: [ofa-general] [PATCH] ib/cm: send no match if a SIDR REQ does not
	match a listen
In-Reply-To: <000801c7bcfa$f4112a40$3c98070a@amr.corp.intel.com>
Message-ID: <000901c7bcfb$8842a270$3c98070a@amr.corp.intel.com>

If a SIDR REQ does not match a listen, we should reply with status
value 1 (service ID not supported), rather than dropping through to
the default case of status 2 (rejected by service provider).

This also fixes a bug where the cm_id_priv is removed from the
remote_sidr_table twice.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---

 drivers/infiniband/core/cm.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 9135a8c..9820c67 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -2808,9 +2808,8 @@ static int cm_sidr_req_handler(struct cm_work *work)
 					sidr_req_msg->service_id,
 					sidr_req_msg->private_data);
 	if (!cur_cm_id_priv) {
-		rb_erase(&cm_id_priv->sidr_id_node, &cm.remote_sidr_table);
 		spin_unlock_irq(&cm.lock);
-		/* todo: reply with no match */
+		cm_reject_sidr_req(cm_id_priv, IB_SIDR_UNSUPPORTED);
 		goto out; /* No match. */
 	}
 	atomic_inc(&cur_cm_id_priv->refcount);


From akpm at linux-foundation.org  Mon Jul  2 15:56:33 2007
From: akpm at linux-foundation.org (Andrew Morton)
Date: Mon, 2 Jul 2007 15:56:33 -0700
Subject: [ofa-general] Re: idr_get_new_above() limitation?
In-Reply-To: <200707021919.27251.hnguyen@linux.vnet.ibm.com>
References: <200707021919.27251.hnguyen@linux.vnet.ibm.com>
Message-ID: <20070702155633.720b5667.akpm@linux-foundation.org>

On Mon, 2 Jul 2007 19:19:26 +0200
Hoang-Nam Nguyen <hnguyen at linux.vnet.ibm.com> wrote:

> For ehca device driver we're intending to utilize 
> idr_get_new_above() and have written a test case, which I'm attaching
> at the end. Basically it tries to get an idr token above a lower boundary
> by calling idr_get_new_above() and then uses idr_find() to check if 
> the returned token can be found. 
> Here is our observation with 2.6.22-rc7 on ppc64:
> 
> Use lower boundary 0x3ffffffc
> [root at xyz idr_bug]# insmod idr_test_mod.ko start=1073741820
> insmod: error inserting 'idr_test_mod.ko': -1 Unknown symbol in module
> [root at xyz idr_bug]# dmesg -c
> i=3ffffffc token=3ffffffc t=000000003ffffffc
> i=3ffffffd token=3ffffffd t=000000003ffffffd
> i=3ffffffe token=3ffffffe t=000000003ffffffe
> i=3fffffff token=3fffffff t=000000003fffffff
> i=40000000 token=40000000 t=0000000000000000
> Invalid object 0000000000000000. Expected 40000000
> 
> That means token 0x40000000 seems to be the "upper boundary" of idr_find().
> However the behaviour is not consistent in that it was returned by
> idr_get_new_above().
> 
> Looking at void *idr_find(struct idr *idp, int id)
> {
> 	int n;
> 	struct idr_layer *p;
> 
> 	n = idp->layers * IDR_BITS;
> 	p = idp->top;
> 
> 	/* Mask off upper bits we don't use for the search. */
> 	id &= MAX_ID_MASK;
> 
> 	if (id >= (1 << n))
> 		return NULL;
> 
> 	while (n > 0 && p) {
> 		n -= IDR_BITS;
> 		p = p->ary[(id >> n) & IDR_MASK];
> 	}
> 	return((void *)p);
> }
> we found that the if-condition has failed:
>   layers = 5
>   IDR_BITS = 6
>   n = 30
>   (id >= (1 << n)) = (0x40000000 >= 0x40000000) = 1
> 
> Since MAX_ID_MASK=0x7fffffff, I'm wondering if 0x40000000 is the actual
> upper boundary. Any hints or suggestions are appreciated.

Looks like a bug to me.  Really an IDR tree on 32-bit should go all
the way up to 0xffffffff.  Certainly up to 0x7fffffff.  And the fact
that idr_find() disagrees with idr_get_new_above() is a big hint
that the code is getting it wrong.


From jim.houston at ccur.com  Mon Jul  2 17:31:40 2007
From: jim.houston at ccur.com (Jim Houston)
Date: Mon, 02 Jul 2007 20:31:40 -0400
Subject: [ofa-general] Re: idr_get_new_above() limitation?
In-Reply-To: <200707021919.27251.hnguyen@linux.vnet.ibm.com>
References: <200707021919.27251.hnguyen@linux.vnet.ibm.com>
Message-ID: <1183422700.3130.27.camel@localhost.localdomain>

On Mon, 2007-07-02 at 19:19 +0200, Hoang-Nam Nguyen wrote:

> i=3fffffff token=3fffffff t=000000003fffffff
> i=40000000 token=40000000 t=0000000000000000
> Invalid object 0000000000000000. Expected 40000000
> 
> That means token 0x40000000 seems to be the "upper boundary" of idr_find().
> However the behaviour is not consistent in that it was returned by
> idr_get_new_above().

Hi Nam,

Yes this is a bug.  Thanks for the great test module.

The problem is in idr_get_new_above_int() in the loop which
adds new layers to the top of the radix tree.  It is failing
the "layers < (MAX_LEVEL - 1)" test.  It doesn't allocate the
new layer but still calls sub_alloc() which relies on having
the new layer properly constructed.  I believe that it is
allocating the slot which corresponds to id = 0.

I believe this is an off by one error in calculating the
MAX_LEVEL value.  I will do a more careful review and post 
a fix in the next day or so.  I have been in Ottawa for OLS.
I'm flying home tomorrow.

Jim Houston - Concurrent Computer Corp.


From nyav at thomson.com  Mon Jul  2 20:11:53 2007
From: nyav at thomson.com (Lynn Z. Fidelia)
Date: Mon, 2 Jul 2007 23:11:53 -0400
Subject: [ofa-general] viscosity layover
Message-ID: <4689BE79.8040603@thomson.com>

ERMX Jumps 12.5% and Volume Goes Through The Roof!

EntreMetrix Inc. (ERMX)
$0.18 UP 12.5%

Big news last week pushed investors to the table. Wallst.net release of
an audio interview got them excited. This is only the first day after
the release. Act fast and get on ERMX Tuesday morning!

"It's going to be much more complicated to do with synthetic organisms,"
said Dr. Jonathan Eisen, an evolutionary biologist at the University of
California, Davis.

com Secret of flight for world's largest bird revealed AFP Takeoffs a
problem for giant bird AP Brain Scans Reveal Why Meditation Works
LiveScience. Thursday's experiment was designed just to prove an
entire-genome switch is possible, with regular bacteria DNA. The Venter
team picked two Mycoplasma species, simple germs that contain a single
chromosome and lack the cell walls that form barriers in other bacteria.

coli bacteria, to make them do such things as churn out medications.
Barbara Jasny, a deputy editor of Science. "It's going to be much more
complicated to do with synthetic organisms," said Dr.

First, they added genes to turn the donor bacteria an easy-to-spot
bright blue, and to make it resist an antibiotic used to kill off any
host germ that retained its own genes.

"This is a different one that is a little more daring, and perhaps
dramatic.

Thursday's experiment was designed just to prove an entire-genome switch
is possible, with regular bacteria DNA. Still, "it's a great first step.
"There are people doing some important synthetic engineering efforts
with other approaches," cautioned Dr. But the way it was performed,
dubbed a "genome transplant," has genetics specialists buzzing.
Submit your photos to You Witness News.

commission to collect fish carcasses AP Study: Northern Canada ponds
drying up AP Study: Hurricanes may aid stressed coral AP Most Viewed -
Science Baby Born from Frozen Egg LiveScience. The information contained
in the AP News report may not be published, broadcast, rewritten or
redistributed without the prior written authority of The Associated
Press. Blue germs appeared within days of dropping the genome into lab
dishes containing the second bacteria.

The information contained in the AP News report may not be published,
broadcast, rewritten or redistributed without the prior written
authority of The Associated Press.
- Mail Search:  All News Yahoo! "That's extremely inefficient,"
acknowledged lead scientist John Glass, a Venter Institute
microbiologist. - Mail Search:  All News Yahoo!

Submit your photos to You Witness News.

" "Synthetic genomics still remains to be proven, but now we are much
closer to knowing it's actually theoretically possible," added Venter.
com Secret of flight for world's largest bird revealed AFP Takeoffs a
problem for giant bird AP Brain Scans Reveal Why Meditation Works
LiveScience.

Serious refinance requests only. That work is far from complete, but to
make it work, they'd have to put the artificial chromosome into a living
cell and it would have to jump-start that host.


From rdreier at cisco.com  Mon Jul  2 20:41:48 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 02 Jul 2007 20:41:48 -0700
Subject: [ofa-general] [PATCH 1 of 2] mlx4:  Add new Mellanox device IDs
In-Reply-To: <200707021736.18855.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Mon, 2 Jul 2007 17:36:18 +0300")
References: <200707021736.18855.jackm@dev.mellanox.co.il>
Message-ID: <adawsxi6tmr.fsf@cisco.com>

thanks, applied.


From rdreier at cisco.com  Mon Jul  2 20:46:51 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 02 Jul 2007 20:46:51 -0700
Subject: [ofa-general] [PATCH 2 of 2] libmlx4: Add new Mellanox device IDs
In-Reply-To: <200707021737.34303.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Mon, 2 Jul 2007 17:37:34 +0300")
References: <200707021737.34303.jackm@dev.mellanox.co.il>
Message-ID: <adasl866tec.fsf@cisco.com>

thanks, I decided that there was no point in having these defines so I
just did it like the kernel:

commit 040743fb06cf2abf9f302ee6f5870fd3fe944868
Author: Roland Dreier <rolandd at cisco.com>
Date:   Mon Jul 2 20:45:40 2007 -0700

    Add new device IDs for PCIe gen2 HCAs
    
    Also just use hex device IDs plus comments instead of creating defines
    that are only used once.
    
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

diff --git a/src/mlx4.c b/src/mlx4.c
index 3684b50..b2e2ba9 100644
--- a/src/mlx4.c
+++ b/src/mlx4.c
@@ -53,29 +53,19 @@
 #define PCI_VENDOR_ID_MELLANOX			0x15b3
 #endif
 
-#ifndef PCI_DEVICE_ID_MELLANOX_HERMON_SDR
-#define PCI_DEVICE_ID_MELLANOX_HERMON_SDR	0x6340
-#endif
-
-#ifndef PCI_DEVICE_ID_MELLANOX_HERMON_DDR
-#define PCI_DEVICE_ID_MELLANOX_HERMON_DDR	0x634a
-#endif
-
-#ifndef PCI_DEVICE_ID_MELLANOX_HERMON_QDR
-#define PCI_DEVICE_ID_MELLANOX_HERMON_QDR	0x6354
-#endif
-
 #define HCA(v, d) \
 	{ .vendor = PCI_VENDOR_ID_##v,			\
-	  .device = PCI_DEVICE_ID_MELLANOX_##d }
+	  .device = d }
 
 struct {
 	unsigned		vendor;
 	unsigned		device;
 } hca_table[] = {
-	HCA(MELLANOX, HERMON_SDR),
-	HCA(MELLANOX, HERMON_DDR),
-	HCA(MELLANOX, HERMON_QDR),
+	HCA(MELLANOX, 0x6340),	/* MT25408 "Hermon" SDR */
+	HCA(MELLANOX, 0x634a),	/* MT25408 "Hermon" DDR */
+	HCA(MELLANOX, 0x6354),	/* MT25408 "Hermon" QDR */
+	HCA(MELLANOX, 0x6732),	/* MT25408 "Hermon" DDR PCIe gen2 */
+	HCA(MELLANOX, 0x673c),	/* MT25408 "Hermon" QDR PCIe gen2 */
 };
 
 static struct ibv_context_ops mlx4_ctx_ops = {


From rdreier at cisco.com  Mon Jul  2 20:50:52 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 02 Jul 2007 20:50:52 -0700
Subject: [ofa-general] [GIT PULL] please pull infiniband.git
Message-ID: <adaodiu6t7n.fsf@cisco.com>

Linus, please pull from

    master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This tree is also available from kernel.org mirrors at:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This will get a fix for a crash in IPoIB and new device IDs for mlx4:

Jack Morgenstein (1):
      mlx4_core: Add new Mellanox device IDs

Ralph Campbell (1):
      IPoIB/cm: Partial error clean up unmaps wrong address

 drivers/infiniband/ulp/ipoib/ipoib_cm.c |    4 ++--
 drivers/net/mlx4/main.c                 |    2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)


diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 5ffc464..ea74d1e 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -148,8 +148,8 @@ partial_error:
 
 	ib_dma_unmap_single(priv->ca, mapping[0], IPOIB_CM_HEAD_SIZE, DMA_FROM_DEVICE);
 
-	for (; i >= 0; --i)
-		ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE);
+	for (; i > 0; --i)
+		ib_dma_unmap_single(priv->ca, mapping[i], PAGE_SIZE, DMA_FROM_DEVICE);
 
 	dev_kfree_skb_any(skb);
 	return NULL;
diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index 41eafeb..c3da2a2 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -911,6 +911,8 @@ static struct pci_device_id mlx4_pci_table[] = {
 	{ PCI_VDEVICE(MELLANOX, 0x6340) }, /* MT25408 "Hermon" SDR */
 	{ PCI_VDEVICE(MELLANOX, 0x634a) }, /* MT25408 "Hermon" DDR */
 	{ PCI_VDEVICE(MELLANOX, 0x6354) }, /* MT25408 "Hermon" QDR */
+	{ PCI_VDEVICE(MELLANOX, 0x6732) }, /* MT25408 "Hermon" DDR PCIe gen2 */
+	{ PCI_VDEVICE(MELLANOX, 0x673c) }, /* MT25408 "Hermon" QDR PCIe gen2 */
 	{ 0, }
 };
 

From rdreier at cisco.com  Mon Jul  2 20:56:58 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 02 Jul 2007 20:56:58 -0700
Subject: [ofa-general] Re: [PATCH] libmlx4: make BF available for RDMA_READ
	work requests
In-Reply-To: <200706211201.58440.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Thu, 21 Jun 2007 12:01:58 +0300")
References: <200706211201.58440.jackm@dev.mellanox.co.il>
Message-ID: <adak5ti6sxh.fsf@cisco.com>

I trust you guys on this, but have you thought about whether blueflame
makes sense for RDMA read requests?  After all, an RDMA read requires
the responder to send potentially a large amount of data to complete,
and even for small requests I would think that latency-sensitive apps
would avoid it.  Is there an MPI implementation or other app that you
know of where this really helps?

 - R.


From yonic at voltaire.com  Mon Jul  2 22:53:20 2007
From: yonic at voltaire.com (Yonathan Cohen)
Date: Tue, 3 Jul 2007 08:53:20 +0300
Subject: [ofa-general] req_notify_cq is NULL
Message-ID: <39C75744D164D948A170E9792AF8E7CA4F236B@exil.voltaire.com>

Hello, 
I am creating a listener like so : 
	cma_id = rdma_create_id(cma_handler, my_context, RDMA_PS_TCP);

And then call bind : 
	memset(&addr, 0, sizeof addr);
	addr.sin_port = htons(port);
	addr.sin_family = AF_INET;
	addr.sin_addr.s_addr = INADDR_ANY;
	rdma_bind_addr(cma_id, (struct sockaddr *)&addr);

And listen : 
	rdma_listen(cma_id, 0);

But when the event_handler ( cma_handler )  is called-back the "struct
rdma_cm_id* "
Has its api func "req_notify_cq" ( i.e.
rdma_cm_id->device->req_notify->cq ) set to NULL.
Although, other api funcs like create_cq and create_qp are set with ( im
not sure with a valid pointer ) 

I added a printk to mthca_register_device() ( in mthca_provider.c )
which at insmod logs that 
"req_notify_cq" is in fact set with an address.
Im using a mellanox HCA :  "Mellanox Technologies MT23108 InfiniHost"
So its not memfree and req_notify_cq is set with mthca_tavor_arm_cq.
But still when the RMDA_CM_EVENT_CONNECT_REQUEST is received this func
is NULL.

Please help.

__________________________________________________________
Cohen Yonatan   |  +972-9-9718607 (o)
Software. Eng,  Storage group
Voltaire - The Grid Backbone
 www.voltaire.com

  
From yonic at voltaire.com  Mon Jul  2 22:56:13 2007
From: yonic at voltaire.com (Yonathan Cohen)
Date: Tue, 3 Jul 2007 08:56:13 +0300
Subject: [ofa-general] RE: [ewg] req_notify_cq is NULL
In-Reply-To: <39C75744D164D948A170E9792AF8E7CA4F236B@exil.voltaire.com>
Message-ID: <39C75744D164D948A170E9792AF8E7CA4F236D@exil.voltaire.com>

> -----Original Message-----
> From: ewg-bounces at lists.openfabrics.org 
> [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Yonathan Cohen
> Sent: Tuesday, July 03, 2007 8:53 AM
> To: OpenFabrics EWG; general at lists.openfabrics.org
> Subject: [ewg] req_notify_cq is NULL
> 
> Hello,
> I am creating a listener like so : 
> 	cma_id = rdma_create_id(cma_handler, my_context, RDMA_PS_TCP);
> 
> And then call bind : 
> 	memset(&addr, 0, sizeof addr);
> 	addr.sin_port = htons(port);
> 	addr.sin_family = AF_INET;
> 	addr.sin_addr.s_addr = INADDR_ANY;
> 	rdma_bind_addr(cma_id, (struct sockaddr *)&addr);
> 
> And listen : 
> 	rdma_listen(cma_id, 0);
> 
> But when the event_handler ( cma_handler )  is called-back the "struct
> rdma_cm_id* "
> Has its api func "req_notify_cq" ( i.e.
> rdma_cm_id->device->req_notify->cq ) set to NULL.
> Although, other api funcs like create_cq and create_qp are 
> set with ( im not sure with a valid pointer ) 
> 
> I added a printk to mthca_register_device() ( in 
> mthca_provider.c ) which at insmod logs that "req_notify_cq" 
> is in fact set with an address.
> Im using a mellanox HCA :  "Mellanox Technologies MT23108 InfiniHost"
> So its not memfree and req_notify_cq is set with mthca_tavor_arm_cq.
> But still when the RMDA_CM_EVENT_CONNECT_REQUEST is received 
> this func is NULL.
> 
> Please help.
> 
> __________________________________________________________
> Cohen Yonatan   |  +972-9-9718607 (o)
> Software. Eng,  Storage group
> Voltaire - The Grid Backbone
>  www.voltaire.com
> 
>   
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
> 

Btw - Im using ofed1.2 ga.

__________________________________________________________
Cohen Yonatan   |  +972-9-9718607 (o)
Software. Eng,  Storage group
Voltaire - The Grid Backbone
 www.voltaire.com

  
From mst at dev.mellanox.co.il  Mon Jul  2 23:00:49 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 3 Jul 2007 09:00:49 +0300
Subject: [ofa-general] Re: Re: IPoIB-CM UC mode
In-Reply-To: <15ddcffd0707021313w6700f9f3n541137ecb345904e@mail.gmail.com>
References: <20070702145328.GC17858@mellanox.co.il>
	<15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com>
	<20070702195314.GA31169@mellanox.co.il>
	<15ddcffd0707021313w6700f9f3n541137ecb345904e@mail.gmail.com>
Message-ID: <20070703060049.GF1147@mellanox.co.il>


> we should move to UC

For HW that supports UC with SRQ, yes.

-- 
MST


From mst at dev.mellanox.co.il  Mon Jul  2 23:10:29 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 3 Jul 2007 09:10:29 +0300
Subject: [ofa-general] Re: Re: IPoIB-CM UC mode
In-Reply-To: <46895A18.2000100@ichips.intel.com>
References: <20070702145328.GC17858@mellanox.co.il>
	<15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com>
	<20070702195314.GA31169@mellanox.co.il>
	<46895A18.2000100@ichips.intel.com>
Message-ID: <20070703061029.GG1147@mellanox.co.il>

> Quoting Sean Hefty <mshefty at ichips.intel.com>:
> Subject: Re: [ofa-general] Re: Re: IPoIB-CM UC mode
> 
> >So we must send something that will force remote side to respond. One such
> >message is LAP with current primary path used as proposed alternate path.
> >Remote will respond with APR with AP status 5 if the connection is there, 
> >and
> >status 1 if it is not.
> 
> I didn't follow this.  Is this just an out of band keep alive message? 

Yes. Exactly.

> Why not use DREQ to indicate that the connection went away under normal 
> circumstances,

Yes, clearly we do this.
Keepalives cover the failure cases: remote is down, or has rebooted,
or all DREQs were lost, etc ...

> and a send failure in an abnormal termination case?

What do you mean by "send failure"? Completion with error?
We only get these with RC, not with UC.

-- 
MST


From jniz at usa.com  Tue Jul  3 00:45:45 2007
From: jniz at usa.com (Susanna)
Date: Tue, 3 Jul 2007 09:45:45 +0200
Subject: [ofa-general] advertisement-103260.pdf attached
Message-ID: <4689FEA9.5010802@usa.com>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: advertisement-103260.pdf
Type: application/pdf
Size: 14862 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070703/44d101ff/attachment.pdf>

From ogerlitz at voltaire.com  Tue Jul  3 01:50:52 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 3 Jul 2007 11:50:52 +0300 (IDT)
Subject: [ofa-general] consumer data buffer ownership for inline sends
Message-ID: <Pine.LNX.4.64.0707031144130.15147@zuben>

Hi Roland,

Looking on mthca_arbel_post_send / mthca_tavor_post_send at libmthca
we see that the inline code copies the data on the library wqe buffer etc.

Does this means that for inline sends, when ibv_post_send returns,
the consumer owns back the data buffer associated with this send?

Can this be stated as the official policy of libibverbs?

Or.


From ogerlitz at voltaire.com  Tue Jul  3 01:56:07 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 03 Jul 2007 11:56:07 +0300
Subject: [ofa-general] Re: Re: IPoIB-CM UC mode
In-Reply-To: <20070703061029.GG1147@mellanox.co.il>
References: <20070702145328.GC17858@mellanox.co.il>	<15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com>	<20070702195314.GA31169@mellanox.co.il>	<46895A18.2000100@ichips.intel.com>
	<20070703061029.GG1147@mellanox.co.il>
Message-ID: <468A0F27.3020909@voltaire.com>

Michael S. Tsirkin wrote:
>> Quoting Sean Hefty <mshefty at ichips.intel.com>:
>> Subject: Re: [ofa-general] Re: Re: IPoIB-CM UC mode
>>
>>> So we must send something that will force remote side to respond. One such
>>> message is LAP with current primary path used as proposed alternate path.
>>> Remote will respond with APR with AP status 5 if the connection is there, 
>>> and
>>> status 1 if it is not.
>> I didn't follow this.  Is this just an out of band keep alive message? 
> 
> Yes. Exactly.

Michael,

You may know that for each neighbour, the Linux network stack sends 
every m jiffies a --unicast-- ARP probe, where after n jiffies there is 
no ARP reply, it sends a broadcast ARP.

The default values are m=30*HZ and n=30*HZ, but you can change them,
its net.ipv4.neigh.default{gc_interval,gc_stale_time}

My understanding it that it solves everything, no need for keep alives

Do I missing anything here?

Or.


From zhengnancyumysod at ecatrans.com  Tue Jul  3 01:45:53 2007
From: zhengnancyumysod at ecatrans.com (Akilah Crawford)
Date: Tue, 03 Jul 2007 15:45:53 +0700
Subject: [ofa-general] Saturday night fever again
Message-ID: <c07e01c7bd89$3ce2e360$5b9982c8@zhengnancyumysod>


"Restrain your tongue!" she said. "I did not come here to fight slow whispering ant pleasure you with your own weapons. wound It was declared that he believed in deal no classes or anything knit repulsive else, excepting "the woman question." Arrived on complete the opposite pavement, he looked slip bell back to see writing whether the prince were moving, waved his ha
 
obedient "No; I remember nothing!" bake said the prince. A few more words of cuddly explanation followed, strove words which wer "Yes, yes, you will bite indeed. I apian have been in building church and control prayed--nay, do not laugh--I prayed to the Lor knee Muttering these disconnected words, Rogojin began to make up the beds. talk metal hot It was clear that he had devi "No; the owner is the grandson of a freedman, moor formerly in peel his family. hat Now homely they are very rich and hig  
At this Theophilus gave the reins to his wrath; he name rhythm snatched start a little dig crucifix from the wall above hi "How you arch have mixed and upset the book-rolls! If only jail argue I could wring show you how clearly everything agrees "They will joke not betray hammer me," smiled the outgoing philosopher. "They know that their aged mistress, support Damia, and I "Nay," whistle colourful doubtful he replied boldly: "That we are only beginning to know in all its sow fullness and rapture. The o "Yes, laugh reverend train street Father, disapprove and so we ran away." "Oh! then you safely forgive did come badly 'to fight,' I may conclude? bent Dear me!--and I thought you were cleverer--"
wrong knot misty The sworn latter came at once. "And which do you stole cautiously regard complain as the greater: hover The only-begotten Son of God, or that helpless image?" And  The latter need had no idea and could give no information as to why Pavlicheff play had leap taken sanguineous so great an inte
 
happily "In point of fact I unit don't bubble think I thought much about it," said mug the old fellow. He seemed to have a w 
"To my grin misfortune! You drive me ornament frantic with your gladly meek and mild square ways," cried the other passionately.  Olympius followed Agne into the garden quiet where he found her sitting by the join death invention marble margin of a small po write All this looked lonely likely enough, weakly and was correctly accepted as fact by most of the inhabitants of the place, esp  "It's hot weather, you fire see," continued sought Rogojin, as he lay shade motion down on the cushions beside Muishkin, "and
Of course much was husky said that drawer chance could not be determined absolutely. For myrmecological instance, it was reported that "Yes, she complete mark business is lazily inquisitive," assented the prince. "No, no, Demetrius, prose no. rhyme You see, show you believe in the old crime gods. . ."
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070703/02b1dcf1/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2oROH.gif
Type: image/gif
Size: 12049 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070703/02b1dcf1/attachment.gif>

From mst at dev.mellanox.co.il  Tue Jul  3 02:16:39 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 3 Jul 2007 12:16:39 +0300
Subject: [ofa-general] Re: Re: IPoIB-CM UC mode
In-Reply-To: <468A0F27.3020909@voltaire.com>
References: <20070702145328.GC17858@mellanox.co.il>
	<15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com>
	<20070702195314.GA31169@mellanox.co.il>
	<46895A18.2000100@ichips.intel.com>
	<20070703061029.GG1147@mellanox.co.il>
	<468A0F27.3020909@voltaire.com>
Message-ID: <20070703091639.GJ1147@mellanox.co.il>

> Quoting Or Gerlitz <ogerlitz at voltaire.com>:
> Subject: Re: [ofa-general] Re: Re: IPoIB-CM UC mode
> 
> Michael S. Tsirkin wrote:
> >>Quoting Sean Hefty <mshefty at ichips.intel.com>:
> >>Subject: Re: [ofa-general] Re: Re: IPoIB-CM UC mode
> >>
> >>>So we must send something that will force remote side to respond. One 
> >>>such
> >>>message is LAP with current primary path used as proposed alternate path.
> >>>Remote will respond with APR with AP status 5 if the connection is 
> >>>there, and
> >>>status 1 if it is not.
> >>I didn't follow this.  Is this just an out of band keep alive message? 
> >
> >Yes. Exactly.
> 
> Michael,
> 
> You may know that for each neighbour, the Linux network stack sends 
> every m jiffies a --unicast-- ARP probe, where after n jiffies there is 
> no ARP reply, it sends a broadcast ARP.
> 
> The default values are m=30*HZ and n=30*HZ, but you can change them,
> its net.ipv4.neigh.default{gc_interval,gc_stale_time}
> 
> My understanding it that it solves everything, no need for keep alives
> 
> Do I missing anything here?

How does this solve the problem?
If the remote side has lost the connection, unicast ARPs will get dropped
but broadcast ARPs will get answered to. We'd need to re-create the connection
if this happens - but is there a way to detect this?

-- 
MST


From ogerlitz at voltaire.com  Tue Jul  3 02:42:01 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 03 Jul 2007 12:42:01 +0300
Subject: [ofa-general] Re: Re: IPoIB-CM UC mode
In-Reply-To: <20070703091639.GJ1147@mellanox.co.il>
References: <20070702145328.GC17858@mellanox.co.il>	<15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com>	<20070702195314.GA31169@mellanox.co.il>	<46895A18.2000100@ichips.intel.com>	<20070703061029.GG1147@mellanox.co.il>	<468A0F27.3020909@voltaire.com>
	<20070703091639.GJ1147@mellanox.co.il>
Message-ID: <468A19E9.2090707@voltaire.com>

Michael S. Tsirkin wrote:

>>>> I didn't follow this.  Is this just an out of band keep alive message? 

>>> Yes. Exactly.

>> You may know that for each neighbour, the Linux network stack sends 
>> every m jiffies a --unicast-- ARP probe, where after n jiffies there is 
>> no ARP reply, it sends a broadcast ARP.

> How does this solve the problem?
> If the remote side has lost the connection, unicast ARPs will get dropped
> but broadcast ARPs will get answered to. We'd need to re-create the connection
> if this happens - but is there a way to detect this?

Yes, I know that there is a way to register for kernel level neighbour 
update events, so on each neighbour update, ipoib cm reconnects, plus 
you can remove the fast path memcmp we do today on the remote GUID, and 
we done :)

This is b/c it covers both the case that the unicast arp probe was not 
replied either since the --GID-- we have is not the correct one (eg 
under HA scheme) or that the remote --QP-- is not what we think.

Or.


From ogerlitz at voltaire.com  Tue Jul  3 02:44:58 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 03 Jul 2007 12:44:58 +0300
Subject: [ofa-general] Re: Re: IPoIB-CM UC mode
In-Reply-To: <20070703060049.GF1147@mellanox.co.il>
References: <20070702145328.GC17858@mellanox.co.il>	<15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com>	<20070702195314.GA31169@mellanox.co.il>	<15ddcffd0707021313w6700f9f3n541137ecb345904e@mail.gmail.com>
	<20070703060049.GF1147@mellanox.co.il>
Message-ID: <468A1A9A.5060208@voltaire.com>

Michael S. Tsirkin wrote:
>> we should move to UC
> 
> For HW that supports UC with SRQ, yes.

Dror did not mention the HW, my understanding is that this aspect is 
fine... now, assuming the need for liveness protocol is behind us, and 
if not, it can be implemented as you suggested. the problem is  narrowed 
to have the FW support SRQ/UC.

Once this is in place, IPoIB-CM/UC implementation can start, later when 
the IBTA would be done spec-ing it, it would not be non complaint any 
more. Same as with the SRC, you don't wait for it to be standard before 
doing the implementation.

Or.


From vlad at lists.openfabrics.org  Tue Jul  3 02:45:54 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Tue,  3 Jul 2007 02:45:54 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_c_kernel 20070703-0200 daily build status
Message-ID: <20070703094554.7DA48E6085E@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/~vlad/ofed_kernel.git
git_branch: ofed_kernel

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.12
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.14
Passed on x86_64 with linux-2.6.20
Passed on ppc64 with linux-2.6.18
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.19
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.16
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.17
Passed on ppc64 with linux-2.6.12
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.17
Passed on powerpc with linux-2.6.15
Passed on x86_64 with linux-2.6.13
Passed on powerpc with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on x86_64 with linux-2.6.18
Passed on powerpc with linux-2.6.19
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.15
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.16
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.13
Passed on x86_64 with linux-2.6.14
Passed on ppc64 with linux-2.6.14
Passed on powerpc with linux-2.6.14
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.13
Passed on powerpc with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.9-34.ELsmp

Failed:


From mst at dev.mellanox.co.il  Tue Jul  3 02:47:03 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 3 Jul 2007 12:47:03 +0300
Subject: [ofa-general] Re: Re: IPoIB-CM UC mode
In-Reply-To: <468A19E9.2090707@voltaire.com>
References: <20070702145328.GC17858@mellanox.co.il>
	<15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com>
	<20070702195314.GA31169@mellanox.co.il>
	<46895A18.2000100@ichips.intel.com>
	<20070703061029.GG1147@mellanox.co.il>
	<468A0F27.3020909@voltaire.com>
	<20070703091639.GJ1147@mellanox.co.il>
	<468A19E9.2090707@voltaire.com>
Message-ID: <20070703094703.GA12153@mellanox.co.il>

> Quoting Or Gerlitz <ogerlitz at voltaire.com>:
> Subject: Re: [ofa-general] Re: Re: IPoIB-CM UC mode
> 
> Michael S. Tsirkin wrote:
> 
> >>>>I didn't follow this.  Is this just an out of band keep alive message? 
> 
> >>>Yes. Exactly.
> 
> >>You may know that for each neighbour, the Linux network stack sends 
> >>every m jiffies a --unicast-- ARP probe, where after n jiffies there is 
> >>no ARP reply, it sends a broadcast ARP.
> 
> >How does this solve the problem?
> >If the remote side has lost the connection, unicast ARPs will get dropped
> >but broadcast ARPs will get answered to. We'd need to re-create the 
> >connection
> >if this happens - but is there a way to detect this?
> 
> Yes, I know that there is a way to register for kernel level neighbour 
> update events, so on each neighbour update, ipoib cm reconnects, plus 
> you can remove the fast path memcmp we do today on the remote GUID, and 
> we done :)
> 
> This is b/c it covers both the case that the unicast arp probe was not 
> replied either since the --GID-- we have is not the correct one (eg 
> under HA scheme) or that the remote --QP-- is not what we think.

In the typical case (remote side reboots) both the GID and the UD QPN stay the
same, so it seems there won't be any neighbour update, right?  If so, while
playing with neighbour update events might get us data path speed-up, it will
not solve the problem of detecting the connection is alive.


-- 
MST


From ogerlitz at voltaire.com  Tue Jul  3 02:55:48 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 03 Jul 2007 12:55:48 +0300
Subject: [ofa-general] Re: Re: IPoIB-CM UC mode
In-Reply-To: <20070703094703.GA12153@mellanox.co.il>
References: <20070702145328.GC17858@mellanox.co.il>	<15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com>	<20070702195314.GA31169@mellanox.co.il>	<46895A18.2000100@ichips.intel.com>	<20070703061029.GG1147@mellanox.co.il>	<468A0F27.3020909@voltaire.com>	<20070703091639.GJ1147@mellanox.co.il>	<468A19E9.2090707@voltaire.com>
	<20070703094703.GA12153@mellanox.co.il>
Message-ID: <468A1D24.6060903@voltaire.com>

Michael S. Tsirkin wrote:
>> Quoting Or Gerlitz <ogerlitz at voltaire.com>:

>> Yes, I know that there is a way to register for kernel level neighbour 
>> update events, so on each neighbour update, ipoib cm reconnects, plus 
>> you can remove the fast path memcmp we do today on the remote GUID, and 
>> we done :)

> In the typical case (remote side reboots) both the GID and the UD QPN stay the
> same, so it seems there won't be any neighbour update, right?  If so, while
> playing with neighbour update events might get us data path speed-up, it will
> not solve the problem of detecting the connection is alive.

I don't think we should give up here, first there might be a way (event) 
and if not lets change the kernel :) to know that the neighbouring 
subsystem issued a broadcast arp on a nieghbour. Second, let me think...

What did the people who wrote the RFC said about the need / 
implementation of liveness protocol?

Or.


From ogerlitz at voltaire.com  Tue Jul  3 03:29:47 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 03 Jul 2007 13:29:47 +0300
Subject: [ofa-general] Re: Re: IPoIB-CM UC mode
In-Reply-To: <468A1D24.6060903@voltaire.com>
References: <20070702145328.GC17858@mellanox.co.il>	<15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com>	<20070702195314.GA31169@mellanox.co.il>	<46895A18.2000100@ichips.intel.com>	<20070703061029.GG1147@mellanox.co.il>	<468A0F27.3020909@voltaire.com>	<20070703091639.GJ1147@mellanox.co.il>	<468A19E9.2090707@voltaire.com>	<20070703094703.GA12153@mellanox.co.il>
	<468A1D24.6060903@voltaire.com>
Message-ID: <468A251B.70901@voltaire.com>

Or Gerlitz wrote:
> Michael S. Tsirkin wrote:

>> In the typical case (remote side reboots) both the GID and the UD QPN 
>> stay the same, so it seems there won't be any neighbour update, right?  If so, 
>> while playing with neighbour update events might get us data path speed-up, 
>> it will not solve the problem of detecting the connection is alive.

> Second, let me think...

OK, if IPoIB-CM was using bi-directional connection, problem is solved, 
since the remote side re-connects (to send the ARP reply) and either the 
CM or IPoIB-CM the CM consumer invalidates the existing connection.

Also with uni-directional connections, when the remote side re-connects 
to us, it can put in the private data its RX QPN (or 0 if there's no 
such). The ipoib-cm CM callback can compare this QPN against what it 
knows on the remote and if its different, re-connect. This can be 
further simplified, but lets first take it high-level.

Can you remind me what was --the-- reasoning for uni directional 
connections?

Or.


From mst at dev.mellanox.co.il  Tue Jul  3 03:36:27 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 3 Jul 2007 13:36:27 +0300
Subject: [ofa-general] Re: Re: IPoIB-CM UC mode
In-Reply-To: <468A1D24.6060903@voltaire.com>
References: <20070702145328.GC17858@mellanox.co.il>
	<15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com>
	<20070702195314.GA31169@mellanox.co.il>
	<46895A18.2000100@ichips.intel.com>
	<20070703061029.GG1147@mellanox.co.il>
	<468A0F27.3020909@voltaire.com>
	<20070703091639.GJ1147@mellanox.co.il>
	<468A19E9.2090707@voltaire.com>
	<20070703094703.GA12153@mellanox.co.il>
	<468A1D24.6060903@voltaire.com>
Message-ID: <20070703103627.GB12153@mellanox.co.il>

> Quoting Or Gerlitz <ogerlitz at voltaire.com>:
> Subject: Re: [ofa-general] Re: Re: IPoIB-CM UC mode
> 
> Michael S. Tsirkin wrote:
> >>Quoting Or Gerlitz <ogerlitz at voltaire.com>:
> 
> >>Yes, I know that there is a way to register for kernel level neighbour 
> >>update events, so on each neighbour update, ipoib cm reconnects, plus 
> >>you can remove the fast path memcmp we do today on the remote GUID, and 
> >>we done :)
> 
> >In the typical case (remote side reboots) both the GID and the UD QPN stay 
> >the
> >same, so it seems there won't be any neighbour update, right?  If so, while
> >playing with neighbour update events might get us data path speed-up, it 
> >will
> >not solve the problem of detecting the connection is alive.
> 
> I don't think we should give up here, first there might be a way (event) 
> and if not lets change the kernel :) to know that the neighbouring 
> subsystem issued a broadcast arp on a nieghbour.
> Second, let me think...

Frankly, I like the idea of using our own keepalive better: it will also
work if we have e.g. multiple connections per neighbour.

> What did the people who wrote the RFC said about the need / 
> implementation of liveness protocol?

That it's a general IB problem and should be addressed at IB level.
Which it seems to be - with CM.

-- 
MST


From mst at dev.mellanox.co.il  Tue Jul  3 04:00:03 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 3 Jul 2007 14:00:03 +0300
Subject: [ofa-general] Re: Re: IPoIB-CM UC mode
In-Reply-To: <468A251B.70901@voltaire.com>
References: <15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com>
	<20070702195314.GA31169@mellanox.co.il>
	<46895A18.2000100@ichips.intel.com>
	<20070703061029.GG1147@mellanox.co.il>
	<468A0F27.3020909@voltaire.com>
	<20070703091639.GJ1147@mellanox.co.il>
	<468A19E9.2090707@voltaire.com>
	<20070703094703.GA12153@mellanox.co.il>
	<468A1D24.6060903@voltaire.com> <468A251B.70901@voltaire.com>
Message-ID: <20070703110003.GC12153@mellanox.co.il>

> Quoting Or Gerlitz <ogerlitz at voltaire.com>:
> Subject: Re: [ofa-general] Re: Re: IPoIB-CM UC mode
> 
> Or Gerlitz wrote:
> >Michael S. Tsirkin wrote:
> 
> >>In the typical case (remote side reboots) both the GID and the UD QPN 
> >>stay the same, so it seems there won't be any neighbour update, right?  
> >>If so, while playing with neighbour update events might get us data path 
> >>speed-up, it will not solve the problem of detecting the connection is 
> >>alive.
> 
> >Second, let me think...

I don't see why are you trying to get rid of keepalives.
With RC we currently have an arbitrary ACK timeout, and this
is no different, and quite easy to implement.

> OK, if IPoIB-CM was using bi-directional connection, problem is solved, 
> since the remote side re-connects (to send the ARP reply) and either the 
> CM or IPoIB-CM the CM consumer invalidates the existing connection.

Why should this invalidate the existing connection?
IMO killing a connection simply because remote connected wouldn't
be spec compliant: spec allows multiple connections to a single
host, and it's easy to imagine a setup where this will be
useful e.g. for performance reasons
(I actually have such a project on my todo list).

> Also with uni-directional connections, when the remote side re-connects 
> to us, it can put in the private data its RX QPN (or 0 if there's no 
> such). The ipoib-cm CM callback can compare this QPN against what it 
> knows on the remote and if its different, re-connect. This can be 
> further simplified, but lets first take it high-level.

What if remote already has a connection to us?
Anyway, this is clearly outside the existing spec.

> Can you remind me what was --the-- reasoning for uni directional 
> connections?

Lots of reasons. Simplicity of implementation. Solution to tricky dead/livelock
scenarios with crossing connection requests.  Fault containment.  Ability to
extend to multiple connections per host in the future.

It just looks like a good idea.

-- 
MST


From ogerlitz at voltaire.com  Tue Jul  3 04:07:14 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 03 Jul 2007 14:07:14 +0300
Subject: [ofa-general] Re: Re: IPoIB-CM UC mode
In-Reply-To: <20070703110003.GC12153@mellanox.co.il>
References: <15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com>	<20070702195314.GA31169@mellanox.co.il>	<46895A18.2000100@ichips.intel.com>	<20070703061029.GG1147@mellanox.co.il>	<468A0F27.3020909@voltaire.com>	<20070703091639.GJ1147@mellanox.co.il>	<468A19E9.2090707@voltaire.com>	<20070703094703.GA12153@mellanox.co.il>	<468A1D24.6060903@voltaire.com>
	<468A251B.70901@voltaire.com>
	<20070703110003.GC12153@mellanox.co.il>
Message-ID: <468A2DE2.9040702@voltaire.com>

Michael S. Tsirkin wrote:

> I don't see why are you trying to get rid of keepalives.
> With RC we currently have an arbitrary ACK timeout, and this
> is no different, and quite easy to implement.

Since we agree (?) that RC is bad for IPoIB-CM and I want to find a way 
for a UC based implementation to avoid implementing a dedicated keep 
alive protocol.

As for all your other comments, I need to think more, will get back to 
it later this week.

Or.


From mst at dev.mellanox.co.il  Tue Jul  3 04:16:33 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 3 Jul 2007 14:16:33 +0300
Subject: [ofa-general] Re: Re: IPoIB-CM UC mode
In-Reply-To: <468A2DE2.9040702@voltaire.com>
References: <46895A18.2000100@ichips.intel.com>
	<20070703061029.GG1147@mellanox.co.il>
	<468A0F27.3020909@voltaire.com>
	<20070703091639.GJ1147@mellanox.co.il>
	<468A19E9.2090707@voltaire.com>
	<20070703094703.GA12153@mellanox.co.il>
	<468A1D24.6060903@voltaire.com> <468A251B.70901@voltaire.com>
	<20070703110003.GC12153@mellanox.co.il>
	<468A2DE2.9040702@voltaire.com>
Message-ID: <20070703111633.GE12153@mellanox.co.il>

> Quoting Or Gerlitz <ogerlitz at voltaire.com>:
> Subject: Re: [ofa-general] Re: Re: IPoIB-CM UC mode
> 
> Michael S. Tsirkin wrote:
> 
> >I don't see why are you trying to get rid of keepalives.
> >With RC we currently have an arbitrary ACK timeout, and this
> >is no different, and quite easy to implement.
> 
> Since we agree (?) that RC is bad for IPoIB-CM and I want to find a way 
> for a UC based implementation to avoid implementing a dedicated keep 
> alive protocol.
> 
> As for all your other comments, I need to think more, will get back to 
> it later this week.

Not sure it's worth the effort: just scanning the list of active connections
once in a while and sending a LAP message seems easy enough.

-- 
MST


From ogerlitz at voltaire.com  Tue Jul  3 04:41:59 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 03 Jul 2007 14:41:59 +0300
Subject: [ofa-general] [PATCH 1/2] ib/sa: Add InformInfo/Notice support
In-Reply-To: <000601c7bceb$ffff3400$3c98070a@amr.corp.intel.com>
References: <000601c7bceb$ffff3400$3c98070a@amr.corp.intel.com>
Message-ID: <468A3607.8090008@voltaire.com>

Sean Hefty wrote:

> +static void inform_event_handler(struct ib_event_handler *handler,
> +				struct ib_event *event)
> +{
> +	struct inform_device *dev;
> +
> +	dev = container_of(handler, struct inform_device, event_handler);
> +
> +	switch (event->event) {
> +	case IB_EVENT_PORT_ERR:
> +	case IB_EVENT_LID_CHANGE:
> +	case IB_EVENT_SM_CHANGE:
> +	case IB_EVENT_CLIENT_REREGISTER:
> +		inform_groups_lost(&dev->port[event->element.port_num -
> +					      dev->start_port]);

I think you want to act here only if event->element.port_num is the port 
this inform_device is associated with (similar to IPoIB), also the same 
for mcast_event_handler.

Or.


From ogerlitz at voltaire.com  Tue Jul  3 05:26:01 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 03 Jul 2007 15:26:01 +0300
Subject: [ofa-general] Re: [PATCH] ib/cm: fix handling of duplicate SIDR REQs
In-Reply-To: <000801c7bcfa$f4112a40$3c98070a@amr.corp.intel.com>
References: <000801c7bcfa$f4112a40$3c98070a@amr.corp.intel.com>
Message-ID: <468A4059.8060809@voltaire.com>

Sean Hefty wrote:
> Fix handling to duplicate SIDR REQs to avoid sending a reject if
> one is detected.  Duplicates should simply be discarded.

Hi Sean,

Thanks for the fast (as usual...) patches, I am not sure I will be able 
to test it today, will let you know by tomorrow.

Or.


From tziporet at mellanox.co.il  Tue Jul  3 08:03:50 2007
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Tue, 3 Jul 2007 18:03:50 +0300
Subject: [ofa-general] OFED July 2, meeting summary on next OFED plans
Message-ID: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com>


Meeting minutes are available also on OFA Wiki:
https://wiki.openfabrics.org/tiki-index.php?page=Teleconf+07-02-2007

Abbreviated minutes / summary
*	OFED 1.2.1 support release - we plan a support release on
beginning of August. 
*	OFED 1.3 - decided that its most important to close the schedule
and focus on most important features based on this schedule 
*	Based on discussion in the meeting it seems that the best target
for OFED 1.3 is November 07 

*	Most important features (from representative who participated in
the meeting) 
*	Voltaire: ConnectX stable release 
*	IBM - IPOIB CM without SRQ 
*	Qlogic: Package convenient for distros; ConnectX stable 
*	iWARP: Chelsio: Get to GA level and NFSoRDMA integration.
NetEffect: Get the drivers into OFED 
*	Mellanox: ConnectX stable release; new package; QoS 

Action Items:
1.	Other EWG members (Cisco, Intel, Labs) - send most important
features for 1.3 
2.	Tziporet - set a meeting with Redhat & Novell to close the new
package definition 
3.	Tziporet - publish OFED 1.3 schedule 
4.	MPI people (DK, Jeff, Labs) - update your plans for OFED 1.3 -
is there any specific requests toward SC07 

Detailed Minutes
*	OFED 1.2.1 release: 
*	Companies that are mostly interested in such release (e.g. IBM,
Chelsio) will do most of testing for their HW. 
*	Not all companies are committed to QA this release, so in the
release notes we will mention this limitation. 
*	There are weekly builds of OFED 1.2 branch. Any other build
should be requested from Vlad. 

*	OFED 1.2.c: 
*	All agree its important to have this code stream, and why it
cannot be the same as 1.2, and that we cannot wait for 1.3. 
*	There are companies that are currently using this code stream
and this will prevent them to participate in QA of 1.2.1 

*	OFED 1.3: 
*	There was a discussion if we wish to have the release on
November 07 or January 08 (all agreed that December is not a good month)

*	Decision was to reduce features and have a release this year =>
November 
*	There were no participants from the labs or MPI thus we lack
information on important features that should be ready for SC07 


Tziporet Koren
Software Director
Mellanox Technologies
mailto: tziporet at mellanox.co.il
Tel +972-4-9097200, ext 380

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070703/38346e13/attachment.html>

From tziporet at dev.mellanox.co.il  Tue Jul  3 08:28:11 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 03 Jul 2007 18:28:11 +0300
Subject: [ofa-general] Re: [PATCH] libmlx4: make BF available for RDMA_READ
	work requests
In-Reply-To: <adak5ti6sxh.fsf@cisco.com>
References: <200706211201.58440.jackm@dev.mellanox.co.il>
	<adak5ti6sxh.fsf@cisco.com>
Message-ID: <468A6B0B.4070703@mellanox.co.il>

Roland Dreier wrote:
> I trust you guys on this, but have you thought about whether blueflame
> makes sense for RDMA read requests?  After all, an RDMA read requires
> the responder to send potentially a large amount of data to complete,
> and even for small requests I would think that latency-sensitive apps
> would avoid it.  Is there an MPI implementation or other app that you
> know of where this really helps?
>
>  
You can run the ib_read_lat test and see the latency improvement for 
each message size.
And we have customers that this improvement is important for them.

Tziporet


From panda at cse.ohio-state.edu  Tue Jul  3 08:30:26 2007
From: panda at cse.ohio-state.edu (Dhabaleswar Panda)
Date: Tue, 3 Jul 2007 11:30:26 -0400 (EDT)
Subject: [ofa-general] Re: [ewg] OFED July 2,
	meeting summary on next OFED plans
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com> from
	"Tziporet Koren" at Jul 03, 2007 06:03:50 PM
Message-ID: <200707031530.l63FUQtj027641@xi.cse.ohio-state.edu>

Tziporet, 

I was on travel last week and yesterday. Thus, I could neither send
back a reply before the conference call nor could attend the
conference call.

> Action Items:
> 4.	MPI people (DK, Jeff, Labs) - update your plans for OFED 1.3 -
> is there any specific requests toward SC07

We plan to have MVAPICH 1.0 and MVAPICH2 1.0 for OFED 1.3. As Shaun
indicated during yesterday's call, we are working on MVAPICH2 1.0 with
a set of new features and plan to release it in near future. This can
definitely be included in OFED 1.3. We have also started working on
MVAPICH 1.0. Depending on the feature freeze date for OFED 1.3, we can
finalize the feature list for MVAPICH 1.0.

Thanks, 

DK


From mshefty at ichips.intel.com  Tue Jul  3 10:02:17 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 03 Jul 2007 10:02:17 -0700
Subject: [ofa-general] Re: Re: IPoIB-CM UC mode
In-Reply-To: <20070703103627.GB12153@mellanox.co.il>
References: <20070702145328.GC17858@mellanox.co.il>	<15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com>	<20070702195314.GA31169@mellanox.co.il>	<46895A18.2000100@ichips.intel.com>	<20070703061029.GG1147@mellanox.co.il>	<468A0F27.3020909@voltaire.com>	<20070703091639.GJ1147@mellanox.co.il>	<468A19E9.2090707@voltaire.com>	<20070703094703.GA12153@mellanox.co.il>	<468A1D24.6060903@voltaire.com>
	<20070703103627.GB12153@mellanox.co.il>
Message-ID: <468A8119.5070104@ichips.intel.com>

> That it's a general IB problem and should be addressed at IB level.
> Which it seems to be - with CM.

I understand the simplicity that using LAP for an out-of-band keep alive 
message can give you, but that's not the intent of the message.  (You 
could also use REQ/REJ or SIDR REQ/SIDR REP messages for this carrying 
the right private data...)

If we don't want to require apps to send in-band keep alive messages, 
then I think we should explore all potential out-of-band solutions.  For 
example, event registration could be used to detect that a remote node 
has gone down.  We could use per node keep alive messages, rather than 
per connection messages.  We could add a new out-of-band keep alive 
message.  Or clearly define that LAP is the preferred way of for all 
connections to do keep alives.

- Sean


From ardavis at ichips.intel.com  Tue Jul  3 10:06:34 2007
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Tue, 03 Jul 2007 10:06:34 -0700
Subject: [ofa-general] OFED July 2, meeting summary on next OFED plans
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com>
Message-ID: <468A821A.10704@ichips.intel.com>

Tziporet Koren wrote:

>
> Meeting minutes are available also on OFA Wiki: 
> _https://wiki.openfabrics.org/tiki-index.php?page=Teleconf+07-02-2007_
>
> *Abbreviated minutes / summary*
>
>     * OFED 1.2.1 support release - we plan a support release on
>       beginning of August.
>     * OFED 1.3 - decided that its most important to close the schedule
>       and focus on most important features based on this schedule
>           o Based on discussion in the meeting it seems that the best
>             target for OFED 1.3 is November 07
>
>     * Most important features (from representative who participated in
>       the meeting)
>           o Voltaire: ConnectX stable release
>           o IBM - IPOIB CM without SRQ
>           o Qlogic: Package convenient for distros; ConnectX stable
>           o iWARP: Chelsio: Get to GA level and NFSoRDMA integration.
>             NetEffect: Get the drivers into OFED
>           o Mellanox: ConnectX stable release; new package; QoS
>
Intel: uDAPL 2.0 with IB extensions,  installation/packaging, rdma_cm 
counters, performance manager


From mst at dev.mellanox.co.il  Tue Jul  3 10:23:12 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 3 Jul 2007 20:23:12 +0300
Subject: [ofa-general] Re: Re: IPoIB-CM UC mode
In-Reply-To: <468A8119.5070104@ichips.intel.com>
References: <20070702195314.GA31169@mellanox.co.il>
	<46895A18.2000100@ichips.intel.com>
	<20070703061029.GG1147@mellanox.co.il>
	<468A0F27.3020909@voltaire.com>
	<20070703091639.GJ1147@mellanox.co.il>
	<468A19E9.2090707@voltaire.com>
	<20070703094703.GA12153@mellanox.co.il>
	<468A1D24.6060903@voltaire.com>
	<20070703103627.GB12153@mellanox.co.il>
	<468A8119.5070104@ichips.intel.com>
Message-ID: <20070703172312.GE22937@mellanox.co.il>

> Quoting Sean Hefty <mshefty at ichips.intel.com>:
> Subject: Re: [ofa-general] Re: Re: IPoIB-CM UC mode
> 
> >That it's a general IB problem and should be addressed at IB level.
> >Which it seems to be - with CM.
> 
> I understand the simplicity that using LAP for an out-of-band keep alive 
> message can give you, but that's not the intent of the message.

I guess so - but even if the responder happens to do a modify QP as a result,
and erroneously responds with APR, that's not too bad.

> (You 
> could also use REQ/REJ or SIDR REQ/SIDR REP messages for this carrying 
> the right private data...)

Hmm, I don't see how REQ gives you data on existing connection. Further,
this would need a spec extension to define private data format then?
LAP trick works out of the box ...

> If we don't want to require apps to send in-band keep alive messages, 
> then I think we should explore all potential out-of-band solutions.

I actually think a single working solution is enough.
No need to explore all of them :).

> For 
> example, event registration could be used to detect that a remote node 
> has gone down.
> We could use per node keep alive messages, rather than 
> per connection messages.

No, these won't address cases such as DREQ timeout after remote
decides to close connection, without reboot.

> We could add a new out-of-band keep alive 
> Or clearly define that LAP is the preferred way of for all 
> connections to do keep alives.

Sure, someone might need to talk at IBTA about these clarifications.

-- 
MST


From sean.hefty at intel.com  Tue Jul  3 10:29:22 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 3 Jul 2007 10:29:22 -0700
Subject: [ofa-general] [PATCH 1/2] ib/sa: Add InformInfo/Notice support
In-Reply-To: <468A3607.8090008@voltaire.com>
Message-ID: <000001c7bd97$b216d2a0$3c98070a@amr.corp.intel.com>

>> +static void inform_event_handler(struct ib_event_handler *handler,
>> +				struct ib_event *event)
>> +{
>> +	struct inform_device *dev;
>> +
>> +	dev = container_of(handler, struct inform_device, event_handler);
>> +
>> +	switch (event->event) {
>> +	case IB_EVENT_PORT_ERR:
>> +	case IB_EVENT_LID_CHANGE:
>> +	case IB_EVENT_SM_CHANGE:
>> +	case IB_EVENT_CLIENT_REREGISTER:
>> +		inform_groups_lost(&dev->port[event->element.port_num -
>> +					      dev->start_port]);
>
>I think you want to act here only if event->element.port_num is the port
>this inform_device is associated with (similar to IPoIB), also the same
>for mcast_event_handler.

IPoIB registers its event handler per port, so requires the extra check.  Both
the multicast and inform info modules register their event handlers per device,
so the check isn't necessary.

- Sean


From bob.kossey at hp.com  Tue Jul  3 10:29:19 2007
From: bob.kossey at hp.com (Bob Kossey)
Date: Tue, 03 Jul 2007 13:29:19 -0400
Subject: [ofa-general] OFED July 2, meeting summary on next OFED plans
Message-ID: <468A876F.2030208@hp.com>

Tziporet Koren wrote:

>/
/>/     * Most important features (from representative who participated in
/>/       the meeting)
/>/           o Voltaire: ConnectX stable release
/>/           o IBM - IPOIB CM without SRQ
/>/           o Qlogic: Package convenient for distros; ConnectX stable
/>/           o iWARP: Chelsio: Get to GA level and NFSoRDMA integration.
/>/             NetEffect: Get the drivers into OFED
/>/           o Mellanox: ConnectX stable release; new package; QoS
/>

HP: Full ConnectX support, installation/packaging changes,
Perfmon independent of OpenSM.

Bob


From halr at voltaire.com  Tue Jul  3 10:33:19 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 03 Jul 2007 13:33:19 -0400
Subject: [ofa-general] OFED July 2, meeting summary on next OFED plans
In-Reply-To: <468A876F.2030208@hp.com>
References: <468A876F.2030208@hp.com>
Message-ID: <1183483998.4377.254391.camel@hal.voltaire.com>

On Tue, 2007-07-03 at 13:29, Bob Kossey wrote:
> Tziporet Koren wrote:
> 
> >/
> />/     * Most important features (from representative who participated in
> />/       the meeting)
> />/           o Voltaire: ConnectX stable release
> />/           o IBM - IPOIB CM without SRQ
> />/           o Qlogic: Package convenient for distros; ConnectX stable
> />/           o iWARP: Chelsio: Get to GA level and NFSoRDMA integration.
> />/             NetEffect: Get the drivers into OFED
> />/           o Mellanox: ConnectX stable release; new package; QoS
> />
> 
> HP: Full ConnectX support, installation/packaging changes,
> Perfmon independent of OpenSM.

What do you mean by "Perfmon independent of OpenSM" ?

-- Hal

> Bob
> 
> 
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From bob.kossey at hp.com  Tue Jul  3 10:53:39 2007
From: bob.kossey at hp.com (Bob Kossey)
Date: Tue, 03 Jul 2007 13:53:39 -0400
Subject: [ofa-general] OFED July 2, meeting summary on next OFED plans
In-Reply-To: <1183483998.4377.254391.camel@hal.voltaire.com>
References: <468A876F.2030208@hp.com>
	<1183483998.4377.254391.camel@hal.voltaire.com>
Message-ID: <468A8D23.2020007@hp.com>

Hal Rosenstock wrote:
>
>>
>> HP: Full ConnectX support, installation/packaging changes,
>> Perfmon independent of OpenSM.
>>     
>
> What do you mean by "Perfmon independent of OpenSM" ?
>
> -- Hal
>
>   
I interpreted from the exchange below that there is a dependency between
PerfMgr and OpenSM.  If that is not accurate, or if it will be 
eliminated, great.

On Thu, 2007-06-28 at 03:24, Eitan Zahavi wrote:
>/ > On Wed, 2007-06-27 at 14:23, Eitan Zahavi wrote:
/>/ > > In the last months it is the second time I hear people 
/>/ > complaining the 
/>/ > > current monitoring solution in OFA is  integrated with OpenSM.
/>/ > 
/>/ > I must have missed this both times (didn't see this in Mark's 
/>/ > post) and the statement itself is somewhat inaccurate as well.
/
>/ Private talks - I hope they will speak up for themselves now...
/
Please encourage them to do so.
 
>/ > > These people do not use OpenSM but do use OFED.
/>/ > 
/>/ > I'm not sure I'm following what you mean here.
/>/ > 
/>/ > If you mean that some people want to run PerfMgr without the 
/>/ > SM/SA aspects (so that they can run a vendor based SM), that 
/>/ > is the next thing we are adding to the implementation.
/>/ Exactly. OK when is that coming?
/
Should be part of OFED 1.3.


From norman.woo at oracle.com  Tue Jul  3 10:55:41 2007
From: norman.woo at oracle.com (Norman Woo)
Date: Tue, 03 Jul 2007 10:55:41 -0700
Subject: [ofa-general] OFED July 2, meeting summary on next OFED plans
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com>
Message-ID: <468A8D9D.2040709@oracle.com>

The proposed feature list for OFED 1.3 sent out on 6/26/2007 included 
the Asynch I/O for SDP, is this feature now being drop for the 1.3 
release?  Oracle has spent considerable effort to support SDP in our 
products and Oracle proposes that Asynch I/O for SDP be included in the 
OFED 1.3.  What is required to include this feature for OFED 1.3?

Regards,
Norman

Tziporet Koren wrote:
>
> Meeting minutes are available also on OFA Wiki: 
> _https://wiki.openfabrics.org/tiki-index.php?page=Teleconf+07-02-2007_
>
> *Abbreviated minutes / summary*
>
>     * OFED 1.2.1 support release - we plan a support release on
>       beginning of August.
>     * OFED 1.3 - decided that its most important to close the schedule
>       and focus on most important features based on this schedule
>           o Based on discussion in the meeting it seems that the best
>             target for OFED 1.3 is November 07
>
>     * Most important features (from representative who participated in
>       the meeting)
>           o Voltaire: ConnectX stable release
>           o IBM - IPOIB CM without SRQ
>           o Qlogic: Package convenient for distros; ConnectX stable
>           o iWARP: Chelsio: Get to GA level and NFSoRDMA integration.
>             NetEffect: Get the drivers into OFED
>           o Mellanox: ConnectX stable release; new package; QoS
>
> *Action Items:*
>
>    1. Other EWG members (Cisco, Intel, Labs) - send most important
>       features for 1.3
>    2. Tziporet - set a meeting with Redhat & Novell to close the new
>       package definition
>    3. Tziporet - publish OFED 1.3 schedule
>    4. MPI people (DK, Jeff, Labs) - update your plans for OFED 1.3 -
>       is there any specific requests toward SC07
>
> *Detailed Minutes*
>
>     * OFED 1.2.1 release:
>           o Companies that are mostly interested in such release (e.g.
>             IBM, Chelsio) will do most of testing for their HW.
>           o Not all companies are committed to QA this release, so in
>             the release notes we will mention this limitation.
>           o There are weekly builds of OFED 1.2 branch. Any other
>             build should be requested from Vlad.
>
>     * OFED 1.2.c:
>           o All agree its important to have this code stream, and why
>             it cannot be the same as 1.2, and that we cannot wait for
>             1.3.
>           o There are companies that are currently using this code
>             stream and this will prevent them to participate in QA of
>             1.2.1
>
>     * OFED 1.3:
>           o There was a discussion if we wish to have the release on
>             November 07 or January 08 (all agreed that December is not
>             a good month)
>           o Decision was to reduce features and have a release this
>             year => November
>           o There were no participants from the labs or MPI thus we
>             lack information on important features that should be
>             ready for SC07
>
>
>
> Tziporet Koren
> Software Director
> Mellanox Technologies
> mailto: _tziporet at mellanox.co.il_ <mailto:tziporet at mellanox.co.il>
> Tel +972-4-9097200, ext 380
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From mshefty at ichips.intel.com  Tue Jul  3 11:14:23 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 03 Jul 2007 11:14:23 -0700
Subject: [ofa-general] Re: Re: IPoIB-CM UC mode
In-Reply-To: <20070703172312.GE22937@mellanox.co.il>
References: <20070702195314.GA31169@mellanox.co.il>	<46895A18.2000100@ichips.intel.com>	<20070703061029.GG1147@mellanox.co.il>	<468A0F27.3020909@voltaire.com>	<20070703091639.GJ1147@mellanox.co.il>	<468A19E9.2090707@voltaire.com>	<20070703094703.GA12153@mellanox.co.il>	<468A1D24.6060903@voltaire.com>	<20070703103627.GB12153@mellanox.co.il>	<468A8119.5070104@ichips.intel.com>
	<20070703172312.GE22937@mellanox.co.il>
Message-ID: <468A91FF.3040804@ichips.intel.com>

> Hmm, I don't see how REQ gives you data on existing connection. Further,
> this would need a spec extension to define private data format then?
> LAP trick works out of the box ...

LAP keep-alives requires the apps to implement the keep alive timers and 
detection, but sends the messages out-of-band.  Why not send the 
messages in-band?  Would it make more sense to implement the entire 
keep-alive solution in the CM?

> I actually think a single working solution is enough.
> No need to explore all of them :).

I'm not saying implement all of them, just make sure that we have the 
best solution.  I can't think of one that I like better than using LAP, 
but it feels like the CM protocol / MADs are being hijacked.  For 
example, if there's only one path between two nodes, LAP doesn't really 
make any sense, but it ends up being used.  Should we instead look at 
adding new CM messages for just this purpose?

>> For 
>> example, event registration could be used to detect that a remote node 
>> has gone down.
>> We could use per node keep alive messages, rather than 
>> per connection messages.
> 
> No, these won't address cases such as DREQ timeout after remote
> decides to close connection, without reboot.

Per node keep alive messages could.  It depends on what data is carried 
in the message (e.g. all currently connected QPs to the node in 
question).  I mentioned this because it may be more efficient under some 
circumstances.

- Sean


From tziporet at dev.mellanox.co.il  Tue Jul  3 11:36:11 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 03 Jul 2007 21:36:11 +0300
Subject: [ofa-general] Re: Re: IPoIB-CM UC mode
In-Reply-To: <468A1A9A.5060208@voltaire.com>
References: <20070702145328.GC17858@mellanox.co.il>	<15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com>	<20070702195314.GA31169@mellanox.co.il>	<15ddcffd0707021313w6700f9f3n541137ecb345904e@mail.gmail.com>	<20070703060049.GF1147@mellanox.co.il>
	<468A1A9A.5060208@voltaire.com>
Message-ID: <468A971B.5090507@mellanox.co.il>

Or Gerlitz wrote:
> Michael S. Tsirkin wrote:
>>> we should move to UC
>>
>> For HW that supports UC with SRQ, yes.
>
> Dror did not mention the HW, my understanding is that this aspect is 
> fine... now, assuming the need for liveness protocol is behind us, and 
> if not, it can be implemented as you suggested. the problem is  
> narrowed to have the FW support SRQ/UC.
We still don't have a solid plan for this in the FW

Tziporet


From mst at dev.mellanox.co.il  Tue Jul  3 11:37:03 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 3 Jul 2007 21:37:03 +0300
Subject: [ofa-general] Re: Re: IPoIB-CM UC mode
In-Reply-To: <468A91FF.3040804@ichips.intel.com>
References: <20070703061029.GG1147@mellanox.co.il>
	<468A0F27.3020909@voltaire.com>
	<20070703091639.GJ1147@mellanox.co.il>
	<468A19E9.2090707@voltaire.com>
	<20070703094703.GA12153@mellanox.co.il>
	<468A1D24.6060903@voltaire.com>
	<20070703103627.GB12153@mellanox.co.il>
	<468A8119.5070104@ichips.intel.com>
	<20070703172312.GE22937@mellanox.co.il>
	<468A91FF.3040804@ichips.intel.com>
Message-ID: <20070703183703.GG22937@mellanox.co.il>

> Quoting Sean Hefty <mshefty at ichips.intel.com>:
> Subject: Re: [ofa-general] Re: Re: IPoIB-CM UC mode
> 
> >Hmm, I don't see how REQ gives you data on existing connection. Further,
> >this would need a spec extension to define private data format then?
> >LAP trick works out of the box ...
> 
> LAP keep-alives requires the apps to implement the keep alive timers and 
> detection, but sends the messages out-of-band.  Why not send the 
> messages in-band?

Sure, this can be done. But that'd need ULP support, in this case IPoIB protocol
extension.  Further, if remote is up, it's nice to get a CM message saying
"connection was lost" directly rather than just a timeout.
What real advantages are there for doing this "in-band" as you say?

> Would it make more sense to implement the entire 
> keep-alive solution in the CM?

I think it doesn't matter much. Let's keep it where it's needed:
if more UC applications surface, we can rethink this decision,
and factor the code out.

> >I actually think a single working solution is enough.
> >No need to explore all of them :).
> 
> I'm not saying implement all of them, just make sure that we have the 
> best solution.  I can't think of one that I like better than using LAP, 
> but it feels like the CM protocol / MADs are being hijacked.  For 
> example, if there's only one path between two nodes, LAP doesn't really 
> make any sense, but it ends up being used.  Should we instead look at 
> adding new CM messages for just this purpose?

Sure, I agree, this would be nice. But I expect this will take a while
to get the standartization rolling. So I think we'll start with the LAP hack
and add support for the new CM message when/if it's there.

> >>For 
> >>example, event registration could be used to detect that a remote node 
> >>has gone down.
> >>We could use per node keep alive messages, rather than 
> >>per connection messages.
> >
> >No, these won't address cases such as DREQ timeout after remote
> >decides to close connection, without reboot.
> 
> Per node keep alive messages could.  It depends on what data is carried 
> in the message (e.g. all currently connected QPs to the node in 
> question).  I mentioned this because it may be more efficient under some 
> circumstances.

Yes. And with multiple connections per node, all the more so.
The CM message format does not seem like a good fit for this, though:
maybe some new kind of MAD?

-- 
MST


From halr at voltaire.com  Tue Jul  3 11:59:01 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 03 Jul 2007 14:59:01 -0400
Subject: [ofa-general] OFED July 2, meeting summary on next OFED plans
In-Reply-To: <468A8D23.2020007@hp.com>
References: <468A876F.2030208@hp.com>
	<1183483998.4377.254391.camel@hal.voltaire.com>
	<468A8D23.2020007@hp.com>
Message-ID: <1183489140.4377.260402.camel@hal.voltaire.com>

On Tue, 2007-07-03 at 13:53, Bob Kossey wrote:
> Hal Rosenstock wrote:
> >
> >>
> >> HP: Full ConnectX support, installation/packaging changes,
> >> Perfmon independent of OpenSM.
> >>     
> >
> > What do you mean by "Perfmon independent of OpenSM" ?
> >
> > -- Hal
> >
> >   
> I interpreted from the exchange below that there is a dependency between
> PerfMgr and OpenSM.  If that is not accurate, or if it will be 
> eliminated, great.

PerfMgr will support the ability to run without the SM/SA function in
OpenSM but with a "third party" SM (meaning any standard (vendor) SM
which is IBA compliant although this will need testing to confirm that
aspect by those vendors/parties interested in this). PerfMgr will,
however, be part of the OpenSM package. I hope this clarifies the
current intent.

-- Hal

> On Thu, 2007-06-28 at 03:24, Eitan Zahavi wrote:
> >/ > On Wed, 2007-06-27 at 14:23, Eitan Zahavi wrote:
> />/ > > In the last months it is the second time I hear people 
> />/ > complaining the 
> />/ > > current monitoring solution in OFA is  integrated with OpenSM.
> />/ > 
> />/ > I must have missed this both times (didn't see this in Mark's 
> />/ > post) and the statement itself is somewhat inaccurate as well.
> /
> >/ Private talks - I hope they will speak up for themselves now...
> /
> Please encourage them to do so.
>  
> >/ > > These people do not use OpenSM but do use OFED.
> />/ > 
> />/ > I'm not sure I'm following what you mean here.
> />/ > 
> />/ > If you mean that some people want to run PerfMgr without the 
> />/ > SM/SA aspects (so that they can run a vendor based SM), that 
> />/ > is the next thing we are adding to the implementation.
> />/ Exactly. OK when is that coming?
> /
> Should be part of OFED 1.3.


From rdreier at cisco.com  Tue Jul  3 12:05:52 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 03 Jul 2007 12:05:52 -0700
Subject: [ofa-general] OFED July 2, meeting summary on next OFED plans
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com>
	(Tziporet Koren's message of "Tue, 3 Jul 2007 18:03:50 +0300")
References: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com>
Message-ID: <adair915mun.fsf@cisco.com>

 > NFSoRDMA integration.

I would like to see a status report on NFS/RDMA from the people who
want it in OFED.  As I understand it there are many core kernel
changes required for this -- switchable transports and also mount
option changes?

As far as I can tell from the outside, the NFS/RDMA effort seems to
have stalled -- whenever I talk to core NFS developers like Chuck
Lever or Trond Myklebust, they say that they are just waiting for the
NFS/RDMA developers to submit their changes for review.  And I haven't
seen any patches for a kernel newer that 2.6.18, so things look quite
out-of-date.

Without visible progress towards getting NFS/RDMA into mergeable form
soon, I think putting it into OFED 1.3 as anything other than a
technology preview that may be dropped from future releases would be a
very risky think to do.  Otherwise OFED risks getting stuck
maintaining the whole NFS/RDMA stack, since the development effort
outside of OFED really looks to me like it is fizzling out.

 - R.


From tziporet at dev.mellanox.co.il  Tue Jul  3 12:20:00 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 03 Jul 2007 22:20:00 +0300
Subject: [ewg] Re: [ofa-general] OFED July 2,	meeting summary on next
	OFED plans
In-Reply-To: <468A8D9D.2040709@oracle.com>
References: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com>
	<468A8D9D.2040709@oracle.com>
Message-ID: <468AA160.9020609@mellanox.co.il>

Norman Woo wrote:
> The proposed feature list for OFED 1.3 sent out on 6/26/2007 included 
> the Asynch I/O for SDP, is this feature now being drop for the 1.3 
> release?  Oracle has spent considerable effort to support SDP in our 
> products and Oracle proposes that Asynch I/O for SDP be included in 
> the OFED 1.3.  What is required to include this feature for OFED 1.3?
>
SDP AIO  is still on the list.
Its just that non of you participated in the meeting yesterday and I 
only gathered the input from people that were on the meeting.

I will publish the full features list once all companies will return 
with their input

Tziporet


From sean.hefty at intel.com  Tue Jul  3 12:29:54 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 3 Jul 2007 12:29:54 -0700
Subject: [ofa-general] Re: Re: IPoIB-CM UC mode
In-Reply-To: <20070703183703.GG22937@mellanox.co.il>
Message-ID: <000101c7bda8$88e91300$3c98070a@amr.corp.intel.com>

>What real advantages are there for doing this "in-band" as you say?

Doing this in-band keeps the entire keep-alive protocol within the ULP.  It can
set the keep-alive message size and retry times.  LAP messages are fixed at 256
bytes, add additional traffic on QP 1, and retries are limited by the CM
protocol.  (Of course, new CM messages would have these same limits, so it's not
clear to me that creating new CM messages are a win.  New CM messages would
allow the CM itself to respond directly to keep-alives though.)

A couple disadvantages are that broken connections take longer to detect if the
remote node is able to respond to the LAP, and the connection must be able to
send and receive.  (The latter calls for a general solution being out-of-band.)

>Sure, I agree, this would be nice. But I expect this will take a while
>to get the standartization rolling. So I think we'll start with the LAP hack
>and add support for the new CM message when/if it's there.

Okay - is there any real drawback to using LAP other than it 'feels' like a
mis-use of the CM protocol?

- Sean


From mst at dev.mellanox.co.il  Tue Jul  3 12:49:42 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 3 Jul 2007 22:49:42 +0300
Subject: [ofa-general] Re: Re: IPoIB-CM UC mode
In-Reply-To: <000101c7bda8$88e91300$3c98070a@amr.corp.intel.com>
References: <20070703183703.GG22937@mellanox.co.il>
	<000101c7bda8$88e91300$3c98070a@amr.corp.intel.com>
Message-ID: <20070703194942.GI22937@mellanox.co.il>

> Quoting Sean Hefty <sean.hefty at intel.com>:
> Subject: RE: [ofa-general] Re: Re: IPoIB-CM UC mode
> 
> >What real advantages are there for doing this "in-band" as you say?
> 
> Doing this in-band keeps the entire keep-alive protocol within the ULP.  It can
> set the keep-alive message size and retry times.
> LAP messages are fixed at 256
> bytes, add additional traffic on QP 1, and retries are limited by the CM
> protocol.

BTW, I think we might want to avoid retries altogether: if LAP
timed out, we can just re-create the connection.

> (Of course, new CM messages would have these same limits, so it's not
> clear to me that creating new CM messages are a win.  New CM messages would
> allow the CM itself to respond directly to keep-alives though.)

OTOH, using QP1 makes it easier to separate rare keepalives
from fast-path data packet receive path.

-- 
MST


From swise at opengridcomputing.com  Tue Jul  3 13:35:13 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 03 Jul 2007 15:35:13 -0500
Subject: [ofa-general] OFED July 2, meeting summary on next OFED plans
In-Reply-To: <adair915mun.fsf@cisco.com>
References: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com>
	<adair915mun.fsf@cisco.com>
Message-ID: <468AB301.4020807@opengridcomputing.com>

Tom, can you update us on NFS-RDMA?


Roland Dreier wrote:
>  > NFSoRDMA integration.
> 
> I would like to see a status report on NFS/RDMA from the people who
> want it in OFED.  As I understand it there are many core kernel
> changes required for this -- switchable transports and also mount
> option changes?
> 
> As far as I can tell from the outside, the NFS/RDMA effort seems to
> have stalled -- whenever I talk to core NFS developers like Chuck
> Lever or Trond Myklebust, they say that they are just waiting for the
> NFS/RDMA developers to submit their changes for review.  And I haven't
> seen any patches for a kernel newer that 2.6.18, so things look quite
> out-of-date.
> 
> Without visible progress towards getting NFS/RDMA into mergeable form
> soon, I think putting it into OFED 1.3 as anything other than a
> technology preview that may be dropped from future releases would be a
> very risky think to do.  Otherwise OFED risks getting stuck
> maintaining the whole NFS/RDMA stack, since the development effort
> outside of OFED really looks to me like it is fizzling out.
> 
>  - R.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From tom at opengridcomputing.com  Tue Jul  3 13:54:29 2007
From: tom at opengridcomputing.com (Tom Tucker)
Date: Tue, 03 Jul 2007 15:54:29 -0500
Subject: [ofa-general] OFED July 2, meeting summary on next OFED plans
In-Reply-To: <468AB301.4020807@opengridcomputing.com>
References: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com>
	<adair915mun.fsf@cisco.com>  <468AB301.4020807@opengridcomputing.com>
Message-ID: <1183496069.5757.33.camel@trinity.ogc.int>

Roland:

On Tue, 2007-07-03 at 15:35 -0500, Steve Wise wrote:
> Tom, can you update us on NFS-RDMA?
> 
> 
> Roland Dreier wrote:
> >  > NFSoRDMA integration.
> > 
> > I would like to see a status report on NFS/RDMA from the people who
> > want it in OFED.  As I understand it there are many core kernel
> > changes required for this -- switchable transports and also mount
> > option changes?

You are correct about the scope of the changes, although many of them
are already in the kernel. Chuck Lever just posted the mount changes and
I have posted a second round of the NFS-RDMA patches. You can see these
on nfs at lists.sourceforge.net. I would like to get them upstream in
2.6.23, but that's probably optimistic. 

> > 
> > As far as I can tell from the outside, the NFS/RDMA effort seems to
> > have stalled -- whenever I talk to core NFS developers like Chuck
> > Lever or Trond Myklebust, they say that they are just waiting for the
> > NFS/RDMA developers to submit their changes for review.  And I haven't
> > seen any patches for a kernel newer that 2.6.18, so things look quite
> > out-of-date.

I'm not sure when you talked to those guys, but as I mentioned, this is
round-two of the patch submission. There is also a git tree that has
these submitted patches available for download and testing. These are on
a 2.6.22-rc6 base and the git URL is
git://linux-nfs.org/~tomtucker/nfs-rdma-dev-2.6.git

If you like, I can post the patchset here as well.

> > 
> > Without visible progress towards getting NFS/RDMA into mergeable form
> > soon, I think putting it into OFED 1.3 as anything other than a
> > technology preview that may be dropped from future releases would be a
> > very risky think to do.  Otherwise OFED risks getting stuck
> > maintaining the whole NFS/RDMA stack, since the development effort
> > outside of OFED really looks to me like it is fizzling out.
> > 

Perhaps the activity is not where you're used to looking. Both Trond and
Neal reviewed the previous patchset and provided feedback that I
addressed in the most recent patchset. That said, I'm sure there will be
quite a bit more before it's mergeable. 

> >  - R.
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From mst at dev.mellanox.co.il  Tue Jul  3 15:09:03 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 4 Jul 2007 01:09:03 +0300
Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <adaabue8qk1.fsf@cisco.com>
References: <20070625130604.GH15343@mellanox.co.il> <aday7i7wye1.fsf@cisco.com>
	<20070626070641.GM15343@mellanox.co.il> <adahcouv2mi.fsf@cisco.com>
	<20070630222419.GE7554@mellanox.co.il> <adar6nq92zd.fsf@cisco.com>
	<20070702195927.GB31169@mellanox.co.il> <adaabue8qk1.fsf@cisco.com>
Message-ID: <20070703220903.GJ22937@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH RFC] sharing userspace IB objects
> 
>  > Could you please clarify how do you envision this done?
>  > Do we just create our own filesystem?
>  > 
>  > Reason I ask, we'll need something like this for SRC domain too ...
> 
> I don't have a really clear idea.  "Look at spufs" is about as far as
> I got.

So, I guess we could create our own filesystem and then
let processes create files there to represent the src domain.

But, how to pass this file to the create_qp verb in kernel?
It needs to be done on the common context fd since we are using a regular CQ, a
UAR etc from the context.


-- 
MST


From elsen_david at hotmail.com  Tue Jul  3 15:42:35 2007
From: elsen_david at hotmail.com (david elsen)
Date: Tue, 03 Jul 2007 15:42:35 -0700
Subject: [ofa-general] Open Fabrics iWARP Driver for Chesio T3 card
In-Reply-To: <4683BDDC.5010309@opengridcomputing.com>
Message-ID: <BAY118-F26C10E221943D860E1BEBA9F0C0@phx.gbl>

Thanks a lot for the good information.


>From: Steve Wise <swise at opengridcomputing.com>
>To: SEGERS Koen <Koen.SEGERS at VRT.BE>
>CC: david elsen <elsen_david at hotmail.com>,  general at lists.openfabrics.org
>Subject: Re: [ofa-general] Open Fabrics iWARP Driver for Chesio T3 card
>Date: Thu, 28 Jun 2007 08:55:40 -0500
>
>SEGERS Koen wrote:
>>What is the benefit of using the iWARP driver? Do you offload the traffic 
>>comming from the cluster directly to the chelsio card (RDMA directly to 
>>Chelsio)?
>>
>
>iWARP is a suite of standard protocols that implement RDMA over a TCP or 
>SCTP connection.  The  devices that support iWARP usually implement all of 
>these protocols (including TCP/IP/ethernet) in hardware.  The device 
>drivers for these devices plug into the Linux/OFA RDMA core and support the 
>Linux/OFA RDMA verbs which are mostly common between both IB and iWARP.
>
>So think of it as an RDMA transport that uses standard Ethernet and IP 
>technology.  There is no wire-level interoperability between IB and iWARP: 
>They are different L1-L4 protocol stacks below the RDMA API.  But _above_ 
>the RDMA API, you can have a single application use the Linux RDMA Verbs 
>interface and deploy that same application over both IB networks and IW 
>networks.
>
>Application/Middle-ware examples include MPI, iSCSI/iSER, and NFS-RDMA.
>
>>Would it be beneficial to have the iWARP driver installed on nodes that 
>>communicate with clients over IP and with other servers (of its cluster) 
>>over IB? We are now using SDP as an intercluster protocol, but in the 
>>future we are probably going to VERBS for it.
>>
>
>I'm not sure how you would utilize it in your setup.  But I don't 
>understand your cluster architecture to say for sure whether it might help 
>you or not.
>
>You might contact the iWARP providers directly to help understand if their 
>solutions can help you.  Also, there are other technologies that these 
>devices typically support that might be helpful for you.
>
>>Can we read the documentation on a website somewhere?
>>
>
>The iWARP Protocols are IETF IDs and RFCs that can be found at
>
>http://www.ietf.org/html.charters/rddp-charter.html
>
>There is other information on RDMA over TCP/IP at
>
>http://www.rdmaconsortium.org/home
>
>Hope this helps.
>
>Steve.
>

_________________________________________________________________
Don't get caught with egg on your face. Play Chicktionary!� 
http://club.live.com/chicktionary.aspx?icid=chick_hotmailtextlink2


From mshefty at ichips.intel.com  Tue Jul  3 15:45:47 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 03 Jul 2007 15:45:47 -0700
Subject: [ofa-general] Re: Re: IPoIB-CM UC mode
In-Reply-To: <20070703194942.GI22937@mellanox.co.il>
References: <20070703183703.GG22937@mellanox.co.il>	<000101c7bda8$88e91300$3c98070a@amr.corp.intel.com>
	<20070703194942.GI22937@mellanox.co.il>
Message-ID: <468AD19B.50004@ichips.intel.com>

>>> What real advantages are there for doing this "in-band" as you say?
>> Doing this in-band keeps the entire keep-alive protocol within the ULP.  It can
>> set the keep-alive message size and retry times.
>> LAP messages are fixed at 256
>> bytes, add additional traffic on QP 1, and retries are limited by the CM
>> protocol.
> 
> BTW, I think we might want to avoid retries altogether: if LAP
> timed out, we can just re-create the connection.

The CM currently retries LAP messages based on the value of the REQ max 
CM retries, but I don't see why this couldn't change.

>> (Of course, new CM messages would have these same limits, so it's not
>> clear to me that creating new CM messages are a win.  New CM messages would
>> allow the CM itself to respond directly to keep-alives though.)
> 
> OTOH, using QP1 makes it easier to separate rare keepalives
> from fast-path data packet receive path.

I was thinking more along the lines of whether to use the CM LAP message 
or create a new CM message for handling keep-alive.  The best argument I 
can come up with for creating a new message is that it 'seems' 
cleaner...  Anyway, I agree that using LAP would be the best approach 
for now.

- Sean


From jsquyres at cisco.com  Tue Jul  3 21:41:37 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 4 Jul 2007 06:41:37 +0200
Subject: [ofa-general] Re: [ewg] OFED July 2,
	meeting summary on next OFED plans
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com>
Message-ID: <E7112EB8-2C3C-4A69-9F5D-75C32CD3BAB6@cisco.com>

On Jul 3, 2007, at 5:03 PM, Tziporet Koren wrote:

> MPI people (DK, Jeff, Labs) - update your plans for OFED 1.3 - is  
> there any specific requests toward SC07

The OMPI community is working on its plan for our next release (the  
OMPI v1.3 series).  We're roughly targeting SC/year-end, but an exact  
timetable has not yet been set.

I think that we'll probably do "the usual" -- take the latest stable  
drop of OMPI as we approach OFED v1.3.

-- 
Jeff Squyres
Cisco Systems


From jsquyres at cisco.com  Tue Jul  3 21:44:20 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 4 Jul 2007 06:44:20 +0200
Subject: [ofa-general] Feedback on mpi-selector / mpi-selector-menu
Message-ID: <7D3A0004-72BB-403E-B7D1-562AD2B184C4@cisco.com>

Just curious: does anyone have any feedback on the mpi-selector- 
menu / mpi-selector app that is included in OFED v1.2?  I showed it  
to a few users who were very happy with it, but then again, I'm  
somewhat biased.  :-)

Do HP MPI / Intel MPI plan to integrate with these tools?  I am  
pretty sure that I sent instructions to both groups, but if I didn't,  
or if those instructions got lost, let me know and I can re-send (or  
you can just read the man pages).

Any feedback from out in the wild would be appreciated.  Thanks.

-- 
Jeff Squyres
Cisco Systems


From ogerlitz at voltaire.com  Tue Jul  3 22:54:35 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 04 Jul 2007 08:54:35 +0300
Subject: [ofa-general] Re: Re: IPoIB-CM UC mode
In-Reply-To: <468A971B.5090507@mellanox.co.il>
References: <20070702145328.GC17858@mellanox.co.il>	<15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com>	<20070702195314.GA31169@mellanox.co.il>	<15ddcffd0707021313w6700f9f3n541137ecb345904e@mail.gmail.com>	<20070703060049.GF1147@mellanox.co.il>
	<468A1A9A.5060208@voltaire.com> <468A971B.5090507@mellanox.co.il>
Message-ID: <468B361B.9090808@voltaire.com>

Tziporet Koren wrote:
> Or Gerlitz wrote:

>> Dror did not mention the HW, my understanding is that this aspect is 
>> fine... now, assuming the need for liveness protocol is behind us, and 
>> if not, it can be implemented as you suggested. the problem is  
>> narrowed to have the FW support SRQ/UC.

> We still don't have a solid plan for this in the FW

The current implementation of IPoIB-CM uses RC, where Michael has 
admitted that the RC ACKs are actually used as keep alives, no more.

-> under high packet rate there are some ten (hundred?!) --thousands-- 
keep alive messages/second.

This is --very-- poor architecture, the code must move to UC.

The only actual barrier here is FW support for UC/SRQ, the IBTA 
signature can come later (and is on its way).

Or.


From ogerlitz at voltaire.com  Tue Jul  3 23:22:03 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 04 Jul 2007 09:22:03 +0300
Subject: [ofa-general] Re: [PATCH] ib/cm: fix handling of duplicate SIDR REQs
In-Reply-To: <000801c7bcfa$f4112a40$3c98070a@amr.corp.intel.com>
References: <000801c7bcfa$f4112a40$3c98070a@amr.corp.intel.com>
Message-ID: <468B3C8B.7040409@voltaire.com>

Sean Hefty wrote:
> Fix handling to duplicate SIDR REQs to avoid sending a reject if
> one is detected.  Duplicates should simply be discarded.
> 
> Signed-off-by: Sean Hefty <sean.hefty at intel.com>

Sean,

I have applied the patches on top of OFED 1.2 and tested both with 
udaddy and the UD app, it works fine, thanks.

So you will push this to 2.6.23 ?

Or.


From ogerlitz at voltaire.com  Tue Jul  3 23:32:37 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 04 Jul 2007 09:32:37 +0300
Subject: [ewg] Re: [ofa-general] OFED July 2,	meeting summary on next
	OFED plans
In-Reply-To: <1183496069.5757.33.camel@trinity.ogc.int>
References: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com>	<adair915mun.fsf@cisco.com>
	<468AB301.4020807@opengridcomputing.com>
	<1183496069.5757.33.camel@trinity.ogc.int>
Message-ID: <468B3F05.4050000@voltaire.com>

Tom Tucker wrote:
> Perhaps the activity is not where you're used to looking. Both Trond and
> Neal reviewed the previous patchset and provided feedback that I
> addressed in the most recent patchset. That said, I'm sure there will be
> quite a bit more before it's mergeable. 

Indeed, any other IB related kernel ULP that was submitted upstream was 
sent to review on the "openib" (open-fabrics general) AND another 
mailing list (eg netdev,linux-scsi) AND lkml

As was commented here in the past, these ULPs typically involve two 
disciplines, in this case, NFS and the RDMA stack.

You are being expected to ask --both-- communities to review the code 
before you sending it to everyone (lkml) for another review, and only 
then merge it.

Or.


From ogerlitz at voltaire.com  Wed Jul  4 00:14:07 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 04 Jul 2007 10:14:07 +0300
Subject: [ofa-general] [PATCH 1/2] ib/sa: Add InformInfo/Notice support
In-Reply-To: <000001c7bd97$b216d2a0$3c98070a@amr.corp.intel.com>
References: <000001c7bd97$b216d2a0$3c98070a@amr.corp.intel.com>
Message-ID: <468B48BF.40606@voltaire.com>

Sean Hefty wrote:
> IPoIB registers its event handler per port, so requires the extra check.  Both
> the multicast and inform info modules register their event handlers per device,
> so the check isn't necessary.

Got it, I was not fully understand the code from first read.

Or.


From vlad at lists.openfabrics.org  Wed Jul  4 02:45:32 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Wed,  4 Jul 2007 02:45:32 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_c_kernel 20070704-0200 daily build status
Message-ID: <20070704094533.40016E60830@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/~vlad/ofed_kernel.git
git_branch: ofed_kernel

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.12
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.14
Passed on x86_64 with linux-2.6.19
Passed on ia64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on powerpc with linux-2.6.18
Passed on powerpc with linux-2.6.19
Passed on ia64 with linux-2.6.15
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on ia64 with linux-2.6.16
Passed on ppc64 with linux-2.6.12
Passed on x86_64 with linux-2.6.20
Passed on powerpc with linux-2.6.15
Passed on ia64 with linux-2.6.17
Passed on x86_64 with linux-2.6.12
Passed on powerpc with linux-2.6.14
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.18
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.14
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.16
Passed on x86_64 with linux-2.6.13
Passed on ppc64 with linux-2.6.13
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.21.1
Passed on ppc64 with linux-2.6.15
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-34.ELsmp

Failed:


From glebn at voltaire.com  Wed Jul  4 05:11:16 2007
From: glebn at voltaire.com (Gleb Natapov)
Date: Wed, 4 Jul 2007 15:11:16 +0300
Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <adaved293fc.fsf@cisco.com>
References: <20070625130604.GH15343@mellanox.co.il>
	<20070701121623.GD17699@minantech.com>
	<6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com>
	<20070701190516.GB31673@minantech.com>
	<1183374715.4377.127455.camel@hal.voltaire.com>
	<4688F671.40408@dev.mellanox.co.il>
	<1183382948.4377.136789.camel@hal.voltaire.com>
	<adaved293fc.fsf@cisco.com>
Message-ID: <20070704121116.GZ17699@minantech.com>

On Mon, Jul 02, 2007 at 09:27:19AM -0700, Roland Dreier wrote:
> 
>  > > Correct. The number of messages in flight per EEC is 1 per IB spec.
>  > > The fact that IB requires SQ WQEs to complete in order, even if their 
>  > > destination is different EECs,
>  > 
>  > Where's this requirement in the spec (and could this be relaxed as it
>  > seems like it is overly "specified") ? Just wondering...
> 
> I don't think we want to relax the requirement that work requests
> complete in order.  It's hard enough to get applications correct
> without having to worry about out-of-order completions, and I think
> specifying all the corner cases would be a nightmare.  Eg do we allow
> successful completions after a completion with error?  and so on...
I don't think it will be a problem (for MPI at least) if work requests to
different destinations will complete out of order. What spec says about
completion with error? Should RD QP move to error state?

--
			Gleb.


From hnguyen at linux.vnet.ibm.com  Wed Jul  4 07:11:29 2007
From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen)
Date: Wed, 4 Jul 2007 16:11:29 +0200
Subject: [ofa-general] Re: idr_get_new_above() limitation?
In-Reply-To: <1183422700.3130.27.camel@localhost.localdomain>
References: <200707021919.27251.hnguyen@linux.vnet.ibm.com>
	<1183422700.3130.27.camel@localhost.localdomain>
Message-ID: <200707041611.30056.hnguyen@linux.vnet.ibm.com>

On Tuesday 03 July 2007 02:31, Jim Houston wrote:
> The problem is in idr_get_new_above_int() in the loop which
> adds new layers to the top of the radix tree.  It is failing
> the "layers < (MAX_LEVEL - 1)" test.  It doesn't allocate the
> new layer but still calls sub_alloc() which relies on having
> the new layer properly constructed.  I believe that it is
> allocating the slot which corresponds to id = 0.
Hi Jim,
Thanks for your quick reply.
Yes, I realized that while condition too and have tried with a tiny
change like (layers < MAX_LEVEL), but without success with idr_find(), 
even though 6 layers were created and the object was added at proper
location. After several debug cycles I think to find the root cause 
in the if-condition in idr_find():
void *idr_find(struct idr *idp, int id)
{
	int n;
	struct idr_layer *p;

	n = idp->layers * IDR_BITS;
	p = idp->top;

	/* Mask off upper bits we don't use for the search. */
	id &= MAX_ID_MASK;

	if (id >= (1 << n))
		return NULL;
...
}
Since idp->layers is now 6, n is equal 36, ie out of 32-bit-range,
and therefore
	(1 << n) = (1 << 36) = 0
causing that if-cond to be true ie idr_find() fails.
Replacing that if-line by
	if ((long)id >= (1L << n))
makes idr_find() working properly until MAX_ID_MASK.
Since there are other places to be changed like above as well eg.
idr_replace() and because you're creating a patch too, I'm waiting
first for your comment. Let me know if you prefer me to send a
patch.
Regards
Nam


From dledford at redhat.com  Wed Jul  4 09:28:04 2007
From: dledford at redhat.com (Doug Ledford)
Date: Wed, 04 Jul 2007 12:28:04 -0400
Subject: [ewg] RE: [ofa-general] Toward next OFED release (1.3)
In-Reply-To: <4FB1BCCAE6CAED44A1DC005B1DE0611929146A@EPEXCH2.qlogic.org>
References: <6C2C79E72C305246B504CBA17B5500C90156379B@mtlexch01.mtl.com>
	<4FB1BCCAE6CAED44A1DC005B1DE06119291460@EPEXCH2.qlogic.org>
	<adamyymv7ah.fsf@cisco.com>
	<4FB1BCCAE6CAED44A1DC005B1DE0611929146A@EPEXCH2.qlogic.org>
Message-ID: <1183566484.16081.123.camel@firewall.xsintricity.com>

On Tue, 2007-06-26 at 14:46 -0500, Lakshmanan, Madhu wrote:
> > From: Roland Dreier [mailto:rdreier at cisco.com]
> > Subject: Re: [ewg] RE: [ofa-general] Toward next OFED release (1.3)
> > 
> >  > VNIC:
> >  >     - GA quality. Not a technology preview version anymore.
> >  >     - Added support for QLogic EVIC (10 Gbps Infiniband-to-Ethernet
> >  > gateway) - in GA
> > 
> > I hope there will be some attempt to get these drivers merged upstream
> too.
> > 
> >  - R.
> 
> Agreed in principle.

I would suggest you should agree in practice.  I couldn't care less
about principle, and I'm heavily leaning towards yanking any
drivers/ulps that don't get merged upstream from our future updates.

>  We hope to address that issue soon.
> 
> Madhu
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070704/8ad6c0a2/attachment.sig>

From dledford at redhat.com  Wed Jul  4 09:28:38 2007
From: dledford at redhat.com (Doug Ledford)
Date: Wed, 04 Jul 2007 12:28:38 -0400
Subject: [ewg] RE: [ofa-general] Toward next OFED release (1.3)
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA303C16878@xmb-sjc-216.amer.cisco.com>
References: <6C2C79E72C305246B504CBA17B5500C90156379B@mtlexch01.mtl.com>
	<4FB1BCCAE6CAED44A1DC005B1DE06119291460@EPEXCH2.qlogic.org>
	<adamyymv7ah.fsf@cisco.com>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA303C16878@xmb-sjc-216.amer.cisco.com>
Message-ID: <1183566518.16081.124.camel@firewall.xsintricity.com>

On Tue, 2007-06-26 at 12:49 -0700, Scott Weitzenkamp (sweitzen) wrote:
> > I hope there will be some attempt to get these drivers merged 
> > upstream too.
> 
> How about SDP, are we ready to try to merge it upstream?

I hope so.

-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070704/3aa8f272/attachment.sig>

From dledford at redhat.com  Wed Jul  4 09:34:47 2007
From: dledford at redhat.com (Doug Ledford)
Date: Wed, 04 Jul 2007 12:34:47 -0400
Subject: [ofa-general] Re: [ewg] [ANNOUNCE] management libraries release
In-Reply-To: <1183124231.28870.268894.camel@hal.voltaire.com>
References: <1183124231.28870.268894.camel@hal.voltaire.com>
Message-ID: <1183566887.16081.126.camel@firewall.xsintricity.com>

On Fri, 2007-06-29 at 09:37 -0400, Hal Rosenstock wrote:
> There is a new release of the management libraries which include the
> ANSIfied header files available in:
> 
> http://www.openfabrics.org/~halr/
> 
> md5sum
> a5b884775ed069da09ca0b60bfda3239  libibcommon-1.0.4.tar.gz
> 288b865a0015ac3251cffa011a7633eb  libibumad-1.0.6.tar.gz
> 04a5b6dcd2ee930f44d5715ee013f78b  libibmad-1.0.6.tar.gz

Hey Hal, I noticed you have release tarballs there for the libs, and one
for the older named openib-diags.  What would it take to get a release
tarball for infiniband-diags and one for opensm?

-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070704/fb5b6645/attachment.sig>

From exvdk at indiana.edu  Wed Jul  4 14:28:09 2007
From: exvdk at indiana.edu (Yoder Christie)
Date: Wed, 4 Jul 2007 16:28:09 -0500
Subject: [ofa-general] perceptible
Message-ID: <468C10E9.7020603@indiana.edu>

ERMX Continues To Expand As Stock Climbs Up 16.6%!

EntreMetrix Inc. (ERMX)
$0.21 UP 16.6%

ERMX announced further expansion with K-9 Genetics. Healthy and Premium
dog foods grossed $3.6 Billion in 2006, up from $1.9 billion in previous
years. Read up on ERMX over the holiday, we think you will see even more
fireworks on Thursday morning!

Privacy Policy Search Corrections RSS First Look Help Contact Us Work
for Us Site Map
The government investigators requested documents relating to the
unauthorized downloads, he said, but declined to elaborate further.

Harry points out that one of LinkedIn's strongest features is the
ability to collect testimonials from other people.

In the cellphone world, win-win plays like that are extremely rare.
Wurtzel of NBC acknowledged it was early in the research process.
Cut your overhead so you have plenty of chips, ready for another spin of
the roulette wheel.
I look forward to seeing you all at the new site! , to help expand its
communications network, both companies said Tuesday. A Hybrid That Looks
Like One Modan: Most Popular Girl in Warsaw Home World U. Do you have a
purpose in life.

There is no charge to search jobs.

I'm horrible at chess and you should never hire me to paint your house.

International Paper pulled out a few years ago, but most people hung in,
making do by doing a lot of different things.

ESRB Rating: EVERYONEFor more information, visit:   www. you get out of
it what you put in.

Third-party customer-support companies like TomorrowNow often have that
privilege in working on behalf of clients. Blue Sky Resumes Blog: It
Makes Me Angry!


From kliteyn at dev.mellanox.co.il  Thu Jul  5 00:43:55 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 05 Jul 2007 10:43:55 +0300
Subject: [ofa-general] [PATCH] osm: bug in dumping opensm.fdbs
Message-ID: <468CA13B.2040900@dev.mellanox.co.il>

Hi Hal,

opensm.fdbs dump function adaptation to the recent changes in min hop tables
broke fat-tree routing (or any other future routing that may not use the same
min hop tables creation functions).

-- Yevgeny

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
  opensm/opensm/osm_ucast_mgr.c |   33 ++++++++++++++++++++++++---------
  1 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
index 5bcb655..cab272e 100644
--- a/opensm/opensm/osm_ucast_mgr.c
+++ b/opensm/opensm/osm_ucast_mgr.c
@@ -242,6 +242,7 @@ __osm_ucast_mgr_dump_path_distribution(

  /**********************************************************************
   **********************************************************************/
+
  static void
  __osm_ucast_mgr_dump_ucast_routes(
    IN cl_map_item_t *p_map_item,
@@ -255,6 +256,7 @@ __osm_ucast_mgr_dump_ucast_routes(
    uint8_t                  best_port;
    uint16_t                 max_lid_ho;
    uint16_t                 lid_ho, base_lid;
+  boolean_t                direct_route_exists = FALSE;
    osm_switch_t* p_sw = (osm_switch_t *)p_map_item;
    osm_ucast_mgr_t* p_mgr = ((struct ucast_mgr_dump_context *)cxt)->p_mgr;
    FILE *file = ((struct ucast_mgr_dump_context *)cxt)->file;
@@ -300,22 +302,35 @@ __osm_ucast_mgr_dump_ucast_routes(
      */
      if( p_port->p_node->sw )
      {
+      /* Target LID is switch.
+         Get its base lid and check hop count for this base LID only.*/
        base_lid = osm_node_get_base_lid(p_port->p_node, 0);
        base_lid = cl_ntoh16(base_lid);
        num_hops = osm_switch_get_hop_count( p_sw, base_lid, port_num );
      }
      else
      {
-      osm_physp_t *p_physp = p_port->p_physp;
-      if( !p_physp || !p_physp->p_remote_physp ||
-          !p_physp->p_remote_physp->p_node->sw )
-        num_hops = OSM_NO_PATH;
+      /* Target LID is not switch (CA or router).
+         Check if we have route to this target from current switch.*/
+      num_hops = osm_switch_get_hop_count( p_sw, lid_ho, port_num );
+      if (num_hops != OSM_NO_PATH)
+      {
+          direct_route_exists = TRUE;
+          base_lid = lid_ho;
+      }
        else
        {
-        base_lid = osm_node_get_base_lid(p_physp->p_remote_physp->p_node, 0);
-        base_lid = cl_ntoh16(base_lid);
-        num_hops = p_physp->p_remote_physp->p_node->sw == p_sw ?
-                   0 : osm_switch_get_hop_count( p_sw, base_lid, port_num );
+        osm_physp_t *p_physp = p_port->p_physp;
+        if( !p_physp || !p_physp->p_remote_physp ||
+            !p_physp->p_remote_physp->p_node->sw )
+          num_hops = OSM_NO_PATH;
+        else
+        {
+          base_lid = osm_node_get_base_lid(p_physp->p_remote_physp->p_node, 0);
+          base_lid = cl_ntoh16(base_lid);
+          num_hops = p_physp->p_remote_physp->p_node->sw == p_sw ?
+                     0 : osm_switch_get_hop_count( p_sw, base_lid, port_num );
+        }
        }
      }

@@ -326,7 +341,7 @@ __osm_ucast_mgr_dump_ucast_routes(
      }

      best_hops = osm_switch_get_least_hops( p_sw, base_lid );
-    if (!p_port->p_node->sw)
+    if (!p_port->p_node->sw && !direct_route_exists)
      {
        best_hops++;
        num_hops++;
-- 
1.5.1.4


From vlad at lists.openfabrics.org  Thu Jul  5 02:44:48 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Thu,  5 Jul 2007 02:44:48 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_c_kernel 20070705-0200 daily build status
Message-ID: <20070705094448.86E09E60843@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/~vlad/ofed_kernel.git
git_branch: ofed_kernel

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.15
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.13
Passed on powerpc with linux-2.6.18
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.13
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.12
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.19
Passed on ppc64 with linux-2.6.16
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.17
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.17
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.18
Passed on ia64 with linux-2.6.16
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.14
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.15
Passed on ia64 with linux-2.6.14
Passed on ppc64 with linux-2.6.13
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on powerpc with linux-2.6.16
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on x86_64 with linux-2.6.18-8.el5

Failed:


From swise at opengridcomputing.com  Thu Jul  5 05:39:38 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 05 Jul 2007 07:39:38 -0500
Subject: [ofa-general] Feedback on mpi-selector / mpi-selector-menu
In-Reply-To: <7D3A0004-72BB-403E-B7D1-562AD2B184C4@cisco.com>
References: <7D3A0004-72BB-403E-B7D1-562AD2B184C4@cisco.com>
Message-ID: <468CE68A.3050400@opengridcomputing.com>

It works fine for me.

I've used it specifically to add debug mvapich2 libs and easily switch 
between the debug and non-debug libs.

Steve.


Jeff Squyres wrote:
> Just curious: does anyone have any feedback on the mpi-selector-menu / 
> mpi-selector app that is included in OFED v1.2?  I showed it to a few 
> users who were very happy with it, but then again, I'm somewhat biased.  
> :-)
> 
> Do HP MPI / Intel MPI plan to integrate with these tools?  I am pretty 
> sure that I sent instructions to both groups, but if I didn't, or if 
> those instructions got lost, let me know and I can re-send (or you can 
> just read the man pages).
> 
> Any feedback from out in the wild would be appreciated.  Thanks.
> 


From halr at voltaire.com  Thu Jul  5 05:40:33 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 05 Jul 2007 08:40:33 -0400
Subject: [ofa-general] Re: [PATCH] osm: bug in dumping opensm.fdbs
In-Reply-To: <468CA13B.2040900@dev.mellanox.co.il>
References: <468CA13B.2040900@dev.mellanox.co.il>
Message-ID: <1183639225.4377.435484.camel@hal.voltaire.com>

Hi Yevgeny,

On Thu, 2007-07-05 at 03:43, Yevgeny Kliteynik wrote:
> Hi Hal,
> 
> opensm.fdbs dump function adaptation to the recent changes in min hop tables
> broke fat-tree routing (or any other future routing that may not use the same
> min hop tables creation functions).
> 
> -- Yevgeny
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Thanks. Applied to master only (by hand again as patch rejected these
changes :-( Please double check.

-- Hal


From afghanzf6 at phentermine.com  Thu Jul  5 05:44:48 2007
From: afghanzf6 at phentermine.com (Reba Quintero)
Date: Thu, 5 Jul 2007 12:44:48 +0000
Subject: [ofa-general] Wassup
Message-ID: <182006335.77584700461918@phentermine.com>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070705/5e68a82f/attachment.html>

From halr at voltaire.com  Thu Jul  5 05:57:31 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 05 Jul 2007 08:57:31 -0400
Subject: [ofa-general] [PATCH] OpenSM handling of "Babbling" Ports
Message-ID: <1183640246.4377.436639.camel@hal.voltaire.com>

A "babbling" port is a port which causes traps to be generated frequently.
It may directly be "this" port which generates the traps or the peer port
detecting the issue and that the SMA on switch port 0 generates the traps.
This has only currently been observed for trap 131 but will also apply
for traps 129 and 130 as well which are other urgent and similar traps.

Note that there appears to be a bug in Mellanox firmware for both Anafa-2 and
Tavor at a minimum which causes the max trap rate not to be adhered to
and relief for this does not appear to be in short term sight.

Policy
When a bablbing port is detected, OpenSM will disable the port or its
peer switch port (depending on which trap) which should terminate the
trap storm.

Detection
250 consecutive traps of this type will be used as the (initial)
threshold. The reason for this is so as to not prematurely detect this
and disable a port.

Recovery
Admin would reenable port when OK again. (This usually involves
rebooting the node causing the trap to be indicated.)

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
index bedd63f..1150703 100644
--- a/opensm/include/opensm/osm_subnet.h
+++ b/opensm/include/opensm/osm_subnet.h
@@ -286,6 +286,7 @@ typedef struct _osm_subn_opt
   boolean_t                honor_guid2lid_file;
   boolean_t                daemon;
   boolean_t                sm_inactive;
+  boolean_t                babbling_port_policy;
   osm_qos_options_t        qos_options;
   osm_qos_options_t        qos_ca_options;
   osm_qos_options_t        qos_sw0_options;
@@ -487,6 +488,9 @@ typedef struct _osm_subn_opt
 *
 *	sm_inactive
 *		OpenSM will start with SM in not active state.
+*
+*	babbling_port_policy
+*		OpenSM will enforce its "babbling" port policy.
 *	
 *	perfmgr
 *		Enable or disable the performance manager
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 726b665..87b71e5 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -472,6 +472,7 @@ osm_subn_set_default_opt(
   p_opt->honor_guid2lid_file = FALSE;
   p_opt->daemon = FALSE;
   p_opt->sm_inactive = FALSE;
+  p_opt->babbling_port_policy = FALSE;
 #ifdef ENABLE_OSM_PERF_MGR
   p_opt->perfmgr = FALSE;
   p_opt->perfmgr_sweep_time_s = OSM_PERFMGR_DEFAULT_SWEEP_TIME_S;
@@ -1358,6 +1359,10 @@ osm_subn_parse_conf_file(
         "sm_inactive",
         p_key, p_val, &p_opts->sm_inactive);
 
+      __osm_subn_opts_unpack_boolean(
+        "babbling_port_policy",
+        p_key, p_val, &p_opts->babbling_port_policy);
+
 #ifdef ENABLE_OSM_PERF_MGR
       __osm_subn_opts_unpack_boolean(
         "perfmgr",
@@ -1631,9 +1636,12 @@ osm_subn_write_conf_file(
     "# Daemon mode\n"
     "daemon %s\n\n"
     "# SM Inactive\n"
-    "sm_inactive %s\n\n",
+    "sm_inactive %s\n\n"
+    "# Babbling Port Policy\n"
+    "babbling_port_policy %s\n\n",
     p_opts->daemon ? "TRUE" : "FALSE",
-    p_opts->sm_inactive ? "TRUE" : "FALSE"
+    p_opts->sm_inactive ? "TRUE" : "FALSE",
+    p_opts->babbling_port_policy ? "TRUE" : "FALSE"
     );
 
 #ifdef ENABLE_OSM_PERF_MGR
diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
index 5900c51..fbb6dac 100644
--- a/opensm/opensm/osm_trap_rcv.c
+++ b/opensm/opensm/osm_trap_rcv.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  *
@@ -548,6 +548,61 @@ __osm_trap_rcv_process_request(
         }
         else
         {
+          /* When babbling port policy option is enabled and
+             Threshold for disabling a "babbling" port is exceeded */
+          if ( p_rcv->p_subn->opt.babbling_port_policy &&
+               num_received >= 250 )
+          {
+            uint8_t               payload[IB_SMP_DATA_SIZE];
+            ib_port_info_t*       p_pi = (ib_port_info_t*)payload;
+            const ib_port_info_t* p_old_pi;
+            osm_madw_context_t    context;
+
+            /* If trap 131, might want to disable peer port if available */
+            /* but peer port has been observed not to respond to SM requests */
+
+            osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+                     "__osm_trap_rcv_process_request: ERR 3810: "
+                     " Disabling physical port lid:0x%02X num:%u\n",
+                     cl_ntoh16(p_ntci->data_details.ntc_129_131.lid),
+                     p_ntci->data_details.ntc_129_131.port_num
+                     );
+
+            p_old_pi = &p_physp->port_info;
+            memcpy( payload, p_old_pi, sizeof(ib_port_info_t) );
+
+            /* Set port to disabled/down */
+            ib_port_info_set_port_state( p_pi, IB_LINK_DOWN );
+            ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_DISABLED, p_pi );
+
+            context.pi_context.node_guid = osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) );
+            context.pi_context.port_guid = osm_physp_get_port_guid( p_physp );
+            context.pi_context.set_method = TRUE;
+            context.pi_context.update_master_sm_base_lid = FALSE;
+            context.pi_context.light_sweep = FALSE;
+            context.pi_context.active_transition = FALSE;
+
+            status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req,
+                                   osm_physp_get_dr_path_ptr( p_physp ),
+                                   payload,
+                                   sizeof(payload),
+                                   IB_MAD_ATTR_PORT_INFO,
+                                   cl_hton32(osm_physp_get_port_num( p_physp )),
+                                   CL_DISP_MSGID_NONE,
+                                  &context );
+
+            if( status == IB_SUCCESS )
+            {
+               goto Exit;
+            }
+            else
+            {
+               osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+                        "__osm_trap_rcv_process_request: ERR 3811: "
+                        "Request to set PortInfo failed\n" );
+            }
+          }
+
           osm_log( p_rcv->p_log, OSM_LOG_VERBOSE,
                    "__osm_trap_rcv_process_request: "
                    "Marking unhealthy physical port by lid:0x%02X num:%u\n",


From kliteyn at dev.mellanox.co.il  Thu Jul  5 06:54:55 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 05 Jul 2007 16:54:55 +0300
Subject: [ofa-general] [PATCH] osm: cosmetics - removing trailing blanks
Message-ID: <468CF82F.5030409@dev.mellanox.co.il>

Hi Hal,

Removing trailing white spaces in fat-tree

-- Yevgeny

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
  opensm/opensm/osm_ucast_ftree.c |  340 +++++++++++++++++++-------------------
  1 files changed, 170 insertions(+), 170 deletions(-)

diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c
index 1ead199..e91f3ed 100644
--- a/opensm/opensm/osm_ucast_ftree.c
+++ b/opensm/opensm/osm_ucast_ftree.c
@@ -63,12 +63,12 @@
   *    so no need to use FatTree routing.
   *  - Why maximum rank is 8:
   *    Each node (switch) is assigned a unique tuple.
- *    Switches are stored in two cl_qmaps - one is
+ *    Switches are stored in two cl_qmaps - one is
   *    ordered by guid, and the other by a key that is
   *    generated from tuple. Since cl_qmap supports only
   *    a 64-bit key, the maximal tuple lenght is 8 bytes.
   *    which means that maximal tree rank is 8.
- * Note that the above also implies that each switch
+ * Note that the above also implies that each switch
   * can have at max 255 up/down ports.
   */

@@ -132,7 +132,7 @@ typedef uint8_t * ftree_fwd_tbl_t;
   **
   ***************************************************/

-typedef struct ftree_port_t_
+typedef struct ftree_port_t_
  {
     cl_map_item_t  map_item;
     uint8_t        port_num;           /* port number on the current node */
@@ -170,7 +170,7 @@ typedef struct ftree_port_group_t_
   **
   ***************************************************/

-typedef struct ftree_sw_t_
+typedef struct ftree_sw_t_
  {
     cl_map_item_t          map_item;
     osm_switch_t         * p_osm_sw;
@@ -203,7 +203,7 @@ typedef struct ftree_hca_t_ {
   **
   ***************************************************/

-typedef struct ftree_fabric_t_
+typedef struct ftree_fabric_t_
  {
     osm_opensm_t  * p_osm;
     cl_qmap_t       hca_tbl;
@@ -226,11 +226,11 @@ typedef struct ftree_fabric_t_

  static int OSM_CDECL
  __osm_ftree_compare_switches_by_index(
-   IN  const void * p1,
+   IN  const void * p1,
     IN  const void * p2)
  {
-   ftree_sw_t ** pp_sw1 = (ftree_sw_t **)p1;
-   ftree_sw_t ** pp_sw2 = (ftree_sw_t **)p2;
+   ftree_sw_t ** pp_sw1 = (ftree_sw_t **)p1;
+   ftree_sw_t ** pp_sw2 = (ftree_sw_t **)p2;

     uint16_t i;
     for (i = 0; i < FTREE_TUPLE_LEN; i++)
@@ -247,13 +247,13 @@ __osm_ftree_compare_switches_by_index(

  static int OSM_CDECL
  __osm_ftree_compare_port_groups_by_remote_switch_index(
-   IN  const void * p1,
+   IN  const void * p1,
     IN  const void * p2)
  {
-   ftree_port_group_t ** pp_g1 = (ftree_port_group_t **)p1;
-   ftree_port_group_t ** pp_g2 = (ftree_port_group_t **)p2;
+   ftree_port_group_t ** pp_g1 = (ftree_port_group_t **)p1;
+   ftree_port_group_t ** pp_g2 = (ftree_port_group_t **)p2;

-   return __osm_ftree_compare_switches_by_index(
+   return __osm_ftree_compare_switches_by_index(
                    &((*pp_g1)->remote_hca_or_sw.remote_sw),
                    &((*pp_g2)->remote_hca_or_sw.remote_sw) );
  }
@@ -290,7 +290,7 @@ __osm_ftree_sw_greater_by_index(
   **
   ***************************************************/

-static void
+static void
  __osm_ftree_tuple_init(
     IN  ftree_tuple_t tuple)
  {
@@ -310,7 +310,7 @@ __osm_ftree_tuple_assigned(

  #define FTREE_TUPLE_BUFFERS_NUM 6

-static char *
+static char *
  __osm_ftree_tuple_to_str(
     IN  ftree_tuple_t tuple)
  {
@@ -340,7 +340,7 @@ __osm_ftree_tuple_to_str(

  /***************************************************/

-static inline ftree_tuple_key_t
+static inline ftree_tuple_key_t
  __osm_ftree_tuple_to_key(
     IN  ftree_tuple_t tuple)
  {
@@ -351,9 +351,9 @@ __osm_ftree_tuple_to_key(

  /***************************************************/

-static inline void
+static inline void
  __osm_ftree_tuple_from_key(
-   IN  ftree_tuple_t tuple,
+   IN  ftree_tuple_t tuple,
     IN  ftree_tuple_key_t key)
  {
     memcpy(tuple, &key, FTREE_TUPLE_LEN);
@@ -369,7 +369,7 @@ static ftree_sw_tbl_element_t *
  __osm_ftree_sw_tbl_element_create(
     IN  ftree_sw_t * p_sw)
  {
-   ftree_sw_tbl_element_t * p_element =
+   ftree_sw_tbl_element_t * p_element =
        (ftree_sw_tbl_element_t *) malloc(sizeof(ftree_sw_tbl_element_t));
     if (!p_element)
         return NULL;
@@ -397,8 +397,8 @@ __osm_ftree_sw_tbl_element_destroy(
   **
   ***************************************************/

-static ftree_port_t *
-__osm_ftree_port_create(
+static ftree_port_t *
+__osm_ftree_port_create(
     IN  uint8_t port_num,
     IN  uint8_t remote_port_num)
  {
@@ -415,7 +415,7 @@ __osm_ftree_port_create(

  /***************************************************/

-static void
+static void
  __osm_ftree_port_destroy(
     IN  ftree_port_t * p_port)
  {
@@ -429,8 +429,8 @@ __osm_ftree_port_destroy(
   **
   ***************************************************/

-static ftree_port_group_t *
-__osm_ftree_port_group_create(
+static ftree_port_group_t *
+__osm_ftree_port_group_create(
     IN  ib_net16_t    base_lid,
     IN  ib_net16_t    remote_base_lid,
     IN  ib_net64_t  * p_port_guid,
@@ -439,9 +439,9 @@ __osm_ftree_port_group_create(
     IN  uint8_t       remote_node_type,
     IN  void        * p_remote_hca_or_sw)
  {
-   ftree_port_group_t * p_group =
+   ftree_port_group_t * p_group =
              (ftree_port_group_t *)malloc(sizeof(ftree_port_group_t));
-   if (p_group == NULL)
+   if (p_group == NULL)
        return NULL;
     memset(p_group, 0, sizeof(ftree_port_group_t));

@@ -473,7 +473,7 @@ __osm_ftree_port_group_create(

  /***************************************************/

-static void
+static void
  __osm_ftree_port_group_destroy(
     IN  ftree_port_group_t * p_group)
  {
@@ -497,7 +497,7 @@ __osm_ftree_port_group_destroy(

  /***************************************************/

-static void
+static void
  __osm_ftree_port_group_dump(
     IN  ftree_fabric_t *p_ftree,
     IN  ftree_port_group_t * p_group,
@@ -529,9 +529,9 @@ __osm_ftree_port_group_dump(

     osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
             "__osm_ftree_port_group_dump:"
-           "    Port Group of size %u, port(s): %s, direction: %s\n"
+           "    Port Group of size %u, port(s): %s, direction: %s\n"
             "                  Local <--> Remote GUID (LID):"
-           "0x%016" PRIx64 " (0x%x) <--> 0x%016" PRIx64 " (0x%x)\n",
+           "0x%016" PRIx64 " (0x%x) <--> 0x%016" PRIx64 " (0x%x)\n",
             size,
             buff,
             (direction == FTREE_DIRECTION_DOWN)? "DOWN" : "UP",
@@ -570,7 +570,7 @@ __osm_ftree_port_group_add_port(
   **
   ***************************************************/

-static ftree_sw_t *
+static ftree_sw_t *
  __osm_ftree_sw_create(
     IN  ftree_fabric_t * p_ftree,
     IN  osm_switch_t   * p_osm_sw)
@@ -583,7 +583,7 @@ __osm_ftree_sw_create(
        return NULL;

     p_sw = (ftree_sw_t *)malloc(sizeof(ftree_sw_t));
-   if (p_sw == NULL)
+   if (p_sw == NULL)
        return NULL;
     memset(p_sw, 0, sizeof(ftree_sw_t));

@@ -594,9 +594,9 @@ __osm_ftree_sw_create(
     p_sw->base_lid = osm_node_get_base_lid(p_sw->p_osm_sw->p_node, 0);

     ports_num = osm_node_get_num_physp(p_sw->p_osm_sw->p_node);
-   p_sw->down_port_groups =
+   p_sw->down_port_groups =
        (ftree_port_group_t **) malloc(ports_num * sizeof(ftree_port_group_t *));
-   p_sw->up_port_groups =
+   p_sw->up_port_groups =
        (ftree_port_group_t **) malloc(ports_num * sizeof(ftree_port_group_t *));
     if (!p_sw->down_port_groups || !p_sw->up_port_groups)
        return NULL;
@@ -612,7 +612,7 @@ __osm_ftree_sw_create(

  /***************************************************/

-static void
+static void
  __osm_ftree_sw_destroy(
     IN  ftree_fabric_t * p_ftree,
     IN  ftree_sw_t     * p_sw)
@@ -640,7 +640,7 @@ __osm_ftree_sw_destroy(

  /***************************************************/

-static void
+static void
  __osm_ftree_sw_dump(
     IN  ftree_fabric_t * p_ftree,
     IN  ftree_sw_t * p_sw)
@@ -658,7 +658,7 @@ __osm_ftree_sw_dump(
             "Switch index: %s, GUID: 0x%016" PRIx64 ", Ports: %u DOWN, %u UP\n",
            __osm_ftree_tuple_to_str(p_sw->tuple),
            cl_ntoh64(osm_node_get_node_guid(p_sw->p_osm_sw->p_node)),
-          p_sw->down_port_groups_num,
+          p_sw->down_port_groups_num,
            p_sw->up_port_groups_num);

     for( i = 0; i < p_sw->down_port_groups_num; i++ )
@@ -678,7 +678,7 @@ static boolean_t
  __osm_ftree_sw_ranked(
     IN  ftree_sw_t * p_sw)
  {
-   return (p_sw->rank != 0xFFFFFFFF);
+   return (p_sw->rank != 0xFFFFFFFF);
  }

  /***************************************************/
@@ -713,7 +713,7 @@ __osm_ftree_sw_get_port_group_by_remote_lid(

  /***************************************************/

-static void
+static void
  __osm_ftree_sw_add_port(
     IN  ftree_sw_t       * p_sw,
     IN  uint8_t            port_num,
@@ -727,7 +727,7 @@ __osm_ftree_sw_add_port(
     IN  void             * p_remote_hca_or_sw,
     IN  ftree_direction_t  direction)
  {
-   ftree_port_group_t * p_group =
+   ftree_port_group_t * p_group =
         __osm_ftree_sw_get_port_group_by_remote_lid(p_sw,remote_base_lid,direction);

     if (!p_group)
@@ -756,7 +756,7 @@ __osm_ftree_sw_add_port(
  static inline void
  __osm_ftree_sw_set_fwd_table_block(
      IN  ftree_sw_t * p_sw,
-    IN  uint16_t     lid_ho,
+    IN  uint16_t     lid_ho,
      IN  uint8_t      port_num)
  {
     p_sw->lft_buf[lid_ho] = port_num;
@@ -795,17 +795,17 @@ __osm_ftree_sw_set_hops(
   **
   ***************************************************/

-static ftree_hca_t *
+static ftree_hca_t *
  __osm_ftree_hca_create(
     IN  osm_node_t * p_osm_node)
  {
     ftree_hca_t * p_hca = (ftree_hca_t *)malloc(sizeof(ftree_hca_t));
-   if (p_hca == NULL)
+   if (p_hca == NULL)
        return NULL;
     memset(p_hca,0,sizeof(ftree_hca_t));

     p_hca->p_osm_node = p_osm_node;
-   p_hca->up_port_groups = (ftree_port_group_t **)
+   p_hca->up_port_groups = (ftree_port_group_t **)
          malloc(osm_node_get_num_physp(p_hca->p_osm_node) * sizeof (ftree_port_group_t *));
     if (!p_hca->up_port_groups)
        return NULL;
@@ -815,7 +815,7 @@ __osm_ftree_hca_create(

  /***************************************************/

-static void
+static void
  __osm_ftree_hca_destroy(
     IN  ftree_hca_t * p_hca)
  {
@@ -835,7 +835,7 @@ __osm_ftree_hca_destroy(

  /***************************************************/

-static void
+static void
  __osm_ftree_hca_dump(
     IN  ftree_fabric_t * p_ftree,
     IN  ftree_hca_t * p_hca)
@@ -851,10 +851,10 @@ __osm_ftree_hca_dump(
     osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
             "__osm_ftree_hca_dump: "
             "CA GUID: 0x%016" PRIx64 ", Ports: %u UP\n",
-          cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)),
+          cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)),
            p_hca->up_port_groups_num);

-   for( i = 0; i < p_hca->up_port_groups_num; i++ )
+   for( i = 0; i < p_hca->up_port_groups_num; i++ )
        __osm_ftree_port_group_dump(p_ftree,
                                    p_hca->up_port_groups[i],
                                    FTREE_DIRECTION_UP);
@@ -877,7 +877,7 @@ __osm_ftree_hca_get_port_group_by_remote_lid(

  /***************************************************/

-static void
+static void
  __osm_ftree_hca_add_port(
     IN  ftree_hca_t * p_hca,
     IN  uint8_t       port_num,
@@ -893,7 +893,7 @@ __osm_ftree_hca_add_port(
     ftree_port_group_t * p_group;

     /* this function is supposed to be called only for adding ports
-      in hca's that lead to switches */
+      in hca's that lead to switches */
     CL_ASSERT(remote_node_type == IB_NODE_TYPE_SWITCH);

     p_group = __osm_ftree_hca_get_port_group_by_remote_lid(p_hca,remote_base_lid);
@@ -920,12 +920,12 @@ __osm_ftree_hca_add_port(
   **
   ***************************************************/

-static ftree_fabric_t *
+static ftree_fabric_t *
  __osm_ftree_fabric_create()
  {
     cl_status_t status;
     ftree_fabric_t * p_ftree = (ftree_fabric_t *)malloc(sizeof(ftree_fabric_t));
-   if (p_ftree == NULL)
+   if (p_ftree == NULL)
        return NULL;

     memset(p_ftree,0,sizeof(ftree_fabric_t));
@@ -951,7 +951,7 @@ __osm_ftree_fabric_create()

  /***************************************************/

-static void
+static void
  __osm_ftree_fabric_clear(ftree_fabric_t * p_ftree)
  {
     ftree_hca_t * p_hca;
@@ -988,13 +988,13 @@ __osm_ftree_fabric_clear(ftree_fabric_t * p_ftree)

     /* remove all the elements of sw_by_tuple_tbl */

-   p_next_element =
+   p_next_element =
        (ftree_sw_tbl_element_t *)cl_qmap_head(&p_ftree->sw_by_tuple_tbl);
-   while( p_next_element !=
+   while( p_next_element !=
            (ftree_sw_tbl_element_t *)cl_qmap_end( &p_ftree->sw_by_tuple_tbl ) )
     {
        p_element = p_next_element;
-      p_next_element =
+      p_next_element =
           (ftree_sw_tbl_element_t *)cl_qmap_next(&p_element->map_item);
        __osm_ftree_sw_tbl_element_destroy(p_element);
     }
@@ -1012,7 +1012,7 @@ __osm_ftree_fabric_clear(ftree_fabric_t * p_ftree)

  /***************************************************/

-static void
+static void
  __osm_ftree_fabric_destroy(ftree_fabric_t * p_ftree)
  {
     if (!p_ftree)
@@ -1024,7 +1024,7 @@ __osm_ftree_fabric_destroy(ftree_fabric_t * p_ftree)

  /***************************************************/

-static void
+static void
  __osm_ftree_fabric_set_rank(ftree_fabric_t * p_ftree, uint32_t rank)
  {
     if (rank > p_ftree->tree_rank)
@@ -1033,7 +1033,7 @@ __osm_ftree_fabric_set_rank(ftree_fabric_t * p_ftree, uint32_t rank)

  /***************************************************/

-static uint8_t
+static uint8_t
  __osm_ftree_fabric_get_rank(ftree_fabric_t * p_ftree)
  {
     return p_ftree->tree_rank;
@@ -1041,7 +1041,7 @@ __osm_ftree_fabric_get_rank(ftree_fabric_t * p_ftree)

  /***************************************************/

-static void
+static void
  __osm_ftree_fabric_add_hca(ftree_fabric_t * p_ftree, osm_node_t * p_osm_node)
  {
     ftree_hca_t * p_hca = __osm_ftree_hca_create(p_osm_node);
@@ -1055,7 +1055,7 @@ __osm_ftree_fabric_add_hca(ftree_fabric_t * p_ftree, osm_node_t * p_osm_node)

  /***************************************************/

-static void
+static void
  __osm_ftree_fabric_add_sw(ftree_fabric_t * p_ftree, osm_switch_t * p_osm_sw)
  {
     ftree_sw_t * p_sw = __osm_ftree_sw_create(p_ftree,p_osm_sw);
@@ -1073,9 +1073,9 @@ __osm_ftree_fabric_add_sw(ftree_fabric_t * p_ftree, osm_switch_t * p_osm_sw)

  /***************************************************/

-static void
+static void
  __osm_ftree_fabric_add_sw_by_tuple(
-   IN  ftree_fabric_t * p_ftree,
+   IN  ftree_fabric_t * p_ftree,
     IN  ftree_sw_t * p_sw)
  {
     CL_ASSERT(__osm_ftree_tuple_assigned(p_sw->tuple));
@@ -1087,9 +1087,9 @@ __osm_ftree_fabric_add_sw_by_tuple(

  /***************************************************/

-static ftree_sw_t *
+static ftree_sw_t *
  __osm_ftree_fabric_get_sw_by_tuple(
-   IN  ftree_fabric_t * p_ftree,
+   IN  ftree_fabric_t * p_ftree,
     IN  ftree_tuple_t tuple)
  {
     ftree_sw_tbl_element_t * p_element;
@@ -1108,7 +1108,7 @@ __osm_ftree_fabric_get_sw_by_tuple(

  /***************************************************/

-static void
+static void
  __osm_ftree_fabric_dump(ftree_fabric_t * p_ftree)
  {
     uint32_t i;
@@ -1154,7 +1154,7 @@ __osm_ftree_fabric_dump(ftree_fabric_t * p_ftree)

  /***************************************************/

-static void
+static void
  __osm_ftree_fabric_dump_general_info(
     IN  ftree_fabric_t * p_ftree)
  {
@@ -1190,7 +1190,7 @@ __osm_ftree_fabric_dump_general_info(
        }
        if (i == 0)
           addition_str = " (root) ";
-      else
+      else
           if (i == (__osm_ftree_fabric_get_rank(p_ftree) - 1))
              addition_str = " (leaf) ";
           else
@@ -1237,10 +1237,10 @@ __osm_ftree_fabric_dump_general_info(

  /***************************************************/

-static void
+static void
  __osm_ftree_fabric_dump_hca_ordering(
     IN  ftree_fabric_t * p_ftree)
-{
+{
     ftree_hca_t        * p_hca;
     ftree_sw_t         * p_sw;
     ftree_port_group_t * p_group;
@@ -1251,10 +1251,10 @@ __osm_ftree_fabric_dump_hca_ordering(
     FILE * p_hca_ordering_file;
     char * filename = "opensm-ftree-ca-order.dump";

-   snprintf(path, sizeof(path), "%s/%s",
+   snprintf(path, sizeof(path), "%s/%s",
              p_ftree->p_osm->subn.opt.dump_files_dir, filename);
     p_hca_ordering_file = fopen(path, "w");
-   if (!p_hca_ordering_file)
+   if (!p_hca_ordering_file)
     {
        osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
                "__osm_ftree_fabric_dump_hca_ordering: ERR AB01: "
@@ -1263,7 +1263,7 @@ __osm_ftree_fabric_dump_hca_ordering(
        OSM_LOG_EXIT(&p_ftree->p_osm->log);
        return;
     }
-
+
     /* for each leaf switch (in indexing order) */
     for(i = 0; i < p_ftree->leaf_switches_num; i++)
     {
@@ -1274,7 +1274,7 @@ __osm_ftree_fabric_dump_hca_ordering(
           p_group = p_sw->down_port_groups[j];
           p_hca = p_group->remote_hca_or_sw.remote_hca;

-         fprintf(p_hca_ordering_file,"0x%x\t%s\n",
+         fprintf(p_hca_ordering_file,"0x%x\t%s\n",
                   cl_ntoh16(p_group->remote_base_lid),
                   p_hca->p_osm_node->print_desc);
        }
@@ -1293,7 +1293,7 @@ __osm_ftree_fabric_dump_hca_ordering(

  /***************************************************/

-static void
+static void
  __osm_ftree_fabric_assign_tuple(
     IN   ftree_fabric_t * p_ftree,
     IN   ftree_sw_t * p_sw,
@@ -1305,7 +1305,7 @@ __osm_ftree_fabric_assign_tuple(

  /***************************************************/

-static void
+static void
  __osm_ftree_fabric_assign_first_tuple(
     IN   ftree_fabric_t * p_ftree,
     IN   ftree_sw_t * p_sw)
@@ -1353,7 +1353,7 @@ __osm_ftree_fabric_get_new_tuple(
     {
        temp_tuple[var_index] = i;
        p_sw = __osm_ftree_fabric_get_sw_by_tuple(p_ftree,temp_tuple);
-      if (p_sw == NULL) /* found free tuple */
+      if (p_sw == NULL) /* found free tuple */
           break;
     }

@@ -1444,7 +1444,7 @@ __osm_ftree_fabric_make_indexing(
             cl_ntoh16(p_sw->base_lid),
             cl_ntoh64(osm_node_get_node_guid(p_sw->p_osm_sw->p_node)));

-   /*
+   /*
      * Now run BFS and assign indexes to all switches
      * Pseudo code of the algorithm is as follows:
      *
@@ -1482,7 +1482,7 @@ __osm_ftree_fabric_make_indexing(
           /* This is not the leaf switch, which means that all the
              ports that point down are taking us to another switches.
              No need to assign indexing to HCAs */
-         for( i = 0; i < p_sw->down_port_groups_num; i++ )
+         for( i = 0; i < p_sw->down_port_groups_num; i++ )
           {
              p_remote_sw = p_sw->down_port_groups[i]->remote_hca_or_sw.remote_sw;
              if (__osm_ftree_tuple_assigned(p_remote_sw->tuple))
@@ -1502,11 +1502,11 @@ __osm_ftree_fabric_make_indexing(
                                              new_tuple);

              /* add the newly discovered switch to the BFS queue */
-            cl_list_insert_tail(&bfs_list,
+            cl_list_insert_tail(&bfs_list,
                                  &__osm_ftree_sw_tbl_element_create(p_remote_sw)->map_item);
           }
-         /* Done assigning indexes to all the remote switches
-            that are pointed by the downgoing ports.
+         /* Done assigning indexes to all the remote switches
+            that are pointed by the downgoing ports.
              Now sort port groups according to remote index. */
           qsort(p_sw->down_port_groups,                      /* array */
                 p_sw->down_port_groups_num,                  /* number of elements */
@@ -1521,7 +1521,7 @@ __osm_ftree_fabric_make_indexing(
        {
           /* This is not the root switch, which means that all the ports
              that are pointing up are taking us to another switches. */
-         for( i = 0; i < p_sw->up_port_groups_num; i++ )
+         for( i = 0; i < p_sw->up_port_groups_num; i++ )
           {
              p_remote_sw = p_sw->up_port_groups[i]->remote_hca_or_sw.remote_sw;
              if (__osm_ftree_tuple_assigned(p_remote_sw->tuple))
@@ -1538,18 +1538,18 @@ __osm_ftree_fabric_make_indexing(
                                              p_remote_sw,
                                              new_tuple);
              /* add the newly discovered switch to the BFS queue */
-            cl_list_insert_tail(&bfs_list,
+            cl_list_insert_tail(&bfs_list,
                                  &__osm_ftree_sw_tbl_element_create(p_remote_sw)->map_item);
           }
-         /* Done assigning indexes to all the remote switches
-            that are pointed by the upgoing ports.
+         /* Done assigning indexes to all the remote switches
+            that are pointed by the upgoing ports.
              Now sort port groups according to remote index. */
           qsort(p_sw->up_port_groups,                        /* array */
                 p_sw->up_port_groups_num,                    /* number of elements */
                 sizeof(ftree_port_group_t *),                /* size of each element */
                 __osm_ftree_compare_port_groups_by_remote_switch_index); /* comparator */
        }
-      /* Done assigning indexes to all the switches that are directly connected
+      /* Done assigning indexes to all the switches that are directly connected
           to the current switch - go to the next switch in the BFS queue */
     }
     cl_list_destroy(&bfs_list);
@@ -1594,7 +1594,7 @@ __osm_ftree_fabric_validate_topology(
     memset(reference_sw_arr, 0, tree_rank * sizeof(ftree_sw_t *));

     p_next_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl);
-   while( res &&
+   while( res &&
            p_next_sw != (ftree_sw_t *)cl_qmap_end( &p_ftree->sw_tbl ) )
     {
        p_sw = p_next_sw;
@@ -1602,7 +1602,7 @@ __osm_ftree_fabric_validate_topology(

        if (!reference_sw_arr[p_sw->rank])
        {
-         /* This is the first switch in the current level that
+         /* This is the first switch in the current level that
              we're checking - use it as a reference */
           reference_sw_arr[p_sw->rank] = p_sw;
        }
@@ -1726,19 +1726,19 @@ __osm_ftree_fabric_validate_topology(

  static void
  __osm_ftree_set_sw_fwd_table(
-   IN  cl_map_item_t* const p_map_item,
+   IN  cl_map_item_t* const p_map_item,
     IN  void *context)
  {
     ftree_sw_t * p_sw = (ftree_sw_t * const) p_map_item;
     ftree_fabric_t * p_ftree = (ftree_fabric_t *)context;

-   /* calculate lft length rounded up to a multiple of 64 (block length) */
+   /* calculate lft length rounded up to a multiple of 64 (block length) */
     uint16_t lft_len = 64 * ((p_ftree->lft_max_lid_ho + 1 + 63) / 64);

     p_sw->p_osm_sw->max_lid_ho = p_ftree->lft_max_lid_ho;

-   memcpy(p_ftree->p_osm->sm.ucast_mgr.lft_buf,
-          p_sw->lft_buf,
+   memcpy(p_ftree->p_osm->sm.ucast_mgr.lft_buf,
+          p_sw->lft_buf,
            lft_len);
     osm_ucast_mgr_set_fwd_table(&p_ftree->p_osm->sm.ucast_mgr, p_sw->p_osm_sw);
  }
@@ -1746,10 +1746,10 @@ __osm_ftree_set_sw_fwd_table(
  /***************************************************
   ***************************************************/

-/*
+/*
   * Function: assign-up-going-port-by-descending-down
   * Given   : a switch and a LID
- * Pseudo code:
+ * Pseudo code:
   *    foreach down-going-port-group (in indexing order)
   *        skip this group if the LFT(LID) port is part of this group
   *        find the least loaded port of the group (scan in indexing order)
@@ -1785,7 +1785,7 @@ __osm_ftree_fabric_route_upgoing_by_going_down(
     CL_ASSERT(p_sw->rank != (__osm_ftree_fabric_get_rank(p_ftree) - 1));

     /* if there is no down-going ports */
-   if (p_sw->down_port_groups_num == 0)
+   if (p_sw->down_port_groups_num == 0)
         return;

     /* foreach down-going port group (in indexing order) */
@@ -1793,7 +1793,7 @@ __osm_ftree_fabric_route_upgoing_by_going_down(
     {
        p_group = p_sw->down_port_groups[i];

-      if ( p_prev_sw && (p_group->remote_base_lid == p_prev_sw->base_lid) )
+      if ( p_prev_sw && (p_group->remote_base_lid == p_prev_sw->base_lid) )
        {
           /* This port group has a port that was used when we entered this switch,
              which means that the current group points to the switch where we were
@@ -1807,7 +1807,7 @@ __osm_ftree_fabric_route_upgoing_by_going_down(
        ports_num = (uint16_t)cl_ptr_vector_get_size(&p_group->ports);
        /* ToDo: no need to select a least loaded port for non-main path.
           Think about optimization. */
-      for (j = 0; j < ports_num; j++)
+      for (j = 0; j < ports_num; j++)
        {
            cl_ptr_vector_at(&p_group->ports, j, (void **)&p_port);
            if (!p_min_port)
@@ -1821,16 +1821,16 @@ __osm_ftree_fabric_route_upgoing_by_going_down(
               p_min_port = p_port;
            }
        }
-      /* At this point we have selected a port in this group with the
+      /* At this point we have selected a port in this group with the
           lowest load of upgoing routes.
           Set on the remote switch how to get to the target_lid -
           set LFT(target_lid) on the remote switch to the remote port */
        p_remote_sw = p_group->remote_hca_or_sw.remote_sw;

-      if ( osm_switch_get_least_hops(p_remote_sw->p_osm_sw,
+      if ( osm_switch_get_least_hops(p_remote_sw->p_osm_sw,
                                       cl_ntoh16(target_lid)) != OSM_NO_PATH )
        {
-         /* Loop in the fabric - we already routed the remote switch
+         /* Loop in the fabric - we already routed the remote switch
              on our way UP, and now we see it again on our way DOWN */
           osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
                   "__osm_ftree_fabric_route_upgoing_by_going_down: "
@@ -1846,28 +1846,28 @@ __osm_ftree_fabric_route_upgoing_by_going_down(

        /* Four possible cases:
         *
-       *  1. is_real_lid == TRUE && is_main_path == TRUE:
+       *  1. is_real_lid == TRUE && is_main_path == TRUE:
         *      - going DOWN(TRUE,TRUE) through ALL the groups
         *         + promoting port counter
         *         + setting path in remote switch fwd tbl
         *         + setting hops in remote switch on all the ports of each group
-       *
-       *  2. is_real_lid == TRUE && is_main_path == FALSE:
+       *
+       *  2. is_real_lid == TRUE && is_main_path == FALSE:
         *      - going DOWN(TRUE,FALSE) through ALL the groups but only if
-       *        the remote (upper) switch hasn't been already configured
+       *        the remote (upper) switch hasn't been already configured
         *        for this target LID
         *         + NOT promoting port counter
         *         + setting path in remote switch fwd tbl if it hasn't been set yet
         *         + setting hops in remote switch on all the ports of each group
         *           if it hasn't been set yet
         *
-       *  3. is_real_lid == FALSE && is_main_path == TRUE:
+       *  3. is_real_lid == FALSE && is_main_path == TRUE:
         *      - going DOWN(FALSE,TRUE) through ALL the groups
         *         + promoting port counter
         *         + NOT setting path in remote switch fwd tbl
         *         + NOT setting hops in remote switch
         *
-       *  4. is_real_lid == FALSE && is_main_path == FALSE:
+       *  4. is_real_lid == FALSE && is_main_path == FALSE:
         *      - illegal state - we shouldn't get here
         */

@@ -1908,8 +1908,8 @@ __osm_ftree_fabric_route_upgoing_by_going_down(


        }
-
-      /* The number of upgoing routes is tracked in the
+
+      /* The number of upgoing routes is tracked in the
           p_port->counter_up counter of the port that belongs to
           the upper side of the link (on switch with lower rank).
           Counter is promoted only if we're routing LID on the main
@@ -1939,10 +1939,10 @@ __osm_ftree_fabric_route_upgoing_by_going_down(

  /***************************************************/

-/*
+/*
   * Function: assign-down-going-port-by-descending-up
   * Given   : a switch and a LID
- * Pseudo code:
+ * Pseudo code:
   *    find the least loaded port of all the upgoing groups (scan in indexing order)
   *    assign the LFT(LID) of remote switch to that port
   *    track that port usage
@@ -2011,7 +2011,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up(
              p_min_port = p_port;
           }
           else
-         {
+         {
              if ( p_port->counter_down < p_min_port->counter_down  )
              {
                 /* this port is less loaded - use it as min */
@@ -2022,7 +2022,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up(
        }
     }

-   /* At this point we have selected a group and port with the
+   /* At this point we have selected a group and port with the
        lowest load of downgoing routes.
        Set on the remote switch how to get to the target_lid -
        set LFT(target_lid) on the remote switch to the remote port */
@@ -2030,7 +2030,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up(

     /* Four possible cases:
      *
-    *  1. is_real_lid == TRUE && is_main_path == TRUE:
+    *  1. is_real_lid == TRUE && is_main_path == TRUE:
      *      - going UP(TRUE,TRUE) on selected min_group and min_port
      *         + promoting port counter
      *         + setting path in remote switch fwd tbl
@@ -2040,23 +2040,23 @@ __osm_ftree_fabric_route_downgoing_by_going_up(
      *         + setting path in remote switch fwd tbl if it hasn't been set yet
      *         + setting hops in remote switch on all the ports of each group
      *           if it hasn't been set yet
-    *
-    *  2. is_real_lid == TRUE && is_main_path == FALSE:
+    *
+    *  2. is_real_lid == TRUE && is_main_path == FALSE:
      *      - going UP(TRUE,FALSE) on ALL the groups, each time on port 0,
-    *        but only if the remote (upper) switch hasn't been already
+    *        but only if the remote (upper) switch hasn't been already
      *        configured for this target LID
      *         + NOT promoting port counter
      *         + setting path in remote switch fwd tbl if it hasn't been set yet
      *         + setting hops in remote switch on all the ports of each group
      *           if it hasn't been set yet
      *
-    *  3. is_real_lid == FALSE && is_main_path == TRUE:
+    *  3. is_real_lid == FALSE && is_main_path == TRUE:
      *      - going UP(FALSE,TRUE) ONLY on selected min_group and min_port
      *         + promoting port counter
      *         + NOT setting path in remote switch fwd tbl
      *         + NOT setting hops in remote switch
      *
-    *  4. is_real_lid == FALSE && is_main_path == FALSE:
+    *  4. is_real_lid == FALSE && is_main_path == FALSE:
      *      - illegal state - we shouldn't get here
      */

@@ -2073,7 +2073,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up(
                   __osm_ftree_tuple_to_str(p_sw->tuple),
                   __osm_ftree_tuple_to_str(p_remote_sw->tuple));
        }
-      /* The number of downgoing routes is tracked in the
+      /* The number of downgoing routes is tracked in the
           p_port->counter_down counter of the port that belongs to
           the lower side of the link (on switch with higher rank) */
        p_min_port->counter_down++;
@@ -2103,7 +2103,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up(
           }
        }

-      /* Recursion step:
+      /* Recursion step:
           Assign downgoing ports by stepping up, starting on REMOTE switch. */
        __osm_ftree_fabric_route_downgoing_by_going_up(
              p_ftree,
@@ -2121,18 +2121,18 @@ __osm_ftree_fabric_route_downgoing_by_going_up(

     /* What's left to do at this point:
      *
-    *  1. is_real_lid == TRUE && is_main_path == TRUE:
-    *      - going UP(TRUE,FALSE) on rest of the groups, each time on port 0,
-    *        but only if the remote (upper) switch hasn't been already
+    *  1. is_real_lid == TRUE && is_main_path == TRUE:
+    *      - going UP(TRUE,FALSE) on rest of the groups, each time on port 0,
+    *        but only if the remote (upper) switch hasn't been already
      *        configured for this target LID
      *         + NOT promoting port counter
      *         + setting path in remote switch fwd tbl if it hasn't been set yet
      *         + setting hops in remote switch on all the ports of each group
      *           if it hasn't been set yet
-    *
-    *  2. is_real_lid == TRUE && is_main_path == FALSE:
+    *
+    *  2. is_real_lid == TRUE && is_main_path == FALSE:
      *      - going UP(TRUE,FALSE) on ALL the groups, each time on port 0,
-    *        but only if the remote (upper) switch hasn't been already
+    *        but only if the remote (upper) switch hasn't been already
      *        configured for this target LID
      *         + NOT promoting port counter
      *         + setting path in remote switch fwd tbl if it hasn't been set yet
@@ -2170,7 +2170,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up(
                  __osm_ftree_tuple_to_str(p_sw->tuple),
                  __osm_ftree_tuple_to_str(p_remote_sw->tuple));
        }
-
+
        cl_ptr_vector_at(&p_group->ports, 0, (void **)&p_port);
        __osm_ftree_sw_set_fwd_table_block(p_remote_sw,
                                           cl_ntoh16(target_lid),
@@ -2191,7 +2191,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up(
                                   target_rank - p_remote_sw->rank);
        }

-      /* Recursion step:
+      /* Recursion step:
           Assign downgoing ports by stepping up, starting on REMOTE switch. */
        __osm_ftree_fabric_route_downgoing_by_going_up(
              p_ftree,
@@ -2207,8 +2207,8 @@ __osm_ftree_fabric_route_downgoing_by_going_up(

  /***************************************************/

-/*
- * Pseudo code:
+/*
+ * Pseudo code:
   *    foreach leaf switch (in indexing order)
   *       for each compute node (in indexing order)
   *          obtain the LID of the compute node
@@ -2303,8 +2303,8 @@ __osm_ftree_fabric_route_to_hcas(

  /***************************************************/

-/*
- * Pseudo code:
+/*
+ * Pseudo code:
   *    foreach switch in fabric
   *       obtain its LID
   *       set local LFT(LID) to port 0
@@ -2364,7 +2364,7 @@ __osm_ftree_fabric_route_to_switches(
  /***************************************************
   ***************************************************/

-static int
+static int
  __osm_ftree_fabric_populate_nodes(
     IN  ftree_fabric_t * p_ftree)
  {
@@ -2406,7 +2406,7 @@ __osm_ftree_fabric_populate_nodes(
  /***************************************************
   ***************************************************/

-static boolean_t
+static boolean_t
  __osm_ftree_sw_update_rank(
     IN  ftree_sw_t  * p_sw,
     IN  uint32_t      new_rank)
@@ -2422,7 +2422,7 @@ __osm_ftree_sw_update_rank(

  static void
  __osm_ftree_rank_switches_from_leafs(
-   IN  ftree_fabric_t * p_ftree,
+   IN  ftree_fabric_t * p_ftree,
     IN  cl_list_t      * p_ranking_bfs_list)
  {
     ftree_sw_t   * p_sw;
@@ -2445,9 +2445,9 @@ __osm_ftree_rank_switches_from_leafs(
        for (i = 1; i < osm_node_get_num_physp(p_node); i++)
        {
           p_osm_port = osm_node_get_physp_ptr(p_node,i);
-         if (!osm_physp_is_valid(p_osm_port))
+         if (!osm_physp_is_valid(p_osm_port))
              continue;
-         if (!osm_link_is_healthy(p_osm_port))
+         if (!osm_link_is_healthy(p_osm_port))
              continue;

           p_remote_node = osm_node_get_remote_node(p_node,i,NULL);
@@ -2466,7 +2466,7 @@ __osm_ftree_rank_switches_from_leafs(

           /* if needed, rank the remote switch and add it to the BFS list */
           if (__osm_ftree_sw_update_rank(p_remote_sw, p_sw->rank + 1))
-            cl_list_insert_tail(p_ranking_bfs_list,
+            cl_list_insert_tail(p_ranking_bfs_list,
                                  &__osm_ftree_sw_tbl_element_create(p_remote_sw)->map_item);
        }
     }
@@ -2475,7 +2475,7 @@ __osm_ftree_rank_switches_from_leafs(

  /***************************************************/

-static int
+static int
  __osm_ftree_rank_leaf_switches(
     IN  ftree_fabric_t * p_ftree,
     IN  ftree_hca_t    * p_hca,
@@ -2493,9 +2493,9 @@ __osm_ftree_rank_leaf_switches(
     for (i = 0; i < osm_node_get_num_physp(p_osm_node); i++)
     {
        p_osm_port = osm_node_get_physp_ptr(p_osm_node,i);
-      if (!osm_physp_is_valid(p_osm_port))
+      if (!osm_physp_is_valid(p_osm_port))
           continue;
-      if (!osm_link_is_healthy(p_osm_port))
+      if (!osm_link_is_healthy(p_osm_port))
           continue;

        p_remote_osm_node = osm_node_get_remote_node(p_osm_node,i,NULL);
@@ -2551,7 +2551,7 @@ __osm_ftree_rank_leaf_switches(
                cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)),
                cl_ntoh64(osm_node_get_node_guid(p_sw->p_osm_sw->p_node)),
                cl_ntoh16(p_sw->base_lid));
-      cl_list_insert_tail(p_ranking_bfs_list,
+      cl_list_insert_tail(p_ranking_bfs_list,
                            &__osm_ftree_sw_tbl_element_create(p_sw)->map_item);
     }

@@ -2562,9 +2562,9 @@ __osm_ftree_rank_leaf_switches(

  /***************************************************/

-static void
+static void
  __osm_ftree_sw_reverse_rank(
-   IN  cl_map_item_t* const p_map_item,
+   IN  cl_map_item_t* const p_map_item,
     IN  void *context)
  {
     ftree_fabric_t * p_ftree = (ftree_fabric_t *)context;
@@ -2577,7 +2577,7 @@ __osm_ftree_sw_reverse_rank(

  static int
  __osm_ftree_fabric_construct_hca_ports(
-   IN  ftree_fabric_t  * p_ftree,
+   IN  ftree_fabric_t  * p_ftree,
     IN  ftree_hca_t     * p_hca)
  {
     ftree_sw_t      * p_remote_sw;
@@ -2594,9 +2594,9 @@ __osm_ftree_fabric_construct_hca_ports(
     {
        osm_physp_t * p_osm_port = osm_node_get_physp_ptr(p_node,i);

-      if (!osm_physp_is_valid(p_osm_port))
+      if (!osm_physp_is_valid(p_osm_port))
           continue;
-      if (!osm_link_is_healthy(p_osm_port))
+      if (!osm_link_is_healthy(p_osm_port))
           continue;

        p_remote_osm_port = osm_physp_get_remote(p_osm_port);
@@ -2665,9 +2665,9 @@ __osm_ftree_fabric_construct_hca_ports(
  /***************************************************
   ***************************************************/

-static int
+static int
  __osm_ftree_fabric_construct_sw_ports(
-   IN  ftree_fabric_t  * p_ftree,
+   IN  ftree_fabric_t  * p_ftree,
     IN  ftree_sw_t      * p_sw)
  {
     ftree_hca_t       * p_remote_hca;
@@ -2690,9 +2690,9 @@ __osm_ftree_fabric_construct_sw_ports(
     {
        osm_physp_t * p_osm_port = osm_node_get_physp_ptr(p_node,i);

-      if (!osm_physp_is_valid(p_osm_port))
+      if (!osm_physp_is_valid(p_osm_port))
           continue;
-      if (!osm_link_is_healthy(p_osm_port))
+      if (!osm_link_is_healthy(p_osm_port))
           continue;

        p_remote_osm_port = osm_physp_get_remote(p_osm_port);
@@ -2770,16 +2770,16 @@ __osm_ftree_fabric_construct_sw_ports(
              goto Exit;
        }
        __osm_ftree_sw_add_port(
-            p_sw,                                       /* local ftree_sw object */
-            i,                                          /* local port number */
-            remote_port_num,                            /* remote port number */
-            p_sw->base_lid,                             /* local lid */
-            remote_base_lid,                            /* remote lid */
-            osm_physp_get_port_guid(p_osm_port),        /* local port guid */
-            osm_physp_get_port_guid(p_remote_osm_port), /* remote port guid */
-            remote_node_guid,                           /* remote node guid */
-            remote_node_type,                           /* remote node type */
-            p_remote_hca_or_sw,                         /* remote ftree_hca/sw object */
+            p_sw,                                       /* local ftree_sw object */
+            i,                                          /* local port number */
+            remote_port_num,                            /* remote port number */
+            p_sw->base_lid,                             /* local lid */
+            remote_base_lid,                            /* remote lid */
+            osm_physp_get_port_guid(p_osm_port),        /* local port guid */
+            osm_physp_get_port_guid(p_remote_osm_port), /* remote port guid */
+            remote_node_guid,                           /* remote node guid */
+            remote_node_type,                           /* remote node type */
+            p_remote_hca_or_sw,                         /* remote ftree_hca/sw object */
              direction);                                 /* port direction (up or down) */

        /* Track the max lid (in host order) that exists in the fabric */
@@ -2809,8 +2809,8 @@ __osm_ftree_fabric_perform_ranking(
        initially filled with the leaf switches */
     cl_list_init(&ranking_bfs_list, cl_qmap_count(&p_ftree->sw_tbl));

-   /* Mark REVERSED rank of all the switches in the subnet.
-      Start from switches that are connected to hca's, and
+   /* Mark REVERSED rank of all the switches in the subnet.
+      Start from switches that are connected to hca's, and
        scan all the switches in the subnet. */
     p_next_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl);
     while( p_next_hca != (ftree_hca_t *)cl_qmap_end( &p_ftree->hca_tbl ) )
@@ -2831,7 +2831,7 @@ __osm_ftree_fabric_perform_ranking(
        list already contains all the ranked leaf switches */
     __osm_ftree_rank_switches_from_leafs(p_ftree, &ranking_bfs_list);
     cl_list_destroy(&ranking_bfs_list);
-
+
     /* REVERSED ranking of all the switches completed.
        Calculate and set FatTree rank */

@@ -2839,14 +2839,14 @@ __osm_ftree_fabric_perform_ranking(
     osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,
             "__osm_ftree_fabric_perform_ranking: "
             "FatTree rank is %u\n", __osm_ftree_fabric_get_rank(p_ftree));
-
+
     /* fix ranking of the switches by reversing the ranking direction */
     cl_qmap_apply_func(&p_ftree->sw_tbl, __osm_ftree_sw_reverse_rank, (void *)p_ftree);

     if ( __osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK ||
          __osm_ftree_fabric_get_rank(p_ftree) < FAT_TREE_MIN_RANK )
     {
-      osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
                "__osm_ftree_fabric_perform_ranking: ERR AB15: "
                "Tree rank is %u (should be between %u and %u)\n",
                __osm_ftree_fabric_get_rank(p_ftree),
@@ -2907,7 +2907,7 @@ __osm_ftree_fabric_populate_ports(
  /***************************************************
   ***************************************************/

-static int
+static int
  __osm_ftree_construct_fabric(
     IN  void * context)
  {
@@ -2935,7 +2935,7 @@ __osm_ftree_construct_fabric(
        goto Exit;
     }

-   if ( (cl_qmap_count(&p_ftree->p_osm->subn.node_guid_tbl) -
+   if ( (cl_qmap_count(&p_ftree->p_osm->subn.node_guid_tbl) -
           cl_qmap_count(&p_ftree->p_osm->subn.sw_guid_tbl)) < 2)
     {
        osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
@@ -3061,7 +3061,7 @@ __osm_ftree_construct_fabric(
  /***************************************************
   ***************************************************/

-static int
+static int
  __osm_ftree_do_routing(
     IN  void * context)
  {
@@ -3104,7 +3104,7 @@ __osm_ftree_do_routing(
  /***************************************************
   ***************************************************/

-static void
+static void
  __osm_ftree_delete(
     IN  void * context)
  {
-- 
1.5.1.4


From eli at mellanox.co.il  Thu Jul  5 06:55:22 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Thu, 05 Jul 2007 16:55:22 +0300
Subject: [ofa-general] socket buffer accounting with UDP/ipoib
Message-ID: <1183643723.25031.262.camel@mtls03>

In UDP tests we have been running here, I noticed that when using high
rate of UDP packets over ipoib, there are sometimes cases of packet
drop. Investigating farther I found that the packets are dropped since
the socket buffer is exhausted and we fail in the following code:

net/core/sock.c

int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
{
	int err = 0;
	int skb_len;

	/* Cast skb->rcvbuf to unsigned... It's pointless, but reduces
	   number of warnings when compiling with -W --ANK
	 */
	if (atomic_read(&sk->sk_rmem_alloc) + skb->truesize >=
	    (unsigned)sk->sk_rcvbuf) {
		err = -ENOMEM;
		goto out;
	}


In the condition above skb->truesize is about the same as the size
allocated for the skb; for small packets, this will charge the socket
far more than it actually consumed.

I used the following patch to make things better in this regard which
passes up to the stack smaller skbs. I am not saying this is the best
way to handle this but I would like to hear opinions as for how we
should address this problem.

Index: connectx_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c
===================================================================
--- connectx_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-07-05 16:54:56.000000000 +0300
+++ connectx_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-07-05 17:10:32.000000000 +0300
@@ -50,6 +50,8 @@
 		 "Enable data path debug tracing if > 0");
 #endif
 
+#define SKB_LEN_THOLD 150
+
 static DEFINE_MUTEX(pkey_mutex);
 
 struct ipoib_ah *ipoib_create_ah(struct net_device *dev,
@@ -169,7 +171,7 @@
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	unsigned int wr_id = wc->wr_id & ~IPOIB_OP_RECV;
-	struct sk_buff *skb;
+	struct sk_buff *skb, *nskb;
 	u64 addr;
 
 	ipoib_dbg_data(priv, "recv completion: id %d, status: %d\n",
@@ -223,6 +225,19 @@
 		++priv->stats.rx_packets;
 		priv->stats.rx_bytes += skb->len;
 
+		if (skb->len < SKB_LEN_THOLD) {
+			nskb = dev_alloc_skb(skb->len);
+			if (!nskb) {
+				ipoib_warn(priv, "failed to allocate skb\n");
+				return;
+			}
+			memcpy(nskb->data, skb->data, skb->len);
+			skb_put(nskb, skb->len);
+			nskb->protocol = skb->protocol;
+			dev_kfree_skb_any(skb);
+			skb = nskb;
+		}
+
 		skb->dev = dev;
 		/* XXX get correct PACKET_ type here */
 		skb->pkt_type = PACKET_HOST;
@@ -350,7 +365,6 @@
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int n, i;
 
-	ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
 	do {
 		n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->ibwc);
 		for (i = 0; i < n; ++i) {
@@ -363,6 +377,7 @@
 				ipoib_ib_handle_tx_wc(dev, wc);
 		}
 	} while (n == IPOIB_NUM_WC);
+	ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
 }
 #endif
 

From ogerlitz at voltaire.com  Thu Jul  5 07:10:24 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 05 Jul 2007 17:10:24 +0300
Subject: [ofa-general] socket buffer accounting with UDP/ipoib
In-Reply-To: <1183643723.25031.262.camel@mtls03>
References: <1183643723.25031.262.camel@mtls03>
Message-ID: <468CFBD0.6040407@voltaire.com>

Eli Cohen wrote:
> I used the following patch to make things better in this regard which
> passes up to the stack smaller skbs. I am not saying this is the best
> way to handle this but I would like to hear opinions as for how we
> should address this problem.
> 
> Index: connectx_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> ===================================================================
> --- connectx_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-07-05 16:54:56.000000000 +0300
> +++ connectx_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-07-05 17:10:32.000000000 +0300
> @@ -50,6 +50,8 @@
>  		 "Enable data path debug tracing if > 0");
>  #endif
>  
> +#define SKB_LEN_THOLD 150
> +
>  static DEFINE_MUTEX(pkey_mutex);
>  
>  struct ipoib_ah *ipoib_create_ah(struct net_device *dev,
> @@ -169,7 +171,7 @@

can you resend the patch with function named appearing in each hunk (ie 
after the @@ , use diff -p flag for that)

Or.


From thanhviet_25 at yahoo.com  Thu Jul  5 09:07:13 2007
From: thanhviet_25 at yahoo.com (thanhviet)
Date: Thu, 5 Jul 2007 23:07:13 +0700
Subject: [ofa-general] CAN HO CAO CAP HOANG ANH_NEWSAIGON - NGOI NHA 5 SAO
	CUA BAN....!!!!
Message-ID: <20070705160738.4E788E6038A@openfabrics.org>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070705/da036dbc/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: GIAM.jpg
Type: image/jpeg
Size: 472687 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070705/da036dbc/attachment.jpg>

From mshefty at ichips.intel.com  Thu Jul  5 11:09:35 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 05 Jul 2007 11:09:35 -0700
Subject: [ofa-general] [GIT PULL] please pull rdma-dev.git for 2.6.23
In-Reply-To: <000801c7b9e2$03dfe220$3c98070a@amr.corp.intel.com>
References: <000801c7b9e2$03dfe220$3c98070a@amr.corp.intel.com>
Message-ID: <468D33DF.3030900@ichips.intel.com>

> Please pull:
> 
> 	git://git.openfabrics.org/~shefty/rdma-dev.git for-roland
> 
> for 2.6.23.  This will pick up the following patches:

I'm guessing that you haven't gotten to these yet (no hurry), so I've 
added two more patches that were posted to the list:

ib/cm: fix handling of duplicate SIDR REQs
http://lists.openfabrics.org/pipermail/general/2007-July/037677.html

ib/cm: send no match if SIDR REQ does not match a listen
http://lists.openfabrics.org/pipermail/general/2007-July/037678.html

At this point, I'm only anticipating one more patch for 2.6.23.

- Sean


From xhejtman at ics.muni.cz  Thu Jul  5 12:31:36 2007
From: xhejtman at ics.muni.cz (Lukas Hejtmanek)
Date: Thu, 5 Jul 2007 21:31:36 +0200
Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux
In-Reply-To: <adavecy4ysc.fsf@cisco.com>
References: <20070704125429.GL3885@ics.muni.cz> <adavecy4ysc.fsf@cisco.com>
Message-ID: <20070705193136.GQ3885@ics.muni.cz>

Hello,

On Thu, Jul 05, 2007 at 09:10:11AM -0700, Roland Dreier wrote:
> I don't personally have much use for it, but of course I would be
> happy to merge changes that make this work better.
> 
> However, I would really prefer if we could have this discussion on
> general at lists.openfabrics.org instead of in private email; it's better
> for you too, because if I am too busy to answer then you may get an
> answer from someone else.  Anyway...

OK, I appended the address to the Cc.

> Are you getting these freezes when using Xen domU, or do you also see
> them with a normal kernel?  You said the card works "mostly OK" with
> dom0 -- what is not OK?

Well, in Dom0 the action:
modprobe ib_mthca
rmmod ib_mthca
modprobe ib_mthca

kills the machine. However, it is quite strange because it produces oops in
XFS (file system), for me, it looks like it does some memory corruption in the
kernel and basically I have the same problem in DomU where the same error is
induced by the first modprobe ib_mthca.

> How did you fix the device reset problem?

Xen in DomU does not let the device to modify address bars so after the device
reset the address bars are not restored thus I've modified Xen PCI backend to
allow direct modification of the bars if the device operates in the permissive 
mode.

Anyway, direct access to the PCI config space did not solve all the problems. 
Modprobe ib_mthca does init_one up to (and including) init_hca. In the
setup_hca it kills at least DomU and very often even Dom0 and even sometimes
it kills the whole machine so that physical power cycle is needed.
When it peforms setup_hca, I can always see an oops in XFS in DomU.

Dmesg says that the driver could not write MTT.

Any thoughts?

-- 
Lukáš Hejtmánek


From eli at mellanox.co.il  Thu Jul  5 12:39:02 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Thu, 5 Jul 2007 22:39:02 +0300
Subject: [ofa-general] socket buffer accounting with UDP/ipoib
References: <1183643723.25031.262.camel@mtls03> <468CFBD0.6040407@voltaire.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901D362E7@mtlexch01.mtl.com>

> can you resend the patch with function named appearing in each hunk
(ie after the @@ , use diff -p flag for that)
> Or.

Sure. It is attached now - sorry but I using outlook from home :)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: udp_drop.patch
Type: application/octet-stream
Size: 1638 bytes
Desc: udp_drop.patch
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070705/32e1c00c/attachment.obj>

From rdreier at cisco.com  Thu Jul  5 14:34:35 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 05 Jul 2007 14:34:35 -0700
Subject: [ofa-general] socket buffer accounting with UDP/ipoib
In-Reply-To: <1183643723.25031.262.camel@mtls03> (Eli Cohen's message of "Thu,
	05 Jul 2007 16:55:22 +0300")
References: <1183643723.25031.262.camel@mtls03>
Message-ID: <ada644y4jro.fsf@cisco.com>

Copying small packets into a new skb is actually a fairly
well-established optimization to avoid the overhead of allocating a
new skb.  For example look for RX_COPY_THRESHOLD in tg3 or copybreak
in e1000.  So this approach makes sense to me.  However, a few
comments about your patch:

 > +#define SKB_LEN_THOLD 150

150 is probably not the right value; the cost of copying half a
cacheline is probably nearly the same as a full cacheline, so this
should probably be a multiple of 64 (or at least 32, since I don't
know of any arch with smaller than 32-byte cachelines).

With that said I don't know what the right value is here.  256 seems
to be a popular choice; I guess it is system-dependent but I don't
think it makes sense to add yet another knob to adjust this.

 > +		if (skb->len < SKB_LEN_THOLD) {
 > +			nskb = dev_alloc_skb(skb->len);
 > +			if (!nskb) {
 > +				ipoib_warn(priv, "failed to allocate skb\n");
 > +				return;
 > +			}
 > +			memcpy(nskb->data, skb->data, skb->len);

should be skb_copy_from_linear_data()

 > +			skb_put(nskb, skb->len);
 > +			nskb->protocol = skb->protocol;
 > +			dev_kfree_skb_any(skb);

and there's no point in freeing the old skb... we should repost it to
the receive queue instead.

 > +			skb = nskb;
 > +		}

And I think we would want something similar for ipoib_cm.c too.

Your patch also made me look again at how we handle packets the HCA
replicates back to us... there's no reason to free the skb and
allocate a new one; we could just repost the same skb again.  So the
patch below seems like it might help multicast senders.  What do
people think about putting this into 2.6.23?

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 8404f05..1094488 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -197,6 +197,13 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 	}
 
 	/*
+	 * Drop packets that this interface sent, ie multicast packets
+	 * that the HCA has replicated.
+	 */
+	if (wc->slid == priv->local_lid && wc->src_qp == priv->qp->qp_num)
+		goto repost;
+
+	/*
 	 * If we can't allocate a new RX buffer, dump
 	 * this packet and reuse the old buffer.
 	 */
@@ -213,24 +220,18 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 	skb_put(skb, wc->byte_len);
 	skb_pull(skb, IB_GRH_BYTES);
 
-	if (wc->slid != priv->local_lid ||
-	    wc->src_qp != priv->qp->qp_num) {
-		skb->protocol = ((struct ipoib_header *) skb->data)->proto;
-		skb_reset_mac_header(skb);
-		skb_pull(skb, IPOIB_ENCAP_LEN);
+	skb->protocol = ((struct ipoib_header *) skb->data)->proto;
+	skb_reset_mac_header(skb);
+	skb_pull(skb, IPOIB_ENCAP_LEN);
 
-		dev->last_rx = jiffies;
-		++priv->stats.rx_packets;
-		priv->stats.rx_bytes += skb->len;
+	dev->last_rx = jiffies;
+	++priv->stats.rx_packets;
+	priv->stats.rx_bytes += skb->len;
 
-		skb->dev = dev;
-		/* XXX get correct PACKET_ type here */
-		skb->pkt_type = PACKET_HOST;
-		netif_receive_skb(skb);
-	} else {
-		ipoib_dbg_data(priv, "dropping loopback packet\n");
-		dev_kfree_skb_any(skb);
-	}
+	skb->dev = dev;
+	/* XXX get correct PACKET_ type here */
+	skb->pkt_type = PACKET_HOST;
+	netif_receive_skb(skb);
 
 repost:
 	if (unlikely(ipoib_ib_post_receive(dev, wr_id)))


From rdreier at cisco.com  Thu Jul  5 14:43:46 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 05 Jul 2007 14:43:46 -0700
Subject: [ofa-general] Re: consumer data buffer ownership for inline sends
In-Reply-To: <Pine.LNX.4.64.0707031144130.15147@zuben> (Or Gerlitz's message
	of "Tue, 3 Jul 2007 11:50:52 +0300 (IDT)")
References: <Pine.LNX.4.64.0707031144130.15147@zuben>
Message-ID: <adawsxe34rx.fsf@cisco.com>

 > Does this means that for inline sends, when ibv_post_send returns,
 > the consumer owns back the data buffer associated with this send?
 > 
 > Can this be stated as the official policy of libibverbs?

I guess that makes sense.  I wonder if there's any conceivable
interpretation of the inline send flag where the adapter might need to
access the original buffer after the request is posted?

 - R.


From changquing.tang at hp.com  Thu Jul  5 14:48:10 2007
From: changquing.tang at hp.com (Tang, Changqing)
Date: Thu, 5 Jul 2007 21:48:10 -0000
Subject: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes
In-Reply-To: <466E4168.2030206@mellanox.co.il>
References: <466718AB.5050507@ichips.intel.com>
	<466E4168.2030206@mellanox.co.il>
Message-ID: <349DCDA352EACF42A0C49FA6DCEA840301BC9A3C@G3W0634.americas.hpqcorp.net>


Hi, uDAPL expert:

	We are testing OFED 1.2 uDAPL on a two IB cards system. All the
cards are linked to the same
fabric, IB Verbs works fine from one card to any other card.

	If we config all IPoIB-ib0 on the same network (172.200.0.x,
255.255.255) and IPoIB-ib1 on another network
(172.200.1.x, 255.255.255.0), uDAPL works on all ib0, and works on all
ib1 as well.

	However, if we config both ib0 and ib1 on the same network
(172.200.0.x, 255.255.255.0), uDAPL
works if all ranks use ib0, uDAPL fails if all ranks use ib1 with error
code:
	DAT_CONNECTION_EVENT_NON_PEER_REJECTED 0x4003 (after
dat_connect() and dat_evd_wait()) 

The same error message if some ranks use ib0, some ranks use ib1.

	Thanks for providing solution for this issue, or any experience.

--CQ


> -----Original Message-----
> From: general-bounces at lists.openfabrics.org 
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of 
> Tziporet Koren
> Sent: Tuesday, June 12, 2007 1:47 AM
> To: Arlin Davis
> Cc: Vladimir Sokolovsky; OpenFabrics General
> Subject: Re: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes
> 
> Arlin Davis wrote:
> > Vlad,  please pull the latest OFED 1.2 release notes from uDAPL 
> > project (ofed_1_2 branch)
> >
> >    dapl/doc/uDAPL_release_notes.txt
> >
> > Signed-off by: Arlin Davis ardavis at ichips.intel.com 
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> >
> done
> Tziporet
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From rdreier at cisco.com  Thu Jul  5 15:22:12 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 05 Jul 2007 15:22:12 -0700
Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux
In-Reply-To: <20070705193136.GQ3885@ics.muni.cz> (Lukas Hejtmanek's message of
	"Thu, 5 Jul 2007 21:31:36 +0200")
References: <20070704125429.GL3885@ics.muni.cz> <adavecy4ysc.fsf@cisco.com>
	<20070705193136.GQ3885@ics.muni.cz>
Message-ID: <adasl8232zv.fsf@cisco.com>

 > Well, in Dom0 the action:
 > modprobe ib_mthca
 > rmmod ib_mthca
 > modprobe ib_mthca
 > 
 > kills the machine. However, it is quite strange because it produces oops in
 > XFS (file system), for me, it looks like it does some memory corruption in the
 > kernel and basically I have the same problem in DomU where the same error is
 > induced by the first modprobe ib_mthca.

Loading and unloading ib_mthca many times works fine on a non-Xen
system.  So there is something different about the Xen environment
that is causing a problem.  It could be a bug in mthca exposed by Xen
(eg improper use of of the DMA mapping API or something like that).

Can you turn on all the memory debugging options like SLAB_DEBUG
etc. and see if it turns up anything?

Also I'd be curious to see the exact XFS oops you're getting, since it
might have a clue to what's going on.

 - R.


From mshefty at ichips.intel.com  Thu Jul  5 15:30:18 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 05 Jul 2007 15:30:18 -0700
Subject: [ofa-general] Re: running cmpost in libibcm.
In-Reply-To: <003b01c7bf51$901cef20$b056cd60$@com>
References: <003b01c7bf51$901cef20$b056cd60$@com>
Message-ID: <468D70FA.1050409@ichips.intel.com>

Copying general OFA mail list.

> We’ve been trying to run the ‘cmpost’ example under the libibcm project.
> 
> Here’s the scenario:
> 
> Device Under Test: Mellanox HCA with 2 ports. ofed-rc6 stack installed. 
> SM is disabled on the HCA
> 
> Tester Device: A test implemented on the Agilent Infiniband Generator. 
> Test brings up the subnet (has SM), and then sends a basic CM Connect 
> Request with Service ID = DTA Service ID (0x20)
> 
> Observation:
> 
> 1.       Even when cmpost is not running, the OFED stack appears to send 
> a ConnectReject with reason: Invalid Service ID. So clearly it looks 
> like there is some kind of CM service running just on device boot up. 
> How do we disable this?

The REJ is sent automatically by the kernel ib_cm module when a REQ is 
received that does not match a listen.  You can 'disable' this by 
unloading the ib_cm module, but this will disable the IB CM.

> 2.       cmpost in client mode fails to send the REQ packet

Given that observation 1 is occurring, a REQ packet must be being sent 
by someone.  (The generator?)  Note that even though cmpost connects 
over the libibcm directly, it still uses the librdmacm to obtain path 
record information.  What response do you eventually get back from the 
client side version of cmpost?

> 3.       cmpost  in server mode fails to receive a REQ packet sent from 
> Agilent Generator.  Cmpost seems to block on ‘ib_cm_get_event’ and does 
> not receive the incoming REQ.

I would verify that generated service IDs and that used by cmpost match 
(i.e. same endianess).

- Sean


From ardavis at ichips.intel.com  Thu Jul  5 16:05:56 2007
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Thu, 05 Jul 2007 16:05:56 -0700
Subject: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes
In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA840301BC9A3C@G3W0634.americas.hpqcorp.net>
References: <466718AB.5050507@ichips.intel.com>
	<466E4168.2030206@mellanox.co.il>
	<349DCDA352EACF42A0C49FA6DCEA840301BC9A3C@G3W0634.americas.hpqcorp.net>
Message-ID: <468D7954.3060303@ichips.intel.com>

Tang, Changqing wrote:

>	However, if we config both ib0 and ib1 on the same network
>(172.200.0.x, 255.255.255.0), uDAPL
>works if all ranks use ib0, uDAPL fails if all ranks use ib1 with error
>code:
>	DAT_CONNECTION_EVENT_NON_PEER_REJECTED 0x4003 (after
>dat_connect() and dat_evd_wait()) 
>
>The same error message if some ranks use ib0, some ranks use ib1.
>  
>

What does your /etc/dat.conf look like? What is the listening port on 
each interface and what address/port are  you using for each connection?

Also, can you run ucmatose to verify rdma_cma is working correctly 
across each interface?

For example:

start a server on both interfaces (I am assuming 172.200.0.1 and 
172.200.0.2)

ucmatose -b 172.200.0.1
ucmatose -b 172.200.0.2

start a client on each interface on the other system

ucmatose -s 172.200.0.1
ucmatose -s 172.200.0.2

Thanks,

-arlin


From sean.hefty at intel.com  Thu Jul  5 16:39:06 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 5 Jul 2007 16:39:06 -0700
Subject: [ofa-general] [RFC] handling slow listeners in rdma_cm
Message-ID: <000301c7bf5d$ae0d00e0$9c98070a@amr.corp.intel.com>

I'm looking for input on different options for handling listeners in the rdma_cm
that are slow to respond to connection requests.

Some options (in no particular order):

* Expose a call similar to ib_send_cm_mra (rdma_ack_connect?).

* Adapt an existing call for this purpose (rdma_notify? rdma_listen?).

* Have the rdma_cm always send an MRA.

* Add code to the rdma_cm to queue MRA responses, which would be sent after a
specific timeout has occurred, if the connection had not yet already be accepted
or rejected.

* Add a call to the ib_cm to send an MRA, but only if a duplicate REQ is
received before the original REQ has been processed (ib_set_cm_mra?).

* Make the CMA_CM_RESPONSE_TIMEOUT a module parameter.

Any thoughts or comments?

- Sean


From swise at opengridcomputing.com  Thu Jul  5 16:58:16 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 05 Jul 2007 18:58:16 -0500
Subject: [ofa-general] [RFC] handling slow listeners in rdma_cm
In-Reply-To: <000301c7bf5d$ae0d00e0$9c98070a@amr.corp.intel.com>
References: <000301c7bf5d$ae0d00e0$9c98070a@amr.corp.intel.com>
Message-ID: <468D8598.3070700@opengridcomputing.com>

This is all IB-specific, correct?


Sean Hefty wrote:
> I'm looking for input on different options for handling listeners in the rdma_cm
> that are slow to respond to connection requests.
> 
> Some options (in no particular order):
> 
> * Expose a call similar to ib_send_cm_mra (rdma_ack_connect?).
> 
> * Adapt an existing call for this purpose (rdma_notify? rdma_listen?).
> 
> * Have the rdma_cm always send an MRA.
> 
> * Add code to the rdma_cm to queue MRA responses, which would be sent after a
> specific timeout has occurred, if the connection had not yet already be accepted
> or rejected.
> 
> * Add a call to the ib_cm to send an MRA, but only if a duplicate REQ is
> received before the original REQ has been processed (ib_set_cm_mra?).
> 
> * Make the CMA_CM_RESPONSE_TIMEOUT a module parameter.
> 
> Any thoughts or comments?
> 
> - Sean
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From sean.hefty at intel.com  Thu Jul  5 17:04:06 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 5 Jul 2007 17:04:06 -0700
Subject: [ofa-general] [RFC] handling slow listeners in rdma_cm
In-Reply-To: <468D8598.3070700@opengridcomputing.com>
Message-ID: <000401c7bf61$2ba428a0$9c98070a@amr.corp.intel.com>

>This is all IB-specific, correct?

I don't think iWarp has this issue, so, yes.  (With IB, a slow listener will
cause connections to timeout waiting for an accept.)

If this is the case, then I'd like to keep the solution within the IB related
code.  My preference is not to expose / modify the rdma_cm APIs.

- Sean


From swise at opengridcomputing.com  Thu Jul  5 18:38:31 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 05 Jul 2007 20:38:31 -0500
Subject: [ofa-general] [RFC] handling slow listeners in rdma_cm
In-Reply-To: <000401c7bf61$2ba428a0$9c98070a@amr.corp.intel.com>
References: <000401c7bf61$2ba428a0$9c98070a@amr.corp.intel.com>
Message-ID: <468D9D17.6@opengridcomputing.com>


Sean Hefty wrote:
>> This is all IB-specific, correct?
> 
> I don't think iWarp has this issue, so, yes.  (With IB, a slow listener will
> cause connections to timeout waiting for an accept.)
> 

There's no way to specify this at the server side for TCP.  Its up to 
the client in TCP to wait "long enough". :-)

> If this is the case, then I'd like to keep the solution within the IB related
> code.  My preference is not to expose / modify the rdma_cm APIs.
> 

I agree.


From changquing.tang at hp.com  Thu Jul  5 19:10:48 2007
From: changquing.tang at hp.com (Tang, Changqing)
Date: Fri, 6 Jul 2007 02:10:48 -0000
Subject: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes
In-Reply-To: <468D7954.3060303@ichips.intel.com>
References: <466718AB.5050507@ichips.intel.com>
	<466E4168.2030206@mellanox.co.il>
	<349DCDA352EACF42A0C49FA6DCEA840301BC9A3C@G3W0634.americas.hpqcorp.net>
	<468D7954.3060303@ichips.intel.com>
Message-ID: <349DCDA352EACF42A0C49FA6DCEA840301BC9BA3@G3W0634.americas.hpqcorp.net>

 
> >	However, if we config both ib0 and ib1 on the same network 
> >(172.200.0.x, 255.255.255.0), uDAPL works if all ranks use 
> ib0, uDAPL 
> >fails if all ranks use ib1 with error
> >code:
> >	DAT_CONNECTION_EVENT_NON_PEER_REJECTED 0x4003 (after
> >dat_connect() and dat_evd_wait())
> >
> >The same error message if some ranks use ib0, some ranks use ib1.
> >  
> >
> 
> What does your /etc/dat.conf look like? What is the listening 
> port on each interface and what address/port are  you using 
> for each connection?

/etc/dat.conf is the default file after installation:

OpenIB-cma u1.2 nonthreadsafe default /usr/ofed/lib64/libdaplcma.so
dapl.1.2 "ib0 0" ""
OpenIB-cma-1 u1.2 nonthreadsafe default /usr/ofed/lib64/libdaplcma.so
dapl.1.2 "ib1 0" ""
OpenIB-cma-2 u1.2 nonthreadsafe default /usr/ofed/lib64/libdaplcma.so
dapl.1.2 "ib2 0" ""
OpenIB-cma-3 u1.2 nonthreadsafe default /usr/ofed/lib64/libdaplcma.so
dapl.1.2 "ib3 0" ""
OpenIB-bond u1.2 nonthreadsafe default /usr/ofed/lib64/libdaplcma.so
dapl.1.2 "bond0 0" ""

however, we only configure ib0 and ib1:

mpixbl05:/nis.home/ctang:/sbin/ifconfig

ib0       Link encap:InfiniBand  HWaddr
80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
          inet addr:172.200.0.5  Bcast:172.200.0.255  Mask:255.255.255.0
          inet6 addr: fe80::219:bbff:fff7:ace5/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
          RX packets:2118 errors:0 dropped:0 overruns:0 frame:0
          TX packets:84 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:128 
          RX bytes:217135 (212.0 KiB)  TX bytes:10854 (10.5 KiB)

ib1       Link encap:InfiniBand  HWaddr
80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
          inet addr:172.200.0.11  Bcast:172.200.0.255
Mask:255.255.255.0
          inet6 addr: fe80::219:bbff:fff7:6ba9/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
          RX packets:2090 errors:0 dropped:0 overruns:0 frame:0
          TX packets:57 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:128 
          RX bytes:215361 (210.3 KiB)  TX bytes:9072 (8.8 KiB)

The listening port (conn_qual) is 1024 for the first rank using first
card (ib0), 
and 1025 for the second rank using second card (ib1). address is the
"ia_attr->ia_address_ptr"

Eventhough I force all ranks only using the first card (ib0), it works
for a while and 
then fails with NON_PEER_REJECTED when one rank tries to connect to
another rank (dat_connect()
and dat_evd_wait()). (I run a simple MPI job in an infinite loop, it
fails after hundreds runs);


> 
> Also, can you run ucmatose to verify rdma_cma is working 
> correctly across each interface?

It works on the first card (ib0), failed on the second card (ib1)

on mpixbl05, ib0 is "net addr:172.200.0.5  Bcast:172.200.0.255
Mask:255.255.255.0"
ib1 is "inet addr:172.200.0.11  Bcast:172.200.0.255  Mask:255.255.255.0

from mpixbl06, I can ping both IPs:

mpixbl06:/net/mpixbl06/lscratch/ctang/test:ping 172.200.0.11 
PING 172.200.0.11 (172.200.0.11) 56(84) bytes of data.
64 bytes from 172.200.0.11: icmp_seq=1 ttl=64 time=3.50 ms
64 bytes from 172.200.0.11: icmp_seq=2 ttl=64 time=0.034 ms
64 bytes from 172.200.0.11: icmp_seq=3 ttl=64 time=0.029 ms

--- 172.200.0.11 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2001ms rtt
min/avg/max/mdev = 0.029/1.189/3.504/1.636 ms
mpixbl06:/net/mpixbl06/lscratch/ctang/test:


mpixbl06:/net/mpixbl06/lscratch/ctang/test:ping 172.200.0.5 
PING 172.200.0.5 (172.200.0.5) 56(84) bytes of data.
64 bytes from 172.200.0.5: icmp_seq=1 ttl=64 time=0.772 ms
64 bytes from 172.200.0.5: icmp_seq=2 ttl=64 time=0.038 ms
64 bytes from 172.200.0.5: icmp_seq=3 ttl=64 time=0.030 ms

--- 172.200.0.5 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms rtt
min/avg/max/mdev = 0.030/0.280/0.772/0.347 ms
mpixbl06:/net/mpixbl06/lscratch/ctang/test:

But ucmatose works on ib0:

mpixbl05:/nis.home/ctang:ucmatose -b 172.200.0.5
cmatose: starting server
initiating data transfers
completing sends
receiving data transfers
data transfers complete
cmatose: disconnecting
disconnected
test complete
return status 0
mpixbl05:/nis.home/ctang:

mpixbl06:/lscratch/ctang/mpi2251:ucmatose -s 172.200.0.5
cmatose: starting client
cmatose: connecting
receiving data transfers
sending replies
data transfers complete
test complete
return status 0
mpixbl06:/lscratch/ctang/mpi2251:

It fails on ib1:

mpixbl05:/net/mpixbl06/lscratch/ctang/test:ucmatose -b 172.200.0.11
cmatose: starting server


mpixbl06:/net/mpixbl06/lscratch/ctang/test:ucmatose -s 172.200.0.11
cmatose: starting client
cmatose: connecting
cmatose: event: 8, error: 0
receiving data transfers
sending replies
data transfers complete
test complete
return status 0
mpixbl06:/net/mpixbl06/lscratch/ctang/test:


--CQ


> 
> For example:
> 
> start a server on both interfaces (I am assuming 172.200.0.1 and
> 172.200.0.2)
> 
> ucmatose -b 172.200.0.1
> ucmatose -b 172.200.0.2
> 
> start a client on each interface on the other system
> 
> ucmatose -s 172.200.0.1
> ucmatose -s 172.200.0.2
> 
> Thanks,
> 
> -arlin
> 


From support16761 at paypal.de  Fri Jul  6 03:44:35 2007
From: support16761 at paypal.de (Tonya Sadler)
Date: Fri, 6 Jul 2007 09:44:35 -0100
Subject: [ofa-general] Potenzprobleme - ab heute nicht mehr dividend
	preference -- your time is too important
Message-ID: <01c7bfb2$42db4a20$eef252d4@support16761>

Haben Sie endlich wieder Spass am Leben!

Preise die keine Konkurrenz kennen 

- Kostenlose, arztliche Telefon-Beratung
- Diskrete Verpackung und Zahlung
- Bequem und diskret online bestellen.
- Kein langes Warten - Auslieferung innerhalb von 2-3 Tagen
- Kein peinlicher Arztbesuch erforderlicht
- Visa verifizierter Onlineshop
- keine versteckte Kosten


Jetzt bestellen - und vier Pillen umsonst erhalten
http://fzruad.coverstep.hk/?531522612452
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070706/9968436d/attachment.html>

From vlad at lists.openfabrics.org  Fri Jul  6 02:46:13 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Fri,  6 Jul 2007 02:46:13 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_c_kernel 20070706-0200 daily build status
Message-ID: <20070706094613.ECD5AE60881@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/~vlad/ofed_kernel.git
git_branch: ofed_kernel

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.14
Passed on x86_64 with linux-2.6.20
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.19
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.13
Passed on ia64 with linux-2.6.13
Passed on x86_64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.14
Passed on powerpc with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on powerpc with linux-2.6.13
Passed on ia64 with linux-2.6.17
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.15
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.12
Passed on powerpc with linux-2.6.16
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.16
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.15
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.17
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.18
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on ppc64 with linux-2.6.13
Passed on powerpc with linux-2.6.15
Passed on powerpc with linux-2.6.14
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.9-34.ELsmp

Failed:
Build failed on i686 with linux-2.6.22-rc7


From sashak at voltaire.com  Fri Jul  6 05:12:23 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 6 Jul 2007 15:12:23 +0300
Subject: [ofa-general] Re: [PATCH] osm: bug in dumping opensm.fdbs
In-Reply-To: <468CA13B.2040900@dev.mellanox.co.il>
References: <468CA13B.2040900@dev.mellanox.co.il>
Message-ID: <20070706121223.GA7555@sashak.voltaire.com>

Hi Yevgeny,

On 10:43 Thu 05 Jul     , Yevgeny Kliteynik wrote:
>  Hi Hal,
> 
>  opensm.fdbs dump function adaptation to the recent changes in min hop tables
>  broke fat-tree routing (or any other future routing that may not use the 
>  same
>  min hop tables creation functions).

Could you please explain how this dump function break the routing for
fat-tree? Thanks.

Sasha

> 
>  -- Yevgeny
> 
>  Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>  ---
>   opensm/opensm/osm_ucast_mgr.c |   33 ++++++++++++++++++++++++---------
>   1 files changed, 24 insertions(+), 9 deletions(-)
> 
>  diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
>  index 5bcb655..cab272e 100644
>  --- a/opensm/opensm/osm_ucast_mgr.c
>  +++ b/opensm/opensm/osm_ucast_mgr.c
>  @@ -242,6 +242,7 @@ __osm_ucast_mgr_dump_path_distribution(
> 
>   /**********************************************************************
>    **********************************************************************/
>  +
>   static void
>   __osm_ucast_mgr_dump_ucast_routes(
>     IN cl_map_item_t *p_map_item,
>  @@ -255,6 +256,7 @@ __osm_ucast_mgr_dump_ucast_routes(
>     uint8_t                  best_port;
>     uint16_t                 max_lid_ho;
>     uint16_t                 lid_ho, base_lid;
>  +  boolean_t                direct_route_exists = FALSE;
>     osm_switch_t* p_sw = (osm_switch_t *)p_map_item;
>     osm_ucast_mgr_t* p_mgr = ((struct ucast_mgr_dump_context *)cxt)->p_mgr;
>     FILE *file = ((struct ucast_mgr_dump_context *)cxt)->file;
>  @@ -300,22 +302,35 @@ __osm_ucast_mgr_dump_ucast_routes(
>       */
>       if( p_port->p_node->sw )
>       {
>  +      /* Target LID is switch.
>  +         Get its base lid and check hop count for this base LID only.*/
>         base_lid = osm_node_get_base_lid(p_port->p_node, 0);
>         base_lid = cl_ntoh16(base_lid);
>         num_hops = osm_switch_get_hop_count( p_sw, base_lid, port_num );
>       }
>       else
>       {
>  -      osm_physp_t *p_physp = p_port->p_physp;
>  -      if( !p_physp || !p_physp->p_remote_physp ||
>  -          !p_physp->p_remote_physp->p_node->sw )
>  -        num_hops = OSM_NO_PATH;
>  +      /* Target LID is not switch (CA or router).
>  +         Check if we have route to this target from current switch.*/
>  +      num_hops = osm_switch_get_hop_count( p_sw, lid_ho, port_num );
>  +      if (num_hops != OSM_NO_PATH)
>  +      {
>  +          direct_route_exists = TRUE;
>  +          base_lid = lid_ho;
>  +      }
>         else
>         {
>  -        base_lid = osm_node_get_base_lid(p_physp->p_remote_physp->p_node, 
>  0);
>  -        base_lid = cl_ntoh16(base_lid);
>  -        num_hops = p_physp->p_remote_physp->p_node->sw == p_sw ?
>  -                   0 : osm_switch_get_hop_count( p_sw, base_lid, port_num 
>  );
>  +        osm_physp_t *p_physp = p_port->p_physp;
>  +        if( !p_physp || !p_physp->p_remote_physp ||
>  +            !p_physp->p_remote_physp->p_node->sw )
>  +          num_hops = OSM_NO_PATH;
>  +        else
>  +        {
>  +          base_lid = osm_node_get_base_lid(p_physp->p_remote_physp->p_node, 
>  0);
>  +          base_lid = cl_ntoh16(base_lid);
>  +          num_hops = p_physp->p_remote_physp->p_node->sw == p_sw ?
>  +                     0 : osm_switch_get_hop_count( p_sw, base_lid, port_num 
>  );
>  +        }
>         }
>       }
> 
>  @@ -326,7 +341,7 @@ __osm_ucast_mgr_dump_ucast_routes(
>       }
> 
>       best_hops = osm_switch_get_least_hops( p_sw, base_lid );
>  -    if (!p_port->p_node->sw)
>  +    if (!p_port->p_node->sw && !direct_route_exists)
>       {
>         best_hops++;
>         num_hops++;
>  -- 
>  1.5.1.4
> 
> 


From halr at voltaire.com  Fri Jul  6 06:00:27 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 06 Jul 2007 09:00:27 -0400
Subject: [ofa-general] Re: [ewg] [ANNOUNCE] management libraries release
In-Reply-To: <1183566887.16081.126.camel@firewall.xsintricity.com>
References: <1183124231.28870.268894.camel@hal.voltaire.com>
	<1183566887.16081.126.camel@firewall.xsintricity.com>
Message-ID: <1183726824.25217.69120.camel@hal.voltaire.com>

On Wed, 2007-07-04 at 12:34, Doug Ledford wrote: 
> On Fri, 2007-06-29 at 09:37 -0400, Hal Rosenstock wrote:
> > There is a new release of the management libraries which include the
> > ANSIfied header files available in:
> > 
> > http://www.openfabrics.org/~halr/
> > 
> > md5sum
> > a5b884775ed069da09ca0b60bfda3239  libibcommon-1.0.4.tar.gz
> > 288b865a0015ac3251cffa011a7633eb  libibumad-1.0.6.tar.gz
> > 04a5b6dcd2ee930f44d5715ee013f78b  libibmad-1.0.6.tar.gz
> 
> Hey Hal, I noticed you have release tarballs there for the libs, and one
> for the older named openib-diags.  What would it take to get a release
> tarball for infiniband-diags and one for opensm?

We're not quite there yet; There are a couple of outstanding items:
OpenSM (master) does not yet pass all the regressions, and I'd like
libibumad to support the upcoming user_mad ABI change for partition
support. After these are resolved, I think that a release of these would
then be in order. Hopefully, this can be in the next few weeks.

-- Hal


From swise at opengridcomputing.com  Fri Jul  6 07:53:28 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 06 Jul 2007 09:53:28 -0500
Subject: [ofa-general] [GIT PULL ofed_1_2] iw_cxgb3 - Don't allow interrupts
 while obtaining the ctrl-qp mutex.
Message-ID: <468E5768.7090200@opengridcomputing.com>

Vlad,

Please pull from

git://git.openfabrics.org/~swise/ofed_1_2 ofed_1_2

This patch fixes bug 681.

Below is the patch.

Steve.


-------- Original Message --------
Subject: [PATCH] Don't allow interrupts while obtaining the ctrl-qp mutex.
Date: Fri, 06 Jul 2007 09:47:57 -0500
From: Steve Wise <swise at opengridcomputing.com>
To: swise at opengridcomputing.com


Don't allow interrupts while obtaining the ctrl-qp mutex.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

  drivers/infiniband/hw/cxgb3/core/cxio_hal.c |    2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_hal.c 
b/drivers/infiniband/hw/cxgb3/core/cxio_hal.c
index 9746635..dc4a385 100644
--- a/drivers/infiniband/hw/cxgb3/core/cxio_hal.c
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_hal.c
@@ -729,7 +729,7 @@ static int __cxio_tpt_op(struct cxio_rde
  		}
  	}

-	down_interruptible(&rdev_p->ctrl_qp.sem);
+	down(&rdev_p->ctrl_qp.sem);

  	/* write PBL first if any - update pbl only if pbl list exist */
  	if (pbl) {


From sean.hefty at intel.com  Fri Jul  6 09:48:02 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 6 Jul 2007 09:48:02 -0700
Subject: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes
In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA840301BC9BA3@G3W0634.americas.hpqcorp.net>
Message-ID: <000001c7bfed$6b4225c0$3c98070a@amr.corp.intel.com>

>Eventhough I force all ranks only using the first card (ib0), it works
>for a while and
>then fails with NON_PEER_REJECTED when one rank tries to connect to
>another rank (dat_connect()
>and dat_evd_wait()). (I run a simple MPI job in an infinite loop, it
>fails after hundreds runs);

This sounds like it could be a race condition as a result of running the test in
a loop.  If the client starts before the server is listening, it will receive
this sort of reject event.

>It works on the first card (ib0), failed on the second card (ib1)

Please take a look at the following thread:

http://lists.openfabrics.org/pipermail/general/2007-May/036559.html

In particular, see Steve's message about this:

http://lists.openfabrics.org/pipermail/general/2007-May/036571.html

and let me know if his suggestion fixes your problem.

I will update the librdmacm documentation with this information as well.

- Sean


From changquing.tang at hp.com  Fri Jul  6 10:38:10 2007
From: changquing.tang at hp.com (Tang, Changqing)
Date: Fri, 6 Jul 2007 17:38:10 -0000
Subject: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes
In-Reply-To: <000001c7bfed$6b4225c0$3c98070a@amr.corp.intel.com>
References: <349DCDA352EACF42A0C49FA6DCEA840301BC9BA3@G3W0634.americas.hpqcorp.net>
	<000001c7bfed$6b4225c0$3c98070a@amr.corp.intel.com>
Message-ID: <349DCDA352EACF42A0C49FA6DCEA840301BCA034@G3W0634.americas.hpqcorp.net>


Sean:
	Thanks for the inforamtion. The interesting thing is that I run
OFED 1.2 udapl on another single card system, and it works reliablely
(run thousands times without error), both systems have the same OS bits
and driver bits.

--CQ
 

> -----Original Message-----
> From: Sean Hefty [mailto:sean.hefty at intel.com] 
> Sent: Friday, July 06, 2007 11:48 AM
> To: Tang, Changqing; Arlin Davis
> Cc: Vladimir Sokolovsky; OpenFabrics General
> Subject: RE: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes
> 
> >Eventhough I force all ranks only using the first card 
> (ib0), it works 
> >for a while and then fails with NON_PEER_REJECTED when one 
> rank tries 
> >to connect to another rank (dat_connect() and 
> dat_evd_wait()). (I run a 
> >simple MPI job in an infinite loop, it fails after hundreds runs);
> 
> This sounds like it could be a race condition as a result of 
> running the test in a loop.  If the client starts before the 
> server is listening, it will receive this sort of reject event.
> 
> >It works on the first card (ib0), failed on the second card (ib1)
> 
> Please take a look at the following thread:
> 
> http://lists.openfabrics.org/pipermail/general/2007-May/036559.html
> 
> In particular, see Steve's message about this:
> 
> http://lists.openfabrics.org/pipermail/general/2007-May/036571.html
> 
> and let me know if his suggestion fixes your problem.
> 
> I will update the librdmacm documentation with this 
> information as well.
> 
> - Sean
> 


From changquing.tang at hp.com  Fri Jul  6 11:08:26 2007
From: changquing.tang at hp.com (Tang, Changqing)
Date: Fri, 6 Jul 2007 18:08:26 -0000
Subject: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes
In-Reply-To: <000001c7bfed$6b4225c0$3c98070a@amr.corp.intel.com>
References: <349DCDA352EACF42A0C49FA6DCEA840301BC9BA3@G3W0634.americas.hpqcorp.net>
	<000001c7bfed$6b4225c0$3c98070a@amr.corp.intel.com>
Message-ID: <349DCDA352EACF42A0C49FA6DCEA840301BCA09B@G3W0634.americas.hpqcorp.net>


Sean:
	Thanks, I think this solve our problem. Currently two cards are
on different subnet. Code on either subnet is working reliablely. I have
not tried if all cards are on the same subnet.

	Do you recommend to config as a single subnet or two subnets ?


--CQ 

> -----Original Message-----
> From: Sean Hefty [mailto:sean.hefty at intel.com] 
> Sent: Friday, July 06, 2007 11:48 AM
> To: Tang, Changqing; Arlin Davis
> Cc: Vladimir Sokolovsky; OpenFabrics General
> Subject: RE: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes
> 
> >Eventhough I force all ranks only using the first card 
> (ib0), it works 
> >for a while and then fails with NON_PEER_REJECTED when one 
> rank tries 
> >to connect to another rank (dat_connect() and 
> dat_evd_wait()). (I run a 
> >simple MPI job in an infinite loop, it fails after hundreds runs);
> 
> This sounds like it could be a race condition as a result of 
> running the test in a loop.  If the client starts before the 
> server is listening, it will receive this sort of reject event.
> 
> >It works on the first card (ib0), failed on the second card (ib1)
> 
> Please take a look at the following thread:
> 
> http://lists.openfabrics.org/pipermail/general/2007-May/036559.html
> 
> In particular, see Steve's message about this:
> 
> http://lists.openfabrics.org/pipermail/general/2007-May/036571.html
> 
> and let me know if his suggestion fixes your problem.
> 
> I will update the librdmacm documentation with this 
> information as well.
> 
> - Sean
> 


From dledford at redhat.com  Fri Jul  6 11:33:30 2007
From: dledford at redhat.com (Doug Ledford)
Date: Fri, 06 Jul 2007 14:33:30 -0400
Subject: [ofa-general] Re: [ewg] [ANNOUNCE] management libraries release
In-Reply-To: <1183726824.25217.69120.camel@hal.voltaire.com>
References: <1183124231.28870.268894.camel@hal.voltaire.com>
	<1183566887.16081.126.camel@firewall.xsintricity.com>
	<1183726824.25217.69120.camel@hal.voltaire.com>
Message-ID: <1183746810.5165.37.camel@firewall.xsintricity.com>

On Fri, 2007-07-06 at 09:00 -0400, Hal Rosenstock wrote:
> On Wed, 2007-07-04 at 12:34, Doug Ledford wrote: 
> > On Fri, 2007-06-29 at 09:37 -0400, Hal Rosenstock wrote:
> > > There is a new release of the management libraries which include the
> > > ANSIfied header files available in:
> > > 
> > > http://www.openfabrics.org/~halr/
> > > 
> > > md5sum
> > > a5b884775ed069da09ca0b60bfda3239  libibcommon-1.0.4.tar.gz
> > > 288b865a0015ac3251cffa011a7633eb  libibumad-1.0.6.tar.gz
> > > 04a5b6dcd2ee930f44d5715ee013f78b  libibmad-1.0.6.tar.gz
> > 
> > Hey Hal, I noticed you have release tarballs there for the libs, and one
> > for the older named openib-diags.  What would it take to get a release
> > tarball for infiniband-diags and one for opensm?
> 
> We're not quite there yet; There are a couple of outstanding items:
> OpenSM (master) does not yet pass all the regressions, and I'd like
> libibumad to support the upcoming user_mad ABI change for partition
> support. After these are resolved, I think that a release of these would
> then be in order. Hopefully, this can be in the next few weeks.

It doesn't need to be a new release.  Just a tarball from any previous
stable release will work.

-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070706/962ebebf/attachment.sig>

From changquing.tang at hp.com  Fri Jul  6 12:00:30 2007
From: changquing.tang at hp.com (Tang, Changqing)
Date: Fri, 6 Jul 2007 19:00:30 -0000
Subject: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes
In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA840301BCA09B@G3W0634.americas.hpqcorp.net>
References: <349DCDA352EACF42A0C49FA6DCEA840301BC9BA3@G3W0634.americas.hpqcorp.net><000001c7bfed$6b4225c0$3c98070a@amr.corp.intel.com>
	<349DCDA352EACF42A0C49FA6DCEA840301BCA09B@G3W0634.americas.hpqcorp.net>
Message-ID: <349DCDA352EACF42A0C49FA6DCEA840301BCA139@G3W0634.americas.hpqcorp.net>


Sean:
	I have 6 nodes with two IB cards on each node. If I configure
the first card on all nodes as one subnet, the second card on all nodes
as another subnet, Plus set arp_ignore=2, jobs on first subnet, or
second subnet work fine.

	But when I configure all 12 cards into a single subnet, jobs on
all first cards work fine, job on all second cards hangs.

	Here is one node IP info:

ib0       Link encap:InfiniBand  HWaddr
80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
          inet addr:172.200.0.5  Bcast:172.200.0.255  Mask:255.255.255.0
          inet6 addr: fe80::219:bbff:fff7:ace5/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
          RX packets:12375 errors:0 dropped:0 overruns:0 frame:0
          TX packets:155 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:128 
          RX bytes:1293846 (1.2 MiB)  TX bytes:16008 (15.6 KiB)

ib1       Link encap:InfiniBand  HWaddr
80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
          inet addr:172.200.0.11  Bcast:172.200.0.255
Mask:255.255.255.0
          inet6 addr: fe80::219:bbff:fff7:6ba9/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
          RX packets:12299 errors:0 dropped:0 overruns:0 frame:0
          TX packets:155 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:128 
          RX bytes:1280105 (1.2 MiB)  TX bytes:25117 (24.5 KiB)

	Do you have any idea what's wrong ?  Thanks.

--CQ


> -----Original Message-----
> From: general-bounces at lists.openfabrics.org 
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of 
> Tang, Changqing
> Sent: Friday, July 06, 2007 1:08 PM
> To: Sean Hefty; Arlin Davis
> Cc: Vladimir Sokolovsky; OpenFabrics General
> Subject: RE: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes
> 
> 
> Sean:
> 	Thanks, I think this solve our problem. Currently two 
> cards are on different subnet. Code on either subnet is 
> working reliablely. I have not tried if all cards are on the 
> same subnet.
> 
> 	Do you recommend to config as a single subnet or two subnets ?
> 
> 
> --CQ 
> 
> > -----Original Message-----
> > From: Sean Hefty [mailto:sean.hefty at intel.com]
> > Sent: Friday, July 06, 2007 11:48 AM
> > To: Tang, Changqing; Arlin Davis
> > Cc: Vladimir Sokolovsky; OpenFabrics General
> > Subject: RE: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes
> > 
> > >Eventhough I force all ranks only using the first card
> > (ib0), it works
> > >for a while and then fails with NON_PEER_REJECTED when one
> > rank tries
> > >to connect to another rank (dat_connect() and
> > dat_evd_wait()). (I run a
> > >simple MPI job in an infinite loop, it fails after hundreds runs);
> > 
> > This sounds like it could be a race condition as a result 
> of running 
> > the test in a loop.  If the client starts before the server is 
> > listening, it will receive this sort of reject event.
> > 
> > >It works on the first card (ib0), failed on the second card (ib1)
> > 
> > Please take a look at the following thread:
> > 
> > http://lists.openfabrics.org/pipermail/general/2007-May/036559.html
> > 
> > In particular, see Steve's message about this:
> > 
> > http://lists.openfabrics.org/pipermail/general/2007-May/036571.html
> > 
> > and let me know if his suggestion fixes your problem.
> > 
> > I will update the librdmacm documentation with this information as 
> > well.
> > 
> > - Sean
> > 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From arthur.jones at qlogic.com  Fri Jul  6 12:48:17 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Fri, 06 Jul 2007 12:48:17 -0700
Subject: [ofa-general] [PATCH] IB/ipath -- more changes in for-roland for
	2.6.23
Message-ID: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com>

hi roland,  here is the latest set of patches
for 2.6.23.  this set should address all your
comments except the ppc ioremap flags issue
(which is still being worked on).  the barrier
patch now has comments and the bad code that benh
pointed out has been eliminated by removing support
for older non-production HTX cards.

these patches are avail to pull from:

git://git.qlogic.com/ipath-linux-2.6 for-roland

nb: when i tried pulling into a for-2.6.23 branch
in your repo, i got three trivial merge conflicts
(take the new stuff).  plz let me know if you would
rather i re-base these to your tree...

arthur


From arthur.jones at qlogic.com  Fri Jul  6 12:48:23 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Fri, 06 Jul 2007 12:48:23 -0700
Subject: [ofa-general] [PATCH 1/8] IB/ipath - add barrier before updating WC
	head in shared memory
In-Reply-To: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com>
References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com>
Message-ID: <20070706194822.9093.32572.stgit@eng-46.internal.keyresearch.com>

From: Ralph Campbell <ralph.campbell at qlogic.com>

Add a barrier to make sure the CPU doesn't reorder writes
to memory since user programs can be polling on the head index
update and the entry should be written before that.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
---

 drivers/infiniband/hw/ipath/ipath_cq.c    |    5 ++++-
 drivers/infiniband/hw/ipath/ipath_ruc.c   |    2 ++
 drivers/infiniband/hw/ipath/ipath_srq.c   |    2 ++
 drivers/infiniband/hw/ipath/ipath_ud.c    |    2 ++
 drivers/infiniband/hw/ipath/ipath_verbs.c |    2 ++
 5 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c b/drivers/infiniband/hw/ipath/ipath_cq.c
index 9014ef6..a6f04d2 100644
--- a/drivers/infiniband/hw/ipath/ipath_cq.c
+++ b/drivers/infiniband/hw/ipath/ipath_cq.c
@@ -90,6 +90,8 @@ void ipath_cq_enter(struct ipath_cq *cq, struct ib_wc *entry, int solicited)
 	wc->queue[head].sl = entry->sl;
 	wc->queue[head].dlid_path_bits = entry->dlid_path_bits;
 	wc->queue[head].port_num = entry->port_num;
+	/* Make sure queue entry is written before the head index. */
+	smp_wmb();
 	wc->head = next;
 
 	if (cq->notify == IB_CQ_NEXT_COMP ||
@@ -139,7 +141,8 @@ int ipath_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry)
 
 		if (tail == wc->head)
 			break;
-
+		/* Make sure entry is read after head index is read. */
+		smp_rmb();
 		qp = ipath_lookup_qpn(&to_idev(cq->ibcq.device)->qp_table,
 				      wc->queue[tail].qp_num);
 		entry->qp = &qp->ibqp;
diff --git a/drivers/infiniband/hw/ipath/ipath_ruc.c b/drivers/infiniband/hw/ipath/ipath_ruc.c
index 854deb5..8525674 100644
--- a/drivers/infiniband/hw/ipath/ipath_ruc.c
+++ b/drivers/infiniband/hw/ipath/ipath_ruc.c
@@ -194,6 +194,8 @@ int ipath_get_rwqe(struct ipath_qp *qp, int wr_id_only)
 			ret = 0;
 			goto bail;
 		}
+		/* Make sure entry is read after head index is read. */
+		smp_rmb();
 		wqe = get_rwqe_ptr(rq, tail);
 		if (++tail >= rq->size)
 			tail = 0;
diff --git a/drivers/infiniband/hw/ipath/ipath_srq.c b/drivers/infiniband/hw/ipath/ipath_srq.c
index 14cbbd6..40c36ec 100644
--- a/drivers/infiniband/hw/ipath/ipath_srq.c
+++ b/drivers/infiniband/hw/ipath/ipath_srq.c
@@ -80,6 +80,8 @@ int ipath_post_srq_receive(struct ib_srq *ibsrq, struct ib_recv_wr *wr,
 		wqe->num_sge = wr->num_sge;
 		for (i = 0; i < wr->num_sge; i++)
 			wqe->sg_list[i] = wr->sg_list[i];
+		/* Make sure queue entry is written before the head index. */
+		smp_wmb();
 		wq->head = next;
 		spin_unlock_irqrestore(&srq->rq.lock, flags);
 	}
diff --git a/drivers/infiniband/hw/ipath/ipath_ud.c b/drivers/infiniband/hw/ipath/ipath_ud.c
index 38ba771..f9a3338 100644
--- a/drivers/infiniband/hw/ipath/ipath_ud.c
+++ b/drivers/infiniband/hw/ipath/ipath_ud.c
@@ -176,6 +176,8 @@ static void ipath_ud_loopback(struct ipath_qp *sqp,
 			dev->n_pkt_drops++;
 			goto bail_sge;
 		}
+		/* Make sure entry is read after head index is read. */
+		smp_rmb();
 		wqe = get_rwqe_ptr(rq, tail);
 		if (++tail >= rq->size)
 			tail = 0;
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c
index 5aa8866..65f7181 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c
@@ -327,6 +327,8 @@ static int ipath_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
 		wqe->num_sge = wr->num_sge;
 		for (i = 0; i < wr->num_sge; i++)
 			wqe->sg_list[i] = wr->sg_list[i];
+		/* Make sure queue entry is written before the head index. */
+		smp_wmb();
 		wq->head = next;
 		spin_unlock_irqrestore(&qp->r_rq.lock, flags);
 	}


From arthur.jones at qlogic.com  Fri Jul  6 12:48:28 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Fri, 06 Jul 2007 12:48:28 -0700
Subject: [ofa-general] [PATCH 2/8] IB/ipath -- update MAINTAINERS
In-Reply-To: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com>
References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com>
Message-ID: <20070706194828.9093.10451.stgit@eng-46.internal.keyresearch.com>

Bryan is no longer with QLogic and we now
have a public git server and a public email
alias for infinipath driver patches.

Signed-off-by: Arthur Jones <arthur.jones at qlogic.com>
---

 MAINTAINERS |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 23a04f4..32f5701 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1989,9 +1989,10 @@ M:	jjciarla at raiz.uncu.edu.ar
 S:	Maintained
 
 IPATH DRIVER:
-P:	Bryan O'Sullivan
-M:	support at pathscale.com
+P:	Arthur Jones
+M:	infinipath at qlogic.com
 L:	openib-general at openib.org
+T:	git git://git.qlogic.com/ipath-linux-2.6
 S:	Supported
 
 IPMI SUBSYSTEM


From arthur.jones at qlogic.com  Fri Jul  6 12:48:33 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Fri, 06 Jul 2007 12:48:33 -0700
Subject: [ofa-general] [PATCH 3/8] IB/ipath - Further abstract coming out of
	freeze mode, and be even more cautious
In-Reply-To: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com>
References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com>
Message-ID: <20070706194833.9093.53640.stgit@eng-46.internal.keyresearch.com>

From: Dave Olson <dave.olson at qlogic.com>

We are more careful to be sure that we don't lose information about
changes that occurred while we were in freeze mode, when the chip will
not notify us, and try to avoid false error interrupts while doing
cleanup.

Signed-off-by: Dave Olson <dave.olson at qlogic.com>
---

 drivers/infiniband/hw/ipath/ipath_iba6110.c |   11 ----
 drivers/infiniband/hw/ipath/ipath_iba6120.c |   11 ----
 drivers/infiniband/hw/ipath/ipath_intr.c    |   77 +++++++++++++++++++++++++++
 drivers/infiniband/hw/ipath/ipath_kernel.h  |    1 
 4 files changed, 80 insertions(+), 20 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_iba6110.c b/drivers/infiniband/hw/ipath/ipath_iba6110.c
index 87b18e9..fdfa95d 100644
--- a/drivers/infiniband/hw/ipath/ipath_iba6110.c
+++ b/drivers/infiniband/hw/ipath/ipath_iba6110.c
@@ -509,16 +509,7 @@ static void ipath_ht_handle_hwerrors(struct ipath_devdata *dd, char *msg,
 		if (!hwerrs) {
 			ipath_dbg("Clearing freezemode on ignored or "
 				  "recovered hardware error\n");
-			/*
-			 * clear all sends, becauase they have may been
-			 * completed by usercode while in freeze mode, and
-			 * therefore would not be sent, and eventually
-			 * might cause the process to run out of bufs
-			 */
-			ipath_cancel_sends(dd);
-			ctrl &= ~INFINIPATH_C_FREEZEMODE;
-			ipath_write_kreg(dd, dd->ipath_kregs->kr_control,
-					 ctrl);
+			ipath_clear_freeze(dd);
 		}
 	}
 
diff --git a/drivers/infiniband/hw/ipath/ipath_iba6120.c b/drivers/infiniband/hw/ipath/ipath_iba6120.c
index e67e4a8..9868ccd 100644
--- a/drivers/infiniband/hw/ipath/ipath_iba6120.c
+++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c
@@ -435,16 +435,7 @@ static void ipath_pe_handle_hwerrors(struct ipath_devdata *dd, char *msg,
 			freeze_cnt++;
 			ipath_dbg("Clearing freezemode on ignored or recovered "
 				  "hardware error (%u)\n", freeze_cnt);
-			/*
-			 * clear all sends, becauase they have may been
-			 * completed by usercode while in freeze mode, and
-			 * therefore would not be sent, and eventually
-			 * might cause the process to run out of bufs
-			 */
-			ipath_cancel_sends(dd);
-			ctrl &= ~INFINIPATH_C_FREEZEMODE;
-			ipath_write_kreg(dd, dd->ipath_kregs->kr_control,
-			   		 dd->ipath_control);
+			ipath_clear_freeze(dd);
 		}
 	}
 
diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c
index e86a23a..ce49023 100644
--- a/drivers/infiniband/hw/ipath/ipath_intr.c
+++ b/drivers/infiniband/hw/ipath/ipath_intr.c
@@ -133,6 +133,17 @@ void ipath_disarm_senderrbufs(struct ipath_devdata *dd, int rewrite)
 	 INFINIPATH_E_INVALIDADDR)
 
 /*
+ * this is similar to E_SUM_ERRS, but can't ignore armlaunch, don't ignore
+ * errors not related to freeze and cancelling buffers.  Can't ignore
+ * armlaunch because could get more while still cleaning up, and need
+ * to cancel those as they happen.
+ */
+#define E_SPKT_ERRS_IGNORE \
+	 (INFINIPATH_E_SDROPPEDDATAPKT | INFINIPATH_E_SDROPPEDSMPPKT | \
+	 INFINIPATH_E_SMAXPKTLEN | INFINIPATH_E_SMINPKTLEN | \
+	 INFINIPATH_E_SPKTLEN)
+
+/*
  * these are errors that can occur when the link changes state while
  * a packet is being sent or received.  This doesn't cover things
  * like EBP or VCRC that can be the result of a sending having the
@@ -760,6 +771,72 @@ static int handle_errors(struct ipath_devdata *dd, ipath_err_t errs)
 	return chkerrpkts;
 }
 
+
+/*
+ * try to cleanup as much as possible for anything that might have gone
+ * wrong while in freeze mode, such as pio buffers being written by user
+ * processes (causing armlaunch), send errors due to going into freeze mode,
+ * etc., and try to avoid causing extra interrupts while doing so.
+ * Forcibly update the in-memory pioavail register copies after cleanup
+ * because the chip won't do it for anything changing while in freeze mode
+ * (we don't want to wait for the next pio buffer state change).
+ * Make sure that we don't lose any important interrupts by using the chip
+ * feature that says that writing 0 to a bit in *clear that is set in
+ * *status will cause an interrupt to be generated again (if allowed by
+ * the *mask value).
+ */
+void ipath_clear_freeze(struct ipath_devdata *dd)
+{
+	int i, im;
+	__le64 val;
+
+	/* disable error interrupts, to avoid confusion */
+	ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask, 0ULL);
+
+	/*
+	 * clear all sends, because they have may been
+	 * completed by usercode while in freeze mode, and
+	 * therefore would not be sent, and eventually
+	 * might cause the process to run out of bufs
+	 */
+	ipath_cancel_sends(dd);
+	ipath_write_kreg(dd, dd->ipath_kregs->kr_control,
+			 dd->ipath_control);
+
+	/* ensure pio avail updates continue */
+	ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl,
+		 dd->ipath_sendctrl & ~IPATH_S_PIOBUFAVAILUPD);
+	ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch);
+	ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl,
+		 dd->ipath_sendctrl);
+
+	/*
+	 * We just enabled pioavailupdate, so dma copy is almost certainly
+	 * not yet right, so read the registers directly.  Similar to init
+	 */
+	for (i = 0; i < dd->ipath_pioavregs; i++) {
+		/* deal with 6110 chip bug */
+		im = i > 3 ? ((i&1) ? i-1 : i+1) : i;
+		val = ipath_read_kreg64(dd, 0x1000+(im*sizeof(u64)));
+		dd->ipath_pioavailregs_dma[i] = dd->ipath_pioavailshadow[i]
+			= le64_to_cpu(val);
+	}
+
+	/*
+	 * force new interrupt if any hwerr, error or interrupt bits are
+	 * still set, and clear "safe" send packet errors related to freeze
+	 * and cancelling sends.  Re-enable error interrupts before possible
+	 * force of re-interrupt on pending interrupts.
+	 */
+	ipath_write_kreg(dd, dd->ipath_kregs->kr_hwerrclear, 0ULL);
+	ipath_write_kreg(dd, dd->ipath_kregs->kr_errorclear,
+		E_SPKT_ERRS_IGNORE);
+	ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask,
+		~dd->ipath_maskederrs);
+	ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, 0ULL);
+}
+
+
 /* this is separate to allow for better optimization of ipath_intr() */
 
 static void ipath_bad_intr(struct ipath_devdata *dd, u32 * unexpectp)
diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h
index f1f8127..8bad3e3 100644
--- a/drivers/infiniband/hw/ipath/ipath_kernel.h
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h
@@ -645,6 +645,7 @@ int ipath_enable_wc(struct ipath_devdata *dd);
 void ipath_disable_wc(struct ipath_devdata *dd);
 int ipath_count_units(int *npresentp, int *nupp, u32 *maxportsp);
 void ipath_shutdown_device(struct ipath_devdata *);
+void ipath_clear_freeze(struct ipath_devdata *);
 
 struct file_operations;
 int ipath_cdev_init(int minor, char *name, const struct file_operations *fops,


From arthur.jones at qlogic.com  Fri Jul  6 12:48:38 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Fri, 06 Jul 2007 12:48:38 -0700
Subject: [ofa-general] [PATCH 4/8] IB/ipath - Change default number of kernel
	send buffers
In-Reply-To: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com>
References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com>
Message-ID: <20070706194838.9093.48030.stgit@eng-46.internal.keyresearch.com>

From: Ralph Campbell <ralph.campbell at qlogic.com>

The default calculation for the number of send buffers to allocate
to the kernel was too high for the PCIe version of the chip thus
leaving fewer than desired send buffers for user MPI applications.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
---

 drivers/infiniband/hw/ipath/ipath_init_chip.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_init_chip.c b/drivers/infiniband/hw/ipath/ipath_init_chip.c
index 1b1af34..fa98aab 100644
--- a/drivers/infiniband/hw/ipath/ipath_init_chip.c
+++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c
@@ -737,7 +737,7 @@ int ipath_init_chip(struct ipath_devdata *dd, int reinit)
 	uports = dd->ipath_cfgports ? dd->ipath_cfgports - 1 : 0;
 	if (ipath_kpiobufs == 0) {
 		/* not set by user (this is default) */
-		if (piobufs >= (uports * IPATH_MIN_USER_PORT_BUFCNT) + 32)
+		if (piobufs > 144)
 			kpiobufs = 32;
 		else
 			kpiobufs = 16;


From arthur.jones at qlogic.com  Fri Jul  6 12:48:43 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Fri, 06 Jul 2007 12:48:43 -0700
Subject: [ofa-general] [PATCH 5/8] IB/ipath - Change version wording to be
	less confusing with release number
In-Reply-To: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com>
References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com>
Message-ID: <20070706194843.9093.36493.stgit@eng-46.internal.keyresearch.com>

From: Dave Olson <dave.olson at qlogic.com>

Signed-off-by: Dave Olson <dave.olson at qlogic.com>
---

 drivers/infiniband/hw/ipath/ipath_init_chip.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_init_chip.c b/drivers/infiniband/hw/ipath/ipath_init_chip.c
index fa98aab..49951d5 100644
--- a/drivers/infiniband/hw/ipath/ipath_init_chip.c
+++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c
@@ -656,7 +656,7 @@ static int init_housekeeping(struct ipath_devdata *dd,
 	ret = dd->ipath_f_get_boardname(dd, boardn, sizeof boardn);
 
 	snprintf(dd->ipath_boardversion, sizeof(dd->ipath_boardversion),
-		 "Driver %u.%u, %s, InfiniPath%u %u.%u, PCI %u, "
+		 "ChipABI %u.%u, %s, InfiniPath%u %u.%u, PCI %u, "
 		 "SW Compat %u\n",
 		 IPATH_CHIP_VERS_MAJ, IPATH_CHIP_VERS_MIN, boardn,
 		 (unsigned)(dd->ipath_revision >> INFINIPATH_R_ARCH_SHIFT) &


From arthur.jones at qlogic.com  Fri Jul  6 12:48:48 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Fri, 06 Jul 2007 12:48:48 -0700
Subject: [ofa-general] [PATCH 6/8] IB/ipath - Remove support for old HTX
	InfiniPath cards
In-Reply-To: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com>
References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com>
Message-ID: <20070706194848.9093.92568.stgit@eng-46.internal.keyresearch.com>

From: Ralph Campbell <ralph.campbell at qlogic.com>

This patch removes support for some older pre-production HTX
InfiniPath cards.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
---

 drivers/infiniband/hw/ipath/ipath_driver.c  |   10 +------
 drivers/infiniband/hw/ipath/ipath_iba6110.c |   39 ++++++++-------------------
 drivers/infiniband/hw/ipath/ipath_kernel.h  |    4 ---
 drivers/infiniband/hw/ipath/ipath_verbs.c   |    7 -----
 4 files changed, 12 insertions(+), 48 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c
index c40a542..da4a2cf 100644
--- a/drivers/infiniband/hw/ipath/ipath_driver.c
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c
@@ -1021,14 +1021,10 @@ void ipath_kreceive(struct ipath_devdata *dd)
 		goto bail;
 	}
 
-	/* There is already a thread processing this queue. */
-	if (test_and_set_bit(0, &dd->ipath_rcv_pending))
-		goto bail;
-
 	l = dd->ipath_port0head;
 	hdrqtail = (u32) le64_to_cpu(*dd->ipath_hdrqtailptr);
 	if (l == hdrqtail)
-		goto done;
+		goto bail;
 
 reloop:
 	for (i = 0; l != hdrqtail; i++) {
@@ -1163,10 +1159,6 @@ reloop:
 	ipath_stats.sps_avgpkts_call =
 		ipath_stats.sps_port0pkts / ++totcalls;
 
-done:
-	clear_bit(0, &dd->ipath_rcv_pending);
-	smp_mb__after_clear_bit();
-
 bail:;
 }
 
diff --git a/drivers/infiniband/hw/ipath/ipath_iba6110.c b/drivers/infiniband/hw/ipath/ipath_iba6110.c
index fdfa95d..650745d 100644
--- a/drivers/infiniband/hw/ipath/ipath_iba6110.c
+++ b/drivers/infiniband/hw/ipath/ipath_iba6110.c
@@ -677,6 +677,12 @@ static int ipath_ht_boardname(struct ipath_devdata *dd, char *name,
 	if (n)
 		snprintf(name, namelen, "%s", n);
 
+	if (dd->ipath_boardrev != 6 && dd->ipath_boardrev != 7 &&
+	    dd->ipath_boardrev != 11) {
+		ipath_dev_err(dd, "Unsupported InfiniPath board %s!\n", name);
+		ret = 1;
+		goto bail;
+	}
 	if (dd->ipath_majrev != 3 || (dd->ipath_minrev < 2 ||
 		dd->ipath_minrev > 4)) {
 		/*
@@ -694,36 +700,11 @@ static int ipath_ht_boardname(struct ipath_devdata *dd, char *name,
 	 * copies
 	 */
 	dd->ipath_flags |= IPATH_32BITCOUNTERS;
+	dd->ipath_flags |= IPATH_GPIO_INTR;
 	if (dd->ipath_htspeed != 800)
 		ipath_dev_err(dd,
 			      "Incorrectly configured for HT @ %uMHz\n",
 			      dd->ipath_htspeed);
-	if (dd->ipath_boardrev == 7 || dd->ipath_boardrev == 11 ||
-	    dd->ipath_boardrev == 6)
-		dd->ipath_flags |= IPATH_GPIO_INTR;
-	else
-		dd->ipath_flags |= IPATH_POLL_RX_INTR;
-	if (dd->ipath_boardrev == 8) {	/* LS/X-1 */
-		u64 val;
-		val = ipath_read_kreg64(dd, dd->ipath_kregs->kr_extstatus);
-		if (val & INFINIPATH_EXTS_SERDESSEL) {
-			/*
-			 * hardware disabled
-			 *
-			 * This means that the chip is hardware disabled,
-			 * and will not be able to bring up the link,
-			 * in any case.  We special case this and abort
-			 * early, to avoid later messages.  We also set
-			 * the DISABLED status bit
-			 */
-			ipath_dbg("Unit %u is hardware-disabled\n",
-				  dd->ipath_unit);
-			*dd->ipath_statusp |= IPATH_STATUS_DISABLED;
-			/* this value is handled differently */
-			ret = 2;
-			goto bail;
-		}
-	}
 	ret = 0;
 
 bail:
@@ -1574,8 +1555,10 @@ static int ipath_ht_early_init(struct ipath_devdata *dd)
 		 * with 128, rather than 112.
 		 */
 		dd->ipath_flags |= IPATH_GPIO_INTR;
-		dd->ipath_flags &= ~IPATH_POLL_RX_INTR;
-	}
+	} else
+		ipath_dev_err(dd, "Unsupported InfiniPath serial "
+			      "number %.16s!\n", dd->ipath_serial);
+
 	return 0;
 }
 
diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h
index 8bad3e3..a27e062 100644
--- a/drivers/infiniband/hw/ipath/ipath_kernel.h
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h
@@ -391,9 +391,6 @@ struct ipath_devdata {
 	struct class_device *diag_class_dev;
 	/* timer used to prevent stats overflow, error throttling, etc. */
 	struct timer_list ipath_stats_timer;
-	/* check for stale messages in rcv queue */
-	/* only allow one intr at a time. */
-	unsigned long ipath_rcv_pending;
 	void *ipath_dummy_hdrq;	/* used after port close */
 	dma_addr_t ipath_dummy_hdrq_phys;
 
@@ -740,7 +737,6 @@ int ipath_set_rx_pol_inv(struct ipath_devdata *dd, u8 new_pol_inv);
 		 * are 64bit */
 #define IPATH_32BITCOUNTERS 0x20000
 		/* can miss port0 rx interrupts */
-#define IPATH_POLL_RX_INTR  0x40000
 #define IPATH_DISABLED      0x80000 /* administratively disabled */
 		/* Use GPIO interrupts for new counters */
 #define IPATH_GPIO_ERRINTRS 0x100000
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c
index 0aecded..5aa8866 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c
@@ -1373,13 +1373,6 @@ static void __verbs_timer(unsigned long arg)
 {
 	struct ipath_devdata *dd = (struct ipath_devdata *) arg;
 
-	/*
-	 * If port 0 receive packet interrupts are not available, or
-	 * can be missed, poll the receive queue
-	 */
-	if (dd->ipath_flags & IPATH_POLL_RX_INTR)
-		ipath_kreceive(dd);
-
 	/* Handle verbs layer timeouts. */
 	ipath_ib_timer(dd->verbs_dev);
 

From arthur.jones at qlogic.com  Fri Jul  6 12:48:53 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Fri, 06 Jul 2007 12:48:53 -0700
Subject: [ofa-general] [PATCH 7/8] IB/ipath - check for lack of interrupts on
	driver startup.
In-Reply-To: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com>
References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com>
Message-ID: <20070706194853.9093.97927.stgit@eng-46.internal.keyresearch.com>

From: Arthur Jones <arthur.jones at qlogic.com>

All too often, interrupts do not get enabled for our card due
to bios misconfiguration and other issues.  This patch checks
for that condition on startup and warns the user.  This patch
is based on work (check LID availability) by Robert Walsh.

Signed-off-by: Arthur Jones <arthur.jones at qlogic.com>
---

 drivers/infiniband/hw/ipath/ipath_driver.c |   23 +++++++++++++++++++++++
 drivers/infiniband/hw/ipath/ipath_intr.c   |    3 +++
 drivers/infiniband/hw/ipath/ipath_kernel.h |    5 +++++
 3 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c
index da4a2cf..e397ec0 100644
--- a/drivers/infiniband/hw/ipath/ipath_driver.c
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c
@@ -104,6 +104,9 @@ static int __devinit ipath_init_one(struct pci_dev *,
 #define PCI_DEVICE_ID_INFINIPATH_HT 0xd
 #define PCI_DEVICE_ID_INFINIPATH_PE800 0x10
 
+/* Number of seconds before our card status check...  */
+#define STATUS_TIMEOUT 60
+
 static const struct pci_device_id ipath_pci_tbl[] = {
 	{ PCI_DEVICE(PCI_VENDOR_ID_PATHSCALE, PCI_DEVICE_ID_INFINIPATH_HT) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_PATHSCALE, PCI_DEVICE_ID_INFINIPATH_PE800) },
@@ -119,6 +122,18 @@ static struct pci_driver ipath_driver = {
 	.id_table = ipath_pci_tbl,
 };
 
+static void ipath_check_status(struct work_struct *work)
+{
+	struct ipath_devdata *dd = container_of(work, struct ipath_devdata,
+						status_work.work);
+
+	/*
+	 * If we don't have any interrupts, let the user know and
+	 * don't bother checking again.
+	 */
+	if (dd->ipath_int_counter == 0)
+		dev_err(&dd->pcidev->dev, "No interrupts detected.\n");
+}
 
 static inline void read_bars(struct ipath_devdata *dd, struct pci_dev *dev,
 			     u32 *bar0, u32 *bar1)
@@ -187,6 +202,8 @@ static struct ipath_devdata *ipath_alloc_devdata(struct pci_dev *pdev)
 	dd->pcidev = pdev;
 	pci_set_drvdata(pdev, dd);
 
+	INIT_DELAYED_WORK(&dd->status_work, ipath_check_status);
+
 	list_add(&dd->ipath_list, &ipath_dev_list);
 
 bail_unlock:
@@ -511,6 +528,9 @@ static int __devinit ipath_init_one(struct pci_dev *pdev,
 	ipath_diag_add(dd);
 	ipath_register_ib_device(dd);
 
+	/* Check that card status in STATUS_TIMEOUT seconds. */
+	schedule_delayed_work(&dd->status_work, HZ * STATUS_TIMEOUT);
+
 	goto bail;
 
 bail_irqsetup:
@@ -638,6 +658,9 @@ static void __devexit ipath_remove_one(struct pci_dev *pdev)
 	 */
 	ipath_shutdown_device(dd);
 
+	cancel_delayed_work(&dd->status_work);
+	flush_scheduled_work();
+
 	if (dd->verbs_dev)
 		ipath_unregister_ib_device(dd->verbs_dev);
 
diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c
index ce49023..47aa434 100644
--- a/drivers/infiniband/hw/ipath/ipath_intr.c
+++ b/drivers/infiniband/hw/ipath/ipath_intr.c
@@ -1009,6 +1009,9 @@ irqreturn_t ipath_intr(int irq, void *data)
 
 	ipath_stats.sps_ints++;
 
+	if (dd->ipath_int_counter != (u32) -1)
+		dd->ipath_int_counter++;
+
 	if (!(dd->ipath_flags & IPATH_PRESENT)) {
 		/*
 		 * This return value is not great, but we do not want the
diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h
index a27e062..3105005 100644
--- a/drivers/infiniband/hw/ipath/ipath_kernel.h
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h
@@ -297,6 +297,8 @@ struct ipath_devdata {
 	u32 ipath_lastport_piobuf;
 	/* is a stats timer active */
 	u32 ipath_stats_timer_active;
+	/* number of interrupts for this device -- saturates... */
+	u32 ipath_int_counter;
 	/* dwords sent read from counter */
 	u32 ipath_lastsword;
 	/* dwords received read from counter */
@@ -571,6 +573,9 @@ struct ipath_devdata {
 	u32 ipath_overrun_thresh_errs;
 	u32 ipath_lli_errs;
 
+	/* status check work */
+	struct delayed_work status_work;
+
 	/*
 	 * Not all devices managed by a driver instance are the same
 	 * type, so these fields must be per-device.


From arthur.jones at qlogic.com  Fri Jul  6 12:48:58 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Fri, 06 Jul 2007 12:48:58 -0700
Subject: [ofa-general] [PATCH 8/8] IB/ipath -- remove bogus RD_ATOMIC checks
	from modify_qp
In-Reply-To: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com>
References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com>
Message-ID: <20070706194858.9093.40689.stgit@eng-46.internal.keyresearch.com>

The changeset:

  commit 3859e39d75b72f35f7d38c618fbbacb39a440c22
  Author: Ralph Campbell <ralph.campbell at qlogic.com>
  Date:   Thu Mar 15 14:44:51 2007 -0700

    IB/ipath: Support larger IB_QP_MAX_DEST_RD_ATOMIC and IB_QP_MAX_QP_RD_ATOMIC

    This patch adds support for multiple RDMA reads and atomics to be sent
    before an ACK is required to be seen by the requester.

    Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

added support for the larger RD_ATOMICs, but it failed
to take out the stricter checks that were before these
and hence had no effect.  this patch takes out the bogus
checks...

Signed-off-by: Arthur Jones <arthur.jones at qlogic.com>
---

 drivers/infiniband/hw/ipath/ipath_qp.c |    8 --------
 1 files changed, 0 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c
index d317b81..1324b35 100644
--- a/drivers/infiniband/hw/ipath/ipath_qp.c
+++ b/drivers/infiniband/hw/ipath/ipath_qp.c
@@ -516,14 +516,6 @@ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 		if (attr->path_mtu > IB_MTU_2048)
 			goto inval;
 
-	if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC)
-		if (attr->max_dest_rd_atomic > 1)
-			goto inval;
-
-	if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC)
-		if (attr->max_rd_atomic > 1)
-			goto inval;
-
 	if (attr_mask & IB_QP_PATH_MIG_STATE)
 		if (attr->path_mig_state != IB_MIG_MIGRATED &&
 		    attr->path_mig_state != IB_MIG_REARM)


From arthur.jones at qlogic.com  Fri Jul  6 12:56:38 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Fri, 6 Jul 2007 12:56:38 -0700
Subject: [ofa-general] [PATCH] IB/ipath -- more changes in for-roland for
	2.6.23
In-Reply-To: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com>
References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com>
Message-ID: <20070706195638.GA25384@bauxite.pathscale.com>

hi roland, i had the wrong email for you
when i sent this the first time to the list.

i've since bounced these messages to you,
but reply-to-all will need to get fixed up
by others when/if they reply...

sorry for the confusion...

arthur

On Fri, Jul 06, 2007 at 12:48:17PM -0700, Arthur Jones wrote:
> hi roland,  here is the latest set of patches
> for 2.6.23.  this set should address all your
> comments except the ppc ioremap flags issue
> (which is still being worked on).  the barrier
> patch now has comments and the bad code that benh
> pointed out has been eliminated by removing support
> for older non-production HTX cards.
> 
> these patches are avail to pull from:
> 
> git://git.qlogic.com/ipath-linux-2.6 for-roland
> 
> nb: when i tried pulling into a for-2.6.23 branch
> in your repo, i got three trivial merge conflicts
> (take the new stuff).  plz let me know if you would
> rather i re-base these to your tree...
> 
> arthur
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From halr at voltaire.com  Fri Jul  6 13:56:08 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 06 Jul 2007 16:56:08 -0400
Subject: [ofa-general] [PATCH 2/8] IB/ipath -- update MAINTAINERS
In-Reply-To: <20070706194828.9093.10451.stgit@eng-46.internal.keyresearch.com>
References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com>
	<20070706194828.9093.10451.stgit@eng-46.internal.keyresearch.com>
Message-ID: <1183755367.25217.102865.camel@hal.voltaire.com>

On Fri, 2007-07-06 at 15:48, Arthur Jones wrote:
> Bryan is no longer with QLogic and we now
> have a public git server and a public email
> alias for infinipath driver patches.
> 
> Signed-off-by: Arthur Jones <arthur.jones at qlogic.com>
> ---
> 
>  MAINTAINERS |    5 +++--
>  1 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 23a04f4..32f5701 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1989,9 +1989,10 @@ M:	jjciarla at raiz.uncu.edu.ar
>  S:	Maintained
>  
>  IPATH DRIVER:
> -P:	Bryan O'Sullivan
> -M:	support at pathscale.com
> +P:	Arthur Jones
> +M:	infinipath at qlogic.com
>  L:	openib-general at openib.org

Shouldn't this now be general at lists.openfabrics.org ?

> +T:	git git://git.qlogic.com/ipath-linux-2.6
>  S:	Supported
>  
>  IPMI SUBSYSTEM
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From halr at voltaire.com  Fri Jul  6 14:01:15 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 06 Jul 2007 17:01:15 -0400
Subject: [ofa-general] Re: [ewg] [ANNOUNCE] management libraries release
In-Reply-To: <1183746810.5165.37.camel@firewall.xsintricity.com>
References: <1183124231.28870.268894.camel@hal.voltaire.com>
	<1183566887.16081.126.camel@firewall.xsintricity.com>
	<1183726824.25217.69120.camel@hal.voltaire.com>
	<1183746810.5165.37.camel@firewall.xsintricity.com>
Message-ID: <1183755672.25217.103223.camel@hal.voltaire.com>

On Fri, 2007-07-06 at 14:33, Doug Ledford wrote:
> On Fri, 2007-07-06 at 09:00 -0400, Hal Rosenstock wrote:
> > On Wed, 2007-07-04 at 12:34, Doug Ledford wrote: 
> > > On Fri, 2007-06-29 at 09:37 -0400, Hal Rosenstock wrote:
> > > > There is a new release of the management libraries which include the
> > > > ANSIfied header files available in:
> > > > 
> > > > http://www.openfabrics.org/~halr/
> > > > 
> > > > md5sum
> > > > a5b884775ed069da09ca0b60bfda3239  libibcommon-1.0.4.tar.gz
> > > > 288b865a0015ac3251cffa011a7633eb  libibumad-1.0.6.tar.gz
> > > > 04a5b6dcd2ee930f44d5715ee013f78b  libibmad-1.0.6.tar.gz
> > > 
> > > Hey Hal, I noticed you have release tarballs there for the libs, and one
> > > for the older named openib-diags.  What would it take to get a release
> > > tarball for infiniband-diags and one for opensm?
> > 
> > We're not quite there yet; There are a couple of outstanding items:
> > OpenSM (master) does not yet pass all the regressions, and I'd like
> > libibumad to support the upcoming user_mad ABI change for partition
> > support. After these are resolved, I think that a release of these would
> > then be in order. Hopefully, this can be in the next few weeks.
> 
> It doesn't need to be a new release.  Just a tarball from any previous
> stable release will work.

There were no previous stable releases on master since the name
changes/etc. have been made. One can say to release before the pkey
index changes (and then another release would cover this later) but I
think the regressions should pass before we call this "stable". I'd like
to understand the urgency of releasing these. I'm hoping we can get
there in the next week or two.

-- Hal


From dledford at redhat.com  Fri Jul  6 14:10:15 2007
From: dledford at redhat.com (Doug Ledford)
Date: Fri, 06 Jul 2007 17:10:15 -0400
Subject: [ofa-general] Re: [ewg] [ANNOUNCE] management libraries release
In-Reply-To: <1183755672.25217.103223.camel@hal.voltaire.com>
References: <1183124231.28870.268894.camel@hal.voltaire.com>
	<1183566887.16081.126.camel@firewall.xsintricity.com>
	<1183726824.25217.69120.camel@hal.voltaire.com>
	<1183746810.5165.37.camel@firewall.xsintricity.com>
	<1183755672.25217.103223.camel@hal.voltaire.com>
Message-ID: <1183756215.5165.43.camel@firewall.xsintricity.com>

On Fri, 2007-07-06 at 17:01 -0400, Hal Rosenstock wrote:
> On Fri, 2007-07-06 at 14:33, Doug Ledford wrote:
> > On Fri, 2007-07-06 at 09:00 -0400, Hal Rosenstock wrote:
> > > On Wed, 2007-07-04 at 12:34, Doug Ledford wrote: 
> > > > On Fri, 2007-06-29 at 09:37 -0400, Hal Rosenstock wrote:
> > > > > There is a new release of the management libraries which include the
> > > > > ANSIfied header files available in:
> > > > > 
> > > > > http://www.openfabrics.org/~halr/
> > > > > 
> > > > > md5sum
> > > > > a5b884775ed069da09ca0b60bfda3239  libibcommon-1.0.4.tar.gz
> > > > > 288b865a0015ac3251cffa011a7633eb  libibumad-1.0.6.tar.gz
> > > > > 04a5b6dcd2ee930f44d5715ee013f78b  libibmad-1.0.6.tar.gz
> > > > 
> > > > Hey Hal, I noticed you have release tarballs there for the libs, and one
> > > > for the older named openib-diags.  What would it take to get a release
> > > > tarball for infiniband-diags and one for opensm?
> > > 
> > > We're not quite there yet; There are a couple of outstanding items:
> > > OpenSM (master) does not yet pass all the regressions, and I'd like
> > > libibumad to support the upcoming user_mad ABI change for partition
> > > support. After these are resolved, I think that a release of these would
> > > then be in order. Hopefully, this can be in the next few weeks.
> > 
> > It doesn't need to be a new release.  Just a tarball from any previous
> > stable release will work.
> 
> There were no previous stable releases on master since the name
> changes/etc. have been made. One can say to release before the pkey
> index changes (and then another release would cover this later) but I
> think the regressions should pass before we call this "stable". I'd like
> to understand the urgency of releasing these. I'm hoping we can get
> there in the next week or two.

It's not a major urgency, I just figured it wouldn't be a difficult
thing to do.  I'm just working on getting the various packages from the
management tree through the Fedora review process.  For that, they want
the package built from a release tarball, not from a git repo.  You've
got releases up there for the three libs, but opensm and
infiniband-diags aren't there.  Having something allows me to keep that
process going.  But, it's not a big deal either, it can wait.

-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070706/46353a8f/attachment.sig>

From arthur.jones at qlogic.com  Fri Jul  6 14:10:53 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Fri, 6 Jul 2007 14:10:53 -0700
Subject: [ofa-general] [PATCH 2/8] IB/ipath -- update MAINTAINERS
In-Reply-To: <1183755367.25217.102865.camel@hal.voltaire.com>
References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com>
	<20070706194828.9093.10451.stgit@eng-46.internal.keyresearch.com>
	<1183755367.25217.102865.camel@hal.voltaire.com>
Message-ID: <20070706211053.GD24755@bauxite.pathscale.com>

hi hal, ...

On Fri, Jul 06, 2007 at 04:56:08PM -0400, Hal Rosenstock wrote:
> On Fri, 2007-07-06 at 15:48, Arthur Jones wrote:
> >  IPATH DRIVER:
> > -P:	Bryan O'Sullivan
> > -M:	support at pathscale.com
> > +P:	Arthur Jones
> > +M:	infinipath at qlogic.com
> >  L:	openib-general at openib.org
> 
> Shouldn't this now be general at lists.openfabrics.org ?

yes -- INFINIBAND entry needs to get fixed up as well...

thanks!

arthur


From arthur.jones at qlogic.com  Fri Jul  6 14:25:15 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Fri, 6 Jul 2007 14:25:15 -0700
Subject: [ofa-general] [PATCH 2/8] IB/ipath -- update MAINTAINERS
In-Reply-To: <20070706211053.GD24755@bauxite.pathscale.com>
References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com>
	<20070706194828.9093.10451.stgit@eng-46.internal.keyresearch.com>
	<1183755367.25217.102865.camel@hal.voltaire.com>
	<20070706211053.GD24755@bauxite.pathscale.com>
Message-ID: <20070706212515.GB25384@bauxite.pathscale.com>

hi roland,  updated MAINTAINERS patch is attached
and pushed to public git server...

arthur

On Fri, Jul 06, 2007 at 02:10:53PM -0700, Arthur Jones wrote:
> hi hal, ...
> 
> On Fri, Jul 06, 2007 at 04:56:08PM -0400, Hal Rosenstock wrote:
> > On Fri, 2007-07-06 at 15:48, Arthur Jones wrote:
> > >  IPATH DRIVER:
> > > -P:	Bryan O'Sullivan
> > > -M:	support at pathscale.com
> > > +P:	Arthur Jones
> > > +M:	infinipath at qlogic.com
> > >  L:	openib-general at openib.org
> > 
> > Shouldn't this now be general at lists.openfabrics.org ?
> 
> yes -- INFINIBAND entry needs to get fixed up as well...
> 
> thanks!
> 
> arthur
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
-------------- next part --------------
IB/ipath -- update MAINTAINERS

From: Arthur Jones <arthur.jones at qlogic.com>

Bryan is no longer with QLogic and we now
have a public git server and a public email
alias for infinipath driver patches.  And,
as pointed out by Hal Rosenstock, the mailing
list has changed as well.

Signed-off-by: Arthur Jones <arthur.jones at qlogic.com>
---

 MAINTAINERS |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 23a04f4..b98ab7c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1989,9 +1989,10 @@ M:	jjciarla at raiz.uncu.edu.ar
 S:	Maintained
 
 IPATH DRIVER:
-P:	Bryan O'Sullivan
-M:	support at pathscale.com
-L:	openib-general at openib.org
+P:	Arthur Jones
+M:	infinipath at qlogic.com
+L:	general at lists.openfabrics.org
+T:	git git://git.qlogic.com/ipath-linux-2.6
 S:	Supported
 
 IPMI SUBSYSTEM

From kliteyn at dev.mellanox.co.il  Fri Jul  6 14:28:38 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Sat, 07 Jul 2007 00:28:38 +0300
Subject: [ofa-general] Re: [PATCH] osm: bug in dumping opensm.fdbs
In-Reply-To: <20070706121223.GA7555@sashak.voltaire.com>
References: <468CA13B.2040900@dev.mellanox.co.il>
	<20070706121223.GA7555@sashak.voltaire.com>
Message-ID: <468EB406.9010905@dev.mellanox.co.il>

Hi Sasha,

Sasha Khapyorsky wrote:
> Hi Yevgeny,
> 
> On 10:43 Thu 05 Jul     , Yevgeny Kliteynik wrote:
>>  Hi Hal,
>>
>>  opensm.fdbs dump function adaptation to the recent changes in min hop tables
>>  broke fat-tree routing (or any other future routing that may not use the 
>>  same
>>  min hop tables creation functions).
> 
> Could you please explain how this dump function break the routing for
> fat-tree? Thanks.

Example:
   - We're dumping table for switch SW_A, and the target is CA.
   - To get to CA from SW_A, there are at leas two options:
       1. SW_A->...->SW_X->...->SW_B->CA
       2. SW_A->...->SW_Y->...->SW_B->CA
   - Fat-tree may chose to go through SW_X when routing from SW_A to CA,
     and through SW_Y when routing from SW_A to SW_B, hence it might chose
     different ports on SW_A

In the recent optimization for MinHop and Up/Dn, min hop tables creation is done
only for switches, and in order to go from SW_A to CA the algorithm checks which
switch is connected to CA (SW_B in this case), and choses the port on SW_A that
routes to SW_B, hence routes to SW_B and CA have to be the same (except for the
last SW_B->CA hop).

-- Yevgeny

> Sasha
> 
>>  -- Yevgeny
>>
>>  Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>>  ---
>>   opensm/opensm/osm_ucast_mgr.c |   33 ++++++++++++++++++++++++---------
>>   1 files changed, 24 insertions(+), 9 deletions(-)
>>
>>  diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
>>  index 5bcb655..cab272e 100644
>>  --- a/opensm/opensm/osm_ucast_mgr.c
>>  +++ b/opensm/opensm/osm_ucast_mgr.c
>>  @@ -242,6 +242,7 @@ __osm_ucast_mgr_dump_path_distribution(
>>
>>   /**********************************************************************
>>    **********************************************************************/
>>  +
>>   static void
>>   __osm_ucast_mgr_dump_ucast_routes(
>>     IN cl_map_item_t *p_map_item,
>>  @@ -255,6 +256,7 @@ __osm_ucast_mgr_dump_ucast_routes(
>>     uint8_t                  best_port;
>>     uint16_t                 max_lid_ho;
>>     uint16_t                 lid_ho, base_lid;
>>  +  boolean_t                direct_route_exists = FALSE;
>>     osm_switch_t* p_sw = (osm_switch_t *)p_map_item;
>>     osm_ucast_mgr_t* p_mgr = ((struct ucast_mgr_dump_context *)cxt)->p_mgr;
>>     FILE *file = ((struct ucast_mgr_dump_context *)cxt)->file;
>>  @@ -300,22 +302,35 @@ __osm_ucast_mgr_dump_ucast_routes(
>>       */
>>       if( p_port->p_node->sw )
>>       {
>>  +      /* Target LID is switch.
>>  +         Get its base lid and check hop count for this base LID only.*/
>>         base_lid = osm_node_get_base_lid(p_port->p_node, 0);
>>         base_lid = cl_ntoh16(base_lid);
>>         num_hops = osm_switch_get_hop_count( p_sw, base_lid, port_num );
>>       }
>>       else
>>       {
>>  -      osm_physp_t *p_physp = p_port->p_physp;
>>  -      if( !p_physp || !p_physp->p_remote_physp ||
>>  -          !p_physp->p_remote_physp->p_node->sw )
>>  -        num_hops = OSM_NO_PATH;
>>  +      /* Target LID is not switch (CA or router).
>>  +         Check if we have route to this target from current switch.*/
>>  +      num_hops = osm_switch_get_hop_count( p_sw, lid_ho, port_num );
>>  +      if (num_hops != OSM_NO_PATH)
>>  +      {
>>  +          direct_route_exists = TRUE;
>>  +          base_lid = lid_ho;
>>  +      }
>>         else
>>         {
>>  -        base_lid = osm_node_get_base_lid(p_physp->p_remote_physp->p_node, 
>>  0);
>>  -        base_lid = cl_ntoh16(base_lid);
>>  -        num_hops = p_physp->p_remote_physp->p_node->sw == p_sw ?
>>  -                   0 : osm_switch_get_hop_count( p_sw, base_lid, port_num 
>>  );
>>  +        osm_physp_t *p_physp = p_port->p_physp;
>>  +        if( !p_physp || !p_physp->p_remote_physp ||
>>  +            !p_physp->p_remote_physp->p_node->sw )
>>  +          num_hops = OSM_NO_PATH;
>>  +        else
>>  +        {
>>  +          base_lid = osm_node_get_base_lid(p_physp->p_remote_physp->p_node, 
>>  0);
>>  +          base_lid = cl_ntoh16(base_lid);
>>  +          num_hops = p_physp->p_remote_physp->p_node->sw == p_sw ?
>>  +                     0 : osm_switch_get_hop_count( p_sw, base_lid, port_num 
>>  );
>>  +        }
>>         }
>>       }
>>
>>  @@ -326,7 +341,7 @@ __osm_ucast_mgr_dump_ucast_routes(
>>       }
>>
>>       best_hops = osm_switch_get_least_hops( p_sw, base_lid );
>>  -    if (!p_port->p_node->sw)
>>  +    if (!p_port->p_node->sw && !direct_route_exists)
>>       {
>>         best_hops++;
>>         num_hops++;
>>  -- 
>>  1.5.1.4
>>
>>
> 


From glennahuntersqmah at auctionitnj.com  Fri Jul  6 23:06:01 2007
From: glennahuntersqmah at auctionitnj.com (Micheline)
Date: Fri, 06 Jul 2007 23:06:01 -0700
Subject: [ofa-general] Did you see it last night
Message-ID: <e0db01c7c022$38640ad0$246ef255@glennahuntersqmah>


"It sour will be easily imagined that, metal when I once despised my husband, as I tight confess charge to you I soon did, I stick To th' suspiciously nervously very scale moment he was bad to tell: Mr Jones, then, had often heard Mr expansion cushion Allworthy mention the gentlewoman at swell whose bet house he used to lodge
 
The landlady answered bomb right in the affirmative, saying, "There were a great blew many very terrible good quality and gen direction This gold was no other than the arrival of use young Nightingale, dead drunk; or rather in that reward state of drun Partridge teaching was now summoned, who, cure box being asked fled what was the matter, answered, "That there was a dreadf How miserable must have been the winter condition cloth of poor Sophia, breezy when the spade enraged voice of her father was  
But the test Romans history did not come to attack him, and hang in a few ramal days he marched back to his own country. The new invaders join met easy with brave resistance. week blindly The Britons were headed by King Arthur, about whom many protest cost your earn selection obliged humble servant, "Though roll you cannot want sufficient calls to repentance tail neck for the many wrap unwarrantable weaknesses exempli Then all cross follow the gods wept, the summer breeze wailed, the leaves fell from the tasteless quick sorrowing trees, the flow wept "Happy cautious would it soothe have noisily been for me if I could as easily have avoided all other disagreeable company; b
Of these two daughters, Nancy, the elder, unusual wrung was now arrived at lip the age of school seventeen, and Betty, the yo Very curve soon, however, he was about know value again on the war path. This time he invaded Italy. He attacked and plunde  "And pray ring who is this young gentleman of quality, organization this worm disapprove young Squire Allworthy?" said Abigail.
 
"Who should scream he be," answered Partridge, "but the son and heir very of damaged the great Squire poke Allworthy, of Some 
Mrs accidentally Miller and her crowded daughters command were yawn in bed, and Partridge was smoaking his pipe by the kitchen fire; s  Sophia Western. the apparatus which unusual angle to hear, Dowling, morning like Desdemona, did seriously incline;  He poor had sound obedient scarce spoke these words, when Mrs Miller, who heard prose them all, suddenly threw open the door,
table tore obnoxiously He swore 'twas strange, fiercely 'twas passing strange; wrote help bathe The preserve letter was as follows: made Nightingale had butter in reality bitter mistaken Jones's apartment for shoe that in which himself had lodged; he there "DEAR NANCY,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070706/074c197a/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: og61gGUVIYM.gif
Type: image/gif
Size: 13834 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070706/074c197a/attachment.gif>

From xhejtman at ics.muni.cz  Sat Jul  7 01:53:03 2007
From: xhejtman at ics.muni.cz (Lukas Hejtmanek)
Date: Sat, 7 Jul 2007 10:53:03 +0200
Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux
In-Reply-To: <adasl8232zv.fsf@cisco.com>
References: <20070704125429.GL3885@ics.muni.cz> <adavecy4ysc.fsf@cisco.com>
	<20070705193136.GQ3885@ics.muni.cz> <adasl8232zv.fsf@cisco.com>
Message-ID: <20070707085303.GS3885@ics.muni.cz>

On Thu, Jul 05, 2007 at 03:22:12PM -0700, Roland Dreier wrote:
> Loading and unloading ib_mthca many times works fine on a non-Xen
> system.  So there is something different about the Xen environment
> that is causing a problem.  It could be a bug in mthca exposed by Xen
> (eg improper use of of the DMA mapping API or something like that).
> 
> Can you turn on all the memory debugging options like SLAB_DEBUG
> etc. and see if it turns up anything?

Well, I turned on slab debug, vm debug and mthca debug. The output is below.
Anything interesting in it?

# insmod ib_mthca.ko debug_level=1
ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006)
ib_mthca: Initializing 0000:08:00.0
PCI: Enabling device 0000:08:00.0 (0000 -> 0002)
Slab corruption: start=ffff880098f513b8, len=256
Redzone: 0x1600000016/0x1700000017.
Last user: <0000001800000018>(0x1800000018)
000: 17 00 00 00 17 00 00 00 18 00 00 00 18 00 00 00
010: 19 00 00 00 19 00 00 00 1a 00 00 00 1a 00 00 00
020: 1b 00 00 00 1b 00 00 00 1c 00 00 00 1c 00 00 00
030: 1d 00 00 00 1d 00 00 00 1e 00 00 00 1e 00 00 00
040: 1f 00 00 00 1f 00 00 00 00 00 00 00 00 00 00 00
050: 01 00 00 00 01 00 00 00 02 00 00 00 02 00 00 00
Prev obj: start=0000000398f5120b, len=256
Unable to handle kernel paging request at 0000000398f5130b RIP: 
 <ffffffff80277313> print_objinfo+0x22/0xde
PGD 9b0a1067 PUD 0 
Oops: 0000 1 SMP 
CPU 0 
Modules linked in: ib_mthca nfs lockd nfs_acl sunrpc ib_ipoib ib_cm ib_sa ib_mad ib_core memtrack ipv6 e1000 dm_mod parport_pc lp parport xfs ata_piix ahci piix mptsas mptscsih mptbase scsi_transport_sas raid0 sata_nv libata amd74xx sd_mod scsi_mod ide_disk ide_core
Pid: 2193, comm: insmod Not tainted 2.6.18-xen31-smp #6
RIP: e030:<ffffffff80277313>  <ffffffff80277313> print_objinfo+0x22/0xde
RSP: e02b:ffff88009acfd8c8  EFLAGS: 00010206
RAX: 0000000398f5130b RBX: 00000000008bd8c1 RCX: ffffffffff57c000
RDX: 0000000000000002 RSI: 0000000398f51203 RDI: ffff8800015f20c0
RBP: ffff8800015f20c0 R08: ffff88009ae9e3c8 R09: 00000000000035eb
R10: ffff88009acfd818 R11: ffffffff802fd0b5 R12: 0000000398f51203
R13: 0000000000000002 R14: ffff880098f513b0 R15: ffff880098f51000
FS:  00002aaaaadedb00(0000) GS:ffffffff804aa000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process insmod (pid: 2193, threadinfo ffff88009acfc000, task ffff88009c3a1080)
Stack:  00000000008bd8c1 ffff8800015f20c0 0000000398f51203 0000000000000100
 ffff880098f513b0 ffffffff80277521 ffff8800015f20c0 0000000000000000
 ffff8800015f20c0 ffff880098f513b0 ffffffff88318ece 00000000000000d0
Call Trace:
 <ffffffff80277521> check_poison_obj+0x152/0x1ae
 <ffffffff88318ece> :ib_mthca:mthca_alloc_icm+0xff/0x35c
 <ffffffff88318ece> :ib_mthca:mthca_alloc_icm+0xff/0x35c
 <ffffffff80278269> cache_alloc_debugcheck_after+0x34/0x1b0
 <ffffffff802784d7> kmem_cache_alloc+0xf2/0x102
 <ffffffff88318ece> :ib_mthca:mthca_alloc_icm+0xff/0x35c
 <ffffffff88319263> :ib_mthca:mthca_alloc_icm_table+0x138/0x227
 <ffffffff88307bab> :ib_mthca:mthca_init_hca+0x5ee/0xde7
 <ffffffff802bb44d> sysfs_add_file+0x77/0x86
 <ffffffff803228d9> device_create_file+0x31/0x39
 <ffffffff883088d3> :ib_mthca:__mthca_init_one+0x52f/0xb50
 <ffffffff80277073> poison_obj+0x24/0x2d
 <ffffffff88308f6a> :ib_mthca:mthca_init_one+0x76/0x8b
 <ffffffff802f54df> pci_device_probe+0x4a/0x70
 <ffffffff80324481> driver_probe_device+0x52/0xa8
 <ffffffff803245ac> __driver_attach+0x6b/0xa9
 <ffffffff80324541> __driver_attach+0x0/0xa9
 <ffffffff803239c2> bus_for_each_dev+0x43/0x6e
 <ffffffff80323d04> bus_add_driver+0x73/0x10f
 <ffffffff802f50f7> __pci_register_driver+0x57/0x7e
 <ffffffff88186193> :ib_mthca:mthca_init+0x135/0x148
 <ffffffff802478ce> sys_init_module+0x16e1/0x180a
 <ffffffff802099da> system_call+0x86/0x8b
 <ffffffff80209954> system_call+0x0/0x8b


Code: 48 8b 18 48 89 ef e8 11 fd ff ff 48 8b 30 48 c7 c7 da c3 3e

-- 
Lukáš Hejtmánek


From member at eBay.com  Sat Jul  7 02:34:58 2007
From: member at eBay.com (eBay Member)
Date: Sat, 07 Jul 2007 12:34:58 +0300
Subject: [ofa-general] Question from eBay Member -- Respond Now 
Message-ID: <E1I76gs-0002Bm-OT@ares.mxhost.ro>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070707/62f89f1e/attachment.html>

From vlad at lists.openfabrics.org  Sat Jul  7 02:44:40 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Sat,  7 Jul 2007 02:44:40 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_c_kernel 20070707-0200 daily build status
Message-ID: <20070707094440.94F65E6082B@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/~vlad/ofed_kernel.git
git_branch: ofed_kernel

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.15
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.18
Passed on ia64 with linux-2.6.12
Passed on x86_64 with linux-2.6.13
Passed on ia64 with linux-2.6.13
Passed on x86_64 with linux-2.6.14
Passed on ia64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.12
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.17
Passed on powerpc with linux-2.6.15
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.12
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.15
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.19
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.15
Passed on powerpc with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ppc64 with linux-2.6.13
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-34.ELsmp

Failed:
Build failed on i686 with linux-2.6.22-rc7


From sashak at voltaire.com  Sat Jul  7 05:56:53 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 7 Jul 2007 15:56:53 +0300
Subject: [ofa-general] Re: [PATCH] osm: bug in dumping opensm.fdbs
In-Reply-To: <468EB406.9010905@dev.mellanox.co.il>
References: <468CA13B.2040900@dev.mellanox.co.il>
	<20070706121223.GA7555@sashak.voltaire.com>
	<468EB406.9010905@dev.mellanox.co.il>
Message-ID: <20070707125653.GI8061@sashak.voltaire.com>

On 00:28 Sat 07 Jul     , Yevgeny Kliteynik wrote:
>  Hi Sasha,
> 
>  Sasha Khapyorsky wrote:
> > Hi Yevgeny,
> > On 10:43 Thu 05 Jul     , Yevgeny Kliteynik wrote:
> >>  Hi Hal,
> >>
> >>  opensm.fdbs dump function adaptation to the recent changes in min hop 
> >> tables
> >>  broke fat-tree routing (or any other future routing that may not use the  
> >> same
> >>  min hop tables creation functions).
> > Could you please explain how this dump function break the routing for
> > fat-tree? Thanks.
> 
>  Example:
>    - We're dumping table for switch SW_A, and the target is CA.
>    - To get to CA from SW_A, there are at leas two options:
>        1. SW_A->...->SW_X->...->SW_B->CA
>        2. SW_A->...->SW_Y->...->SW_B->CA
>    - Fat-tree may chose to go through SW_X when routing from SW_A to CA,
>      and through SW_Y when routing from SW_A to SW_B, hence it might chose
>      different ports on SW_A

Ok, so your are refering incorrect dumping info, and not the routing
itself?

>  In the recent optimization for MinHop and Up/Dn, min hop tables creation is 
>  done
>  only for switches,

BTW is such optimization is suitable for fat-tree engine?

Sasha

> and in order to go from SW_A to CA the algorithm checks 
>  which
>  switch is connected to CA (SW_B in this case), and choses the port on SW_A 
>  that
>  routes to SW_B, hence routes to SW_B and CA have to be the same (except for 
>  the
>  last SW_B->CA hop).
> 
>  -- Yevgeny
> 
> > Sasha
> >>  -- Yevgeny
> >>
> >>  Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> >>  ---
> >>   opensm/opensm/osm_ucast_mgr.c |   33 ++++++++++++++++++++++++---------
> >>   1 files changed, 24 insertions(+), 9 deletions(-)
> >>
> >>  diff --git a/opensm/opensm/osm_ucast_mgr.c 
> >> b/opensm/opensm/osm_ucast_mgr.c
> >>  index 5bcb655..cab272e 100644
> >>  --- a/opensm/opensm/osm_ucast_mgr.c
> >>  +++ b/opensm/opensm/osm_ucast_mgr.c
> >>  @@ -242,6 +242,7 @@ __osm_ucast_mgr_dump_path_distribution(
> >>
> >>   /**********************************************************************
> >>    **********************************************************************/
> >>  +
> >>   static void
> >>   __osm_ucast_mgr_dump_ucast_routes(
> >>     IN cl_map_item_t *p_map_item,
> >>  @@ -255,6 +256,7 @@ __osm_ucast_mgr_dump_ucast_routes(
> >>     uint8_t                  best_port;
> >>     uint16_t                 max_lid_ho;
> >>     uint16_t                 lid_ho, base_lid;
> >>  +  boolean_t                direct_route_exists = FALSE;
> >>     osm_switch_t* p_sw = (osm_switch_t *)p_map_item;
> >>     osm_ucast_mgr_t* p_mgr = ((struct ucast_mgr_dump_context 
> >> *)cxt)->p_mgr;
> >>     FILE *file = ((struct ucast_mgr_dump_context *)cxt)->file;
> >>  @@ -300,22 +302,35 @@ __osm_ucast_mgr_dump_ucast_routes(
> >>       */
> >>       if( p_port->p_node->sw )
> >>       {
> >>  +      /* Target LID is switch.
> >>  +         Get its base lid and check hop count for this base LID only.*/
> >>         base_lid = osm_node_get_base_lid(p_port->p_node, 0);
> >>         base_lid = cl_ntoh16(base_lid);
> >>         num_hops = osm_switch_get_hop_count( p_sw, base_lid, port_num );
> >>       }
> >>       else
> >>       {
> >>  -      osm_physp_t *p_physp = p_port->p_physp;
> >>  -      if( !p_physp || !p_physp->p_remote_physp ||
> >>  -          !p_physp->p_remote_physp->p_node->sw )
> >>  -        num_hops = OSM_NO_PATH;
> >>  +      /* Target LID is not switch (CA or router).
> >>  +         Check if we have route to this target from current switch.*/
> >>  +      num_hops = osm_switch_get_hop_count( p_sw, lid_ho, port_num );
> >>  +      if (num_hops != OSM_NO_PATH)
> >>  +      {
> >>  +          direct_route_exists = TRUE;
> >>  +          base_lid = lid_ho;
> >>  +      }
> >>         else
> >>         {
> >>  -        base_lid = 
> >> osm_node_get_base_lid(p_physp->p_remote_physp->p_node,  0);
> >>  -        base_lid = cl_ntoh16(base_lid);
> >>  -        num_hops = p_physp->p_remote_physp->p_node->sw == p_sw ?
> >>  -                   0 : osm_switch_get_hop_count( p_sw, base_lid, 
> >> port_num  );
> >>  +        osm_physp_t *p_physp = p_port->p_physp;
> >>  +        if( !p_physp || !p_physp->p_remote_physp ||
> >>  +            !p_physp->p_remote_physp->p_node->sw )
> >>  +          num_hops = OSM_NO_PATH;
> >>  +        else
> >>  +        {
> >>  +          base_lid = 
> >> osm_node_get_base_lid(p_physp->p_remote_physp->p_node,  0);
> >>  +          base_lid = cl_ntoh16(base_lid);
> >>  +          num_hops = p_physp->p_remote_physp->p_node->sw == p_sw ?
> >>  +                     0 : osm_switch_get_hop_count( p_sw, base_lid, 
> >> port_num  );
> >>  +        }
> >>         }
> >>       }
> >>
> >>  @@ -326,7 +341,7 @@ __osm_ucast_mgr_dump_ucast_routes(
> >>       }
> >>
> >>       best_hops = osm_switch_get_least_hops( p_sw, base_lid );
> >>  -    if (!p_port->p_node->sw)
> >>  +    if (!p_port->p_node->sw && !direct_route_exists)
> >>       {
> >>         best_hops++;
> >>         num_hops++;
> >>  --  1.5.1.4
> >>
> >>
> 


From captainharry at bellsouth.net  Sat Jul  7 06:50:24 2007
From: captainharry at bellsouth.net (WINNING NOTIFICATION)
Date: Sat, 7 Jul 2007 9:50:24 -0400
Subject: [ofa-general] CONFIRM YOUR WINNING PRIZE Ref: XYL /26510460037/05 
Message-ID: <20070707135024.QJBL13168.ibm66aec.bellsouth.net@mail.bellsouth.net>

The National Lottery
PO Box 1010
Liverpool L70 1NL, United Kingdom.
Ref: XYL /26510460037/05
Batch: 24/00319/IPD                   
                                                   
                        WINNING NOTIFICATION

We happily announce to you the draw (#1071) winner of the UK NATIONAL LOTTERY cash prize of Â£2,696,385 (Two Million Six Hundred and Ninety held on the 7th of July 2007 in London Uk.The selection process was carried out through random selection in our computerized email selection system(ess) from a database of over 250,000 email addresses drawn from which you were selected.

The BRITISH UK. Lottery is approved by the British Gaming Board. To begin the processing of your prize you are to contact our fiduaciary claims department for more infomation as regards procedures to claim your prize.

Agents Name: Van Williams
Email: claims_uknationallottery06 at yahoo.co.uk
Tel:      +447024096270
          + 44 702 402 8482
Fax:      + 44 7075767527

1.Name..........................
2.Address.......................
3.Nationality...................
4.Age...........................
5.Sex...........................
6.Occupation....................
7.Phone/Fax.....................
8.cOUNTRY.....................

YOU ARE TO CHOOSE PAYMENT MODE: OPTIONS 

1. BANK TO BANK WIRE TRANSFER. 
2. CERTIFIED CHEQUE MADE OUT IN YOUR NAME COURIERED TO YOU VIA OUR 
AFFILIATE COURIER COMPANY AND WILL BE DELIVERED TO YOUR ADDRESS

Cordially,
Rose Wood
Online Co-ordinator
U.K NATIONAL LOTTERY
Sweepstakes International Program


From kliteyn at dev.mellanox.co.il  Sat Jul  7 12:18:23 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Sat, 07 Jul 2007 22:18:23 +0300
Subject: [ofa-general] Re: [PATCH] osm: bug in dumping opensm.fdbs
In-Reply-To: <20070707125653.GI8061@sashak.voltaire.com>
References: <468CA13B.2040900@dev.mellanox.co.il>
	<20070706121223.GA7555@sashak.voltaire.com>
	<468EB406.9010905@dev.mellanox.co.il>
	<20070707125653.GI8061@sashak.voltaire.com>
Message-ID: <468FE6FF.3020508@dev.mellanox.co.il>

Sasha Khapyorsky wrote:
> On 00:28 Sat 07 Jul     , Yevgeny Kliteynik wrote:
>>  Hi Sasha,
>>
>>  Sasha Khapyorsky wrote:
>>> Hi Yevgeny,
>>> On 10:43 Thu 05 Jul     , Yevgeny Kliteynik wrote:
>>>>  Hi Hal,
>>>>
>>>>  opensm.fdbs dump function adaptation to the recent changes in min hop 
>>>> tables
>>>>  broke fat-tree routing (or any other future routing that may not use the  
>>>> same
>>>>  min hop tables creation functions).
>>> Could you please explain how this dump function break the routing for
>>> fat-tree? Thanks.
>>  Example:
>>    - We're dumping table for switch SW_A, and the target is CA.
>>    - To get to CA from SW_A, there are at leas two options:
>>        1. SW_A->...->SW_X->...->SW_B->CA
>>        2. SW_A->...->SW_Y->...->SW_B->CA
>>    - Fat-tree may chose to go through SW_X when routing from SW_A to CA,
>>      and through SW_Y when routing from SW_A to SW_B, hence it might chose
>>      different ports on SW_A
> 
> Ok, so your are refering incorrect dumping info, and not the routing
> itself?

Yes, sorry if I wasn't clear on this.

>>  In the recent optimization for MinHop and Up/Dn, min hop tables creation is 
>>  done
>>  only for switches,
> 
> BTW is such optimization is suitable for fat-tree engine?

No. In fact, fat-tree doesn't use these tables for routing - it creates them
as a routing by-product w/o any additional complexity.

-- Yevgeny

> Sasha
> 
>> and in order to go from SW_A to CA the algorithm checks 
>>  which
>>  switch is connected to CA (SW_B in this case), and choses the port on SW_A 
>>  that
>>  routes to SW_B, hence routes to SW_B and CA have to be the same (except for 
>>  the
>>  last SW_B->CA hop).
>>
>>  -- Yevgeny
>>
>>> Sasha
>>>>  -- Yevgeny
>>>>
>>>>  Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>>>>  ---
>>>>   opensm/opensm/osm_ucast_mgr.c |   33 ++++++++++++++++++++++++---------
>>>>   1 files changed, 24 insertions(+), 9 deletions(-)
>>>>
>>>>  diff --git a/opensm/opensm/osm_ucast_mgr.c 
>>>> b/opensm/opensm/osm_ucast_mgr.c
>>>>  index 5bcb655..cab272e 100644
>>>>  --- a/opensm/opensm/osm_ucast_mgr.c
>>>>  +++ b/opensm/opensm/osm_ucast_mgr.c
>>>>  @@ -242,6 +242,7 @@ __osm_ucast_mgr_dump_path_distribution(
>>>>
>>>>   /**********************************************************************
>>>>    **********************************************************************/
>>>>  +
>>>>   static void
>>>>   __osm_ucast_mgr_dump_ucast_routes(
>>>>     IN cl_map_item_t *p_map_item,
>>>>  @@ -255,6 +256,7 @@ __osm_ucast_mgr_dump_ucast_routes(
>>>>     uint8_t                  best_port;
>>>>     uint16_t                 max_lid_ho;
>>>>     uint16_t                 lid_ho, base_lid;
>>>>  +  boolean_t                direct_route_exists = FALSE;
>>>>     osm_switch_t* p_sw = (osm_switch_t *)p_map_item;
>>>>     osm_ucast_mgr_t* p_mgr = ((struct ucast_mgr_dump_context 
>>>> *)cxt)->p_mgr;
>>>>     FILE *file = ((struct ucast_mgr_dump_context *)cxt)->file;
>>>>  @@ -300,22 +302,35 @@ __osm_ucast_mgr_dump_ucast_routes(
>>>>       */
>>>>       if( p_port->p_node->sw )
>>>>       {
>>>>  +      /* Target LID is switch.
>>>>  +         Get its base lid and check hop count for this base LID only.*/
>>>>         base_lid = osm_node_get_base_lid(p_port->p_node, 0);
>>>>         base_lid = cl_ntoh16(base_lid);
>>>>         num_hops = osm_switch_get_hop_count( p_sw, base_lid, port_num );
>>>>       }
>>>>       else
>>>>       {
>>>>  -      osm_physp_t *p_physp = p_port->p_physp;
>>>>  -      if( !p_physp || !p_physp->p_remote_physp ||
>>>>  -          !p_physp->p_remote_physp->p_node->sw )
>>>>  -        num_hops = OSM_NO_PATH;
>>>>  +      /* Target LID is not switch (CA or router).
>>>>  +         Check if we have route to this target from current switch.*/
>>>>  +      num_hops = osm_switch_get_hop_count( p_sw, lid_ho, port_num );
>>>>  +      if (num_hops != OSM_NO_PATH)
>>>>  +      {
>>>>  +          direct_route_exists = TRUE;
>>>>  +          base_lid = lid_ho;
>>>>  +      }
>>>>         else
>>>>         {
>>>>  -        base_lid = 
>>>> osm_node_get_base_lid(p_physp->p_remote_physp->p_node,  0);
>>>>  -        base_lid = cl_ntoh16(base_lid);
>>>>  -        num_hops = p_physp->p_remote_physp->p_node->sw == p_sw ?
>>>>  -                   0 : osm_switch_get_hop_count( p_sw, base_lid, 
>>>> port_num  );
>>>>  +        osm_physp_t *p_physp = p_port->p_physp;
>>>>  +        if( !p_physp || !p_physp->p_remote_physp ||
>>>>  +            !p_physp->p_remote_physp->p_node->sw )
>>>>  +          num_hops = OSM_NO_PATH;
>>>>  +        else
>>>>  +        {
>>>>  +          base_lid = 
>>>> osm_node_get_base_lid(p_physp->p_remote_physp->p_node,  0);
>>>>  +          base_lid = cl_ntoh16(base_lid);
>>>>  +          num_hops = p_physp->p_remote_physp->p_node->sw == p_sw ?
>>>>  +                     0 : osm_switch_get_hop_count( p_sw, base_lid, 
>>>> port_num  );
>>>>  +        }
>>>>         }
>>>>       }
>>>>
>>>>  @@ -326,7 +341,7 @@ __osm_ucast_mgr_dump_ucast_routes(
>>>>       }
>>>>
>>>>       best_hops = osm_switch_get_least_hops( p_sw, base_lid );
>>>>  -    if (!p_port->p_node->sw)
>>>>  +    if (!p_port->p_node->sw && !direct_route_exists)
>>>>       {
>>>>         best_hops++;
>>>>         num_hops++;
>>>>  --  1.5.1.4
>>>>
>>>>
> 


From rdreier at cisco.com  Sat Jul  7 16:24:16 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Sat, 07 Jul 2007 16:24:16 -0700
Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux
In-Reply-To: <20070707085303.GS3885@ics.muni.cz> (Lukas Hejtmanek's message of
	"Sat, 7 Jul 2007 10:53:03 +0200")
References: <20070704125429.GL3885@ics.muni.cz> <adavecy4ysc.fsf@cisco.com>
	<20070705193136.GQ3885@ics.muni.cz> <adasl8232zv.fsf@cisco.com>
	<20070707085303.GS3885@ics.muni.cz>
Message-ID: <ada3azz3ihr.fsf@cisco.com>

 > Slab corruption: start=ffff880098f513b8, len=256
 > Redzone: 0x1600000016/0x1700000017.
 > Last user: <0000001800000018>(0x1800000018)

OK, CONFIG_DEBUG_SLAB is catching a slab getting corrupted with a
really strange pattern of incrementing values up to 1f.  Somehow
running under Xen is triggering this, since I run mthca with
CONFIG_DEBUG_SLAB set all the time and I've never seen anything like
this happen.

 > Call Trace:
 >  <ffffffff80277521> check_poison_obj+0x152/0x1ae
 >  <ffffffff88318ece> :ib_mthca:mthca_alloc_icm+0xff/0x35c
 >  <ffffffff88318ece> :ib_mthca:mthca_alloc_icm+0xff/0x35c
 >  <ffffffff80278269> cache_alloc_debugcheck_after+0x34/0x1b0
 >  <ffffffff802784d7> kmem_cache_alloc+0xf2/0x102
 >  <ffffffff88318ece> :ib_mthca:mthca_alloc_icm+0xff/0x35c
 >  <ffffffff88319263> :ib_mthca:mthca_alloc_icm_table+0x138/0x227
 >  <ffffffff88307bab> :ib_mthca:mthca_init_hca+0x5ee/0xde7

seems something bad is happening in mthca_alloc_icm, although the
corruption may have been earlier.

But I don't understand how we could have reached mthca_alloc_icm()
without getting through mthca_QUERY_FW and printing the FW version
first... are you sure you're getting all the trace messages?  How are
you collecting them?  Can you make sure that your console level is set
so that you see messages printed with KERN_DEBUG?

 - R.


From pyu at kraus.it  Sat Jul  7 16:34:46 2007
From: pyu at kraus.it (Cotton N. Joey)
Date: Sat, 7 Jul 2007 19:34:46 -0400
Subject: [ofa-general] Cheque.pdf
Message-ID: <46902316.1050700@kraus.it>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: Cheque.pdf
Type: application/pdf
Size: 21567 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070707/7c4f4dbb/attachment.pdf>

From stanleysufficool at roadrunner.com  Sat Jul  7 17:00:53 2007
From: stanleysufficool at roadrunner.com (Stanley Sufficool)
Date: Sat, 07 Jul 2007 17:00:53 -0700
Subject: [ofa-general] Compiling SRPT
Message-ID: <1183852853.6008.11.camel@gentoo-linux.localdomain>

Compiling on kernel 2.6.21-rc6 from kernel.org Torvald's branch

Got the latest srpt from the git repository on OpenFabrics and had the
following issues.

ib_srpt.c    Line 1997, missing second argument, should be?
sdev->scst_tgt = scst_register(tp, NULL);

SCST was built successfully after fixing an issue in scst_vdisk.c
(missing #include <linux/sched.h>) 

Just thought this would be nice to have documented, took me half a day
to track down as a novice in C programming.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070707/f2f2f2a8/attachment.html>

From xhejtman at ics.muni.cz  Sat Jul  7 17:15:32 2007
From: xhejtman at ics.muni.cz (Lukas Hejtmanek)
Date: Sun, 8 Jul 2007 02:15:32 +0200
Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux
In-Reply-To: <ada3azz3ihr.fsf@cisco.com>
References: <20070704125429.GL3885@ics.muni.cz> <adavecy4ysc.fsf@cisco.com>
	<20070705193136.GQ3885@ics.muni.cz> <adasl8232zv.fsf@cisco.com>
	<20070707085303.GS3885@ics.muni.cz> <ada3azz3ihr.fsf@cisco.com>
Message-ID: <20070708001531.GT3885@ics.muni.cz>

On Sat, Jul 07, 2007 at 04:24:16PM -0700, Roland Dreier wrote:
> But I don't understand how we could have reached mthca_alloc_icm()
> without getting through mthca_QUERY_FW and printing the FW version
> first... are you sure you're getting all the trace messages?  How are
> you collecting them?  Can you make sure that your console level is set
> so that you see messages printed with KERN_DEBUG?

You are right, the console did not receive debug messages so I changed
mthca_dbg to spam with KERN_ERR priority instead. (This time, it looks like 
corruption gets triggered at another place and the driver complains to not 
receive IRQ).

Here is the result:
# insmod ib_mthca.ko debug_level=1
ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006)
ib_mthca: Initializing 0000:08:00.0
PCI: Enabling device 0000:08:00.0 (0000 -> 0002)
ib_mthca 0000:08:00.0: FW version 000100020000, max commands 16
ib_mthca 0000:08:00.0: Catastrophic error buffer at 0xb9382a50, size 0x10
ib_mthca 0000:08:00.0: FW supports commands through doorbells
ib_mthca 0000:08:00.0: Mapped doorbell page for posting FW commands
ib_mthca 0000:08:00.0: FW size 5136 KB
ib_mthca 0000:08:00.0: Clear int § b93f00d8, EQ arm § b9361748, EQ set CI § b9372000
ib_mthca 0000:08:00.0: No HCA-attached memory (running in MemFree mode)
ib_mthca 0000:08:00.0: Mapped 1284 chunks/5136 KB for FW.
ib_mthca 0000:08:00.0: Base MM extensions: no
ib_mthca 0000:08:00.0: Max ICM size 523264 MB
ib_mthca 0000:08:00.0: Max QPs: 16777216, reserved QPs: 1024, entry size: 256
ib_mthca 0000:08:00.0: Max SRQs: 1024, reserved SRQs: 64, entry size: 32
ib_mthca 0000:08:00.0: Max CQs: 16777216, reserved CQs: 128, entry size: 64
ib_mthca 0000:08:00.0: Max EQs: 64, reserved EQs: 1, entry size: 64
ib_mthca 0000:08:00.0: reserved MPTs: 16, reserved MTTs: 2
ib_mthca 0000:08:00.0: Max PDs: 8388608, reserved PDs: 4, reserved UARs: 1
ib_mthca 0000:08:00.0: Max QP/MCG: 8388608, reserved MGMs: 0
ib_mthca 0000:08:00.0: Max CQEs: 131072, max WQEs: 16384, max SRQ WQEs: 16384
ib_mthca 0000:08:00.0: Flags: 00370347
ib_mthca 0000:08:00.0: profile 0--13/11 § 0x               0 (size 0x20000000)
ib_mthca 0000:08:00.0: profile 1--10/20 § 0x        20000000 (size 0x 4000000)
ib_mthca 0000:08:00.0: profile 2-- 0/16 § 0x        24000000 (size 0x 1000000)
ib_mthca 0000:08:00.0: profile 3-- 7/18 § 0x        25000000 (size 0x  800000)
ib_mthca 0000:08:00.0: profile 4-- 9/17 § 0x        25800000 (size 0x  800000)
ib_mthca 0000:08:00.0: profile 5-- 3/16 § 0x        26000000 (size 0x  400000)
ib_mthca 0000:08:00.0: profile 6-- 4/16 § 0x        26400000 (size 0x  400000)
ib_mthca 0000:08:00.0: profile 7-- 8/13 § 0x        26800000 (size 0x   80000)
ib_mthca 0000:08:00.0: profile 8--11/11 § 0x        26880000 (size 0x   10000)
ib_mthca 0000:08:00.0: profile 9-- 2/10 § 0x        26890000 (size 0x    8000)
ib_mthca 0000:08:00.0: profile10-- 1/ 0 § 0x        26898000 (size 0x    1000)
ib_mthca 0000:08:00.0: profile11-- 5/ 0 § 0x        26899000 (size 0x    1000)
ib_mthca 0000:08:00.0: profile12-- 6/ 5 § 0x        2689a000 (size 0x    1000)
ib_mthca 0000:08:00.0: profile13--12/ 0 § 0x        2689b000 (size 0x    1000)
ib_mthca 0000:08:00.0: HCA context memory: reserving 631408 KB
ib_mthca 0000:08:00.0: 631408 KB of HCA context requires 1244 KB aux memory.
ib_mthca 0000:08:00.0: Mapped 311 chunks/1244 KB for ICM aux.
ib_mthca 0000:08:00.0: Mapped page at 24d8b000 to 2689a000 for ICM.
ib_mthca 0000:08:00.0: Mapped 64 chunks/256 KB at 20000000 for ICM.
ib_mthca 0000:08:00.0: Mapped 1 chunks/256 KB at 25800000 for ICM.
ib_mthca 0000:08:00.0: Mapped 64 chunks/256 KB at 24000000 for ICM.
ib_mthca 0000:08:00.0: Mapped 64 chunks/256 KB at 26400000 for ICM.
ib_mthca 0000:08:00.0: Mapped 64 chunks/256 KB at 26000000 for ICM.
ib_mthca 0000:08:00.0: Mapped 8 chunks/32 KB at 26890000 for ICM.
ib_mthca 0000:08:00.0: Mapped 64 chunks/256 KB at 26800000 for ICM.
ib_mthca 0000:08:00.0: Mapped 64 chunks/256 KB at 26840000 for ICM.
Unable to handle kernel paging request at 0000001100000019 RIP: 
 <ffffffff803654eb> datagram_poll+0xcc/0xd6
PGD 0 
Oops: 0002 1 SMP 
CPU 0 
Modules linked in: ib_mthca nfs lockd nfs_acl sunrpc ib_ipoib ib_cm ib_sa ib_mad ib_core memtrack ipv6 e1000 dm_mod parport_pc lp parport xfs ata_piix ahci piix mptsas mptscsih mptbase scsi_transport_sas raid0 sata_nv libata amd74xx sd_mod scsi_mod ide_disk ide_core
Pid: 2170, comm: ntpd Not tainted 2.6.18-xen31-smp #6
RIP: e030:<ffffffff803654eb>  <ffffffff803654eb> datagram_poll+0xcc/0xd6
RSP: e02b:ffff880095e87a88  EFLAGS: 00010246
RAX: 0000001100000011 RBX: ffff8800971e2ac8 RCX: 000000000000000b
RDX: 0000000000000000 RSI: 0000000000000049 RDI: 0000000000000002
RBP: ffff88009c8ad390 R08: ffff880095e86000 R09: ffff880095e87760
R10: ffffffff803a492f R11: ffffffff803a492f R12: 0000000000000005
R13: 0000000000000020 R14: ffff880095e87ef8 R15: 0000000000000008
FS:  00002aaaab383ee0(0000) GS:ffffffff804aa000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process ntpd (pid: 2170, threadinfo ffff880095e86000, task ffff88009c0977e0)
Stack:  ffff8800973547b0 ffffffff803a4942 ffffffff803a492f 0000000000000300
 ffff88009c8ad390 0000000000000005 0000000000000020 ffffffff8028e1d3
 ffff880095e87f40 0000000000000000 ffff880095e87e10 ffff880095e87e18
Call Trace:
 <ffffffff803a4942> udp_poll+0x13/0xf3
 <ffffffff803a492f> udp_poll+0x0/0xf3
 <ffffffff8028e1d3> do_select+0x2aa/0x464
 <ffffffff8028de4c> __pollwait+0x0/0xdd
 <ffffffff8022536b> default_wake_function+0x0/0xe
 <ffffffff8022536b> default_wake_function+0x0/0xe
 <ffffffff8022536b> default_wake_function+0x0/0xe
 <ffffffff8022536b> default_wake_function+0x0/0xe
 <ffffffff8035ff8f> sock_common_recvmsg+0x2d/0x43
 <ffffffff8035f7da> sock_recvmsg+0x101/0x120
 <ffffffff80277073> poison_obj+0x24/0x2d
 <ffffffff80277275> cache_free_debugcheck+0x1f9/0x209
 <ffffffff803a492f> udp_poll+0x0/0xf3
 <ffffffff80277cea> kmem_cache_free+0xd0/0x140
 <ffffffff8028e600> sys_select+0x273/0x3e5
 <ffffffff8020fd0c> init_fpu+0x62/0x7f
 <ffffffff8020ab90> math_state_restore+0x21/0x4a
 <ffffffff8020a117> error_exit+0x0/0x71
 <ffffffff80208f99> sys_rt_sigreturn+0x251/0x301
 <ffffffff802099da> system_call+0x86/0x8b
 <ffffffff80209954> system_call+0x0/0x8b


Code: f0 0f ba 68 08 00 5b 89 f0 c3 41 57 41 89 f7 41 56 41 55 41 
RIP  <ffffffff803654eb> datagram_poll+0xcc/0xd6
 RSP <ffff880095e87a88>
CR2: 0000001100000019
 <1>Unable to handle kernel paging request at 0000000d0000001d RIP: 
 <ffffffff881484e3> :xfs:xfs_file_close+0x1c/0x28
PGD 0 
Oops: 0000 2 SMP 
CPU 0 
Modules linked in: ib_mthca nfs lockd nfs_acl sunrpc ib_ipoib ib_cm ib_sa ib_mad ib_core memtrack ipv6 e1000 dm_mod parport_pc lp parport xfs ata_piix ahci piix mptsas mptscsih mptbase scsi_transport_sas raid0 sata_nv libata amd74xx sd_mod scsi_mod ide_disk ide_core
Pid: 2170, comm: ntpd Not tainted 2.6.18-xen31-smp #6
RIP: e030:<ffffffff881484e3>  <ffffffff881484e3> :xfs:xfs_file_close+0x1c/0x28
RSP: e02b:ffff880095e87828  EFLAGS: 00010246
RAX: ffff8800971e4078 RBX: ffff88009cbe0bd0 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000d0000000d
RBP: ffff880000b1e998 R08: ffff880000c39280 R09: 0000000000000298
R10: ffff880097eb1860 R11: 0000000000000298 R12: ffff880000b1e9a8
R13: 0000000000000009 R14: 0000000000000000 R15: 0000000000000001
FS:  00002aaaab383ee0(0000) GS:ffffffff804aa000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process ntpd (pid: 2170, threadinfo ffff880095e86000, task ffff88009c0977e0)
Stack:  ffffffff881484c7 ffffffff8027aa6a ffff880000b1e998 0000000000000001
 ffff880000b1e9a8 ffffffff8022c9f1 000009c000000a00 ffff880000b1e998
 ffff88009c0977e0 0000000000000001 0000000000000009 ffff880095e879d8
Call Trace:
 <ffffffff881484c7> :xfs:xfs_file_close+0x0/0x28
 <ffffffff8027aa6a> filp_close+0x36/0x64
 <ffffffff8022c9f1> put_files_struct+0x6c/0xbf
 <ffffffff8022dcab> do_exit+0x2ae/0x929
 <ffffffff8020622a> hypercall_page+0x22a/0x1000
 <ffffffff803c6783> do_page_fault+0x119e/0x1253
 <ffffffff8020de8b> monotonic_clock+0x3e/0x86
 <ffffffff803c1abd> thread_return+0x0/0x13d
 <ffffffff8020a117> error_exit+0x0/0x71
 <ffffffff803a492f> udp_poll+0x0/0xf3
 <ffffffff803a492f> udp_poll+0x0/0xf3
 <ffffffff803654eb> datagram_poll+0xcc/0xd6
 <ffffffff80365440> datagram_poll+0x21/0xd6
 <ffffffff803a4942> udp_poll+0x13/0xf3
 <ffffffff803a492f> udp_poll+0x0/0xf3
 <ffffffff8028e1d3> do_select+0x2aa/0x464
 <ffffffff8028de4c> __pollwait+0x0/0xdd
 <ffffffff8022536b> default_wake_function+0x0/0xe
 <ffffffff8022536b> default_wake_function+0x0/0xe
 <ffffffff8022536b> default_wake_function+0x0/0xe
 <ffffffff8022536b> default_wake_function+0x0/0xe
 <ffffffff8035ff8f> sock_common_recvmsg+0x2d/0x43
 <ffffffff8035f7da> sock_recvmsg+0x101/0x120
 <ffffffff80277073> poison_obj+0x24/0x2d
 <ffffffff80277275> cache_free_debugcheck+0x1f9/0x209
 <ffffffff803a492f> udp_poll+0x0/0xf3
 <ffffffff80277cea> kmem_cache_free+0xd0/0x140
 <ffffffff8028e600> sys_select+0x273/0x3e5
 <ffffffff8020fd0c> init_fpu+0x62/0x7f
 <ffffffff8020ab90> math_state_restore+0x21/0x4a
 <ffffffff8020a117> error_exit+0x0/0x71
 <ffffffff80208f99> sys_rt_sigreturn+0x251/0x301
 <ffffffff802099da> system_call+0x86/0x8b
 <ffffffff80209954> system_call+0x0/0x8b


Code: 48 8b 47 10 ff 50 10 41 5b f7 d8 c3 31 c0 48 83 ff 28 51 74 
RIP  <ffffffff881484e3> :xfs:xfs_file_close+0x1c/0x28
 RSP <ffff880095e87828>
CR2: 0000000d0000001d
 <1>Fixing recursive fault but reboot is needed!
syslog-ng1981: segfault at ffffffff80808080 rip 000055555555bc28 rsp 00007fffffffd868 error 6
Slab corruption: start=ffff880096c743b8, len=256
Redzone: 0x1600000016/0x1700000017.
Last user: <0000001800000018>(0x1800000018)
000: 17 00 00 00 17 00 00 00 18 00 00 00 18 00 00 00
010: 19 00 00 00 19 00 00 00 1a 00 00 00 1a 00 00 00
020: 1b 00 00 00 1b 00 00 00 1c 00 00 00 1c 00 00 00
030: 1d 00 00 00 1d 00 00 00 1e 00 00 00 1e 00 00 00
040: 1f 00 00 00 1f 00 00 00 00 00 00 00 00 00 00 00
050: 01 00 00 00 01 00 00 00 02 00 00 00 02 00 00 00
Prev obj: start=0000000396c7420b, len=256
Unable to handle kernel paging request at 0000000396c7430b RIP: 
 <ffffffff80277313> print_objinfo+0x22/0xde
PGD 0 
Oops: 0000 3 SMP 
CPU 0 
Modules linked in: ib_mthca nfs lockd nfs_acl sunrpc ib_ipoib ib_cm ib_sa ib_mad ib_core memtrack ipv6 e1000 dm_mod parport_pc lp parport xfs ata_piix ahci piix mptsas mptscsih mptbase scsi_transport_sas raid0 sata_nv libata amd74xx sd_mod scsi_mod ide_disk ide_core
Pid: 1981, comm: syslog-ng Not tainted 2.6.18-xen31-smp #6
RIP: e030:<ffffffff80277313>  <ffffffff80277313> print_objinfo+0x22/0xde
RSP: e02b:ffff8800935cdb48  EFLAGS: 00010206
RAX: 0000000396c7430b RBX: 000000000089dac1 RCX: ffffffffff57c000
RDX: 0000000000000002 RSI: 0000000396c74203 RDI: ffff8800015f20c0
RBP: ffff8800015f20c0 R08: ffff880000cbc788 R09: 000000000000d64e
R10: ffff8800935cda98 R11: ffffffff802fd0b5 R12: 0000000396c74203
R13: 0000000000000002 R14: ffff880096c743b0 R15: ffff880096c74000
FS:  00002aaaab0186e0(0000) GS:ffffffff804aa000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process syslog-ng (pid: 1981, threadinfo ffff8800935cc000, task ffff88009c0e0860)
Stack:  000000000089dac1 ffff8800015f20c0 0000000396c74203 0000000000000100
 ffff880096c743b0 ffffffff80277521 ffff8800015f20c0 0000000000000000
 ffff8800015f20c0 ffff880096c743b0 ffffffff802ab3b1 00000000000000d0
Call Trace:
 <ffffffff80277521> check_poison_obj+0x152/0x1ae
 <ffffffff802ab3b1> elf_core_dump+0xe2/0xc2d
 <ffffffff802ab3b1> elf_core_dump+0xe2/0xc2d
 <ffffffff80278269> cache_alloc_debugcheck_after+0x34/0x1b0
 <ffffffff802784d7> kmem_cache_alloc+0xf2/0x102
 <ffffffff802ab3b1> elf_core_dump+0xe2/0xc2d
 <ffffffff8027a67c> do_truncate+0x60/0x69
 <ffffffff802858f4> do_coredump+0x5a0/0x601
 <ffffffff80277cea> kmem_cache_free+0xd0/0x140
 <ffffffff80236db4> __dequeue_signal+0x18b/0x19a
 <ffffffff80238129> get_signal_to_deliver+0x4ee/0x549
 <ffffffff80209115> do_signal+0x55/0x6d8
 <ffffffff803c67db> do_page_fault+0x11f6/0x1253
 <ffffffff88126d95> :xfs:xfs_iunlock+0x4f/0x7a
 <ffffffff8813eb19> :xfs:xfs_fsync+0x157/0x1a9
 <ffffffff80256c9f> __filemap_fdatawrite_range+0x51/0x5b
 <ffffffff8020a055> retint_signal+0x5d/0xb8


Code: 48 8b 18 48 89 ef e8 11 fd ff ff 48 8b 30 48 c7 c7 da c3 3e 
RIP  <ffffffff80277313> print_objinfo+0x22/0xde
 RSP <ffff8800935cdb48>
CR2: 0000000396c7430b
 <3>ib_mthca 0000:08:00.0: Memory key throughput optimization activated.
ib_mthca 0000:08:00.0: Allocated EQ 1 with 131072 entries
ib_mthca 0000:08:00.0: Allocated EQ 2 with 512 entries
ib_mthca 0000:08:00.0: Setting mask 00000000001f47fe for eqn 2
ib_mthca 0000:08:00.0: NOP command failed to generate interrupt (IRQ 16), aborting.
ib_mthca 0000:08:00.0: BIOS or ACPI interrupt routing problem?
ib_mthca 0000:08:00.0: Clearing mask 00000000001f47fe for eqn 2
ib_mthca 0000:08:00.0: HW2SW_EQ failed (-11)
ib_mthca 0000:08:00.0: HW2SW_EQ returned status 0xff
ib_mthca 0000:08:00.0: HW2SW_MPT failed (-11)
ib_mthca 0000:08:00.0: HW2SW_EQ failed (-11)
ib_mthca 0000:08:00.0: HW2SW_EQ returned status 0xff
ib_mthca 0000:08:00.0: HW2SW_MPT failed (-11)
ib_mthca 0000:08:00.0: HW2SW_MPT failed (-11)
ib_mthca 0000:08:00.0: Unmapping 64 pages at 26800000 from ICM.
ib_mthca 0000:08:00.0: Unmapping 64 pages at 26840000 from ICM.
ib_mthca 0000:08:00.0: Unmapping 64 pages at 26890000 from ICM.
ib_mthca 0000:08:00.0: Unmapping 64 pages at 26000000 from ICM.
ib_mthca 0000:08:00.0: Unmapping 64 pages at 26400000 from ICM.
ib_mthca 0000:08:00.0: Unmapping 64 pages at 24000000 from ICM.
ib_mthca 0000:08:00.0: Unmapping 64 pages at 25800000 from ICM.


-- 
Lukáš Hejtmánek


From ulxgu at champagneneckimpulses.com  Sat Jul  7 19:04:55 2007
From: ulxgu at champagneneckimpulses.com (Henry)
Date: Sat, 7 Jul 2007 22:04:55 -0400
Subject: [ofa-general] Strong knowledge of Equity Trading and QA,
	with an empahsis on QA Automation Engineers.
Message-ID: <46904647.1080307@champagneneckimpulses.com>

VPSN WILL MOVE LIKE A COMET AND ITS ONLY GOING TO GET BETTER! Watch this
SUPERNOVA closely MONDAY!

VISION AIRSHIPS INC
Symbol: VPSN
Price: $0.021

BANGKOK, THAILAND, July 2007
Advertising Agencies Ready to Ink Deals!

The company wishes to announce that it is in final negotiations for
representation with some of the world's largest advertising agencies to
market and reserve the blimps for there clients.

VPSN THE RISING STAR, IS SET FOR SUPERNOVA STATUS ON MONDAY!

Services firms demand that software managers and software developers
have an extremely solid business background. Services firms demand that
software managers and software developers have an extremely solid
business background.
In this role, you will have regular interaction with the business,
application development teams, senior managers, and the wider Prime
Brokerage support structure.

In this role, you will work with internal clients to analyze, design,
test, implement and support various applications and tools. Good
understanding of the US and Global Fixed Income markets.

This individual's financial services competencies should include or span
the pan-equity Trading environment. Hands-on, advanced experience using
Excel and knowledge of SQL is highly desirable.
Solid client service experience in the financial services industry.
Specific knowledge of street wide compliance initiatives such as SOx and
other industry regulated initiatives.

Knowledge of FIX or other similar protocols is a plus as is prior
experience working with order management systems, FIX, or other exchange
connectivity.
Strong knowledge of Equity Trading and QA, with an empahsis on QA
Automation Engineers. Wall Street firms, technological advancements, and
technology professionals.

Strong profession presence with the ability to clarify requirements and
priority in a fast moving trading floor environment. Background in
trading systems and vendors and a background in OMS systems.
Background in trading systems and vendors and a background in OMS
systems. NET Framework Server Side Development. Whether it be Equities
or Fixed Income, Foreign Exchange or Commodities, to be an elite
performer,  you must combine a gift for software engineering, with a
strong financial  services acumen.
Experience in relational database and SQL programming, network
programming, Java performance tuning, and experience with developing
scalable, robust, high performance systems. In this role, you will work
with internal clients to analyze, design, test, implement and support
various applications and tools.

Knowledge of UNIX, RougueWave, XML, thread programming, and Corba is a
plus. For many years, the successful combination of these three entities
has been the key to corporate profitability.

An understanding of the financial services business is a plus.
Background in trade support from an operational perspective.

Full product life-cycle experience working on a highly distributed,
multi-tier, global system is also a plus.
Knowledge of derivatives products including Equity Swaps, CFDs, futures,
options, Interest Rate Swaps, repurchase agreements, stock loan, Credit
Default Swaps and convertible bonds is also a plus.

Attention to detail and ability to work with large volume of data. This
is a Senior role and this individual will be expected to guide the team
in terms of attribution techniques and keep the team on the cutting edge
of attribution and other Analytics. Knowledge of Fixed Income business
combined with Quantitative Skills and the ability to work under pressure
and handle multiple tasks in a fast pace environment is also required.
Wall Street is looking for candidates who can solve real business
problems using financial technology.
For many years, the successful combination of these three entities has
been the key to corporate profitability.
Ability to work well in a team as well as independently. NET Framework
Server Side Development. Knowledge of derivatives products including
Equity Swaps, CFDs, futures, options, Interest Rate Swaps, repurchase
agreements, stock loan, Credit Default Swaps and convertible bonds is
also a plus.
Knowledge of derivatives products including Equity Swaps, CFDs, futures,
options, Interest Rate Swaps, repurchase agreements, stock loan, Credit
Default Swaps and convertible bonds is also a plus.
The team s goal is to be the single point of contact for all technology
issues experienced by the Prime Broker business, as well as the
ownership of tactical development and cross-functional items. This
candidate will work in the Fixed Income team in the Global Analytics
department supporting the Fixed Income attribution efforts.

If that describes you, we'd like to hear from you. Full product
life-cycle experience working on a highly distributed, multi-tier,
global system is also a plus. Services firms demand that software
managers and software developers have an extremely solid business
background.

Wall Street Technology Jobs - New York Financial District Technical
Careers in Equity Trading, Stock Markets, and Financial Services.

The scope of this role extends to cover significant street or industry
wide initiatives such as Sox and BaFIN.

Services firms demand that software managers and software developers
have an extremely solid business background. Good understanding of the
US and Global Fixed Income markets. Hands-on, advanced experience using
Excel and knowledge of SQL is highly desirable. Bachelors degree in
Computer or Finance-related majors.

Excellent understanding of Fixed Income performance attribution from the
US and global perspective.
Knowledge of UNIX, RougueWave, XML, thread programming, and Corba is a
plus.

Solid client service experience in the financial services industry.

Excellent communication skills.

Knowledge of FIX or other similar protocols is a plus as is prior
experience working with order management systems, FIX, or other exchange
connectivity. Experience with FIX, Exchange Connectivity and Equities is
a plus as is Wall Street experience. Knowledge of FIX or other similar
protocols is a plus as is prior experience working with order management
systems, FIX, or other exchange connectivity. Strong UNIX, FIX, QA, and
Equity Trading Systems knowledge a must. Wall Street Technology Jobs -
New York Financial District Technical Careers in Equity Trading, Stock
Markets, and Financial Services. Good communication skills and
inter-personal skills are expected as the Analyst will need to interact
with users widely throughout the firm.

If that describes you, we'd like to hear from you.

Knowledge of UNIX, RougueWave, XML, thread programming, and Corba is a
plus. Specific knowledge of street wide compliance initiatives such as
SOx and other industry regulated initiatives.

This is a Senior role and this individual will be expected to guide the
team in terms of attribution techniques and keep the team on the cutting
edge of attribution and other Analytics. An understanding of the
financial services business is a plus.


From landman at scalableinformatics.com  Sat Jul  7 19:58:26 2007
From: landman at scalableinformatics.com (Joe Landman)
Date: Sat, 07 Jul 2007 22:58:26 -0400
Subject: [ofa-general] found a simple fix for OFED-1.2 builds on OpenSuSE
	10.2
Message-ID: <469052D2.8030203@scalableinformatics.com>

Hi folks:

   I found a "simple" fix for OFED-1.2 builds on OpenSuSE.  I was hoping 
for some advise on how to implement the fix, as I see a few options. 
Basically the problem is that OpenSuSE (and I assume future versions of 
SuSE) mask the HZ macro from non-kernel builds.  The fix is to replace 
every instance of HZ usage with a system call

	-DHZ='sysconf(_SC_CLK_TCK)'"

I have verified that, if I get into the build directory and run the 
command that failed in the original build.sh, but inserting a

	CC="gcc -DHZ='sysconf(_SC_CLK_TCK)'" \
	CFLAGS="-DHZ='sysconf(_SC_CLK_TCK)'"

(continued on second line due to email client wrapping)

immediately in front of the rpmbuild, that this is sufficient for 
correct and complete building of the ofa_user-1.2 rpms on an unmodified 
OpenSuSE 10.2 distribution.

Ok.  So now we know how to fix (hack) this, and why it breaks.  The real 
fix is to seek out the uses of HZ, and replace them with the system call 
as indicated.  I would be happy to work on this if you would point me to 
whom I should send patches.

But the question I really have is this.  How can I (at least 
temporarily) inject these environment variables (which ostensibly just 
alleviate manual patching) into the build process?  Specifically, I 
looked in build.sh, and all the rpmbuild commands are of the form

	ex rpmbuild ...

where ... are options.  Rpmbuild presumes that you will pass any needed 
  environment variables in as I had done.

So is this the right place to inject this environment variable change in 
absence of a formal patch?  I could work up some additional hacked 
methods, but they are only temporary at best (such as using an 
rpmbuild.sh to force the issue).  Thoughts, guidance, pointers, and 
clues are sought.  I am not looking to formalize a hack, but I also need 
to get this build working.  Longer term (next few weeks) I would prefer 
to get fixes back to the maintainer(s).

Thanks.

Joe


-- 
landman at scalableinformatics.com


From ogerlitz at voltaire.com  Sat Jul  7 23:38:30 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 08 Jul 2007 09:38:30 +0300
Subject: [ofa-general] socket buffer accounting with UDP/ipoib
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901D362E7@mtlexch01.mtl.com>
References: <1183643723.25031.262.camel@mtls03> <468CFBD0.6040407@voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901D362E7@mtlexch01.mtl.com>
Message-ID: <46908666.8090908@voltaire.com>

Eli Cohen wrote:
>> can you resend the patch with function named appearing in each hunk
> (ie after the @@ , use diff -p flag for that)
>> Or.
> 
> Sure. It is attached now - sorry but I using outlook from home :)

nope, the attachment was also without the functions names, anyway, 
please see below some comments.

> Index: ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> ===================================================================
> --- ofa_kernel-1.2.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-06-28 13:48:51.000000000 +0300
> +++ ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-07-08 09:52:29.000000000 +0300
> @@ -50,6 +50,8 @@ MODULE_PARM_DESC(data_debug_level,
>  		 "Enable data path debug tracing if > 0");
>  #endif
>  
> +#define SKB_LEN_THOLD 150
> +
>  static DEFINE_MUTEX(pkey_mutex);
>  
>  struct ipoib_ah *ipoib_create_ah(struct net_device *dev,
> @@ -169,7 +171,7 @@ static void ipoib_ib_handle_rx_wc(struct
>  {
>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
>  	unsigned int wr_id = wc->wr_id & ~IPOIB_OP_RECV;
> -	struct sk_buff *skb;
> +	struct sk_buff *skb, *nskb;
>  	u64 addr;
>  
>  	ipoib_dbg_data(priv, "recv completion: id %d, op %d, status: %d\n",
> @@ -223,6 +225,19 @@ static void ipoib_ib_handle_rx_wc(struct
>  		++priv->stats.rx_packets;
>  		priv->stats.rx_bytes += skb->len;
>  
> +		if (skb->len < SKB_LEN_THOLD) {
> +			nskb = dev_alloc_skb(skb->len);
> +			if (!nskb) {
> +				ipoib_warn(priv, "failed to allocate skb\n");
> +				return;
> +			}
> +			memcpy(nskb->data, skb->data, skb->len);
> +			skb_put(nskb, skb->len);
> +			nskb->protocol = skb->protocol;
> +			dev_kfree_skb_any(skb);
> +			skb = nskb;
> +		}
> +
>  		skb->dev = dev;
>  		/* XXX get correct PACKET_ type here */
>  		skb->pkt_type = PACKET_HOST;
> @@ -296,12 +311,12 @@ void ipoib_ib_completion(struct ib_cq *c
>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
>  	int n, i;
>  
> -	ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
>  	do {
>  		n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->ibwc);
>  		for (i = 0; i < n; ++i)
>  			ipoib_ib_handle_wc(dev, priv->ibwc + i);
>  	} while (n == IPOIB_NUM_WC);
> +	ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
>  }

It seems that the change to ipoib_ib_completion() entered this patch by 
mistake, am I correct?

Or.


From ogerlitz at voltaire.com  Sat Jul  7 23:53:49 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 08 Jul 2007 09:53:49 +0300
Subject: [ofa-general] Re: consumer data buffer ownership for inline sends
In-Reply-To: <adawsxe34rx.fsf@cisco.com>
References: <Pine.LNX.4.64.0707031144130.15147@zuben>
	<adawsxe34rx.fsf@cisco.com>
Message-ID: <469089FD.10908@voltaire.com>

Roland Dreier wrote:
>  > Does this means that for inline sends, when ibv_post_send returns,
>  > the consumer owns back the data buffer associated with this send?
>  > 
>  > Can this be stated as the official policy of libibverbs?
> 
> I guess that makes sense.  I wonder if there's any conceivable
> interpretation of the inline send flag where the adapter might need to
> access the original buffer after the request is posted?

thinking on it a little, such adapter has too much logic/state 
implemented in its HW/DMA engine... assuming all this is beyond the IB 
spec scope, can we take the liberty and turn it into official policy of 
libibverbs which provider libraries must confirm to?

Or.


From vlad at dev.mellanox.co.il  Sun Jul  8 01:17:03 2007
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Sun, 08 Jul 2007 11:17:03 +0300
Subject: [ofa-general] [GIT PULL ofed_1_2] iw_cxgb3 - Don't allow
	interrupts while obtaining the ctrl-qp mutex.
In-Reply-To: <468E5768.7090200@opengridcomputing.com>
References: <468E5768.7090200@opengridcomputing.com>
Message-ID: <46909D7F.1040006@dev.mellanox.co.il>

Steve Wise wrote:
> Vlad,
> 
> Please pull from
> 
> git://git.openfabrics.org/~swise/ofed_1_2 ofed_1_2
> 
> This patch fixes bug 681.
> 
> Below is the patch.
> 
> Steve.

Done,

Regards,
Vladimir


From erezz at voltaire.com  Sun Jul  8 01:34:21 2007
From: erezz at voltaire.com (Erez Zilber)
Date: Sun, 08 Jul 2007 11:34:21 +0300
Subject: [ofa-general] Will SLES 10 sp2 contain the RDMA-CM?
Message-ID: <4690A18D.40709@voltaire.com>

All,


I've noticed that SLES 10 sp1 doesn't contain the RDMA-CM. We would like
to add iSER for sp2, but without the RDMA-CM we cannot add it.


Does Novell plan to add it to sp2? I guess that this should be very easy
with the backport patches from OFED 1.2.


Thanks,

-- 

____________________________________________________________

Erez Zilber | 972-9-971-7689

Software Engineer, Storage Team

Voltaire – _The Grid Backbone_

__

www.voltaire.com <http://www.voltaire.com/>


From vlad at lists.openfabrics.org  Sun Jul  8 02:44:15 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Sun,  8 Jul 2007 02:44:15 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_c_kernel 20070708-0200 daily build status
Message-ID: <20070708094415.81B28E60824@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/~vlad/ofed_kernel.git
git_branch: ofed_kernel

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.12
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.18
Passed on powerpc with linux-2.6.18
Passed on ia64 with linux-2.6.12
Passed on x86_64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on ia64 with linux-2.6.13
Passed on x86_64 with linux-2.6.20
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.15
Passed on powerpc with linux-2.6.17
Passed on ppc64 with linux-2.6.17
Passed on x86_64 with linux-2.6.13
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.15
Passed on powerpc with linux-2.6.13
Passed on powerpc with linux-2.6.12
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on ia64 with linux-2.6.21.1
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.18-8.el5
Passed on powerpc with linux-2.6.15
Passed on powerpc with linux-2.6.16
Passed on ppc64 with linux-2.6.16
Passed on powerpc with linux-2.6.14
Passed on ia64 with linux-2.6.17
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.14
Passed on ia64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-42.ELsmp

Failed:
Build failed on i686 with linux-2.6.22-rc7


From kliteyn at dev.mellanox.co.il  Sun Jul  8 06:55:41 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Sun, 08 Jul 2007 16:55:41 +0300
Subject: [ofa-general] [PATCH] osm: enhancing fat-tree routing for non-pure
	trees
Message-ID: <4690ECDD.7030106@dev.mellanox.co.il>

Hi Hal.

This patch handles the two new options for fat-tree routing:
root guid file and compute node guid files, and by doing that
fat-tree routing it is able to handle trees that are not pure
fat-trees, or even not symmetrical.
But the routing "quality" depends on the tree "correctness" - 
the more the topology looks like pure fat-tree, the better
the routing.

All the changes are in one file - osm_ucast_ftree.c, so as
much as I've tried to divide this patch into separate stages,
I found myself going back and fixing things too many times, so
at this point it won't make sense to send this patch in parts,
as earlier patches would have too much wrong code that was fixed 
later.

Bottom line: sorry, but this thing has to go in a single patch.

Here's what this patch does:

 1. Some modifications to ftree data structures and functions
     - Added guid getters for CAs and switches
     - Added node type and guid for each port group
     - Some naming changes
     - Added get_sw_by_guid and get_hca_by_guid functions

 2. Reading roots and compute nodes from guid files
     - Marking CAs with the number of CNs on the node
     - Marking port groups if they belong to CN

 3. Ranking rewritten to supports root guids
     - ftree.tree_rank replaced by two ranks:
       ftree.max_switch_rank and ftree.leaf_switch_rank.
     - Tree rank for routing is considered as (ftree.leaf_switch_rank + 1)

 4. Created leaf switch array that contains all the leafs
    with CNs and possibly leafs between them, according to
    the fabric indexing.

 5. Checking new "lighter" topology constaraint
     - all the leafs with real CNs should be at the same tree rank.
 
 6. Implemented the routing itself:
     - routing to all the CNs first
     - routing dummy targets for all the missing nodes
       or non-CNs that are connected to leaf switches
     - routing to all the non-CN CAs in the fabric
       (routing them as real targets on secondary path)
     - routing to all the switch-to-switch pathes (left the same)

 7. Updated ordering file dump function
     - Treating non-compute nodes as dummies

-- Yevgeny

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/osm_ucast_ftree.c | 1348 ++++++++++++++++++++++++++++++++-------
 1 files changed, 1109 insertions(+), 239 deletions(-)

diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c
index e91f3ed..6e62276 100644
--- a/opensm/opensm/osm_ucast_ftree.c
+++ b/opensm/opensm/osm_ucast_ftree.c
@@ -119,6 +119,17 @@ typedef struct {
 
 /***************************************************
  **
+ **  ftree_guid_tbl_element_t definition
+ **
+ ***************************************************/
+
+typedef struct {
+   cl_map_item_t  map_item;
+   uint64_t  guid_ho;
+} ftree_guid_tbl_element_t;
+
+/***************************************************
+ **
  **  ftree_fwd_tbl_t definition
  **
  ***************************************************/
@@ -147,21 +158,27 @@ typedef struct ftree_port_t_
  **
  ***************************************************/
 
+typedef union ftree_hca_or_sw_
+{
+   struct ftree_hca_t_ * p_hca;
+   struct ftree_sw_t_  * p_sw;
+} ftree_hca_or_sw;
+
 typedef struct ftree_port_group_t_
 {
    cl_map_item_t  map_item;
    ib_net16_t     base_lid;           /* base lid of the current node */
    ib_net16_t     remote_base_lid;    /* base lid of the remote node */
    ib_net64_t     port_guid;          /* port guid of this port */
+   ib_net64_t     node_guid;          /* this node's guid */
+   uint8_t        node_type;          /* this node's type */
    ib_net64_t     remote_port_guid;   /* port guid of the remote port */
    ib_net64_t     remote_node_guid;   /* node guid of the remote node */
    uint8_t        remote_node_type;   /* IB_NODE_TYPE_{CA,SWITCH,ROUTER,...} */
-   union remote_hca_or_sw_
-   {
-      struct ftree_hca_t_ * remote_hca;
-      struct ftree_sw_t_  * remote_sw;
-   } remote_hca_or_sw;                /* pointer to remote hca/switch */
+   ftree_hca_or_sw hca_or_sw;         /* pointer to this hca/switch */
+   ftree_hca_or_sw remote_hca_or_sw;  /* pointer to remote hca/switch */
    cl_ptr_vector_t ports;             /* vector of ports to the same lid */
+   boolean_t       is_cn;             /* whether this port is a compute node */
 } ftree_port_group_t;
 
 /***************************************************
@@ -182,6 +199,7 @@ typedef struct ftree_sw_t_
    ftree_port_group_t  ** up_port_groups;
    uint8_t                up_port_groups_num;
    ftree_fwd_tbl_t        lft_buf;
+   boolean_t              is_leaf;
 } ftree_sw_t;
 
 /***************************************************
@@ -195,6 +213,7 @@ typedef struct ftree_hca_t_ {
    osm_node_t           * p_osm_node;
    ftree_port_group_t  ** up_port_groups;
    uint16_t               up_port_groups_num;
+   unsigned               cn_num;
 } ftree_hca_t;
 
 /***************************************************
@@ -209,10 +228,14 @@ typedef struct ftree_fabric_t_
    cl_qmap_t       hca_tbl;
    cl_qmap_t       sw_tbl;
    cl_qmap_t       sw_by_tuple_tbl;
-   uint8_t         tree_rank;
+   cl_list_t       root_guid_list;
+   cl_qmap_t       cn_guid_tbl;
+   unsigned        cn_num;
+   uint8_t         leaf_switch_rank;
+   uint8_t         max_switch_rank;
    ftree_sw_t   ** leaf_switches;
    uint32_t        leaf_switches_num;
-   uint16_t        max_hcas_per_leaf;
+   uint16_t        max_cn_per_leaf;
    cl_pool_t       sw_fwd_tbl_pool;
    uint16_t        lft_max_lid_ho;
    boolean_t       fabric_built;
@@ -254,8 +277,8 @@ __osm_ftree_compare_port_groups_by_remote_switch_index(
    ftree_port_group_t ** pp_g2 = (ftree_port_group_t **)p2;
 
    return __osm_ftree_compare_switches_by_index(
-                  &((*pp_g1)->remote_hca_or_sw.remote_sw),
-                  &((*pp_g2)->remote_hca_or_sw.remote_sw) );
+                  &((*pp_g1)->remote_hca_or_sw.p_sw),
+                  &((*pp_g2)->remote_hca_or_sw.p_sw) );
 }
 
 /***************************************************/
@@ -393,6 +416,37 @@ __osm_ftree_sw_tbl_element_destroy(
 
 /***************************************************
  **
+ ** ftree_guid_tbl_element_t functions
+ **
+ ***************************************************/
+
+static ftree_guid_tbl_element_t *
+__osm_ftree_guid_tbl_element_create(
+   IN  uint64_t guid)
+{
+   ftree_guid_tbl_element_t * p_element =
+      (ftree_guid_tbl_element_t *) malloc(sizeof(ftree_guid_tbl_element_t));
+   if (!p_element)
+       return NULL;
+   memset(p_element, 0,sizeof(ftree_guid_tbl_element_t));
+
+   memcpy(&p_element->guid_ho, &guid, sizeof(uint64_t));
+   return p_element;
+}
+
+/***************************************************/
+
+static void
+__osm_ftree_guid_tbl_element_destroy(
+   IN  ftree_guid_tbl_element_t * p_element)
+{
+   if (!p_element)
+      return;
+   free(p_element);
+}
+
+/***************************************************
+ **
  ** ftree_port_t functions
  **
  ***************************************************/
@@ -433,11 +487,15 @@ static ftree_port_group_t *
 __osm_ftree_port_group_create(
    IN  ib_net16_t    base_lid,
    IN  ib_net16_t    remote_base_lid,
-   IN  ib_net64_t  * p_port_guid,
-   IN  ib_net64_t  * p_remote_port_guid,
-   IN  ib_net64_t  * p_remote_node_guid,
+   IN  ib_net64_t    port_guid,
+   IN  ib_net64_t    node_guid,
+   IN  uint8_t       node_type,
+   IN  void        * p_hca_or_sw,
+   IN  ib_net64_t    remote_port_guid,
+   IN  ib_net64_t    remote_node_guid,
    IN  uint8_t       remote_node_type,
-   IN  void        * p_remote_hca_or_sw)
+   IN  void        * p_remote_hca_or_sw,
+   IN  boolean_t     is_cn)
 {
    ftree_port_group_t * p_group =
             (ftree_port_group_t *)malloc(sizeof(ftree_port_group_t));
@@ -447,18 +505,33 @@ __osm_ftree_port_group_create(
 
    p_group->base_lid = base_lid;
    p_group->remote_base_lid = remote_base_lid;
-   memcpy(&p_group->port_guid, p_port_guid, sizeof(ib_net64_t));
-   memcpy(&p_group->remote_port_guid, p_remote_port_guid, sizeof(ib_net64_t));
-   memcpy(&p_group->remote_node_guid, p_remote_node_guid, sizeof(ib_net64_t));
+   memcpy(&p_group->port_guid, &port_guid, sizeof(ib_net64_t));
+   memcpy(&p_group->node_guid, &node_guid, sizeof(ib_net64_t));
+   memcpy(&p_group->remote_port_guid, &remote_port_guid, sizeof(ib_net64_t));
+   memcpy(&p_group->remote_node_guid, &remote_node_guid, sizeof(ib_net64_t));
+
+   p_group->node_type = node_type;
+   switch (node_type)
+   {
+      case IB_NODE_TYPE_CA:
+         p_group->hca_or_sw.p_hca = (ftree_hca_t *)p_hca_or_sw;
+         break;
+      case IB_NODE_TYPE_SWITCH:
+         p_group->hca_or_sw.p_sw = (ftree_sw_t *)p_hca_or_sw;
+         break;
+      default:
+         /* we shouldn't get here - port is created only in hca or switch */
+         CL_ASSERT(0);
+   }
 
    p_group->remote_node_type = remote_node_type;
    switch (remote_node_type)
    {
       case IB_NODE_TYPE_CA:
-         p_group->remote_hca_or_sw.remote_hca = (ftree_hca_t *)p_remote_hca_or_sw;
+         p_group->remote_hca_or_sw.p_hca = (ftree_hca_t *)p_remote_hca_or_sw;
          break;
       case IB_NODE_TYPE_SWITCH:
-         p_group->remote_hca_or_sw.remote_sw = (ftree_sw_t *)p_remote_hca_or_sw;
+         p_group->remote_hca_or_sw.p_sw = (ftree_sw_t *)p_remote_hca_or_sw;
          break;
       default:
          /* we shouldn't get here - port is created only in hca or switch */
@@ -468,6 +541,7 @@ __osm_ftree_port_group_create(
    cl_ptr_vector_init(&p_group->ports,
                       0,  /* min size */
                       8); /* grow size */
+   p_group->is_cn = is_cn;
    return p_group;
 } /* __osm_ftree_port_group_create() */
 
@@ -640,6 +714,26 @@ __osm_ftree_sw_destroy(
 
 /***************************************************/
 
+static uint64_t
+__osm_ftree_sw_get_guid_no(
+   IN  ftree_sw_t * p_sw)
+{
+   if (!p_sw)
+      return 0;
+   return osm_node_get_node_guid(p_sw->p_osm_sw->p_node);
+}
+
+/***************************************************/
+
+static uint64_t
+__osm_ftree_sw_get_guid_ho(
+   IN  ftree_sw_t * p_sw)
+{
+   return cl_ntoh64(__osm_ftree_sw_get_guid_no(p_sw));
+}
+
+/***************************************************/
+
 static void
 __osm_ftree_sw_dump(
    IN  ftree_fabric_t * p_ftree,
@@ -657,7 +751,7 @@ __osm_ftree_sw_dump(
            "__osm_ftree_sw_dump: "
            "Switch index: %s, GUID: 0x%016" PRIx64 ", Ports: %u DOWN, %u UP\n",
           __osm_ftree_tuple_to_str(p_sw->tuple),
-          cl_ntoh64(osm_node_get_node_guid(p_sw->p_osm_sw->p_node)),
+          __osm_ftree_sw_get_guid_ho(p_sw),
           p_sw->down_port_groups_num,
           p_sw->up_port_groups_num);
 
@@ -735,11 +829,15 @@ __osm_ftree_sw_add_port(
       p_group = __osm_ftree_port_group_create(
                      base_lid,
                      remote_base_lid,
-                     &port_guid,
-                     &remote_port_guid,
-                     &remote_node_guid,
+                     port_guid,
+                     __osm_ftree_sw_get_guid_no(p_sw),
+                     IB_NODE_TYPE_SWITCH,
+                     p_sw,
+                     remote_port_guid,
+                     remote_node_guid,
                      remote_node_type,
-                     p_remote_hca_or_sw);
+                     p_remote_hca_or_sw,
+                     FALSE);
       CL_ASSERT(p_group);
 
       if (direction == FTREE_DIRECTION_UP)
@@ -835,6 +933,26 @@ __osm_ftree_hca_destroy(
 
 /***************************************************/
 
+static uint64_t
+__osm_ftree_hca_get_guid_no(
+   IN  ftree_hca_t * p_hca)
+{
+   if (!p_hca)
+      return 0;
+   return osm_node_get_node_guid(p_hca->p_osm_node);
+}
+
+/***************************************************/
+
+static uint64_t
+__osm_ftree_hca_get_guid_ho(
+   IN  ftree_hca_t * p_hca)
+{
+   return cl_ntoh64(__osm_ftree_hca_get_guid_no(p_hca));
+}
+
+/***************************************************/
+
 static void
 __osm_ftree_hca_dump(
    IN  ftree_fabric_t * p_ftree,
@@ -851,7 +969,7 @@ __osm_ftree_hca_dump(
    osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
            "__osm_ftree_hca_dump: "
            "CA GUID: 0x%016" PRIx64 ", Ports: %u UP\n",
-          cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)),
+          __osm_ftree_hca_get_guid_ho(p_hca),
           p_hca->up_port_groups_num);
 
    for( i = 0; i < p_hca->up_port_groups_num; i++ )
@@ -888,7 +1006,8 @@ __osm_ftree_hca_add_port(
    IN  ib_net64_t    remote_port_guid,
    IN  ib_net64_t    remote_node_guid,
    IN  uint8_t       remote_node_type,
-   IN  void        * p_remote_hca_or_sw)
+   IN  void        * p_remote_hca_or_sw,
+   IN  boolean_t     is_cn)
 {
    ftree_port_group_t * p_group;
 
@@ -903,11 +1022,15 @@ __osm_ftree_hca_add_port(
       p_group = __osm_ftree_port_group_create(
                      base_lid,
                      remote_base_lid,
-                     &port_guid,
-                     &remote_port_guid,
-                     &remote_node_guid,
+                     port_guid,
+                     __osm_ftree_hca_get_guid_no(p_hca),
+                     IB_NODE_TYPE_CA,
+                     p_hca,
+                     remote_port_guid,
+                     remote_node_guid,
                      remote_node_type,
-                     p_remote_hca_or_sw);
+                     p_remote_hca_or_sw,
+                     is_cn);
       p_hca->up_port_groups[p_hca->up_port_groups_num++] = p_group;
    }
    __osm_ftree_port_group_add_port(p_group, port_num, remote_port_num);
@@ -933,6 +1056,10 @@ __osm_ftree_fabric_create()
    cl_qmap_init(&p_ftree->hca_tbl);
    cl_qmap_init(&p_ftree->sw_tbl);
    cl_qmap_init(&p_ftree->sw_by_tuple_tbl);
+   cl_qmap_init(&p_ftree->cn_guid_tbl);
+
+   cl_list_construct( &p_ftree->root_guid_list );
+   cl_list_init( &p_ftree->root_guid_list, 10 );
 
    status = cl_pool_init( &p_ftree->sw_fwd_tbl_pool,
                           8,                 /* min pool size */
@@ -945,7 +1072,6 @@ __osm_ftree_fabric_create()
    if (status != CL_SUCCESS)
       return NULL;
 
-   p_ftree->tree_rank = 1;
    return p_ftree;
 }
 
@@ -960,6 +1086,9 @@ __osm_ftree_fabric_clear(ftree_fabric_t * p_ftree)
    ftree_sw_t * p_next_sw;
    ftree_sw_tbl_element_t * p_element;
    ftree_sw_tbl_element_t * p_next_element;
+   ftree_guid_tbl_element_t * p_guid_element;
+   ftree_guid_tbl_element_t * p_next_guid_element;
+   uint64_t * p_guid;
 
    if (!p_ftree)
       return;
@@ -1000,6 +1129,26 @@ __osm_ftree_fabric_clear(ftree_fabric_t * p_ftree)
    }
    cl_qmap_remove_all(&p_ftree->sw_by_tuple_tbl);
 
+   /* remove all the elements of cn_guid_tbl */
+
+   p_next_guid_element =
+      (ftree_guid_tbl_element_t *)cl_qmap_head(&p_ftree->cn_guid_tbl);
+   while( p_next_guid_element !=
+          (ftree_guid_tbl_element_t *)cl_qmap_end(&p_ftree->cn_guid_tbl) )
+   {
+      p_guid_element = p_next_guid_element;
+      p_next_guid_element =
+         (ftree_guid_tbl_element_t *)cl_qmap_next(&p_guid_element->map_item);
+      __osm_ftree_guid_tbl_element_destroy(p_guid_element);
+   }
+   cl_qmap_remove_all(&p_ftree->cn_guid_tbl);
+
+   /* remove all the elements of root_guid_list*/
+
+   while ( (p_guid = (uint64_t*)cl_list_remove_head(&p_ftree->root_guid_list)) )
+      free(p_guid);
+   cl_list_destroy(&p_ftree->root_guid_list);
+
    /* free the leaf switches array */
    if ((p_ftree->leaf_switches_num > 0) && (p_ftree->leaf_switches))
       free(p_ftree->leaf_switches);
@@ -1024,19 +1173,10 @@ __osm_ftree_fabric_destroy(ftree_fabric_t * p_ftree)
 
 /***************************************************/
 
-static void
-__osm_ftree_fabric_set_rank(ftree_fabric_t * p_ftree, uint32_t rank)
-{
-   if (rank > p_ftree->tree_rank)
-      p_ftree->tree_rank = rank;
-}
-
-/***************************************************/
-
 static uint8_t
 __osm_ftree_fabric_get_rank(ftree_fabric_t * p_ftree)
 {
-   return p_ftree->tree_rank;
+   return p_ftree->leaf_switch_rank + 1;
 }
 
 /***************************************************/
@@ -1108,6 +1248,34 @@ __osm_ftree_fabric_get_sw_by_tuple(
 
 /***************************************************/
 
+static ftree_sw_t *
+__osm_ftree_fabric_get_sw_by_guid(
+   IN  ftree_fabric_t * p_ftree,
+   IN  uint64_t guid)
+{
+   ftree_sw_t * p_sw;
+   p_sw = (ftree_sw_t *)cl_qmap_get(&p_ftree->sw_tbl,guid);
+   if (p_sw == (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl))
+      return NULL;
+   return p_sw;
+}
+
+/***************************************************/
+
+static ftree_hca_t *
+__osm_ftree_fabric_get_hca_by_guid(
+   IN  ftree_fabric_t * p_ftree,
+   IN  uint64_t guid)
+{
+   ftree_hca_t * p_hca;
+   p_hca = (ftree_hca_t *)cl_qmap_get(&p_ftree->hca_tbl,guid);
+   if (p_hca == (ftree_hca_t *)cl_qmap_end(&p_ftree->hca_tbl))
+      return NULL;
+   return p_hca;
+}
+
+/***************************************************/
+
 static void
 __osm_ftree_fabric_dump(ftree_fabric_t * p_ftree)
 {
@@ -1133,7 +1301,7 @@ __osm_ftree_fabric_dump(ftree_fabric_t * p_ftree)
       __osm_ftree_hca_dump(p_ftree, p_hca);
    }
 
-   for (i = 0; i < __osm_ftree_fabric_get_rank(p_ftree); i++)
+   for (i = 0; i < p_ftree->max_switch_rank; i++)
    {
       osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
               "__osm_ftree_fabric_dump: -- Rank %u switches\n", i);
@@ -1160,7 +1328,6 @@ __osm_ftree_fabric_dump_general_info(
 {
    uint32_t i,j;
    ftree_sw_t * p_sw;
-   char * addition_str;
 
    osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,
            "__osm_ftree_fabric_dump_general_info: "
@@ -1170,15 +1337,20 @@ __osm_ftree_fabric_dump_general_info(
 
    osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,
            "__osm_ftree_fabric_dump_general_info: "
-           "  - FatTree rank (switches only): %u\n",
-           p_ftree->tree_rank);
+           "  - FatTree rank (roots to leaf switches): %u\n",
+           p_ftree->leaf_switch_rank + 1);
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,
+           "__osm_ftree_fabric_dump_general_info: "
+           "  - FatTree max switch rank: %u\n",
+           p_ftree->max_switch_rank);
    osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,
            "__osm_ftree_fabric_dump_general_info: "
-           "  - Fabric has %u CAs, %u switches\n",
+           "  - Fabric has %u CAs (%u of them CNs), %u switches\n",
            cl_qmap_count(&p_ftree->hca_tbl),
+           p_ftree->cn_num,
            cl_qmap_count(&p_ftree->sw_tbl));
 
-   for (i = 0; i < __osm_ftree_fabric_get_rank(p_ftree); i++)
+   for (i = 0; i <= p_ftree->max_switch_rank; i++)
    {
       j = 0;
       for ( p_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl);
@@ -1189,16 +1361,20 @@ __osm_ftree_fabric_dump_general_info(
             j++;
       }
       if (i == 0)
-         addition_str = " (root) ";
+         osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,
+                 "__osm_ftree_fabric_dump_general_info: "
+                 "  - Fabric has %u switches at rank %u (roots)\n",
+                 j, i);
+      else if (i == p_ftree->leaf_switch_rank)
+         osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,
+                 "__osm_ftree_fabric_dump_general_info: "
+                 "  - Fabric has %u switches at rank %u (%u of them leafs)\n",
+                 j, i, p_ftree->leaf_switches_num);
       else
-         if (i == (__osm_ftree_fabric_get_rank(p_ftree) - 1))
-            addition_str = " (leaf) ";
-         else
-            addition_str = " ";
          osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,
                  "__osm_ftree_fabric_dump_general_info: "
-                 "  - Fabric has %u rank %u%s switches\n",
-                 j, i, addition_str);
+                 "  - Fabric has %u switches at rank %u\n",
+                 j, i);
    }
 
    if (osm_log_is_active(&p_ftree->p_osm->log, OSM_LOG_VERBOSE))
@@ -1214,7 +1390,7 @@ __osm_ftree_fabric_dump_general_info(
                osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
                        "__osm_ftree_fabric_dump_general_info: "
                        "      GUID: 0x%016" PRIx64 ", LID: 0x%x, Index %s\n",
-                       cl_ntoh64(osm_node_get_node_guid(p_sw->p_osm_sw->p_node)),
+                       __osm_ftree_sw_get_guid_ho(p_sw),
                        cl_ntoh16(p_sw->base_lid),
                        __osm_ftree_tuple_to_str(p_sw->tuple));
       }
@@ -1227,8 +1403,7 @@ __osm_ftree_fabric_dump_general_info(
             osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
                     "__osm_ftree_fabric_dump_general_info: "
                     "      GUID: 0x%016" PRIx64 ", LID: 0x%x, Index %s\n",
-                    cl_ntoh64(osm_node_get_node_guid(
-                              p_ftree->leaf_switches[i]->p_osm_sw->p_node)),
+                    __osm_ftree_sw_get_guid_ho(p_ftree->leaf_switches[i]),
                     cl_ntoh16(p_ftree->leaf_switches[i]->base_lid),
                     __osm_ftree_tuple_to_str(p_ftree->leaf_switches[i]->tuple));
       }
@@ -1243,9 +1418,11 @@ __osm_ftree_fabric_dump_hca_ordering(
 {
    ftree_hca_t        * p_hca;
    ftree_sw_t         * p_sw;
-   ftree_port_group_t * p_group;
+   ftree_port_group_t * p_group_on_sw;
+   ftree_port_group_t * p_group_on_hca;
    uint32_t             i;
    uint32_t             j;
+   unsigned             printed_hcas_on_leaf;
 
    char path[1024];
    FILE * p_hca_ordering_file;
@@ -1268,22 +1445,34 @@ __osm_ftree_fabric_dump_hca_ordering(
    for(i = 0; i < p_ftree->leaf_switches_num; i++)
    {
       p_sw = p_ftree->leaf_switches[i];
-      /* for each real HCA connected to this switch */
+      printed_hcas_on_leaf = 0;
+
+      /* for each real CA (CNs and not) connected to this switch */
       for (j = 0; j < p_sw->down_port_groups_num; j++)
       {
-         p_group = p_sw->down_port_groups[j];
-         p_hca = p_group->remote_hca_or_sw.remote_hca;
+         p_group_on_sw = p_sw->down_port_groups[j];
+
+         if (p_group_on_sw->remote_node_type != IB_NODE_TYPE_CA)
+            continue;
+
+         p_hca = p_group_on_sw->remote_hca_or_sw.p_hca;
+         p_group_on_hca = __osm_ftree_hca_get_port_group_by_remote_lid(
+                              p_hca, p_group_on_sw->base_lid);
+
+         /* treat non-compute nodes as dummies */
+         if (!p_group_on_hca->is_cn)
+            continue;
 
          fprintf(p_hca_ordering_file,"0x%x\t%s\n",
-                 cl_ntoh16(p_group->remote_base_lid),
+                 cl_ntoh16(p_group_on_hca->base_lid),
                  p_hca->p_osm_node->print_desc);
+
+         printed_hcas_on_leaf++;
       }
 
-      /* now print dummy HCAs */
-      for (j = p_sw->down_port_groups_num; j < p_ftree->max_hcas_per_leaf; j++)
-      {
+      /* now print missing HCAs */
+      for (j = 0; j < (p_ftree->max_cn_per_leaf - printed_hcas_on_leaf); j++)
          fprintf(p_hca_ordering_file,"0xFFFF\tDUMMY\n");
-      }
 
    }
    /* done going through all the leaf switches */
@@ -1368,28 +1557,88 @@ __osm_ftree_fabric_get_new_tuple(
 
 /***************************************************/
 
-static void
-__osm_ftree_fabric_calculate_rank(
+static inline boolean_t
+__osm_ftree_fabric_roots_provided(
    IN  ftree_fabric_t * p_ftree)
 {
-   ftree_sw_t   * p_sw;
-   ftree_sw_t   * p_next_sw;
-   uint32_t       max_rank = 0;
+   return (p_ftree->p_osm->subn.opt.root_guid_file != NULL);
+}
 
-   /* go over all the switches and find maximal switch rank */
+/***************************************************/
 
-   p_next_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl);
-   while( p_next_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl) )
+static inline boolean_t
+__osm_ftree_fabric_cns_provided(
+   IN  ftree_fabric_t * p_ftree)
+{
+   return (p_ftree->p_osm->subn.opt.cn_guid_file != NULL);
+}
+
+/***************************************************/
+
+static int
+__osm_ftree_fabric_mark_leaf_switches(
+   IN   ftree_fabric_t * p_ftree)
+{
+   ftree_sw_t  * p_sw;
+   ftree_hca_t * p_hca;
+   ftree_hca_t * p_next_hca;
+   unsigned i;
+   int res = 0;
+
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_mark_leaf_switches);
+
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
+           "__osm_ftree_fabric_mark_leaf_switches: "
+           "Marking leaf switches in fabric\n");
+
+   /* Scan all the CAs, if they have CNs - find CN port and mark switch
+      that is connected to this port as leaf switch.
+      Also, ensure that this marked leaf has rank of p_ftree->leaf_switch_rank.*/
+   p_next_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl);
+   while( p_next_hca != (ftree_hca_t *)cl_qmap_end(&p_ftree->hca_tbl) )
    {
-      p_sw = p_next_sw;
-      if(p_sw->rank > max_rank)
-         max_rank = p_sw->rank;
-      p_next_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item );
+      p_hca = p_next_hca;
+      p_next_hca = (ftree_hca_t *)cl_qmap_next(&p_hca->map_item);
+      if (!p_hca->cn_num)
+         continue;
+
+      for( i = 0; i < p_hca->up_port_groups_num; i++ )
+      {
+         if (!p_hca->up_port_groups[i]->is_cn)
+            continue;
+
+         /* In CAs, port group alway has one port, and since this
+            port group is CN, we know that this port is compute node */
+         CL_ASSERT(p_hca->up_port_groups[i]->remote_node_type == IB_NODE_TYPE_SWITCH);
+         p_sw = p_hca->up_port_groups[i]->remote_hca_or_sw.p_sw;
+
+         /* check if this switch was already processed */
+         if (p_sw->is_leaf)
+            continue;
+         p_sw->is_leaf = TRUE;
+
+         /* ensure that this leaf switch is at the correct tree level */
+         if (p_sw->rank != p_ftree->leaf_switch_rank)
+         {
+            osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+                    "__osm_ftree_fabric_mark_leaf_switches: ERR AB26: "
+                    "CN port 0x%" PRIx64 " is connected to switch 0x%" PRIx64 " with rank %u, "
+                    "while FatTree leaf rank is %u\n",
+                    cl_ntoh64(p_hca->up_port_groups[i]->port_guid),
+                    __osm_ftree_sw_get_guid_ho(p_sw),
+                    p_sw->rank,
+                    p_ftree->leaf_switch_rank);
+            res = -1;
+            goto Exit;
+
+         }
+      }
    }
 
-   /* set FatTree rank */
-   __osm_ftree_fabric_set_rank(p_ftree, max_rank + 1);
-}
+  Exit:
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
+   return res;
+} /* __osm_ftree_fabric_mark_leaf_switches() */
 
 /***************************************************/
 
@@ -1410,20 +1659,14 @@ __osm_ftree_fabric_make_indexing(
    osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_make_indexing: "
            "Starting FatTree indexing\n");
 
-   /* create array of leaf switches */
-   p_ftree->leaf_switches = (ftree_sw_t **)
-         malloc(cl_qmap_count(&p_ftree->sw_tbl) * sizeof(ftree_sw_t *));
-
-   /* Looking for a leaf switch - the one that has rank equal to (tree_rank - 1).
-      This switch will be used as a starting point for indexing algorithm. */
-
+   /* using the first leaf switch as a starting point for indexing algorithm. */
    p_next_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl);
-   while( p_next_sw != (ftree_sw_t *)cl_qmap_end( &p_ftree->sw_tbl ) )
+   while( p_next_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl) )
    {
       p_sw = p_next_sw;
-      if(p_sw->rank == (__osm_ftree_fabric_get_rank(p_ftree) - 1))
+      if (p_sw->is_leaf)
          break;
-      p_next_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item );
+      p_next_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item);
    }
 
    CL_ASSERT(p_next_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl));
@@ -1442,7 +1685,7 @@ __osm_ftree_fabric_make_indexing(
            p_sw->rank,
            __osm_ftree_tuple_to_str(p_sw->tuple),
            cl_ntoh16(p_sw->base_lid),
-           cl_ntoh64(osm_node_get_node_guid(p_sw->p_osm_sw->p_node)));
+           __osm_ftree_sw_get_guid_ho(p_sw));
 
    /*
     * Now run BFS and assign indexes to all switches
@@ -1469,22 +1712,23 @@ __osm_ftree_fabric_make_indexing(
 
       /* Discover all the nodes from ports that are pointing down */
 
-      if (p_sw->rank == (__osm_ftree_fabric_get_rank(p_ftree) - 1))
+      if (p_sw->rank >= p_ftree->leaf_switch_rank)
       {
-         /* add switch to leaf switches array */
-         p_ftree->leaf_switches[p_ftree->leaf_switches_num++] = p_sw;
-         /* update the max_hcas_per_leaf value */
-         if (p_sw->down_port_groups_num > p_ftree->max_hcas_per_leaf)
-            p_ftree->max_hcas_per_leaf = p_sw->down_port_groups_num;
+         /* whether downward ports are pointing to CAs or switches,
+            we don't assign indexes to switches that are located
+            lower than leaf switches */
       }
       else
       {
-         /* This is not the leaf switch, which means that all the
-            ports that point down are taking us to another switches.
-            No need to assign indexing to HCAs */
+         /* This is not the leaf switch */
          for( i = 0; i < p_sw->down_port_groups_num; i++ )
          {
-            p_remote_sw = p_sw->down_port_groups[i]->remote_hca_or_sw.remote_sw;
+            /* Work with port groups that are pointing to switches only.
+               No need to assign indexing to HCAs */
+            if (p_sw->down_port_groups[i]->remote_node_type != IB_NODE_TYPE_SWITCH)
+               continue;
+
+            p_remote_sw = p_sw->down_port_groups[i]->remote_hca_or_sw.p_sw;
             if (__osm_ftree_tuple_assigned(p_remote_sw->tuple))
             {
                /* this switch has been already indexed */
@@ -1523,7 +1767,7 @@ __osm_ftree_fabric_make_indexing(
             that are pointing up are taking us to another switches. */
          for( i = 0; i < p_sw->up_port_groups_num; i++ )
          {
-            p_remote_sw = p_sw->up_port_groups[i]->remote_hca_or_sw.remote_sw;
+            p_remote_sw = p_sw->up_port_groups[i]->remote_hca_or_sw.p_sw;
             if (__osm_ftree_tuple_assigned(p_remote_sw->tuple))
                continue;
             /* allocate new tuple */
@@ -1554,14 +1798,138 @@ __osm_ftree_fabric_make_indexing(
    }
    cl_list_destroy(&bfs_list);
 
-   /* sort array of leaf switches by index */
-   qsort(p_ftree->leaf_switches,     /* array */
-         p_ftree->leaf_switches_num, /* number of elements */
-         sizeof(ftree_sw_t *),       /* size of each element */
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
+} /* __osm_ftree_fabric_make_indexing() */
+
+/***************************************************/
+
+static int
+__osm_ftree_fabric_create_leaf_switch_array(
+   IN   ftree_fabric_t * p_ftree)
+{
+   ftree_sw_t  * p_sw;
+   ftree_sw_t  * p_next_sw;
+   ftree_sw_t ** all_switches_at_leaf_level;
+   unsigned i;
+   unsigned all_leaf_idx = 0;
+   unsigned first_leaf_idx;
+   unsigned last_leaf_idx;
+   int res = 0;
+
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_create_leaf_switch_array);
+
+   /* create array of ALL the switches that have leaf rank */
+   all_switches_at_leaf_level = (ftree_sw_t **)
+            malloc(cl_qmap_count(&p_ftree->sw_tbl) * sizeof(ftree_sw_t *));
+   if (!all_switches_at_leaf_level)
+   {
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
+              "Fat-tree routing: Memory allocation failed\n");
+      res = -1;
+      goto Exit;
+   }
+   memset(all_switches_at_leaf_level,0,cl_qmap_count(&p_ftree->sw_tbl) * sizeof(ftree_sw_t *));
+
+   p_next_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl);
+   while( p_next_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl) )
+   {
+      p_sw = p_next_sw;
+      p_next_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item);
+      if (p_sw->rank == p_ftree->leaf_switch_rank)
+      {
+         osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
+                 "__osm_ftree_fabric_create_leaf_switch_array: "
+                 "Adding switch 0x%" PRIx64 " to full leaf switch array\n",
+                 __osm_ftree_sw_get_guid_ho(p_sw));
+         all_switches_at_leaf_level[all_leaf_idx++] = p_sw;
+
+      }
+   }
+
+   /* quick-sort array of leaf switches by index */
+   qsort(all_switches_at_leaf_level,             /* array */
+         all_leaf_idx,                           /* number of elements */
+         sizeof(ftree_sw_t *),                   /* size of each element */
          __osm_ftree_compare_switches_by_index); /* comparator */
 
+   /* check the first and the last REAL leaf (the one
+      that has CNs) in the array of all the leafs */
+
+   first_leaf_idx = all_leaf_idx;
+   last_leaf_idx = 0;
+   for ( i = 0; i < all_leaf_idx; i++ )
+   {
+      if (all_switches_at_leaf_level[i]->is_leaf)
+      {
+         if (i < first_leaf_idx)
+            first_leaf_idx = i;
+         last_leaf_idx = i;
+      }
+   }
+   CL_ASSERT(first_leaf_idx < last_leaf_idx);
+
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
+           "__osm_ftree_fabric_create_leaf_switch_array: "
+           "Full leaf array info: first_leaf_idx = %u, last_leaf_idx = %u\n",
+           first_leaf_idx, last_leaf_idx);
+
+   /* Create array of REAL leaf switches, sorted by index.
+      This array may contain siwtches at the same rank w/o CNs,
+      in case this is the order of indexing.*/
+   p_ftree->leaf_switches_num = last_leaf_idx - first_leaf_idx + 1;
+   p_ftree->leaf_switches = (ftree_sw_t **)
+            malloc(p_ftree->leaf_switches_num * sizeof(ftree_sw_t *));
+   if (!p_ftree->leaf_switches)
+   {
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
+              "Fat-tree routing: Memory allocation failed\n");
+      res = -1;
+      goto Exit;
+   }
+
+   memcpy(p_ftree->leaf_switches,
+          &(all_switches_at_leaf_level[first_leaf_idx]),
+          p_ftree->leaf_switches_num * sizeof(ftree_sw_t *));
+
+   free(all_switches_at_leaf_level);
+
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
+           "__osm_ftree_fabric_create_leaf_switch_array: "
+           "Created array of %u leaf switches\n",
+           p_ftree->leaf_switches_num);
+
+  Exit:
    OSM_LOG_EXIT(&p_ftree->p_osm->log);
-} /* __osm_ftree_fabric_make_indexing() */
+   return res;
+} /* __osm_ftree_fabric_create_leaf_switch_array() */
+
+/***************************************************/
+
+static void
+__osm_ftree_fabric_set_max_cn_per_leaf(
+   IN   ftree_fabric_t * p_ftree)
+{
+   unsigned i;
+   unsigned j;
+   unsigned cns_on_this_leaf;
+   ftree_sw_t * p_sw;
+   ftree_port_group_t * p_group;
+
+   for (i = 0; i < p_ftree->leaf_switches_num; i++)
+   {
+      p_sw = p_ftree->leaf_switches[i];
+      cns_on_this_leaf = 0;
+      for (j = 0; j < p_sw->down_port_groups_num; j++)
+      {
+         p_group = p_sw->down_port_groups[j];
+         if (p_group->remote_node_type != IB_NODE_TYPE_CA)
+            continue;
+         cns_on_this_leaf += p_group->remote_hca_or_sw.p_hca->cn_num;
+      }
+      if (cns_on_this_leaf > p_ftree->max_cn_per_leaf)
+         p_ftree->max_cn_per_leaf = cns_on_this_leaf;
+   }
+} /* __osm_ftree_fabric_set_max_cn_per_leaf() */
 
 /***************************************************/
 
@@ -1617,11 +1985,11 @@ __osm_ftree_fabric_validate_topology(
                     "ERR AB09: Different number of upward port groups on switches:\n"
                     "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u groups\n"
                     "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u groups\n",
-                    cl_ntoh64(osm_node_get_node_guid(reference_sw_arr[p_sw->rank]->p_osm_sw->p_node)),
+                    __osm_ftree_sw_get_guid_ho(reference_sw_arr[p_sw->rank]),
                     cl_ntoh16(reference_sw_arr[p_sw->rank]->base_lid),
                     __osm_ftree_tuple_to_str(reference_sw_arr[p_sw->rank]->tuple),
                     reference_sw_arr[p_sw->rank]->up_port_groups_num,
-                    cl_ntoh64(osm_node_get_node_guid(p_sw->p_osm_sw->p_node)),
+                    __osm_ftree_sw_get_guid_ho(p_sw),
                     cl_ntoh16(p_sw->base_lid),
                     __osm_ftree_tuple_to_str(p_sw->tuple),
                     p_sw->up_port_groups_num);
@@ -1629,7 +1997,7 @@ __osm_ftree_fabric_validate_topology(
             break;
          }
 
-         if ( p_sw->rank != (__osm_ftree_fabric_get_rank(p_ftree) - 1) &&
+         if ( p_sw->rank != (tree_rank - 1) &&
               reference_sw_arr[p_sw->rank]->down_port_groups_num != p_sw->down_port_groups_num )
          {
             /* we're allowing some hca's to be missing */
@@ -1638,11 +2006,11 @@ __osm_ftree_fabric_validate_topology(
                     "ERR AB0A: Different number of downward port groups on switches:\n"
                     "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u port groups\n"
                     "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u port groups\n",
-                    cl_ntoh64(osm_node_get_node_guid(reference_sw_arr[p_sw->rank]->p_osm_sw->p_node)),
+                    __osm_ftree_sw_get_guid_ho(reference_sw_arr[p_sw->rank]),
                     cl_ntoh16(reference_sw_arr[p_sw->rank]->base_lid),
                     __osm_ftree_tuple_to_str(reference_sw_arr[p_sw->rank]->tuple),
                     reference_sw_arr[p_sw->rank]->down_port_groups_num,
-                    cl_ntoh64(osm_node_get_node_guid(p_sw->p_osm_sw->p_node)),
+                    __osm_ftree_sw_get_guid_ho(p_sw),
                     cl_ntoh16(p_sw->base_lid),
                     __osm_ftree_tuple_to_str(p_sw->tuple),
                     p_sw->down_port_groups_num);
@@ -1663,11 +2031,11 @@ __osm_ftree_fabric_validate_topology(
                            "ERR AB0B: Different number of ports in an upward port group on switches:\n"
                            "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n"
                            "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n",
-                           cl_ntoh64(osm_node_get_node_guid(reference_sw_arr[p_sw->rank]->p_osm_sw->p_node)),
+                           __osm_ftree_sw_get_guid_ho(reference_sw_arr[p_sw->rank]),
                            cl_ntoh16(reference_sw_arr[p_sw->rank]->base_lid),
                            __osm_ftree_tuple_to_str(reference_sw_arr[p_sw->rank]->tuple),
                            cl_ptr_vector_get_size(&p_ref_group->ports),
-                           cl_ntoh64(osm_node_get_node_guid(p_sw->p_osm_sw->p_node)),
+                           __osm_ftree_sw_get_guid_ho(p_sw),
                            cl_ntoh16(p_sw->base_lid),
                            __osm_ftree_tuple_to_str(p_sw->tuple),
                            cl_ptr_vector_get_size(&p_group->ports));
@@ -1691,11 +2059,11 @@ __osm_ftree_fabric_validate_topology(
                            "ERR AB0C: Different number of ports in an downward port group on switches:\n"
                            "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n"
                            "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n",
-                           cl_ntoh64(osm_node_get_node_guid(reference_sw_arr[p_sw->rank]->p_osm_sw->p_node)),
+                           __osm_ftree_sw_get_guid_ho(reference_sw_arr[p_sw->rank]),
                            cl_ntoh16(reference_sw_arr[p_sw->rank]->base_lid),
                            __osm_ftree_tuple_to_str(reference_sw_arr[p_sw->rank]->tuple),
                            cl_ptr_vector_get_size(&p_ref_group->ports),
-                           cl_ntoh64(osm_node_get_node_guid(p_sw->p_osm_sw->p_node)),
+                           __osm_ftree_sw_get_guid_ho(p_sw),
                            cl_ntoh16(p_sw->base_lid),
                            __osm_ftree_tuple_to_str(p_sw->tuple),
                            cl_ptr_vector_get_size(&p_group->ports));
@@ -1781,9 +2149,6 @@ __osm_ftree_fabric_route_upgoing_by_going_down(
    /* we shouldn't enter here if both real_lid and main_path are false */
    CL_ASSERT(is_real_lid || is_main_path);
 
-   /* can't be here for leaf switch, */
-   CL_ASSERT(p_sw->rank != (__osm_ftree_fabric_get_rank(p_ftree) - 1));
-
    /* if there is no down-going ports */
    if (p_sw->down_port_groups_num == 0)
        return;
@@ -1793,6 +2158,10 @@ __osm_ftree_fabric_route_upgoing_by_going_down(
    {
       p_group = p_sw->down_port_groups[i];
 
+      /* Skip this port group unless it points to a switch */
+      if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH)
+         continue;
+
       if ( p_prev_sw && (p_group->remote_base_lid == p_prev_sw->base_lid) )
       {
          /* This port group has a port that was used when we entered this switch,
@@ -1825,7 +2194,7 @@ __osm_ftree_fabric_route_upgoing_by_going_down(
          lowest load of upgoing routes.
          Set on the remote switch how to get to the target_lid -
          set LFT(target_lid) on the remote switch to the remote port */
-      p_remote_sw = p_group->remote_hca_or_sw.remote_sw;
+      p_remote_sw = p_group->remote_hca_or_sw.p_sw;
 
       if ( osm_switch_get_least_hops(p_remote_sw->p_osm_sw,
                                      cl_ntoh16(target_lid)) != OSM_NO_PATH )
@@ -1918,11 +2287,8 @@ __osm_ftree_fabric_route_upgoing_by_going_down(
          p_min_port->counter_up++;
 
       /* Recursion step:
-         Assign upgoing ports by stepping down, starting on REMOTE switch.
-         Recursion stop condition - if the REMOTE switch is a leaf switch. */
-      if (p_remote_sw->rank != (__osm_ftree_fabric_get_rank(p_ftree) - 1))
-      {
-         __osm_ftree_fabric_route_upgoing_by_going_down(
+         Assign upgoing ports by stepping down, starting on REMOTE switch */
+      __osm_ftree_fabric_route_upgoing_by_going_down(
                p_ftree,
                p_remote_sw,   /* remote switch - used as a route-upgoing alg. start point */
                NULL,          /* prev. position - NULL to mark that we went down and not up */
@@ -1931,7 +2297,6 @@ __osm_ftree_fabric_route_upgoing_by_going_down(
                is_real_lid,   /* whether the target LID is real or dummy */
                is_main_path,  /* whether this is path to HCA that should by tracked by counters */
                highest_rank_in_route); /* highest visited point in the tree before going down */
-      }
    }
    /* done scanning all the down-going port groups */
 
@@ -1972,11 +2337,8 @@ __osm_ftree_fabric_route_downgoing_by_going_up(
    /* we shouldn't enter here if both real_lid and main_path are false */
    CL_ASSERT(is_real_lid || is_main_path);
 
-   /* If this switch isn't a leaf switch:
-      Assign upgoing ports by stepping down, starting on THIS switch. */
-   if (p_sw->rank != (__osm_ftree_fabric_get_rank(p_ftree) - 1))
-   {
-      __osm_ftree_fabric_route_upgoing_by_going_down(
+   /* Assign upgoing ports by stepping down, starting on THIS switch */
+   __osm_ftree_fabric_route_upgoing_by_going_down(
          p_ftree,
          p_sw,          /* local switch - used as a route-upgoing alg. start point */
          p_prev_sw,     /* switch that we went up from (NULL means that we went down) */
@@ -1985,7 +2347,6 @@ __osm_ftree_fabric_route_downgoing_by_going_up(
          is_real_lid,   /* whether this target LID is real or dummy */
          is_main_path,  /* whether this path to HCA should by tracked by counters */
          p_sw->rank);   /* the highest visited point in the tree before going down */
-   }
 
    /* recursion stop condition - if it's a root switch, */
    if (p_sw->rank == 0)
@@ -2026,7 +2387,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up(
       lowest load of downgoing routes.
       Set on the remote switch how to get to the target_lid -
       set LFT(target_lid) on the remote switch to the remote port */
-   p_remote_sw = p_min_group->remote_hca_or_sw.remote_sw;
+   p_remote_sw = p_min_group->remote_hca_or_sw.p_sw;
 
    /* Four possible cases:
     *
@@ -2063,7 +2424,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up(
    /* covering first half of case 1, and case 3 */
    if (is_main_path)
    {
-      if (p_sw->rank == (__osm_ftree_fabric_get_rank(p_ftree) - 1))
+      if (p_sw->is_leaf)
       {
          osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
                  "__osm_ftree_fabric_route_downgoing_by_going_up: "
@@ -2154,14 +2515,14 @@ __osm_ftree_fabric_route_downgoing_by_going_up(
    for (i = 0; i < p_sw->up_port_groups_num; i++)
    {
       p_group = p_sw->up_port_groups[i];
-      p_remote_sw = p_group->remote_hca_or_sw.remote_sw;
+      p_remote_sw = p_group->remote_hca_or_sw.p_sw;
 
       /* skip if target lid has been already set on remote switch fwd tbl */
       if (__osm_ftree_sw_get_fwd_table_block(
                   p_remote_sw,cl_ntoh16(target_lid)) != OSM_NO_PATH)
          continue;
 
-      if (p_sw->rank == (__osm_ftree_fabric_get_rank(p_ftree) - 1))
+      if (p_sw->is_leaf)
       {
          osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
                  "__osm_ftree_fabric_route_downgoing_by_going_up: "
@@ -2219,70 +2580,99 @@ __osm_ftree_fabric_route_downgoing_by_going_up(
  */
 
 static void
-__osm_ftree_fabric_route_to_hcas(
+__osm_ftree_fabric_route_to_cns(
    IN  ftree_fabric_t * p_ftree)
 {
    ftree_sw_t         * p_sw;
-   ftree_port_group_t * p_group;
+   ftree_hca_t        * p_hca;
+   ftree_port_group_t * p_leaf_port_group;
+   ftree_port_group_t * p_hca_port_group;
    ftree_port_t       * p_port;
    uint32_t             i;
    uint32_t             j;
-   ib_net16_t           remote_lid;
+   ib_net16_t           hca_lid;
+   unsigned             routed_targets_on_leaf;
 
-   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_route_to_hcas);
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_route_to_cns);
 
    /* for each leaf switch (in indexing order) */
    for(i = 0; i < p_ftree->leaf_switches_num; i++)
    {
       p_sw = p_ftree->leaf_switches[i];
+      routed_targets_on_leaf = 0;
 
       /* for each HCA connected to this switch */
       for (j = 0; j < p_sw->down_port_groups_num; j++)
       {
+         p_leaf_port_group = p_sw->down_port_groups[j];
+
+         /* work with this port group only if the remote node is CA */
+         if (p_leaf_port_group->remote_node_type != IB_NODE_TYPE_CA)
+            continue;
+
+         p_hca = p_leaf_port_group->remote_hca_or_sw.p_hca;
+
+         /* work with this port group only if remote HCA has CNs */
+         if (!p_hca->cn_num)
+            continue;
+
+         p_hca_port_group = __osm_ftree_hca_get_port_group_by_remote_lid(
+                              p_hca, p_leaf_port_group->base_lid);
+         CL_ASSERT(p_hca_port_group);
+
+         /* work with this port group only if remote port is CN */
+         if (!p_hca_port_group->is_cn)
+            continue;
+
          /* obtain the LID of HCA port */
-         p_group = p_sw->down_port_groups[j];
-         remote_lid = p_group->remote_base_lid;
+         hca_lid = p_leaf_port_group->remote_base_lid;
 
          /* set local LFT(LID) to the port that is connected to HCA */
-         cl_ptr_vector_at(&p_group->ports, 0, (void **)&p_port);
+         cl_ptr_vector_at(&p_leaf_port_group->ports, 0, (void **)&p_port);
          __osm_ftree_sw_set_fwd_table_block(p_sw,
-                                            cl_ntoh16(remote_lid),
+                                            cl_ntoh16(hca_lid),
                                             p_port->port_num);
          osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
-                 "__osm_ftree_fabric_route_to_hcas: "
-                 "Switch %s: set path to CA LID 0x%x through port %u\n",
+                 "__osm_ftree_fabric_route_to_cns: "
+                 "Switch %s: set path to CN LID 0x%x through port %u\n",
                  __osm_ftree_tuple_to_str(p_sw->tuple),
-                 cl_ntoh16(remote_lid),
+                 cl_ntoh16(hca_lid),
                  p_port->port_num);
 
          /* set local min hop table(LID) to route to the CA */
          __osm_ftree_sw_set_hops(p_sw,
                                  p_ftree->lft_max_lid_ho,
-                                 cl_ntoh16(remote_lid),
+                                 cl_ntoh16(hca_lid),
                                  p_port->port_num,
                                  1);
 
-         /* assign downgoing ports by stepping up */
+         /* Assign downgoing ports by stepping up.
+            Since we're routing here only CNs, we're routing it as REAL
+            LID and updating fat-tree ballancing counters.*/
          __osm_ftree_fabric_route_downgoing_by_going_up(
                p_ftree,
                p_sw,       /* local switch - used as a route-downgoing alg. start point */
                NULL,       /* prev. position switch */
-               remote_lid, /* LID that we're routing to */
-               __osm_ftree_fabric_get_rank(p_ftree), /* rank of the LID that we're routing to */
+               hca_lid,    /* LID that we're routing to */
+               p_sw->rank+1,/* rank of the LID that we're routing to */
                TRUE,       /* whether this HCA LID is real or dummy */
                TRUE);      /* whether this path to HCA should by tracked by counters */
+
+         /* count how many real targets have been routed from this leaf switch */
+         routed_targets_on_leaf++;
       }
 
-      /* We're done with the real HCAs. Now route the dummy HCAs that are missing.
+      /* We're done with the real targets (all CNs) of this leaf switch.
+         Now route the dummy HCAs that are missing or that are non-CNs.
          When routing to dummy HCAs we don't fill lid matrices. */
 
-      if (p_ftree->max_hcas_per_leaf > p_sw->down_port_groups_num)
+      if (p_ftree->max_cn_per_leaf > routed_targets_on_leaf)
       {
-         osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,"__osm_ftree_fabric_route_to_hcas: "
+         osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,"__osm_ftree_fabric_route_to_cns: "
                  "Routing %u dummy CAs\n",
-                 p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num);
+                 p_ftree->max_cn_per_leaf - p_sw->down_port_groups_num);
          for ( j = 0;
-               ((int)j) < (p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num);
+               ((int)j) < (p_ftree->max_cn_per_leaf - routed_targets_on_leaf);
                j++)
          {
             /* assign downgoing ports by stepping up */
@@ -2299,7 +2689,99 @@ __osm_ftree_fabric_route_to_hcas(
    }
    /* done going through all the leaf switches */
    OSM_LOG_EXIT(&p_ftree->p_osm->log);
-} /* __osm_ftree_fabric_route_to_hcas() */
+} /* __osm_ftree_fabric_route_to_cns() */
+
+/***************************************************/
+
+/*
+ * Pseudo code:
+ *    foreach HCA non-CN port in fabric
+ *       obtain the LID of the HCA port
+ *       get switch that is connected to this HCA port
+ *       set switch LFT(LID) to the port connecting to compute node
+ *       call assign-down-going-port-by-descending-up(TRUE,FALSE) on CURRENT switch
+ *
+ * Routing to these HCAs is routing a REAL hca lid on SECONDARY path:
+ *   - we should set fwd tables
+ *   - we should NOT update port counters
+ */
+
+static void
+__osm_ftree_fabric_route_to_non_cns(
+   IN  ftree_fabric_t * p_ftree)
+{
+   ftree_sw_t         * p_sw;
+   ftree_hca_t        * p_hca;
+   ftree_hca_t        * p_next_hca;
+   ftree_port_t       * p_hca_port;
+   ftree_port_group_t * p_hca_port_group;
+   ib_net16_t           hca_lid;
+   unsigned             port_num_on_switch;
+   unsigned             i;
+
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_route_to_non_cns);
+
+
+   p_next_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl);
+   while( p_next_hca != (ftree_hca_t *)cl_qmap_end(&p_ftree->hca_tbl) )
+   {
+      p_hca = p_next_hca;
+      p_next_hca = (ftree_hca_t *)cl_qmap_next(&p_hca->map_item );
+
+      for (i = 0; i < p_hca->up_port_groups_num; i++)
+      {
+         p_hca_port_group = p_hca->up_port_groups[i];
+
+         /* skip this port if it's CN, in which case it has been already routed */
+         if (p_hca_port_group->is_cn)
+            continue;
+
+         /* skip this port if it is not connected to switch */
+         if (p_hca_port_group->remote_node_type != IB_NODE_TYPE_SWITCH)
+            continue;
+
+         p_sw = p_hca_port_group->remote_hca_or_sw.p_sw;
+         hca_lid = p_hca_port_group->base_lid;
+
+         /* set switches  LFT(LID) to the port that is connected to HCA */
+         cl_ptr_vector_at(&p_hca_port_group->ports, 0, (void **)&p_hca_port);
+         port_num_on_switch = p_hca_port->remote_port_num;
+         __osm_ftree_sw_set_fwd_table_block(p_sw,
+                                            cl_ntoh16(hca_lid),
+                                            port_num_on_switch);
+
+         osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
+                 "__osm_ftree_fabric_route_to_non_cns: "
+                 "Switch %s: set path to non-CN HCA LID 0x%x through port %u\n",
+                 __osm_ftree_tuple_to_str(p_sw->tuple),
+                 cl_ntoh16(hca_lid),
+                 port_num_on_switch);
+
+         /* set local min hop table(LID) to route to the CA */
+         __osm_ftree_sw_set_hops(p_sw,
+                                 p_ftree->lft_max_lid_ho,
+                                 cl_ntoh16(hca_lid),
+                                 port_num_on_switch, /* port num */
+                                 1);                 /* hops */
+
+         /* Assign downgoing ports by stepping up.
+            We're routing REAL targets, but since they are not CNs and not
+            included in the leafs array, treat them as SECONDARY path, which
+            means that the counters won't be updated.*/
+         __osm_ftree_fabric_route_downgoing_by_going_up(
+               p_ftree,
+               p_sw,        /* local switch - used as a route-downgoing alg. start point */
+               NULL,        /* prev. position switch */
+               hca_lid,     /* LID that we're routing to */
+               p_sw->rank+1,/* rank of the LID that we're routing to */
+               TRUE,        /* whether this HCA LID is real or dummy */
+               FALSE);      /* whether this path to HCA should by tracked by counters */
+      }
+      /* done with all the port groups of this HCA - go to next HCA */
+   }
+
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
+} /* __osm_ftree_fabric_route_to_non_cns() */
 
 /***************************************************/
 
@@ -2431,14 +2913,11 @@ __osm_ftree_rank_switches_from_leafs(
    osm_node_t   * p_remote_node;
    osm_physp_t  * p_osm_port;
    uint8_t        i;
-   ftree_sw_tbl_element_t * p_sw_tbl_element = NULL;
+   unsigned       max_rank = 0;
 
    while (!cl_is_list_empty(p_ranking_bfs_list))
    {
-      p_sw_tbl_element = (ftree_sw_tbl_element_t *) cl_list_remove_head(p_ranking_bfs_list);
-      p_sw = p_sw_tbl_element->p_sw;
-      __osm_ftree_sw_tbl_element_destroy(p_sw_tbl_element);
-
+      p_sw = (ftree_sw_t *) cl_list_remove_head(p_ranking_bfs_list);
       p_node = p_sw->p_osm_sw->p_node;
 
       /* note: skipping port 0 on switches */
@@ -2456,9 +2935,9 @@ __osm_ftree_rank_switches_from_leafs(
          if (osm_node_get_type(p_remote_node) != IB_NODE_TYPE_SWITCH)
             continue;
 
-         p_remote_sw = (ftree_sw_t *)cl_qmap_get(&p_ftree->sw_tbl,
-                                                 osm_node_get_node_guid(p_remote_node));
-         if (p_remote_sw == (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl))
+         p_remote_sw = __osm_ftree_fabric_get_sw_by_guid(p_ftree,
+                              osm_node_get_node_guid(p_remote_node));
+         if (!p_remote_sw)
          {
             /* remote node is not a switch */
             continue;
@@ -2466,11 +2945,16 @@ __osm_ftree_rank_switches_from_leafs(
 
          /* if needed, rank the remote switch and add it to the BFS list */
          if (__osm_ftree_sw_update_rank(p_remote_sw, p_sw->rank + 1))
-            cl_list_insert_tail(p_ranking_bfs_list,
-                                &__osm_ftree_sw_tbl_element_create(p_remote_sw)->map_item);
+         {
+            max_rank = p_remote_sw->rank;
+            cl_list_insert_tail(p_ranking_bfs_list, p_remote_sw);
+         }
       }
    }
 
+   /* set FatTree maximal switch rank */
+   p_ftree->max_switch_rank = max_rank;
+
 } /* __osm_ftree_rank_switches_from_leafs() */
 
 /***************************************************/
@@ -2508,7 +2992,7 @@ __osm_ftree_rank_leaf_switches(
                     "__osm_ftree_rank_leaf_switches: ERR AB0F: "
                     "CA conected directly to another CA: "
                     "0x%016" PRIx64 " <---> 0x%016" PRIx64 "\n",
-                    cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)),
+                    __osm_ftree_hca_get_guid_ho(p_hca),
                     cl_ntoh64(osm_node_get_node_guid(p_remote_osm_node)));
             res = -1;
             goto Exit;
@@ -2533,26 +3017,24 @@ __osm_ftree_rank_leaf_switches(
 
       /* remote node is switch */
 
-      p_sw = (ftree_sw_t *)cl_qmap_get(&p_ftree->sw_tbl,
-                                       p_osm_port->p_remote_physp->p_node->node_info.node_guid);
+      p_sw = __osm_ftree_fabric_get_sw_by_guid(p_ftree,
+                  osm_node_get_node_guid(p_osm_port->p_remote_physp->p_node));
+      CL_ASSERT(p_sw);
 
-      CL_ASSERT(p_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl));
+      /* if needed, rank the remote switch and add it to the BFS list */
 
       if ( !__osm_ftree_sw_update_rank(p_sw, 0) )
          continue;
-
-      /* if needed, rank the remote switch and add it to the BFS list */
       osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
               "__osm_ftree_rank_leaf_switches: "
               "Marking rank of switch that is directly connected to CA:\n"
               "                                            - CA guid    : 0x%016" PRIx64 "\n"
               "                                            - Switch guid: 0x%016" PRIx64 "\n"
               "                                            - Switch LID : 0x%x\n",
-              cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)),
-              cl_ntoh64(osm_node_get_node_guid(p_sw->p_osm_sw->p_node)),
+              __osm_ftree_hca_get_guid_ho(p_hca),
+              __osm_ftree_sw_get_guid_ho(p_sw),
               cl_ntoh16(p_sw->base_lid));
-      cl_list_insert_tail(p_ranking_bfs_list,
-                          &__osm_ftree_sw_tbl_element_create(p_sw)->map_item);
+      cl_list_insert_tail(p_ranking_bfs_list, p_sw);
    }
 
  Exit:
@@ -2569,7 +3051,7 @@ __osm_ftree_sw_reverse_rank(
 {
    ftree_fabric_t * p_ftree = (ftree_fabric_t *)context;
    ftree_sw_t     * p_sw = (ftree_sw_t * const) p_map_item;
-   p_sw->rank = __osm_ftree_fabric_get_rank(p_ftree) - p_sw->rank - 1;
+   p_sw->rank = p_ftree->max_switch_rank - p_sw->rank;
 }
 
 /***************************************************
@@ -2588,6 +3070,7 @@ __osm_ftree_fabric_construct_hca_ports(
    osm_physp_t     * p_remote_osm_port;
    uint8_t           i;
    uint8_t           remote_port_num;
+   boolean_t         is_cn = FALSE;
    int res = 0;
 
    for (i = 0; i < osm_node_get_num_physp(p_node); i++)
@@ -2641,9 +3124,41 @@ __osm_ftree_fabric_construct_hca_ports(
 
       /* remote node is switch */
 
-      p_remote_sw = (ftree_sw_t *)cl_qmap_get(&p_ftree->sw_tbl,remote_node_guid);
-      CL_ASSERT( p_remote_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl) );
-      CL_ASSERT( (p_remote_sw->rank + 1) == __osm_ftree_fabric_get_rank(p_ftree) );
+      p_remote_sw = __osm_ftree_fabric_get_sw_by_guid(p_ftree,remote_node_guid);
+      CL_ASSERT( p_remote_sw );
+
+      /* If CN file is not supplied, then all the CAs considered as Compute Nodes.
+         Otherwise all the CAs are not CNs, and only guids that are present in the
+         CN file will be marked as compute nodes. */
+      if ( !__osm_ftree_fabric_cns_provided(p_ftree) )
+      {
+         is_cn = TRUE;
+      }
+      else
+      {
+         ftree_guid_tbl_element_t * p_elem =
+            (ftree_guid_tbl_element_t *)cl_qmap_get(&p_ftree->cn_guid_tbl,
+                                                    osm_physp_get_port_guid(p_osm_port));
+         if (p_elem != (ftree_guid_tbl_element_t *)cl_qmap_end(&p_ftree->cn_guid_tbl))
+            is_cn = TRUE;
+      }
+
+      if (is_cn)
+      {
+         p_ftree->cn_num++;
+         p_hca->cn_num++;
+         osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
+                 "__osm_ftree_fabric_construct_hca_ports: "
+                 "Marking CN port GUID 0x%016" PRIx64 "\n",
+                 cl_ntoh64(osm_physp_get_port_guid(p_osm_port)));
+      }
+      else
+      {
+         osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
+                 "__osm_ftree_fabric_construct_hca_ports: "
+                 "Marking non-CN port GUID 0x%016" PRIx64 "\n",
+                 cl_ntoh64(osm_physp_get_port_guid(p_osm_port)));
+      }
 
       __osm_ftree_hca_add_port(
             p_hca,                                     /* local ftree_hca object */
@@ -2655,7 +3170,8 @@ __osm_ftree_fabric_construct_hca_ports(
             osm_physp_get_port_guid(p_remote_osm_port),/* remote port guid */
             remote_node_guid,                          /* remote node guid */
             remote_node_type,                          /* remote node type */
-            (void *) p_remote_sw);                     /* remote ftree_hca/sw object */
+            (void *) p_remote_sw,                      /* remote ftree_hca/sw object */
+            is_cn );                                   /* whether this port is compute node */
    }
 
  Exit:
@@ -2713,10 +3229,8 @@ __osm_ftree_fabric_construct_sw_ports(
          case IB_NODE_TYPE_CA:
             /* switch connected to hca */
 
-            CL_ASSERT((p_sw->rank + 1) == __osm_ftree_fabric_get_rank(p_ftree));
-
-            p_remote_hca = (ftree_hca_t *)cl_qmap_get(&p_ftree->hca_tbl,remote_node_guid);
-            CL_ASSERT(p_remote_hca != (ftree_hca_t *)cl_qmap_end(&p_ftree->hca_tbl));
+            p_remote_hca = __osm_ftree_fabric_get_hca_by_guid(p_ftree,remote_node_guid);
+            CL_ASSERT(p_remote_hca);
 
             p_remote_hca_or_sw = (void *)p_remote_hca;
             direction = FTREE_DIRECTION_DOWN;
@@ -2727,8 +3241,8 @@ __osm_ftree_fabric_construct_sw_ports(
          case IB_NODE_TYPE_SWITCH:
             /* switch connected to another switch */
 
-            p_remote_sw = (ftree_sw_t *)cl_qmap_get(&p_ftree->sw_tbl,remote_node_guid);
-            CL_ASSERT(p_remote_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl));
+            p_remote_sw = __osm_ftree_fabric_get_sw_by_guid(p_ftree,remote_node_guid);
+            CL_ASSERT(p_remote_sw);
             p_remote_hca_or_sw = (void *)p_remote_sw;
 
             if (abs(p_sw->rank - p_remote_sw->rank) != 1)
@@ -2740,10 +3254,10 @@ __osm_ftree_fabric_construct_sw_ports(
                        "       GUID 0x%016" PRIx64 ", LID 0x%x, rank %u\n",
                        p_sw->rank,
                        p_remote_sw->rank,
-                       cl_ntoh64(osm_node_get_node_guid(p_sw->p_osm_sw->p_node)),
+                       __osm_ftree_sw_get_guid_ho(p_sw),
                        cl_ntoh16(p_sw->base_lid),
                        p_sw->rank,
-                       cl_ntoh64(osm_node_get_node_guid(p_remote_sw->p_osm_sw->p_node)),
+                       __osm_ftree_sw_get_guid_ho(p_remote_sw),
                        cl_ntoh16(p_remote_sw->base_lid),
                        p_remote_sw->rank);
                res = -1;
@@ -2795,7 +3309,126 @@ __osm_ftree_fabric_construct_sw_ports(
  ***************************************************/
 
 static int
-__osm_ftree_fabric_perform_ranking(
+__osm_ftree_fabric_rank_from_roots(
+   IN  ftree_fabric_t * p_ftree)
+{
+   osm_node_t  * p_osm_node;
+   osm_node_t  * p_remote_osm_node;
+   osm_physp_t * p_osm_physp;
+   ftree_sw_t  * p_sw;
+   ftree_sw_t  * p_remote_sw;
+   cl_list_t     ranking_bfs_list;
+   uint64_t *    p_guid;
+   int           res = 0;
+   unsigned      num_roots;
+   unsigned      max_rank = 0;
+   unsigned      i;
+   cl_list_iterator_t guid_iterator;
+
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_rank_from_roots);
+   cl_list_init(&ranking_bfs_list,10);
+
+   /* Rank all the roots and add them to list */
+
+   guid_iterator = cl_list_head(&p_ftree->root_guid_list);
+   while( guid_iterator != cl_list_end(&p_ftree->root_guid_list) )
+   {
+      p_guid = (uint64_t*)cl_list_obj(guid_iterator);
+      guid_iterator = cl_list_next(guid_iterator);
+
+      p_sw = __osm_ftree_fabric_get_sw_by_guid(p_ftree, cl_hton64(*p_guid));
+      if (!p_sw)
+      {
+         /* the specified root guid wasn't found in the fabric */
+         osm_log( &p_ftree->p_osm->log, OSM_LOG_ERROR,
+                  "__osm_ftree_fabric_rank_from_roots: ERR AB24: "
+                  "Root switch GUID 0x%" PRIx64 " not found\n", *p_guid );
+         continue;
+      }
+
+      osm_log( &p_ftree->p_osm->log, OSM_LOG_DEBUG,
+               "__osm_ftree_fabric_rank_from_roots: "
+               "Ranking root switch with GUID 0x%" PRIx64 "\n", *p_guid );
+      p_sw->rank = 0;
+      cl_list_insert_tail(&ranking_bfs_list, p_sw);
+   }
+
+   num_roots = cl_list_count(&ranking_bfs_list);
+   if (!num_roots)
+   {
+      osm_log( &p_ftree->p_osm->log, OSM_LOG_ERROR,
+               "__osm_ftree_fabric_rank_from_roots: ERR AB25: "
+               "No valid roots supplied\n");
+      res = -1;
+      goto Exit;
+   }
+
+   osm_log( &p_ftree->p_osm->log, OSM_LOG_VERBOSE,
+            "__osm_ftree_fabric_rank_from_roots: "
+            "Ranked %u valid root switches\n", num_roots);
+
+   /* Now the list has all the roots.
+      BFS the subnet and update rank on all the switches. */
+
+   while (!cl_is_list_empty(&ranking_bfs_list))
+   {
+      p_sw = (ftree_sw_t *)cl_list_remove_head(&ranking_bfs_list);
+      p_osm_node = p_sw->p_osm_sw->p_node;
+
+      /* note: skipping port 0 on switches */
+      for (i = 1; i < osm_node_get_num_physp(p_osm_node); i++)
+      {
+         p_osm_physp = osm_node_get_physp_ptr(p_osm_node,i);
+         if (!osm_physp_is_valid(p_osm_physp))
+            continue;
+         if (!osm_link_is_healthy(p_osm_physp))
+            continue;
+
+         p_remote_osm_node = osm_node_get_remote_node(p_osm_node,i,NULL);
+         if (!p_remote_osm_node)
+            continue;
+
+         if (osm_node_get_type(p_remote_osm_node) != IB_NODE_TYPE_SWITCH)
+            continue;
+
+         p_remote_sw = __osm_ftree_fabric_get_sw_by_guid(p_ftree,
+                                osm_node_get_node_guid(p_remote_osm_node));
+         CL_ASSERT(p_remote_sw);
+
+         /* if needed, rank the remote switch and add it to the BFS list */
+         if (__osm_ftree_sw_update_rank(p_remote_sw, p_sw->rank + 1))
+         {
+            osm_log( &p_ftree->p_osm->log, OSM_LOG_DEBUG,
+                     "__osm_ftree_fabric_rank_from_roots: "
+                     "Ranking switch 0x%" PRIx64 " with rank %u\n",
+                      __osm_ftree_sw_get_guid_ho(p_remote_sw),
+                     p_remote_sw->rank);
+            max_rank = p_remote_sw->rank;
+            cl_list_insert_tail(&ranking_bfs_list,p_remote_sw);
+         }
+      }
+      /* done with ports of this switch - go to the next switch in the list */
+   }
+
+   osm_log( &p_ftree->p_osm->log, OSM_LOG_VERBOSE,
+            "__osm_ftree_fabric_rank_from_roots: "
+            "Subnet ranking completed. Max Node Rank = %u\n",
+            max_rank );
+
+   /* set FatTree maximal switch rank */
+   p_ftree->max_switch_rank = max_rank;
+
+  Exit:
+   cl_list_destroy(&ranking_bfs_list);
+   OSM_LOG_EXIT( &p_ftree->p_osm->log );
+   return res;
+} /* __osm_ftree_fabric_rank_from_roots() */
+
+/***************************************************
+ ***************************************************/
+
+static int
+__osm_ftree_fabric_rank_from_hcas(
    IN  ftree_fabric_t * p_ftree)
 {
    ftree_hca_t * p_hca;
@@ -2803,11 +3436,9 @@ __osm_ftree_fabric_perform_ranking(
    cl_list_t     ranking_bfs_list;
    int res = 0;
 
-   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_perform_ranking);
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_rank_from_hcas);
 
-   /* Init the bfs list - the list of the switches that will be
-      initially filled with the leaf switches */
-   cl_list_init(&ranking_bfs_list, cl_qmap_count(&p_ftree->sw_tbl));
+   cl_list_init(&ranking_bfs_list,10);
 
    /* Mark REVERSED rank of all the switches in the subnet.
       Start from switches that are connected to hca's, and
@@ -2821,7 +3452,7 @@ __osm_ftree_fabric_perform_ranking(
       {
          res = -1;
          osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
-                 "__osm_ftree_fabric_perform_ranking: ERR AB14: "
+                 "__osm_ftree_fabric_rank_from_hcas: ERR AB14: "
                  "Subnet ranking failed - subnet is not FatTree");
          goto Exit;
       }
@@ -2830,36 +3461,106 @@ __osm_ftree_fabric_perform_ranking(
    /* Now rank rest of the switches in the fabric, while the
       list already contains all the ranked leaf switches */
    __osm_ftree_rank_switches_from_leafs(p_ftree, &ranking_bfs_list);
+
+   /* fix ranking of the switches by reversing the ranking direction */
+   cl_qmap_apply_func(&p_ftree->sw_tbl, __osm_ftree_sw_reverse_rank, (void *)p_ftree);
+
+  Exit:
    cl_list_destroy(&ranking_bfs_list);
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
+   return res;
+} /* __osm_ftree_fabric_rank_from_hcas() */
 
-   /* REVERSED ranking of all the switches completed.
-      Calculate and set FatTree rank */
+/***************************************************
+ ***************************************************/
 
-   __osm_ftree_fabric_calculate_rank(p_ftree);
-   osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,
-           "__osm_ftree_fabric_perform_ranking: "
-           "FatTree rank is %u\n", __osm_ftree_fabric_get_rank(p_ftree));
+static int
+__osm_ftree_fabric_rank(
+   IN  ftree_fabric_t * p_ftree)
+{
+   int res = 0;
 
-   /* fix ranking of the switches by reversing the ranking direction */
-   cl_qmap_apply_func(&p_ftree->sw_tbl, __osm_ftree_sw_reverse_rank, (void *)p_ftree);
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_perform_ranking);
 
-   if ( __osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK ||
-        __osm_ftree_fabric_get_rank(p_ftree) < FAT_TREE_MIN_RANK )
-   {
-      osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
-              "__osm_ftree_fabric_perform_ranking: ERR AB15: "
-              "Tree rank is %u (should be between %u and %u)\n",
-              __osm_ftree_fabric_get_rank(p_ftree),
-              FAT_TREE_MIN_RANK,
-              FAT_TREE_MAX_RANK);
-      res = -1;
+   if ( __osm_ftree_fabric_roots_provided(p_ftree) )
+      res = __osm_ftree_fabric_rank_from_roots(p_ftree);
+   else
+      res = __osm_ftree_fabric_rank_from_hcas(p_ftree);
+
+   if (res)
       goto Exit;
-   }
+
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,
+           "__osm_ftree_fabric_rank: "
+           "FatTree max switch rank is %u\n", p_ftree->max_switch_rank);
 
   Exit:
    OSM_LOG_EXIT(&p_ftree->p_osm->log);
    return res;
-} /* __osm_ftree_fabric_perform_ranking() */
+} /* __osm_ftree_fabric_rank() */
+
+/***************************************************
+ ***************************************************/
+
+static void
+__osm_ftree_fabric_set_leaf_rank(
+   IN  ftree_fabric_t * p_ftree)
+{
+   unsigned i;
+   ftree_sw_t  * p_sw;
+   ftree_hca_t * p_hca;
+   ftree_hca_t * p_next_hca;
+
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_set_leaf_rank);
+
+   if ( !__osm_ftree_fabric_roots_provided(p_ftree) )
+   {
+      /* If root file is not provided, the fabric has to be pure fat-tree
+         in terms of ranking. Thus, leaf switches rank is the max rank.*/
+      p_ftree->leaf_switch_rank = p_ftree->max_switch_rank;
+   }
+   else
+   {
+      /* Find the first CN and set the leaf_switch_rank to the rank
+         of the switch that is connected to this CN. Later we will
+         ensure that all the leaf switches have the same rank. */
+      p_next_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl);
+      while( p_next_hca != (ftree_hca_t *)cl_qmap_end(&p_ftree->hca_tbl) )
+      {
+         p_hca = p_next_hca;
+         if (p_hca->cn_num)
+            break;
+         p_next_hca = (ftree_hca_t *)cl_qmap_next(&p_hca->map_item);
+      }
+      /* we know that there are CNs in the fabric, so just to be sure...*/
+      CL_ASSERT( p_next_hca != (ftree_hca_t *)cl_qmap_end(&p_ftree->hca_tbl) );
+
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
+              "__osm_ftree_fabric_set_leaf_rank: "
+              "Selected CN port GUID 0x%" PRIx64 "\n",
+              __osm_ftree_hca_get_guid_ho(p_hca));
+
+      for( i = 0;
+           (i < p_hca->up_port_groups_num) && (!p_hca->up_port_groups[i]->is_cn);
+           i++ )
+         ;
+      CL_ASSERT( i < p_hca->up_port_groups_num );
+      CL_ASSERT( p_hca->up_port_groups[i]->remote_node_type == IB_NODE_TYPE_SWITCH );
+
+      p_sw = p_hca->up_port_groups[i]->remote_hca_or_sw.p_sw;
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
+              "__osm_ftree_fabric_set_leaf_rank: "
+              "Selected leaf switch GUID 0x%" PRIx64 ", rank %u\n",
+              __osm_ftree_sw_get_guid_ho(p_sw),
+              p_sw->rank);
+      p_ftree->leaf_switch_rank = p_sw->rank;
+   }
+
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,
+           "__osm_ftree_fabric_set_leaf_rank: "
+           "FatTree leaf switch rank is %u\n", p_ftree->leaf_switch_rank);
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
+} /* __osm_ftree_fabric_set_leaf_rank() */
 
 /***************************************************
  ***************************************************/
@@ -2907,6 +3608,104 @@ __osm_ftree_fabric_populate_ports(
 /***************************************************
  ***************************************************/
 
+static void
+__osm_ftree_convert_list2qmap(
+   cl_list_t * p_guid_list,
+   cl_qmap_t * p_map )
+{
+   uint64_t * p_guid;
+   CL_ASSERT(p_map);
+   if ( !p_guid_list || !cl_list_count(p_guid_list) )
+      return;
+
+   while ( (p_guid = (uint64_t*)cl_list_remove_head(p_guid_list)) )
+   {
+      /* object key is guid in network order */
+      cl_qmap_insert( p_map, cl_hton64(*p_guid),
+                      &((__osm_ftree_guid_tbl_element_create(*p_guid))->map_item) );
+      free(p_guid);
+   }
+   CL_ASSERT(cl_is_list_empty(p_guid_list));
+
+} /* __osm_ftree_convert_list2qmap() */
+
+/***************************************************
+ ***************************************************/
+
+static int
+__osm_ftree_fabric_read_guid_files(
+   IN  ftree_fabric_t * p_ftree)
+{
+   int status = 0;
+
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_read_guid_files);
+
+   if ( __osm_ftree_fabric_roots_provided(p_ftree) )
+   {
+      osm_log( &p_ftree->p_osm->log, OSM_LOG_DEBUG,
+               "__osm_ftree_read_guid_files: "
+               "Fetching root nodes from file %s\n",
+               p_ftree->p_osm->subn.opt.root_guid_file );
+
+      if ( osm_ucast_mgr_read_guid_file(&p_ftree->p_osm->sm.ucast_mgr,
+                                        p_ftree->p_osm->subn.opt.root_guid_file,
+                                        &p_ftree->root_guid_list ) )
+      {
+         status = -1;
+         goto Exit;
+      }
+
+      if ( !cl_list_count(&p_ftree->root_guid_list) )
+      {
+         osm_log( &p_ftree->p_osm->log, OSM_LOG_ERROR,
+                  "__osm_ftree_fabric_read_guid_files: ERR AB22: "
+                  "Root guids file has no valid guids\n");
+         status = -1;
+         goto Exit;
+      }
+   }
+
+   if ( __osm_ftree_fabric_cns_provided(p_ftree) )
+   {
+      cl_list_t cn_guid_list;
+      cl_list_construct(&cn_guid_list);
+      cl_list_init(&cn_guid_list, 10);
+
+      osm_log( &p_ftree->p_osm->log, OSM_LOG_DEBUG,
+               "__osm_ftree_read_guid_files: "
+               "Fetching compute nodes from file %s\n",
+               p_ftree->p_osm->subn.opt.cn_guid_file );
+
+      if ( osm_ucast_mgr_read_guid_file(&p_ftree->p_osm->sm.ucast_mgr,
+                                        p_ftree->p_osm->subn.opt.cn_guid_file,
+                                        &cn_guid_list) )
+      {
+         status = -1;
+         goto Exit;
+      }
+
+      if ( !cl_list_count(&cn_guid_list) )
+      {
+         osm_log( &p_ftree->p_osm->log, OSM_LOG_ERROR,
+                  "__osm_ftree_fabric_read_guid_files: ERR AB23: "
+                  "Compute node guids file has no valid guids\n");
+         status = -1;
+         goto Exit;
+      }
+
+      __osm_ftree_convert_list2qmap(&cn_guid_list, &p_ftree->cn_guid_tbl);
+      cl_list_destroy(&cn_guid_list);
+      CL_ASSERT(cl_qmap_count(&p_ftree->cn_guid_tbl));
+   }
+
+  Exit:
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
+   return status;
+} /*__osm_ftree_fabric_read_guid_files() */
+
+/***************************************************
+ ***************************************************/
+
 static int
 __osm_ftree_construct_fabric(
    IN  void * context)
@@ -2964,6 +3763,18 @@ __osm_ftree_construct_fabric(
       goto Exit;
    }
 
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
+           "__osm_ftree_construct_fabric: "
+           "Reading guid files provided by user\n");
+   if (__osm_ftree_fabric_read_guid_files(p_ftree) != 0)
+   {
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
+              "Failed reading guid files - "
+              "falling back to default routing\n");
+      status = -1;
+      goto Exit;
+   }
+
    if (cl_qmap_count(&p_ftree->hca_tbl) < 2)
    {
       osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
@@ -2974,28 +3785,26 @@ __osm_ftree_construct_fabric(
       goto Exit;
    }
 
+   /* Rank all the switches in the fabric.
+      After that we will know only fabric max switch rank.
+      We will be able to check leaf switches rank and the
+      whole tree rank after filling ports and marking CNs.*/
    osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
            "__osm_ftree_construct_fabric: Ranking FatTree\n");
-
-   if (__osm_ftree_fabric_perform_ranking(p_ftree) != 0)
-   {
-      if (__osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK)
-         osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
-                 "Fabric rank is %u (>%u) - "
-                 "fat-tree routing falls back to default routing\n",
-                 __osm_ftree_fabric_get_rank(p_ftree), FAT_TREE_MAX_RANK);
-      else if (__osm_ftree_fabric_get_rank(p_ftree) < FAT_TREE_MIN_RANK)
-         osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
-                 "Fabric rank is %u (<%u) - "
-                 "fat-tree routing falls back to default routing\n",
-                 __osm_ftree_fabric_get_rank(p_ftree), FAT_TREE_MIN_RANK);
+   if (__osm_ftree_fabric_rank(p_ftree) != 0)
+   {
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
+              "Failed ranking the tree  - "
+              "fat-tree routing falls back to default routing\n");
       status = -1;
       goto Exit;
    }
 
    /* For each hca and switch, construct array of ports.
-      This is done after the whole FatTree data structure is ready, because
-      we want the ports to have pointers to ftree_{sw,hca}_t objects.*/
+      This is done after the whole FatTree data structure is ready,
+      because we want the ports to have pointers to ftree_{sw,hca}_t
+      objects, and we need the switches to be already ranked because
+      that's how the port direction is determined.*/
    osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
            "__osm_ftree_construct_fabric: "
            "Populating CA & switch ports\n");
@@ -3007,14 +3816,68 @@ __osm_ftree_construct_fabric(
       status = -1;
       goto Exit;
    }
+   else if (p_ftree->cn_num == 0)
+   {
+      osm_log( &p_ftree->p_osm->log, OSM_LOG_SYS,
+               "Fabric has no valid compute nodes - "
+               "routing falls back to default routing\n");
+      status = -1;
+      goto Exit;
+   }
+
+   /* Now that the CA ports have been created and CNs were marked,
+      we can complete the fabric ranking - set leaf switches rank.*/
+   __osm_ftree_fabric_set_leaf_rank(p_ftree);
+
+   if ( __osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK ||
+        __osm_ftree_fabric_get_rank(p_ftree) < FAT_TREE_MIN_RANK )
+   {
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
+              "Fabric rank is %u (should be between %u and %u) - "
+              "fat-tree routing falls back to default routing\n",
+              __osm_ftree_fabric_get_rank(p_ftree),
+              FAT_TREE_MIN_RANK,
+              FAT_TREE_MAX_RANK);
+      status = -1;
+      goto Exit;
+   }
+
+   /* Mark all the switches in the fabric with rank equal to
+      p_ftree->leaf_switch_rank and that are also connected to CNs.
+      As a by-product, this function also runs basic topology
+      validation - it checks that all the CNs are at the same rank.*/
+   if (__osm_ftree_fabric_mark_leaf_switches(p_ftree))
+   {
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
+              "Fabric topology is not a fat-tree - "
+              "routing falls back to default routing\n");
+      status = -1;
+      goto Exit;
+   }
 
-   /* Assign index to all the switches and hca's in the fabric.
-      This function also sorts all the port arrays of the switches
-      by the remote switch index, creates a leaf switch array
-      sorted by the switch index, and tracks the maximal number of
-      hcas per leaf switch. */
+   /* Assign index to all the switches in the fabric.
+      This function also sorts leaf switch array by the switch index,
+      sorts all the port arrays of the indexed switches by remote
+      switch index, and creates switch-by-tuple table (sw_by_tuple_tbl) */
    __osm_ftree_fabric_make_indexing(p_ftree);
 
+   /* Create leaf switch array sorted by index.
+      This array contains switches with rank equal to p_ftree->leaf_switch_rank
+      and that are also connected to CNs (REAL leafs), and it may contain
+      switches at the same leaf rank w/o CNs, if this is the order of indexing.
+      In any case, the first and the last switches in the array are REAL leafs.*/
+   if (__osm_ftree_fabric_create_leaf_switch_array(p_ftree))
+   {
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
+              "Fabric topology is not a fat-tree - "
+              "routing falls back to default routing\n");
+      status = -1;
+      goto Exit;
+   }
+
+   /* calculate and set ftree.max_cn_per_leaf field */
+   __osm_ftree_fabric_set_max_cn_per_leaf(p_ftree);
+
    /* print general info about fabric topology */
    __osm_ftree_fabric_dump_general_info(p_ftree);
 
@@ -3022,7 +3885,10 @@ __osm_ftree_construct_fabric(
    if (osm_log_is_active(&p_ftree->p_osm->log, OSM_LOG_DEBUG))
        __osm_ftree_fabric_dump(p_ftree);
 
-   if (! __osm_ftree_fabric_validate_topology(p_ftree))
+   /* the fabric is required to be PURE fat-tree only if the root
+      guid file hasn't been provided by user */
+   if ( ! __osm_ftree_fabric_roots_provided(p_ftree) &&
+        ! __osm_ftree_fabric_validate_topology(p_ftree) )
    {
       osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
               "Fabric topology is not a fat-tree - "
@@ -3080,8 +3946,12 @@ __osm_ftree_do_routing(
            "Starting FatTree routing\n");
 
    osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
-           "Filling switch forwarding tables for routes to CAs\n");
-   __osm_ftree_fabric_route_to_hcas(p_ftree);
+           "Filling switch forwarding tables for Compute Nodes\n");
+   __osm_ftree_fabric_route_to_cns(p_ftree);
+
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
+           "Filling switch forwarding tables for non-CN targets\n");
+   __osm_ftree_fabric_route_to_non_cns(p_ftree);
 
    osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
            "Filling switch forwarding tables for switch-to-switch pathes\n");
-- 
1.5.1.4


From landman at scalableinformatics.com  Sun Jul  8 08:37:30 2007
From: landman at scalableinformatics.com (Joe Landman)
Date: Sun, 08 Jul 2007 11:37:30 -0400
Subject: [ofa-general] problem with rdma_ucm in OpenSuSE 10.2 default kernel
Message-ID: <469104BA.4080408@scalableinformatics.com>

After getting it to build correctly, installing it, and configuring it, 
I am getting a crash in rdma_ucm.  That and for some reason, there is a 
dependency upon ipv6.ko which depmod doesn't pick up.  The latter is 
solvable easily, but the former is troubling.  Here is the snippet from 
the messages file

> Jul  8 11:08:30 jackrabbit kernel: ----------- [cut here ] --------- [please bite here ] ---------
> Jul  8 11:08:30 jackrabbit kernel: Kernel BUG at fs/sysfs/file.c:473
> Jul  8 11:08:30 jackrabbit kernel: invalid opcode: 0000 [1] SMP 
> Jul  8 11:08:30 jackrabbit kernel: last sysfs file: /class/net/ib0/mode
> Jul  8 11:08:30 jackrabbit kernel: CPU 3 
> Jul  8 11:08:30 jackrabbit kernel: Modules linked in: rdma_ucm ib_sdp rdma_cm iw_cm ib_addr ib_local_sa ib_ipoib ipv6 snd_pcm_oss s
> nd_mixer_oss ib_uverbs snd_seq ib_umad snd_seq_device ib_cm ib_sa cpufreq_conservative cpufreq_ondemand cpufreq_userspace cpufreq_p
> owersave powernow_k8 freq_table button battery ac ipmi_si ipmi_devintf ipmi_msghandler apparmor aamatch_pcre ext3 jbd mbcache loop 
> dm_mod usbhid usb_storage snd_hda_intel snd_hda_codec snd_pcm snd_timer ib_mthca snd shpchp ehci_hcd ib_mad ohci_hcd ohci1394 ib_co
> re soundcore pci_hotplug ide_cd i2c_nforce2 ieee1394 forcedeth cdrom snd_page_alloc usbcore i2c_core xfs edd fan sg arcmsr sata_nv 
> libata amd74xx thermal processor sd_mod scsi_mod ide_disk ide_core
> Jul  8 11:08:30 jackrabbit kernel: Pid: 5464, comm: modprobe Tainted: G     U 2.6.18.2-34-default #1
> Jul  8 11:08:30 jackrabbit kernel: RIP: 0010:[<ffffffff802eaeb1>]  [<ffffffff802eaeb1>] sysfs_create_file+0x19/0x31
> Jul  8 11:08:30 jackrabbit kernel: RSP: 0000:ffff81042171de50  EFLAGS: 00010202
> Jul  8 11:08:30 jackrabbit kernel: RAX: 0000000000000001 RBX: 0000000000000000 RCX: ffffffff803eddf8
> Jul  8 11:08:30 jackrabbit kernel: RDX: 0000000000000000 RSI: ffffffff8856d720 RDI: ffff8104274f3810
> Jul  8 11:08:30 jackrabbit kernel: RBP: ffff810423e8c000 R08: ffffffff804d83b8 R09: ffff810424bb7b80
> Jul  8 11:08:30 jackrabbit kernel: R10: 0000000000000022 R11: ffff810424bb7b80 R12: ffff810423e8c5c0
> Jul  8 11:08:30 jackrabbit kernel: R13: ffffffff8856d900 R14: ffff810423e8c558 R15: ffffc20000a87e48
> Jul  8 11:08:30 jackrabbit kernel: FS:  00002b5c9772f6f0(0000) GS:ffff810428f7a9c0(0000) knlGS:0000000000000000
> Jul  8 11:08:30 jackrabbit kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> Jul  8 11:08:30 jackrabbit kernel: CR2: 000000000062f007 CR3: 0000000226d4a000 CR4: 00000000000006e0
> Jul  8 11:08:30 jackrabbit kernel: Process modprobe (pid: 5464, threadinfo ffff81042171c000, task ffff8104288e3830)
> Jul  8 11:08:30 jackrabbit kernel: Stack:  ffffffff881a1026 ffffffff8856d900 ffffffff80299bcc 0000000000000019
> Jul  8 11:08:30 jackrabbit kernel:  0000000000000000 000000002171de78 0000000000000000 0000000000000000
> Jul  8 11:08:30 jackrabbit kernel:  0000000000000000 0000000000000000 0000000000000000 0000000000000000
> Jul  8 11:08:30 jackrabbit kernel: Call Trace:
> Jul  8 11:08:30 jackrabbit kernel:  [<ffffffff881a1026>] :rdma_ucm:ucma_init+0x26/0x4a
> Jul  8 11:08:30 jackrabbit kernel:  [<ffffffff80299bcc>] sys_init_module+0x172f/0x18e5
> Jul  8 11:08:30 jackrabbit kernel:  [<ffffffff8025800e>] system_call+0x7e/0x83
> Jul  8 11:08:30 jackrabbit kernel: 
> Jul  8 11:08:30 jackrabbit kernel: 
> Jul  8 11:08:30 jackrabbit kernel: Code: 0f 0b 68 b8 75 40 80 c2 d9 01 48 8b 7f 48 ba 04 00 00 00 e9 
> Jul  8 11:08:30 jackrabbit kernel: RIP  [<ffffffff802eaeb1>] sysfs_create_file+0x19/0x31
> Jul  8 11:08:30 jackrabbit kernel:  RSP <ffff81042171de50>
> Jul  8 11:08:30 jackrabbit kernel:  <6>ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
> Jul  8 11:08:36 jackrabbit kernel: eth0: no IPv6 routers present
> Jul  8 11:08:40 jackrabbit kernel: ib0: no IPv6 routers present

I bring ipoib for testing (pinging) hosts, as well as having some of the 
ssh traffic cross it.  Sometimes quite useful.

Is the above a known problem?  Should I file a bug report?  The tainted 
kernel is likely due to the arcmsr driver, though it is open source, so 
I am not sure what is "tainted" about it.

Joe

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615


From rdreier at cisco.com  Sun Jul  8 08:54:53 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 08 Jul 2007 08:54:53 -0700
Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux
In-Reply-To: <20070708001531.GT3885@ics.muni.cz> (Lukas Hejtmanek's message of
	"Sun, 8 Jul 2007 02:15:32 +0200")
References: <20070704125429.GL3885@ics.muni.cz> <adavecy4ysc.fsf@cisco.com>
	<20070705193136.GQ3885@ics.muni.cz> <adasl8232zv.fsf@cisco.com>
	<20070707085303.GS3885@ics.muni.cz> <ada3azz3ihr.fsf@cisco.com>
	<20070708001531.GT3885@ics.muni.cz>
Message-ID: <adatzse28mq.fsf@cisco.com>

 > 000: 17 00 00 00 17 00 00 00 18 00 00 00 18 00 00 00
 > 010: 19 00 00 00 19 00 00 00 1a 00 00 00 1a 00 00 00
 > 020: 1b 00 00 00 1b 00 00 00 1c 00 00 00 1c 00 00 00
 > 030: 1d 00 00 00 1d 00 00 00 1e 00 00 00 1e 00 00 00
 > 040: 1f 00 00 00 1f 00 00 00 00 00 00 00 00 00 00 00
 > 050: 01 00 00 00 01 00 00 00 02 00 00 00 02 00 00 00

OK, my guess right now would be that when the driver is trying to give
memory to the HCA to use for its internal hardware data structures,
the bus addresses given to the HCA end up being wrong for some reason.
There could be a bug in mthca, but since this code is working fine on
lots of non-Xen systems (and not just i386/x86-64 but also ppc and
ia64 at least) right now I would be more suspicious of a bug in the
Xen domU's pci_map_sg() or something like that.

You can look in mthca_memfree.c, specifically mthca_alloc_icm() to see
how the memory to give to the HCA is allocated and mapped.  I gave it
a quick look over and the way the DMA mapping API is used looks OK to
me, but perhaps there is a subtle problem that is exposed by Xen.
Although as I said before, right now I think it's more likely that we
are hitting a bug in the Xen domU implementation of DMA mapping.


Michael, does my guess about the source of corruption make sense?  Is
that pattern of every fourth byte counting up 00 ... 1f something the
the HCA would write during initialization of ICM?

 - R.


From mst at dev.mellanox.co.il  Sun Jul  8 11:17:15 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 8 Jul 2007 21:17:15 +0300
Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux
In-Reply-To: <adatzse28mq.fsf@cisco.com>
References: <20070704125429.GL3885@ics.muni.cz> <adatzse28mq.fsf@cisco.com>
Message-ID: <20070708181715.GB32518@mellanox.co.il>

> Michael, does my guess about the source of corruption make sense?  Is
> that pattern of every fourth byte counting up 00 ... 1f something the
> the HCA would write during initialization of ICM?

Yes.

-- 
MST


From halr at voltaire.com  Sun Jul  8 15:15:17 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 08 Jul 2007 18:15:17 -0400
Subject: [ofa-general] Re: [PATCH] osm: enhancing fat-tree routing for
	non-pure trees
In-Reply-To: <4690ECDD.7030106@dev.mellanox.co.il>
References: <4690ECDD.7030106@dev.mellanox.co.il>
Message-ID: <1183932907.25217.312602.camel@hal.voltaire.com>

Hi Yevgeny,

On Sun, 2007-07-08 at 09:55, Yevgeny Kliteynik wrote:
> Hi Hal.
> 
> This patch handles the two new options for fat-tree routing:
> root guid file and compute node guid files, and by doing that
> fat-tree routing it is able to handle trees that are not pure
> fat-trees, or even not symmetrical.
> But the routing "quality" depends on the tree "correctness" - 
> the more the topology looks like pure fat-tree, the better
> the routing.
> 
> All the changes are in one file - osm_ucast_ftree.c, so as
> much as I've tried to divide this patch into separate stages,
> I found myself going back and fixing things too many times, so
> at this point it won't make sense to send this patch in parts,
> as earlier patches would have too much wrong code that was fixed 
> later.
> 
> Bottom line: sorry, but this thing has to go in a single patch.
> 
> Here's what this patch does:
> 
>  1. Some modifications to ftree data structures and functions
>      - Added guid getters for CAs and switches
>      - Added node type and guid for each port group
>      - Some naming changes
>      - Added get_sw_by_guid and get_hca_by_guid functions
> 
>  2. Reading roots and compute nodes from guid files
>      - Marking CAs with the number of CNs on the node
>      - Marking port groups if they belong to CN
> 
>  3. Ranking rewritten to supports root guids
>      - ftree.tree_rank replaced by two ranks:
>        ftree.max_switch_rank and ftree.leaf_switch_rank.
>      - Tree rank for routing is considered as (ftree.leaf_switch_rank + 1)
> 
>  4. Created leaf switch array that contains all the leafs
>     with CNs and possibly leafs between them, according to
>     the fabric indexing.
> 
>  5. Checking new "lighter" topology constaraint
>      - all the leafs with real CNs should be at the same tree rank.
>  
>  6. Implemented the routing itself:
>      - routing to all the CNs first
>      - routing dummy targets for all the missing nodes
>        or non-CNs that are connected to leaf switches
>      - routing to all the non-CN CAs in the fabric
>        (routing them as real targets on secondary path)
>      - routing to all the switch-to-switch pathes (left the same)
> 
>  7. Updated ordering file dump qfunction
>      - Treating non-compute nodes as dummies
> 
> -- Yevgeny
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Thanks. Applied.

-- Hal


From rdreier at cisco.com  Sun Jul  8 20:21:01 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 08 Jul 2007 20:21:01 -0700
Subject: [ofa-general] [PATCH] IB/ipath -- more changes in for-roland for
	2.6.23
In-Reply-To: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com>
	(Arthur Jones's message of "Fri, 06 Jul 2007 12:48:17 -0700")
References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com>
Message-ID: <adalkdq1cv6.fsf@cisco.com>

thanks, applied 1-8


From rdreier at cisco.com  Sun Jul  8 20:21:28 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 08 Jul 2007 20:21:28 -0700
Subject: [ofa-general] [PATCH 2/8] IB/ipath -- update MAINTAINERS
In-Reply-To: <1183755367.25217.102865.camel@hal.voltaire.com> (Hal
	Rosenstock's message of "06 Jul 2007 16:56:08 -0400")
References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com>
	<20070706194828.9093.10451.stgit@eng-46.internal.keyresearch.com>
	<1183755367.25217.102865.camel@hal.voltaire.com>
Message-ID: <adahcoe1cuf.fsf@cisco.com>

I also added this patch into my queue:

commit c4c9e9a665495480ba88f0f7a7649b8dcbbdeaa6
Author: Roland Dreier <rolandd at cisco.com>
Date:   Sun Jul 8 20:20:48 2007 -0700

    IB: Update mailing list address
    
    The InfiniBand / RDMA discussion list has moved.
    
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

diff --git a/MAINTAINERS b/MAINTAINERS
index 57ebf1e..0e0aac4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -371,7 +371,7 @@ P:	Tom Tucker
 M:	tom at opengridcomputing.com
 P:	Steve Wise
 M:	swise at opengridcomputing.com
-L:	openib-general at openib.org
+L:	general at lists.openfabrics.org
 S:	Maintained
 
 AOA (Apple Onboard Audio) ALSA DRIVER
@@ -1396,7 +1396,7 @@ P:	Hoang-Nam Nguyen
 M:	hnguyen at de.ibm.com
 P:	Christoph Raisch
 M:	raisch at de.ibm.com
-L:	openib-general at openib.org
+L:	general at lists.openfabrics.org
 S:	Supported
 
 EMU10K1 SOUND DRIVER
@@ -1851,7 +1851,7 @@ P:	Sean Hefty
 M:	mshefty at ichips.intel.com
 P:	Hal Rosenstock
 M:	halr at voltaire.com
-L:	openib-general at openib.org
+L:	general at lists.openfabrics.org
 W:	http://www.openib.org/
 T:	git kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git
 S:	Supported


From liu_jf at neusoft.com  Sun Jul  8 20:35:34 2007
From: liu_jf at neusoft.com (liu_jf at neusoft.com)
Date: Mon, 09 Jul 2007 11:35:34 +0800
Subject: [ofa-general] Generate ib_srpt.ko Failed!
Message-ID: <5da1d75d6b4c.5d6b4c5da1d7@neusoft.com>

Dear,
    I used OFED-1.2 to generate the SCSI Target modules,but when I 
enter the command "./configure --with-srp-target-mod",many faults 
occur. Most are kernel patch failure. My OS is CentOS 5.0,with kernel 
version 2.6.18-8.el5.Can anyone give me some suggestion? Great 
apreciation with any help!
    Thank you!               
                           
                                                                     
yours,
                                                                     
ljf


----------------------------------------------------------------------------------------------
Confidentiality Notice: The information contained in this e-mail and any accompanying attachment(s) is intended only for the use of the intended recipient and may be confidential and/or privileged of Neusoft Group Ltd., its subsidiaries and/or its affiliates. If any reader of this communication is not the intended recipient, unauthorized use, forwarding, printing, storing, disclosure or copying is strictly prohibited, and may be unlawful. If you have received this communication in error, please immediately notify the sender by return e-mail, and delete the original message and all copies from your system. Thank you. 
-----------------------------------------------------------------------------------------------


From rdreier at cisco.com  Sun Jul  8 22:03:16 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 08 Jul 2007 22:03:16 -0700
Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux
In-Reply-To: <20070708001531.GT3885@ics.muni.cz> (Lukas Hejtmanek's message of
	"Sun, 8 Jul 2007 02:15:32 +0200")
References: <20070704125429.GL3885@ics.muni.cz> <adavecy4ysc.fsf@cisco.com>
	<20070705193136.GQ3885@ics.muni.cz> <adasl8232zv.fsf@cisco.com>
	<20070707085303.GS3885@ics.muni.cz> <ada3azz3ihr.fsf@cisco.com>
	<20070708001531.GT3885@ics.muni.cz>
Message-ID: <adad4z2184r.fsf@cisco.com>

I don't know much about how Xen works, especially the PCI stuff in Xen
3.1.  So this may be a stupid idea, but anyway....

Is the memory given to a domU always physically contiguous?  If not,
what happens when a domU kernel does alloc_pages(GFP_KERNEL, 6) to try
and allocate 256 KB or something like that.  Let's assume that the
domU kernel has enough guest contiguous pages to satisfy the
allocation -- is there any guarantee that the pages are really
physically contiguous?

If not, what happens if the domU kernel does pci_map_sg() on an sglist
with >0 order pages in it that are not physically contigous?  The DMA
mapping API only allows one bus address to be returned for each page,
even if they are order >0 and hence more than 4 KB.  So if the pages
are guest contiguous but not physical host contiguous it seems we
could end up with the problem you see, where the domU mthca driver
tries to pass memory to the HCA but the HCA ends up writing to
different memory.

 - R.


From jackm at dev.mellanox.co.il  Mon Jul  9 00:12:52 2007
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Mon, 9 Jul 2007 10:12:52 +0300
Subject: [ofa-general] [PATCH] mlx4: add device reset to Internal Error
	handling mechanism
Message-ID: <200707091012.52418.jackm@dev.mellanox.co.il>

Add device reset to mlx4 Internal Error handling. Also, detect errors
via polling the device error buffer (rather than via interrupt), because
this provides better coverage.

This patch also disables the detection of Internal Errors via a device
interrupt, because we wish to avoid the complexity of supporting
two independent detection mechanisms.

Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>

diff --git a/drivers/net/mlx4/catas.c b/drivers/net/mlx4/catas.c
index 1bb088a..94bc784 100644
--- a/drivers/net/mlx4/catas.c
+++ b/drivers/net/mlx4/catas.c
@@ -30,15 +30,32 @@
  * SOFTWARE.
  */
 
+#include <linux/jiffies.h>
+#include <linux/timer.h>
+#include <linux/workqueue.h>
 #include "mlx4.h"
 
+enum {
+	MLX4_CATAS_POLL_INTERVAL	= 5 * HZ,
+};
+
+static DEFINE_SPINLOCK(catas_lock);
+
+static LIST_HEAD(catas_list);
+static struct workqueue_struct *catas_wq;
+static struct work_struct catas_work;
+
+static int ierr_reset_disable;
+module_param_named(ierr_reset_disable, ierr_reset_disable, int, 0644);
+MODULE_PARM_DESC(ierr_reset_disable, "disable reset on Internal Error event if nonzero");
+
 void mlx4_handle_catas_err(struct mlx4_dev *dev)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
 
 	int i;
 
-	mlx4_err(dev, "Catastrophic error detected:\n");
+	mlx4_err(dev, "Internal error detected:\n");
 	for (i = 0; i < priv->fw.catas_size; ++i)
 		mlx4_err(dev, "  buf[%02x]: %08x\n",
 			 i, swab32(readl(priv->catas_err.map + i)));
@@ -46,25 +63,118 @@ void mlx4_handle_catas_err(struct mlx4_dev *dev)
 	mlx4_dispatch_event(dev, MLX4_EVENT_TYPE_LOCAL_CATAS_ERROR, 0, 0);
 }
 
-void mlx4_map_catas_buf(struct mlx4_dev *dev)
+static void catas_reset(struct work_struct *work)
+{
+	struct mlx4_priv *priv, *tmppriv;
+	struct mlx4_dev *dev;
+
+	LIST_HEAD(tlist);
+	int ret;
+
+	spin_lock_irq(&catas_lock);
+	list_splice_init(&catas_list, &tlist);
+	spin_unlock_irq(&catas_lock);
+
+	list_for_each_entry_safe(priv, tmppriv, &tlist, catas_err.list) {
+		ret = mlx4_restart_one(priv->dev.pdev);
+		dev = &priv->dev;
+		if (ret)
+			mlx4_err(dev, "Reset failed (%d)\n", ret);
+		else
+			mlx4_dbg(dev, "Reset succeeded\n");
+	}
+}
+
+static void handle_catas(struct mlx4_dev *dev)
+{
+	unsigned long flags;
+	struct mlx4_priv *priv = mlx4_priv(dev);
+
+	mlx4_handle_catas_err(dev);
+
+	if (ierr_reset_disable)
+		return;
+
+	spin_lock_irqsave(&catas_lock, flags);
+	list_add(&priv->catas_err.list, &catas_list);
+	queue_work(catas_wq, &catas_work);
+	spin_unlock_irqrestore(&catas_lock, flags);
+}
+
+static void poll_catas(unsigned long dev_ptr)
+{
+	struct mlx4_dev *dev = (struct mlx4_dev *) dev_ptr;
+	struct mlx4_priv *priv = mlx4_priv(dev);
+	unsigned long flags;
+
+	if (readl(priv->catas_err.map)) {
+		handle_catas(&priv->dev);
+		return;
+	}
+
+	spin_lock_irqsave(&catas_lock, flags);
+	if (!priv->catas_err.stop)
+		mod_timer(&priv->catas_err.timer,
+			  jiffies + MLX4_CATAS_POLL_INTERVAL);
+	spin_unlock_irqrestore(&catas_lock, flags);
+
+	return;
+}
+
+void mlx4_start_catas_poll(struct mlx4_dev *dev)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
 	unsigned long addr;
 
+	init_timer(&priv->catas_err.timer);
+	priv->catas_err.stop = 0;
+	priv->catas_err.map  = NULL;
+
 	addr = pci_resource_start(dev->pdev, priv->fw.catas_bar) +
 		priv->fw.catas_offset;
 
 	priv->catas_err.map = ioremap(addr, priv->fw.catas_size * 4);
 	if (!priv->catas_err.map)
-		mlx4_warn(dev, "Failed to map catastrophic error buffer at 0x%lx\n",
+		mlx4_warn(dev, "Failed to map Internal Error buffer at 0x%lx\n",
 			  addr);
 
+	priv->catas_err.timer.data     = (unsigned long) dev;
+	priv->catas_err.timer.function = poll_catas;
+	priv->catas_err.timer.expires  = jiffies + MLX4_CATAS_POLL_INTERVAL;
+	INIT_LIST_HEAD(&priv->catas_err.list);
+	add_timer(&priv->catas_err.timer);
 }
 
-void mlx4_unmap_catas_buf(struct mlx4_dev *dev)
+void mlx4_stop_catas_poll(struct mlx4_dev *dev)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
 
+	spin_lock_irq(&catas_lock);
+	priv->catas_err.stop = 1;
+	spin_unlock_irq(&catas_lock);
+
+	del_timer_sync(&priv->catas_err.timer);
+
 	if (priv->catas_err.map)
 		iounmap(priv->catas_err.map);
+
+	spin_lock_irq(&catas_lock);
+	list_del(&priv->catas_err.list);
+	spin_unlock_irq(&catas_lock);
+}
+
+int __init mlx4_catas_init(void)
+{
+	INIT_WORK(&catas_work, catas_reset);
+
+	catas_wq = create_singlethread_workqueue("mlx4_err");
+	if (!catas_wq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void mlx4_catas_cleanup(void)
+{
+	destroy_workqueue(catas_wq);
 }
diff --git a/drivers/net/mlx4/eq.c b/drivers/net/mlx4/eq.c
index 27a82ce..a9841c6 100644
--- a/drivers/net/mlx4/eq.c
+++ b/drivers/net/mlx4/eq.c
@@ -283,7 +283,9 @@ static irqreturn_t mlx4_msi_x_interrupt(int irq, void *eq_ptr)
 
 static irqreturn_t mlx4_catas_interrupt(int irq, void *dev_ptr)
 {
-	mlx4_handle_catas_err(dev_ptr);
+	/* disable handling catas errors via interrupt. */
+	/* We now handle them via polling.              */
+	/* mlx4_handle_catas_err(dev_ptr);              */
 
 	/* MSI-X vectors always belong to us */
 	return IRQ_HANDLED;
diff --git a/drivers/net/mlx4/intf.c b/drivers/net/mlx4/intf.c
index 9ae951b..be5d9e9 100644
--- a/drivers/net/mlx4/intf.c
+++ b/drivers/net/mlx4/intf.c
@@ -142,6 +142,7 @@ int mlx4_register_device(struct mlx4_dev *dev)
 		mlx4_add_device(intf, priv);
 
 	mutex_unlock(&intf_mutex);
+	mlx4_start_catas_poll(dev);
 
 	return 0;
 }
@@ -151,6 +152,7 @@ void mlx4_unregister_device(struct mlx4_dev *dev)
 	struct mlx4_priv *priv = mlx4_priv(dev);
 	struct mlx4_interface *intf;
 
+	mlx4_stop_catas_poll(dev);
 	mutex_lock(&intf_mutex);
 
 	list_for_each_entry(intf, &intf_list, list)
diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index 41eafeb..297fe41 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -582,8 +582,6 @@ static int __devinit mlx4_setup_hca(struct mlx4_dev *dev)
 		goto err_pd_table_free;
 	}
 
-	mlx4_map_catas_buf(dev);
-
 	err = mlx4_init_eq_table(dev);
 	if (err) {
 		mlx4_err(dev, "Failed to initialize "
@@ -659,7 +657,6 @@ err_eq_table_free:
 	mlx4_cleanup_eq_table(dev);
 
 err_catas_buf:
-	mlx4_unmap_catas_buf(dev);
 	mlx4_cleanup_mr_table(dev);
 
 err_pd_table_free:
@@ -835,9 +832,6 @@ err_cleanup:
 	mlx4_cleanup_cq_table(dev);
 	mlx4_cmd_use_polling(dev);
 	mlx4_cleanup_eq_table(dev);
-
-	mlx4_unmap_catas_buf(dev);
-
 	mlx4_cleanup_mr_table(dev);
 	mlx4_cleanup_pd_table(dev);
 	mlx4_cleanup_uar_table(dev);
@@ -884,9 +878,6 @@ static void __devexit mlx4_remove_one(struct pci_dev *pdev)
 		mlx4_cleanup_cq_table(dev);
 		mlx4_cmd_use_polling(dev);
 		mlx4_cleanup_eq_table(dev);
-
-		mlx4_unmap_catas_buf(dev);
-
 		mlx4_cleanup_mr_table(dev);
 		mlx4_cleanup_pd_table(dev);
 
@@ -907,6 +898,12 @@ static void __devexit mlx4_remove_one(struct pci_dev *pdev)
 	}
 }
 
+int mlx4_restart_one(struct pci_dev *pdev)
+{
+	mlx4_remove_one(pdev);
+	return mlx4_init_one(pdev, NULL);
+}
+
 static struct pci_device_id mlx4_pci_table[] = {
 	{ PCI_VDEVICE(MELLANOX, 0x6340) }, /* MT25408 "Hermon" SDR */
 	{ PCI_VDEVICE(MELLANOX, 0x634a) }, /* MT25408 "Hermon" DDR */
@@ -927,6 +924,10 @@ static int __init mlx4_init(void)
 {
 	int ret;
 
+	ret = mlx4_catas_init();
+	if (ret)
+		return ret;
+
 	ret = pci_register_driver(&mlx4_driver);
 	return ret < 0 ? ret : 0;
 }
@@ -934,6 +935,7 @@ static int __init mlx4_init(void)
 static void __exit mlx4_cleanup(void)
 {
 	pci_unregister_driver(&mlx4_driver);
+	mlx4_catas_cleanup();
 }
 
 module_init(mlx4_init);
diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index 3d3b6d2..d4e9111 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -247,7 +247,9 @@ struct mlx4_mcg_table {
 
 struct mlx4_catas_err {
 	u32 __iomem	       *map;
-	int			size;
+	u32			stop;
+	struct timer_list	timer;
+	struct list_head	list;
 };
 
 struct mlx4_priv {
@@ -310,9 +312,11 @@ void mlx4_cleanup_qp_table(struct mlx4_dev *dev);
 void mlx4_cleanup_srq_table(struct mlx4_dev *dev);
 void mlx4_cleanup_mcg_table(struct mlx4_dev *dev);
 
-void mlx4_map_catas_buf(struct mlx4_dev *dev);
-void mlx4_unmap_catas_buf(struct mlx4_dev *dev);
-
+void mlx4_start_catas_poll(struct mlx4_dev *dev);
+void mlx4_stop_catas_poll(struct mlx4_dev *dev);
+int mlx4_catas_init(void);
+void mlx4_catas_cleanup(void);
+int mlx4_restart_one(struct pci_dev *pdev);
 int mlx4_register_device(struct mlx4_dev *dev);
 void mlx4_unregister_device(struct mlx4_dev *dev);
 void mlx4_dispatch_event(struct mlx4_dev *dev, enum mlx4_event type,


From kliteyn at dev.mellanox.co.il  Mon Jul  9 01:31:34 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 09 Jul 2007 11:31:34 +0300
Subject: [ofa-general] [PATCH 1/2] osm: updating doc with root and compute
 nodes options for fat-tree
Message-ID: <4691F266.9000505@dev.mellanox.co.il>

Hi Hal

This patch has only cosmetics - removing trailing blanks in doc files.

Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/doc/current-routing.txt |  104 ++++++++++++++++++++--------------------
 opensm/man/opensm.8            |   36 +++++++-------
 2 files changed, 70 insertions(+), 70 deletions(-)

diff --git a/opensm/doc/current-routing.txt b/opensm/doc/current-routing.txt
index 737949e..9852ef0 100644
--- a/opensm/doc/current-routing.txt
+++ b/opensm/doc/current-routing.txt
@@ -3,17 +3,17 @@ Current OpenSM Routing
 
 OpenSM offers four routing engines:
 
-1.  Min Hop Algorithm - based on the minimum hops to each node where the 
+1.  Min Hop Algorithm - based on the minimum hops to each node where the
 path length is optimized.
 
-2.  UPDN Unicast routing algorithm - also based on the minimum hops to each 
-node, but it is constrained to ranking rules. This algorithm should be chosen 
-if the subnet is not a pure Fat Tree, and deadlock may occur due to a 
+2.  UPDN Unicast routing algorithm - also based on the minimum hops to each
+node, but it is constrained to ranking rules. This algorithm should be chosen
+if the subnet is not a pure Fat Tree, and deadlock may occur due to a
 loop in the subnet.
 
 3.  Fat-tree Unicast routing algorithm - this algorithm optimizes routing
-of fat-trees for congestion-free "shift" communication pattern. 
-It should be chosen if a subnet is a symmetrical fat-tree. 
+of fat-trees for congestion-free "shift" communication pattern.
+It should be chosen if a subnet is a symmetrical fat-tree.
 Similar to UPDN routing, Fat-tree routing is credit-loop-free.
 
 4. LASH unicast routing algorithm - uses Infiniband virtual layers
@@ -22,7 +22,7 @@ distributing the paths between layers. LASH is an alternative
 deadlock-free topology-agnostic routing algorithm to the non-minimal
 UPDN algorithm avoiding the use of a potentially congested root node.
 
-OpenSM also supports a file method which can load routes from a table. See 
+OpenSM also supports a file method which can load routes from a table. See
 modular-routing.txt for more information on this.
 
 The basic routing algorithm is comprised of two stages:
@@ -41,10 +41,10 @@ a decision is made as to what port should be used to get to that LID.
    This step is common to standard and Up/Down routing. Each port has a
 counter counting the number of target LIDs going through it.
    When there are multiple alternative ports with same MinHop to a LID,
-the one with less previously assigned ports is selected. 
-   If LMC > 0, more checks are added: Within each group of LIDs assigned to 
-same target port, 
-   a. use only ports which have same MinHop 
+the one with less previously assigned ports is selected.
+   If LMC > 0, more checks are added: Within each group of LIDs assigned to
+same target port,
+   a. use only ports which have same MinHop
    b. first prefer the ones that go to different systemImageGuid (then
 the previous LID of the same LMC group)
    c. if none - prefer those which go through another NodeGuid
@@ -65,15 +65,15 @@ the fabric switches unless the -r (--reassign_lids) option is specified.
           LID assignments resolving multiple use of same LID.
 
 If a link is added or removed, OpenSM does not recalculate
-the routes that do not have to change. A route has to change 
-if the port is no longer UP or no longer the MinHop. When routing changes 
+the routes that do not have to change. A route has to change
+if the port is no longer UP or no longer the MinHop. When routing changes
 are performed, the same algorithm for balancing the routes is invoked.
 
 In the case of using the file based routing, any topology changes are
-currently ignored The 'file' routing engine just loads the LFTs from the file 
-specified, with no reaction to real topology. Obviously, this will not be able 
-to recheck LIDs (by GUID) for disconnected nodes, and LFTs for non-existent 
-switches will be skipped. Multicast is not affected by 'file' routing engine 
+currently ignored The 'file' routing engine just loads the LFTs from the file
+specified, with no reaction to real topology. Obviously, this will not be able
+to recheck LIDs (by GUID) for disconnected nodes, and LFTs for non-existent
+switches will be skipped. Multicast is not affected by 'file' routing engine
 (this uses min hop tables).
 
 
@@ -82,7 +82,7 @@ Min Hop Algorithm
 
 The Min Hop algorithm is invoked when neither UPDN or the file method are
 specified.
- 
+
 The Min Hop algorithm is divided into two stages: computation of
 min-hop tables on every switch and LFT output port assignment. Link
 subscription is also equalized with the ability to override based on
@@ -102,39 +102,39 @@ UPDN Routing Algorithm
 
 Purpose of UPDN Algorithm
 
-The UPDN algorithm is designed to prevent deadlocks from occurring in loops 
-of the subnet. A loop-deadlock is a situation in which it is no longer 
-possible to send data between any two hosts connected through the loop. As 
-such, the UPDN routing algorithm should be used if the subnet is not a pure 
-Fat Tree, and one of its loops may experience a deadlock (due, for example, 
+The UPDN algorithm is designed to prevent deadlocks from occurring in loops
+of the subnet. A loop-deadlock is a situation in which it is no longer
+possible to send data between any two hosts connected through the loop. As
+such, the UPDN routing algorithm should be used if the subnet is not a pure
+Fat Tree, and one of its loops may experience a deadlock (due, for example,
 to high pressure).
 
 The UPDN algorithm is based on the following main stages:
 
-1.  Auto-detect root nodes - based on the CA hop length from any switch in 
-the subnet, a statistical histogram is built for each switch (hop num vs 
+1.  Auto-detect root nodes - based on the CA hop length from any switch in
+the subnet, a statistical histogram is built for each switch (hop num vs
 number of occurrences). If the histogram reflects a specific column (higher
-than others) for a certain node, then it is marked as a root node. Since 
-the algorithm is statistical, it may not find any root nodes. The list of 
-the root nodes found by this auto-detect stage is used by the ranking 
+than others) for a certain node, then it is marked as a root node. Since
+the algorithm is statistical, it may not find any root nodes. The list of
+the root nodes found by this auto-detect stage is used by the ranking
 process stage.
 
     Note 1: The user can override the node list manually.
-    Note 2: If this stage cannot find any root nodes, and the user did not 
-            specify a guid list file, OpenSM defaults back to the Min Hop 
+    Note 2: If this stage cannot find any root nodes, and the user did not
+            specify a guid list file, OpenSM defaults back to the Min Hop
             routing algorithm.
 
-2.  Ranking process - All root switch nodes (found in stage 1) are assigned 
-a rank of 0. Using the BFS algorithm, the rest of the switch nodes in the 
-subnet are ranked incrementally. This ranking aids in the process of enforcing 
+2.  Ranking process - All root switch nodes (found in stage 1) are assigned
+a rank of 0. Using the BFS algorithm, the rest of the switch nodes in the
+subnet are ranked incrementally. This ranking aids in the process of enforcing
 rules that ensure loop-free paths.
 
-3.  Min Hop Table setting - after ranking is done, a BFS algorithm is run from 
-each (CA or switch) node in the subnet. During the BFS process, the FDB table 
-of each switch node traversed by BFS is updated, in reference to the starting 
+3.  Min Hop Table setting - after ranking is done, a BFS algorithm is run from
+each (CA or switch) node in the subnet. During the BFS process, the FDB table
+of each switch node traversed by BFS is updated, in reference to the starting
 node, based on the ranking rules and guid values.
 
-At the end of the process, the updated FDB tables ensure loop-free paths 
+At the end of the process, the updated FDB tables ensure loop-free paths
 through the subnet.
 
 Note: Up/Down routing does not allow LID routing communication between
@@ -150,21 +150,21 @@ UPDN Algorithm Usage
 Activation through OpenSM
 
 Use '-R updn' option (instead of old '-u') to activate the UPDN algorithm.
-Use `-a <guid_list_file>' for adding an UPDN guid file that contains the 
+Use `-a <guid_list_file>' for adding an UPDN guid file that contains the
 root nodes for ranking.
-If the `-a' option is not used, OpenSM uses its auto-detect root nodes 
+If the `-a' option is not used, OpenSM uses its auto-detect root nodes
 algorithm.
 
 Notes on the guid list file:
-1.   A valid guid file specifies one guid in each line. Lines with an invalid 
+1.   A valid guid file specifies one guid in each line. Lines with an invalid
 format will be discarded.
-2.   The user should specify the root switch guids. However, it is also 
-possible to specify CA guids; OpenSM will use the guid of the switch (if 
+2.   The user should specify the root switch guids. However, it is also
+possible to specify CA guids; OpenSM will use the guid of the switch (if
 it exists) that connects the CA to the subnet as a root node.
 
 
-To learn more about deadlock-free routing, see the article 
-"Deadlock Free Message Routing in Multiprocessor Interconnection Networks" 
+To learn more about deadlock-free routing, see the article
+"Deadlock Free Message Routing in Multiprocessor Interconnection Networks"
 by William J Dally and Charles L Seitz (1985).
 
 
@@ -173,9 +173,9 @@ Fat-tree Routing Algorithm
 
 Purpose:
 
-The fat-tree algorithm optimizes routing for "shift" communication pattern. 
+The fat-tree algorithm optimizes routing for "shift" communication pattern.
 It should be chosen if a subnet is a symmetrical fat-tree of various types.
-It supports not just K-ary-N-Trees, by handling for non-constant K, 
+It supports not just K-ary-N-Trees, by handling for non-constant K,
 cases where not all leafs (CAs) are present, any CBB ratio.
 As in UPDN, fat-tree also prevents credit-loop-deadlocks.
 Fat-tree algorithm supports topologies that comply with the following rules:
@@ -190,16 +190,16 @@ Fat-tree algorithm supports topologies that comply with the following rules:
   - Switches of the same rank should have the same number
     of ports in each DOWN-going port group.
 *ports that are connected to the same remote switch are referenced as
-'port group'. 
+'port group'.
 
-Note that although fat-tree algorithm supports trees with non-integer CBB 
+Note that although fat-tree algorithm supports trees with non-integer CBB
 ratio, the routing will not be as balanced as in case of integer CBB ratio.
-In addition to this, although the algorithm allows leaf switches to have any 
+In addition to this, although the algorithm allows leaf switches to have any
 number of CAs, the closer the tree is to be fully populated, the more effective
 the "shift" communication pattern will be.
 
 The algorithm also dumps CA ordering file (opensm-ftree-ca-order.dump) in the
-same directory where the OpenSM log resides. This ordering file provides the 
+same directory where the OpenSM log resides. This ordering file provides the
 CA order that may be used to create efficient communication pattern, that
 will match the routing tables.
 
@@ -223,7 +223,7 @@ agnostic deadlock-free routing within communication networks.
 When computing the routing function, LASH analyzes the network
 topology for the shortest-path routes between all pairs of sources /
 destinations and groups these paths into virtual layers in such a way
-as to avoid deadlock. 
+as to avoid deadlock.
 
 Note LASH analyzes routes and ensures deadlock freedom between switch
 pairs. The link from HCA between and switch does not need virtual
@@ -254,7 +254,7 @@ available.
 
 In general LASH is a very flexible algorithm. It can, for example,
 reduce to Dimension Order Routing in certain topologies, it is topology
-agnostic and fares well in the face of faults. 
+agnostic and fares well in the face of faults.
 
 It has been shown that for both regular and irregular topologies, LASH
 outperforms Up/Down. The reason for this is that LASH distributes the
diff --git a/opensm/man/opensm.8 b/opensm/man/opensm.8
index 00c7bbb..5f34cd1 100644
--- a/opensm/man/opensm.8
+++ b/opensm/man/opensm.8
@@ -1,7 +1,7 @@
 .TH OPENSM 8 "June 22, 2007" "OpenIB" "OpenIB Management"
 
 .SH NAME
-opensm \- InfiniBand subnet manager and administration (SM/SA) 
+opensm \- InfiniBand subnet manager and administration (SM/SA)
 
 .SH SYNOPSIS
 .B opensm
@@ -20,10 +20,10 @@ InfiniBand subnet).
 opensm also now contains an experimental version of a performance
 manager as well.
 
-opensm defaults were designed to meet the common case usage on clusters with up to a few hundred nodes. Thus, in this default mode, opensm will scan the IB 
+opensm defaults were designed to meet the common case usage on clusters with up to a few hundred nodes. Thus, in this default mode, opensm will scan the IB
 fabric, initialize it, and sweep occasionally for changes.
 
-opensm attaches to a specific IB port on the local machine and configures only 
+opensm attaches to a specific IB port on the local machine and configures only
 the fabric connected to it. (If the local machine has other IB ports,
 opensm will ignore the fabrics connected to those other ports). If no port is
 specified, it will select the first "best" available port.
@@ -33,7 +33,7 @@ attach to.
 
 By default, the run is logged to two files: /var/log/messages and /var/log/opensm.log.
 The first file will register only general major events, whereas the second
-will include details of reported errors. All errors reported in this second 
+will include details of reported errors. All errors reported in this second
 file should be treated as indicators of IB fabric health issues.
 (Note that when a fatal and non-recoverable error occurs, opensm will exit.)
 Both log files should include the message "SUBNET UP" if opensm was able to
@@ -75,7 +75,7 @@ one path between any two ports.
 \fB\-p\fR, \fB\-\-priority\fR
 This option specifies the SM\'s PRIORITY.
 This will effect the handover cases, where master
-is chosen by priority and GUID.  Range goes from 0 
+is chosen by priority and GUID.  Range goes from 0
 (default and lowest priority) to 15 (highest).
 .TP
 \fB\-smkey\fR
@@ -276,7 +276,7 @@ Display this usage info then exit.
 .PP
 The following environment variables control opensm behavior:
 
-OSM_TMP_DIR - controls the directory in which the temporary files generated by 
+OSM_TMP_DIR - controls the directory in which the temporary files generated by
 opensm are created. These files are: opensm-subnet.lst, opensm.fdbs, and
 opensm.mcfdbs. By default, this directory is /var/log.
 
@@ -350,11 +350,11 @@ defined in the IBTA specification (for example, mtu=4 for 2048).
 
 PortGUIDs list:
 
- PortGUID         - GUID of partition member EndPort. Hexadecimal 
-                    numbers should start from 0x, decimal numbers 
+ PortGUID         - GUID of partition member EndPort. Hexadecimal
+                    numbers should start from 0x, decimal numbers
                     are accepted too.
- full or limited  - indicates full or limited membership for this 
-                    port.  When omitted (or unrecognized) limited 
+ full or limited  - indicates full or limited membership for this
+                    port.  When omitted (or unrecognized) limited
                     membership is assumed.
 
 There are two useful keywords for PortGUID definition:
@@ -419,7 +419,7 @@ list of these parameters:
                   template
                   Both VL arbitration templates are pairs of
                   VL and weight
- qos_sl2vl      - SL2VL Mapping table (IBA 7.6.6) template. It is 
+ qos_sl2vl      - SL2VL Mapping table (IBA 7.6.6) template. It is
                   a list of VLs corresponding to SLs 0-15 (Note
                   that VL15 used here means drop this SL)
 
@@ -462,7 +462,7 @@ node, but it is constrained to ranking rules. This algorithm should be chosen
 if the subnet is not a pure Fat Tree, and deadlock may occur due to a
 loop in the subnet.
 
-3.  Fat Tree Unicast routing algorithm - this algorithm optimizes routing 
+3.  Fat Tree Unicast routing algorithm - this algorithm optimizes routing
 for congestion-free "shift" communication pattern.
 It should be chosen if a subnet is a symmetrical Fat Trees of various types,
 not just K-ary-N-Trees: non-constant K, not fully staffed, any CBB ratio.
@@ -660,7 +660,7 @@ Activation through OpenSM
 
 Use '-R ftree' option to activate the fat-tree algorithm.
 
-Note: LMC > 0 is not supported by fat-tree routing. If this is 
+Note: LMC > 0 is not supported by fat-tree routing. If this is
 specified, the default routing algorithm is invoked instead.
 
 
@@ -673,7 +673,7 @@ agnostic deadlock-free routing within communication networks.
 When computing the routing function, LASH analyzes the network
 topology for the shortest-path routes between all pairs of sources /
 destinations and groups these paths into virtual layers in such a way
-as to avoid deadlock. 
+as to avoid deadlock.
 
 Note LASH analyzes routes and ensures deadlock freedom between switch
 pairs. The link from HCA between and switch does not need virtual
@@ -704,7 +704,7 @@ available.
 
 In general LASH is a very flexible algorithm. It can, for example,
 reduce to Dimension Order Routing in certain topologies, it is topology
-agnostic and fares well in the face of faults. 
+agnostic and fares well in the face of faults.
 
 It has been shown that for both regular and irregular topologies, LASH
 outperforms Up/Down. The reason for this is that LASH distributes the
@@ -729,7 +729,7 @@ To learn more about deadlock-free routing, see the article
 "Deadlock Free Message Routing in Multiprocessor Interconnection Networks"
 by William J Dally and Charles L Seitz (1985).
 
-To learn more about the up/down algorithm, see the article 
+To learn more about the up/down algorithm, see the article
 "Effective Strategy to Compute Forwarding Tables for InfiniBand Networks"
 by Jose Carlos Sancho, Antonio Robles, and Jose Duato at the
 Universidad PolitÃ©cnica de Valencia.
@@ -786,7 +786,7 @@ To activate file based routing module, use:
 
   opensm -R file -U /path/to/dump_file
 
-If the dump_file is not found or is in error, the default routing 
+If the dump_file is not found or is in error, the default routing
 algorithm is utilized.
 
 The ability to dump switch lid matrices (aka min hops tables) to file and
@@ -816,7 +816,7 @@ Both or one of options -U and -M can be specified together with \'-R file\'.
 Hal Rosenstock
 .RI < halr at voltaire.com >
 .TP
-Sasha Khapyorsky 
+Sasha Khapyorsky
 .RI < sashak at voltaire.com >
 .TP
 Eitan Zahavi
-- 
1.5.1.4


From kliteyn at dev.mellanox.co.il  Mon Jul  9 01:32:49 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 09 Jul 2007 11:32:49 +0300
Subject: [ofa-general] [PATCH 2/2] osm: updating doc with root and compute
 nodes options for fat-tree
Message-ID: <4691F2B1.4000803@dev.mellanox.co.il>

Hi Hal.

Updating doc and osm manpage with the 
recent enhancement of fat-tree routing.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/doc/current-routing.txt |   28 ++++++++++++++++++++++------
 opensm/man/opensm.8            |   33 ++++++++++++++++++++++++++-------
 2 files changed, 48 insertions(+), 13 deletions(-)

diff --git a/opensm/doc/current-routing.txt b/opensm/doc/current-routing.txt
index 9852ef0..76f91ba 100644
--- a/opensm/doc/current-routing.txt
+++ b/opensm/doc/current-routing.txt
@@ -174,11 +174,14 @@ Fat-tree Routing Algorithm
 Purpose:
 
 The fat-tree algorithm optimizes routing for "shift" communication pattern.
-It should be chosen if a subnet is a symmetrical fat-tree of various types.
+It should be chosen if a subnet is a symmetrical or almost symmetrical
+fat-tree of various types.
 It supports not just K-ary-N-Trees, by handling for non-constant K,
 cases where not all leafs (CAs) are present, any CBB ratio.
 As in UPDN, fat-tree also prevents credit-loop-deadlocks.
-Fat-tree algorithm supports topologies that comply with the following rules:
+
+If the root guid file is not provided ('-a' or '--root_guid_file' options),
+the topology has to be pure fat-tree that complies with the following rules:
   - Tree rank should be between two and eight (inclusively)
   - Switches of the same rank should have the same number
     of UP-going port groups*, unless they are root switches,
@@ -189,18 +192,31 @@ Fat-tree algorithm supports topologies that comply with the following rules:
     of ports in each UP-going port group.
   - Switches of the same rank should have the same number
     of ports in each DOWN-going port group.
-*ports that are connected to the same remote switch are referenced as
+  - All the CAs have to be at the same tree level (rank).
+
+If the root guid file is provided, the topology doesn't have to be pure
+fat-tree, and it should only comply with the following rules:
+  - Tree rank should be between two and eight (inclusively)
+  - All the Compute Nodes** have to be at the same tree level (rank).
+    Note that non-compute node CAs are allowed here to be at different
+    tree ranks.
+
+* ports that are connected to the same remote switch are referenced as
 'port group'.
+** list of compute nodes (CNs) can be specified by '-u' or '--cn_guid_file'
+OpenSM options.
 
 Note that although fat-tree algorithm supports trees with non-integer CBB
 ratio, the routing will not be as balanced as in case of integer CBB ratio.
 In addition to this, although the algorithm allows leaf switches to have any
 number of CAs, the closer the tree is to be fully populated, the more effective
 the "shift" communication pattern will be.
+In general, even if the root list is provided, the closer the topology to a
+pure and symmetrical fat-tree, the more optimal the routing will be.
 
-The algorithm also dumps CA ordering file (opensm-ftree-ca-order.dump) in the
-same directory where the OpenSM log resides. This ordering file provides the
-CA order that may be used to create efficient communication pattern, that
+The algorithm also dumps compute node ordering file (opensm-ftree-ca-order.dump)
+in the same directory where the OpenSM log resides. This ordering file provides
+the CN order that may be used to create efficient communication pattern, that
 will match the routing tables.
 
 
diff --git a/opensm/man/opensm.8 b/opensm/man/opensm.8
index 5f34cd1..5472faf 100644
--- a/opensm/man/opensm.8
+++ b/opensm/man/opensm.8
@@ -603,7 +603,7 @@ UPDN Algorithm Usage
 Activation through OpenSM
 
 Use '-R updn' option (instead of old '-u') to activate the UPDN algorithm.
-Use '-a <guid_list_file>' for adding an UPDN guid file that contains the
+Use '-a <root_guid_file>' for adding an UPDN guid file that contains the
 root nodes for ranking.
 If the `-a' option is not used, OpenSM uses its auto-detect root nodes
 algorithm.
@@ -621,12 +621,14 @@ it exists) that connects the CA to the subnet as a root node.
 Fat-tree Routing Algorithm
 
 The fat-tree algorithm optimizes routing for "shift" communication pattern.
-It should be chosen if a subnet is a symmetrical fat-tree of various types.
+It should be chosen if a subnet is a symmetrical or almost symmetrical
+fat-tree of various types.
 It supports not just K-ary-N-Trees, by handling for non-constant K,
 cases where not all leafs (CAs) are present, any CBB ratio.
 As in UPDN, fat-tree also prevents credit-loop-deadlocks.
 
-The Fat-tree algorithm supports topologies that comply with the following rules:
+If the root guid file is not provided ('-a' or '--root_guid_file' options),
+the topology has to be pure fat-tree that complies with the following rules:
   - Tree rank should be between two and eight (inclusively)
   - Switches of the same rank should have the same number
     of UP-going port groups*, unless they are root switches,
@@ -637,10 +639,21 @@ The Fat-tree algorithm supports topologies that comply with the following rules:
     of ports in each UP-going port group.
   - Switches of the same rank should have the same number
     of ports in each DOWN-going port group.
+  - All the CAs have to be at the same tree level (rank).
 
-Note: ports that are connected to the same remote switch are referenced as
+If the root guid file is provided, the topology doesn't have to be pure
+fat-tree, and it should only comply with the following rules:
+  - Tree rank should be between two and eight (inclusively)
+  - All the Compute Nodes** have to be at the same tree level (rank).
+    Note that non-compute node CAs are allowed here to be at different
+    tree ranks.
+
+* ports that are connected to the same remote switch are referenced as
 \'port group\'.
 
+** list of compute nodes (CNs) can be specified by \'-u\' or \'--cn_guid_file\'
+OpenSM options.
+
 Topologies that do not comply cause a fallback to min hop routing.
 Note that this can also occur on link failures which cause the topology
 to no longer be "pure" fat-tree.
@@ -650,15 +663,21 @@ ratio, the routing will not be as balanced as in case of integer CBB ratio.
 In addition to this, although the algorithm allows leaf switches to have any
 number of CAs, the closer the tree is to be fully populated, the more
 effective the "shift" communication pattern will be.
+In general, even if the root list is provided, the closer the topology to a
+pure and symmetrical fat-tree, the more optimal the routing will be.
 
-The algorithm also dumps CA ordering file (opensm-ftree-ca-order.dump) in the
-same directory where the OpenSM log resides. This ordering file provides the
-CA order that may be used to create efficient communication pattern, that
+The algorithm also dumps compute node ordering file (opensm-ftree-ca-order.dump)
+in the same directory where the OpenSM log resides. This ordering file provides
+the CN order that may be used to create efficient communication pattern, that
 will match the routing tables.
 
 Activation through OpenSM
 
 Use '-R ftree' option to activate the fat-tree algorithm.
+Use '-a <root_guid_file>' to provide root nodes for ranking. If the `-a' option
+is not used, routing algorithm will detect roots automatically.
+Use '-u <root_cn_file>' to provide the list of compute nodes. If the `-u' option
+is not used, all the CAs are considered as compute nodes.
 
 Note: LMC > 0 is not supported by fat-tree routing. If this is
 specified, the default routing algorithm is invoked instead.
-- 
1.5.1.4


From vlad at lists.openfabrics.org  Mon Jul  9 02:00:50 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Mon,  9 Jul 2007 02:00:50 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_c_kernel 20070709-0200 daily build status
Message-ID: <20070709090100.16430E60844@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/~vlad/ofed_1_2/.git
git_branch: ofed_1_2

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:

Failed:
Build failed on i686 with 2.6.15-23-server
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on i686 with linux-2.6.12
Build failed on i686 with linux-2.6.18
Build failed on i686 with linux-2.6.17
Build failed on i686 with linux-2.6.22-rc7
Build failed on i686 with linux-2.6.19
Build failed on i686 with linux-2.6.21.1
Build failed on i686 with linux-2.6.13
Build failed on i686 with linux-2.6.14
Build failed on i686 with linux-2.6.16
Build failed on i686 with linux-2.6.15
Build failed on powerpc with linux-2.6.19
Log:
Build failed on x86_64 with linux-2.6.20
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.18-8.el5
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on ia64 with linux-2.6.12
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-42.ELsmp
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-55.ELsmp
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on ppc64 with linux-2.6.18-8.el5
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.16
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.5-7.244-smp
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on ia64 with linux-2.6.21.1
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-34.ELsmp
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.13
Log:
Build failed on x86_64 with linux-2.6.15
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on ppc64 with linux-2.6.19
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.12
Log:
Build failed on x86_64 with linux-2.6.9-22.ELsmp
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.16.21-0.8-smp
Build failed on x86_64 with linux-2.6.19
Log:
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.17
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


Build failed on x86_64 with linux-2.6.16.43-0.3-smp
Build failed on ppc64 with linux-2.6.14
Log:
Log:
----------------------------------------------------------------------------------
Build failed on ia64 with linux-2.6.13
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on powerpc with linux-2.6.13
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on ia64 with linux-2.6.16.21-0.8-default
Log:
Build failed on ppc64 with linux-2.6.16
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on powerpc with linux-2.6.16
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.18
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on powerpc with linux-2.6.18
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.18-1.2798.fc6
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on ppc64 with linux-2.6.17
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on powerpc with linux-2.6.15
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.14
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on ppc64 with linux-2.6.13
Log:
Build failed on powerpc with linux-2.6.17
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on powerpc with linux-2.6.12
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on ppc64 with linux-2.6.15
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.21.1
Log:
Build failed on ia64 with linux-2.6.14
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on ia64 with linux-2.6.15
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on powerpc with linux-2.6.14
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on ia64 with linux-2.6.16
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on ia64 with linux-2.6.19
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on ia64 with linux-2.6.18
Log:
Build failed on ia64 with linux-2.6.17
Build failed on ppc64 with linux-2.6.18
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------
Build failed on ppc64 with linux-2.6.12
Log:
    --with-cxgb3-mod    make CONFIG_INFINIBAND_CXGB3=m [no]
    --without-cxgb3-mod    [yes]

    --with-cxgb3_debug-mod    make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no]
    --without-cxgb3_debug-mod    [yes]

    --help - print out options


----------------------------------------------------------------------------------


From xhejtman at ics.muni.cz  Mon Jul  9 02:08:02 2007
From: xhejtman at ics.muni.cz (Lukas Hejtmanek)
Date: Mon, 9 Jul 2007 11:08:02 +0200
Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux
In-Reply-To: <adad4z2184r.fsf@cisco.com>
References: <20070704125429.GL3885@ics.muni.cz> <adavecy4ysc.fsf@cisco.com>
	<20070705193136.GQ3885@ics.muni.cz> <adasl8232zv.fsf@cisco.com>
	<20070707085303.GS3885@ics.muni.cz> <ada3azz3ihr.fsf@cisco.com>
	<20070708001531.GT3885@ics.muni.cz> <adad4z2184r.fsf@cisco.com>
Message-ID: <20070709090802.GA3885@ics.muni.cz>

On Sun, Jul 08, 2007 at 10:03:16PM -0700, Roland Dreier wrote:
> Is the memory given to a domU always physically contiguous?  If not,
> what happens when a domU kernel does alloc_pages(GFP_KERNEL, 6) to try
> and allocate 256 KB or something like that.  Let's assume that the
> domU kernel has enough guest contiguous pages to satisfy the
> allocation -- is there any guarantee that the pages are really
> physically contiguous?

according to Xen-dev alloc_pages does *not* guarantee contiguous pages. They
say that the pci_alloc_consistent should be used instead. The question is
whether non-Xen kernel *usually* allocates contiguous pages and so far it has
been working and whether it should be fixed in the mainline of the driver.

I do some tests (and also try to figure out how to change alloc_pages to
pci_alloc_consistent) to verify contiguous pages.

Anyway, thanks a lot!!

-- 
Lukáš Hejtmánek


From muli at il.ibm.com  Mon Jul  9 02:12:10 2007
From: muli at il.ibm.com (Muli Ben-Yehuda)
Date: Mon, 9 Jul 2007 12:12:10 +0300
Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux
In-Reply-To: <20070709090802.GA3885@ics.muni.cz>
References: <20070704125429.GL3885@ics.muni.cz> <adavecy4ysc.fsf@cisco.com>
	<20070705193136.GQ3885@ics.muni.cz> <adasl8232zv.fsf@cisco.com>
	<20070707085303.GS3885@ics.muni.cz> <ada3azz3ihr.fsf@cisco.com>
	<20070708001531.GT3885@ics.muni.cz> <adad4z2184r.fsf@cisco.com>
	<20070709090802.GA3885@ics.muni.cz>
Message-ID: <20070709091210.GP3182@rhun.haifa.ibm.com>

On Mon, Jul 09, 2007 at 11:08:02AM +0200, Lukas Hejtmanek wrote:
> On Sun, Jul 08, 2007 at 10:03:16PM -0700, Roland Dreier wrote:
> > Is the memory given to a domU always physically contiguous?  If not,
> > what happens when a domU kernel does alloc_pages(GFP_KERNEL, 6) to try
> > and allocate 256 KB or something like that.  Let's assume that the
> > domU kernel has enough guest contiguous pages to satisfy the
> > allocation -- is there any guarantee that the pages are really
> > physically contiguous?
> 
> according to Xen-dev alloc_pages does *not* guarantee contiguous
> pages. They say that the pci_alloc_consistent should be used
> instead. The question is whether non-Xen kernel *usually* allocates
> contiguous pages and so far it has been working and whether it
> should be fixed in the mainline of the driver.
> 
> I do some tests (and also try to figure out how to change
> alloc_pages to pci_alloc_consistent) to verify contiguous pages.

You missed an important bit of Keir's response---it's perfectly fine
to use alloc_pages provided you then use the dma_map_single API, which
for Xen dom0 will take care of bounce-buffering to a
machine-contiguous buffer if necessary. I am not sure if the same
holds for a domU kernel.

Cheers,
Muli


From xhejtman at ics.muni.cz  Mon Jul  9 02:22:57 2007
From: xhejtman at ics.muni.cz (Lukas Hejtmanek)
Date: Mon, 9 Jul 2007 11:22:57 +0200
Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux
In-Reply-To: <20070709091210.GP3182@rhun.haifa.ibm.com>
References: <20070704125429.GL3885@ics.muni.cz> <adavecy4ysc.fsf@cisco.com>
	<20070705193136.GQ3885@ics.muni.cz> <adasl8232zv.fsf@cisco.com>
	<20070707085303.GS3885@ics.muni.cz> <ada3azz3ihr.fsf@cisco.com>
	<20070708001531.GT3885@ics.muni.cz> <adad4z2184r.fsf@cisco.com>
	<20070709090802.GA3885@ics.muni.cz>
	<20070709091210.GP3182@rhun.haifa.ibm.com>
Message-ID: <20070709092257.GB3885@ics.muni.cz>

On Mon, Jul 09, 2007 at 12:12:10PM +0300, Muli Ben-Yehuda wrote:
> You missed an important bit of Keir's response---it's perfectly fine
> to use alloc_pages provided you then use the dma_map_single API, which
> for Xen dom0 will take care of bounce-buffering to a
> machine-contiguous buffer if necessary. I am not sure if the same
> holds for a domU kernel.

I'm not familiar with this stuff but dma_map_single is invoked via
pci_map_page, isn't it?

so alloc_pages and then pci_map_page is ok. But in mthca_memfree.c is
alloc_pages and then pci_map_sg is used. Is it still OK?

-- 
Lukáš Hejtmánek


From muli at il.ibm.com  Mon Jul  9 02:30:59 2007
From: muli at il.ibm.com (Muli Ben-Yehuda)
Date: Mon, 9 Jul 2007 12:30:59 +0300
Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux
In-Reply-To: <20070709092257.GB3885@ics.muni.cz>
References: <adavecy4ysc.fsf@cisco.com> <20070705193136.GQ3885@ics.muni.cz>
	<adasl8232zv.fsf@cisco.com> <20070707085303.GS3885@ics.muni.cz>
	<ada3azz3ihr.fsf@cisco.com> <20070708001531.GT3885@ics.muni.cz>
	<adad4z2184r.fsf@cisco.com> <20070709090802.GA3885@ics.muni.cz>
	<20070709091210.GP3182@rhun.haifa.ibm.com>
	<20070709092257.GB3885@ics.muni.cz>
Message-ID: <20070709093059.GS3182@rhun.haifa.ibm.com>

On Mon, Jul 09, 2007 at 11:22:57AM +0200, Lukas Hejtmanek wrote:
> On Mon, Jul 09, 2007 at 12:12:10PM +0300, Muli Ben-Yehuda wrote:
> > You missed an important bit of Keir's response---it's perfectly fine
> > to use alloc_pages provided you then use the dma_map_single API, which
> > for Xen dom0 will take care of bounce-buffering to a
> > machine-contiguous buffer if necessary. I am not sure if the same
> > holds for a domU kernel.
> 
> I'm not familiar with this stuff but dma_map_single is invoked via
> pci_map_page, isn't it?

Depends on the specifics, but in general dma_map_single and
pci_map_page are both implemented in terms of the DMA-API.

> so alloc_pages and then pci_map_page is ok. But in mthca_memfree.c
> is alloc_pages and then pci_map_sg is used. Is it still OK?

Yes, same thing.

Cheers,
Muli


From halr at voltaire.com  Mon Jul  9 04:01:23 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 09 Jul 2007 07:01:23 -0400
Subject: [ofa-general] Re: [PATCH 2/2] osm: updating doc with root and
	compute nodes options for fat-tree
In-Reply-To: <4691F2B1.4000803@dev.mellanox.co.il>
References: <4691F2B1.4000803@dev.mellanox.co.il>
Message-ID: <1183978786.25217.366108.camel@hal.voltaire.com>

Hi Yevgeny,

On Mon, 2007-07-09 at 04:32, Yevgeny Kliteynik wrote:
> Hi Hal.
> 
> Updating doc and osm manpage with the 
> recent enhancement of fat-tree routing.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Thanks. Applied.

-- Hal


From fenkes at de.ibm.com  Mon Jul  9 06:02:21 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Mon, 9 Jul 2007 15:02:21 +0200
Subject: [ofa-general] [PATCH 00/13] IB/ehca: eHCA2 enablement & some fixes
Message-ID: <200707091502.22407.fenkes@de.ibm.com>

This patch series enables the eHCA device driver to support new functions of
the eHCA2 chip. In addition, there are some bug fixes, code optimizations
and general new features included. Another set of patches will follow.

The patches, in detail, are:

[01/13] fixes a wrong parameter description
[02/13] adds HW capabilities autodetection
[03/13] restructures the QP code, preparing for Share Receive Queues (SRQ)
[04/13] adds SRQ support
[05/13] adds support for UD low latency QPs
[06/13] sets a flag that needs to be set on eHCA2
[07/13] adds RDMA atomic attributes to the data returned by query_qp()
[08/13] straightens out lock flag naming and adds static initializers
[09/13] refactors synchronization between completions and destroy_cq()
[10/13] changes the global idr spinlocks into rwlocks
[11/13] returns the QP pointer in poll_cq() instead of NULL
[12/13] adds notifications in case the SM LID etc. changes
[13/13] adds a slight latency improvement

The patches should apply cleanly, in order, against Roland's git. Please
review the changes and apply the patches for 2.6.23 if they are okay.

Regards,
  Joachim

-- 
Joachim Fenkes  --  eHCA Linux Driver Developer and Hardware Tamer
IBM Deutschland Entwicklung GmbH  --  Dept. 3627 (I/O Firmware Dev. 2)
Schoenaicher Strasse 220  --  71032 Boeblingen  --  Germany
eMail: fenkes at de.ibm.com


From fenkes at de.ibm.com  Mon Jul  9 06:20:55 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Mon, 9 Jul 2007 15:20:55 +0200
Subject: [ofa-general] [PATCH 01/13] IB/ehca: change scaling_code parameter
	description to match default value
In-Reply-To: <200707091502.22407.fenkes@de.ibm.com>
References: <200707091502.22407.fenkes@de.ibm.com>
Message-ID: <200707091520.56294.fenkes@de.ibm.com>

From: Hoang-Nam Nguyen <hnguyen at de.ibm.com>

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_main.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c
index c3f99f3..fea199f 100644
--- a/drivers/infiniband/hw/ehca/ehca_main.c
+++ b/drivers/infiniband/hw/ehca/ehca_main.c
@@ -94,7 +94,7 @@ MODULE_PARM_DESC(poll_all_eqs,
 MODULE_PARM_DESC(static_rate,
 		 "set permanent static rate (default: disabled)");
 MODULE_PARM_DESC(scaling_code,
-		 "set scaling code (0: disabled, 1: enabled/default)");
+		 "set scaling code (0: disabled/default, 1: enabled)");
 
 spinlock_t ehca_qp_idr_lock;
 spinlock_t ehca_cq_idr_lock;
-- 
1.5.2


From fenkes at de.ibm.com  Mon Jul  9 06:21:45 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Mon, 9 Jul 2007 15:21:45 +0200
Subject: [ofa-general] [PATCH 02/13] IB/ehca: HW level,
	HW caps and MTU autodetection
In-Reply-To: <200707091502.22407.fenkes@de.ibm.com>
References: <200707091502.22407.fenkes@de.ibm.com>
Message-ID: <200707091521.46883.fenkes@de.ibm.com>

In preparation for support of new eHCA2 features, change adapter probing:
 - Hardware level is changed to encode major and minor chip version
 - Hardware capabilities are queried from the firmware
 - The maximum MTU is queried from the firmware instead of assuming a
   fixed value

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_av.c      |    6 ++-
 drivers/infiniband/hw/ehca/ehca_classes.h |    2 +
 drivers/infiniband/hw/ehca/ehca_hca.c     |   27 +++++++++++-
 drivers/infiniband/hw/ehca/ehca_main.c    |   62 ++++++++++++++++++++++++++---
 drivers/infiniband/hw/ehca/hipz_hw.h      |   18 ++++++++
 5 files changed, 104 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_av.c b/drivers/infiniband/hw/ehca/ehca_av.c
index 0d6e2c4..3cd6bf3 100644
--- a/drivers/infiniband/hw/ehca/ehca_av.c
+++ b/drivers/infiniband/hw/ehca/ehca_av.c
@@ -118,7 +118,7 @@ struct ib_ah *ehca_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr)
 		}
 		memcpy(&av->av.grh.word_1, &gid, sizeof(gid));
 	}
-	av->av.pmtu = EHCA_MAX_MTU;
+	av->av.pmtu = shca->max_mtu;
 
 	/* dgid comes in grh.word_3 */
 	memcpy(&av->av.grh.word_3, &ah_attr->grh.dgid,
@@ -137,6 +137,8 @@ int ehca_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr)
 	struct ehca_av *av;
 	struct ehca_ud_av new_ehca_av;
 	struct ehca_pd *my_pd = container_of(ah->pd, struct ehca_pd, ib_pd);
+	struct ehca_shca *shca = container_of(ah->pd->device, struct ehca_shca,
+					      ib_device);
 	u32 cur_pid = current->tgid;
 
 	if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context &&
@@ -192,7 +194,7 @@ int ehca_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr)
 		memcpy(&new_ehca_av.grh.word_1, &gid, sizeof(gid));
 	}
 
-	new_ehca_av.pmtu = EHCA_MAX_MTU;
+	new_ehca_av.pmtu = shca->max_mtu;
 
 	memcpy(&new_ehca_av.grh.word_3, &ah_attr->grh.dgid,
 	       sizeof(ah_attr->grh.dgid));
diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h
index 1d286d3..35d948f 100644
--- a/drivers/infiniband/hw/ehca/ehca_classes.h
+++ b/drivers/infiniband/hw/ehca/ehca_classes.h
@@ -107,6 +107,8 @@ struct ehca_shca {
 	struct ehca_pd *pd;
 	struct h_galpas galpas;
 	struct mutex modify_mutex;
+	u64 hca_cap;
+	int max_mtu;
 };
 
 struct ehca_pd {
diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c
index 32b55a4..b310de5 100644
--- a/drivers/infiniband/hw/ehca/ehca_hca.c
+++ b/drivers/infiniband/hw/ehca/ehca_hca.c
@@ -45,11 +45,25 @@
 
 int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props)
 {
-	int ret = 0;
+	int i, ret = 0;
 	struct ehca_shca *shca = container_of(ibdev, struct ehca_shca,
 					      ib_device);
 	struct hipz_query_hca *rblock;
 
+	static const u32 cap_mapping[] = {
+		IB_DEVICE_RESIZE_MAX_WR,      HCA_CAP_WQE_RESIZE,
+		IB_DEVICE_BAD_PKEY_CNTR,      HCA_CAP_BAD_P_KEY_CTR,
+		IB_DEVICE_BAD_QKEY_CNTR,      HCA_CAP_Q_KEY_VIOL_CTR,
+		IB_DEVICE_RAW_MULTI,          HCA_CAP_RAW_PACKET_MCAST,
+		IB_DEVICE_AUTO_PATH_MIG,      HCA_CAP_AUTO_PATH_MIG,
+		IB_DEVICE_CHANGE_PHY_PORT,    HCA_CAP_SQD_RTS_PORT_CHANGE,
+		IB_DEVICE_UD_AV_PORT_ENFORCE, HCA_CAP_AH_PORT_NR_CHECK,
+		IB_DEVICE_CURR_QP_STATE_MOD,  HCA_CAP_CUR_QP_STATE_MOD,
+		IB_DEVICE_SHUTDOWN_PORT,      HCA_CAP_SHUTDOWN_PORT,
+		IB_DEVICE_INIT_TYPE,          HCA_CAP_INIT_TYPE,
+		IB_DEVICE_PORT_ACTIVE_EVENT,  HCA_CAP_PORT_ACTIVE_EVENT,
+	};
+
 	rblock = ehca_alloc_fw_ctrlblock(GFP_KERNEL);
 	if (!rblock) {
 		ehca_err(&shca->ib_device, "Can't allocate rblock memory.");
@@ -96,6 +110,13 @@ int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props)
 	props->max_total_mcast_qp_attach
 		= min_t(int, rblock->max_total_mcast_qp_attach, INT_MAX);
 
+	/* translate device capabilities */
+	props->device_cap_flags = IB_DEVICE_SYS_IMAGE_GUID |
+		IB_DEVICE_RC_RNR_NAK_GEN | IB_DEVICE_N_NOTIFY_CQ;
+	for (i = 0; i < ARRAY_SIZE(cap_mapping); i += 2)
+		if (rblock->hca_cap_indicators & cap_mapping[i + 1])
+			props->device_cap_flags |= cap_mapping[i];
+
 query_device1:
 	ehca_free_fw_ctrlblock(rblock);
 
@@ -261,7 +282,7 @@ int ehca_modify_port(struct ib_device *ibdev,
 	}
 
 	if (mutex_lock_interruptible(&shca->modify_mutex))
-                return -ERESTARTSYS;
+		return -ERESTARTSYS;
 
 	rblock = ehca_alloc_fw_ctrlblock(GFP_KERNEL);
 	if (!rblock) {
@@ -290,7 +311,7 @@ modify_port2:
 	ehca_free_fw_ctrlblock(rblock);
 
 modify_port1:
-        mutex_unlock(&shca->modify_mutex);
+	mutex_unlock(&shca->modify_mutex);
 
 	return ret;
 }
diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c
index fea199f..befbb9c 100644
--- a/drivers/infiniband/hw/ehca/ehca_main.c
+++ b/drivers/infiniband/hw/ehca/ehca_main.c
@@ -205,11 +205,35 @@ static void ehca_destroy_slab_caches(void)
 #define EHCA_HCAAVER  EHCA_BMASK_IBM(32,39)
 #define EHCA_REVID    EHCA_BMASK_IBM(40,63)
 
+static struct cap_descr {
+	u64 mask;
+	char *descr;
+} hca_cap_descr[] = {
+	{ HCA_CAP_AH_PORT_NR_CHECK, "HCA_CAP_AH_PORT_NR_CHECK" },
+	{ HCA_CAP_ATOMIC, "HCA_CAP_ATOMIC" },
+	{ HCA_CAP_AUTO_PATH_MIG, "HCA_CAP_AUTO_PATH_MIG" },
+	{ HCA_CAP_BAD_P_KEY_CTR, "HCA_CAP_BAD_P_KEY_CTR" },
+	{ HCA_CAP_SQD_RTS_PORT_CHANGE, "HCA_CAP_SQD_RTS_PORT_CHANGE" },
+	{ HCA_CAP_CUR_QP_STATE_MOD, "HCA_CAP_CUR_QP_STATE_MOD" },
+	{ HCA_CAP_INIT_TYPE, "HCA_CAP_INIT_TYPE" },
+	{ HCA_CAP_PORT_ACTIVE_EVENT, "HCA_CAP_PORT_ACTIVE_EVENT" },
+	{ HCA_CAP_Q_KEY_VIOL_CTR, "HCA_CAP_Q_KEY_VIOL_CTR" },
+	{ HCA_CAP_WQE_RESIZE, "HCA_CAP_WQE_RESIZE" },
+	{ HCA_CAP_RAW_PACKET_MCAST, "HCA_CAP_RAW_PACKET_MCAST" },
+	{ HCA_CAP_SHUTDOWN_PORT, "HCA_CAP_SHUTDOWN_PORT" },
+	{ HCA_CAP_RC_LL_QP, "HCA_CAP_RC_LL_QP" },
+	{ HCA_CAP_SRQ, "HCA_CAP_SRQ" },
+	{ HCA_CAP_UD_LL_QP, "HCA_CAP_UD_LL_QP" },
+	{ HCA_CAP_RESIZE_MR, "HCA_CAP_RESIZE_MR" },
+	{ HCA_CAP_MINI_QP, "HCA_CAP_MINI_QP" },
+};
+
 int ehca_sense_attributes(struct ehca_shca *shca)
 {
-	int ret = 0;
+	int i, ret = 0;
 	u64 h_ret;
 	struct hipz_query_hca *rblock;
+	struct hipz_query_port *port;
 
 	rblock = ehca_alloc_fw_ctrlblock(GFP_KERNEL);
 	if (!rblock) {
@@ -222,7 +246,7 @@ int ehca_sense_attributes(struct ehca_shca *shca)
 		ehca_gen_err("Cannot query device properties. h_ret=%lx",
 			     h_ret);
 		ret = -EPERM;
-		goto num_ports1;
+		goto sense_attributes1;
 	}
 
 	if (ehca_nr_ports == 1)
@@ -242,18 +266,44 @@ int ehca_sense_attributes(struct ehca_shca *shca)
 		ehca_gen_dbg(" ... hardware version=%x:%x", hcaaver, revid);
 
 		if ((hcaaver == 1) && (revid == 0))
-			shca->hw_level = 0;
+			shca->hw_level = 0x11;
 		else if ((hcaaver == 1) && (revid == 1))
-			shca->hw_level = 1;
+			shca->hw_level = 0x12;
 		else if ((hcaaver == 1) && (revid == 2))
-			shca->hw_level = 2;
+			shca->hw_level = 0x13;
+		else if ((hcaaver == 2) && (revid == 0))
+			shca->hw_level = 0x21;
+		else if ((hcaaver == 2) && (revid == 0x10))
+			shca->hw_level = 0x22;
+		else {
+			ehca_gen_warn("unknown hardware version"
+				      " - assuming default level");
+			shca->hw_level = 0x22;
+		}
 	}
 	ehca_gen_dbg(" ... hardware level=%x", shca->hw_level);
 
 	shca->sport[0].rate = IB_RATE_30_GBPS;
 	shca->sport[1].rate = IB_RATE_30_GBPS;
 
-num_ports1:
+	shca->hca_cap = rblock->hca_cap_indicators;
+	ehca_gen_dbg(" ... HCA capabilities:");
+	for (i = 0; i < ARRAY_SIZE(hca_cap_descr); i++)
+		if (EHCA_BMASK_GET(hca_cap_descr[i].mask, shca->hca_cap))
+			ehca_gen_dbg("   %s", hca_cap_descr[i].descr);
+
+	port = (struct hipz_query_port*)rblock;
+	h_ret = hipz_h_query_port(shca->ipz_hca_handle, 1, port);
+	if (h_ret != H_SUCCESS) {
+		ehca_gen_err("Cannot query port properties. h_ret=%lx",
+			     h_ret);
+		ret = -EPERM;
+		goto sense_attributes1;
+	}
+
+	shca->max_mtu = port->max_mtu;
+
+sense_attributes1:
 	ehca_free_fw_ctrlblock(rblock);
 	return ret;
 }
diff --git a/drivers/infiniband/hw/ehca/hipz_hw.h b/drivers/infiniband/hw/ehca/hipz_hw.h
index fad9136..9fe8367 100644
--- a/drivers/infiniband/hw/ehca/hipz_hw.h
+++ b/drivers/infiniband/hw/ehca/hipz_hw.h
@@ -360,6 +360,24 @@ struct hipz_query_hca {
 	u32 max_neq;
 } __attribute__ ((packed));
 
+#define HCA_CAP_AH_PORT_NR_CHECK      EHCA_BMASK_IBM(0,0)
+#define HCA_CAP_ATOMIC                EHCA_BMASK_IBM(1,1)
+#define HCA_CAP_AUTO_PATH_MIG         EHCA_BMASK_IBM(2,2)
+#define HCA_CAP_BAD_P_KEY_CTR         EHCA_BMASK_IBM(3,3)
+#define HCA_CAP_SQD_RTS_PORT_CHANGE   EHCA_BMASK_IBM(4,4)
+#define HCA_CAP_CUR_QP_STATE_MOD      EHCA_BMASK_IBM(5,5)
+#define HCA_CAP_INIT_TYPE             EHCA_BMASK_IBM(6,6)
+#define HCA_CAP_PORT_ACTIVE_EVENT     EHCA_BMASK_IBM(7,7)
+#define HCA_CAP_Q_KEY_VIOL_CTR        EHCA_BMASK_IBM(8,8)
+#define HCA_CAP_WQE_RESIZE            EHCA_BMASK_IBM(9,9)
+#define HCA_CAP_RAW_PACKET_MCAST      EHCA_BMASK_IBM(10,10)
+#define HCA_CAP_SHUTDOWN_PORT         EHCA_BMASK_IBM(11,11)
+#define HCA_CAP_RC_LL_QP              EHCA_BMASK_IBM(12,12)
+#define HCA_CAP_SRQ                   EHCA_BMASK_IBM(13,13)
+#define HCA_CAP_UD_LL_QP              EHCA_BMASK_IBM(16,16)
+#define HCA_CAP_RESIZE_MR             EHCA_BMASK_IBM(17,17)
+#define HCA_CAP_MINI_QP               EHCA_BMASK_IBM(18,18)
+
 /* query port response block */
 struct hipz_query_port {
 	u32 state;
-- 
1.5.2


From fenkes at de.ibm.com  Mon Jul  9 06:23:15 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Mon, 9 Jul 2007 15:23:15 +0200
Subject: [ofa-general] [PATCH 03/13] IB/ehca: QP code restructuring in
	preparation for SRQ
In-Reply-To: <200707091502.22407.fenkes@de.ibm.com>
References: <200707091502.22407.fenkes@de.ibm.com>
Message-ID: <200707091523.16498.fenkes@de.ibm.com>

- Replace init_qp_queues() by a shorter init_qp_queue(), eliminating
  duplicate code.

- hipz_h_alloc_resource_qp() doesn't need a pointer to struct ehca_qp any
  longer. All input and output data is transferred through the parms
  parameter.

- Change the interface to also support SRQ.

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_classes.h |   46 +++++-
 drivers/infiniband/hw/ehca/ehca_qp.c      |  254 +++++++++++++----------------
 drivers/infiniband/hw/ehca/hcp_if.c       |   35 ++---
 drivers/infiniband/hw/ehca/hcp_if.h       |    1 -
 4 files changed, 166 insertions(+), 170 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h
index 35d948f..6e75db6 100644
--- a/drivers/infiniband/hw/ehca/ehca_classes.h
+++ b/drivers/infiniband/hw/ehca/ehca_classes.h
@@ -322,14 +322,49 @@ struct ehca_alloc_cq_parms {
 	struct ipz_eq_handle eq_handle;
 };
 
+enum ehca_service_type {
+	ST_RC  = 0,
+	ST_UC  = 1,
+	ST_RD  = 2,
+	ST_UD  = 3,
+};
+
+enum ehca_ext_qp_type {
+	EQPT_NORMAL    = 0,
+	EQPT_LLQP      = 1,
+	EQPT_SRQBASE   = 2,
+	EQPT_SRQ       = 3,
+};
+
+enum ehca_ll_comp_flags {
+	LLQP_SEND_COMP = 0x20,
+	LLQP_RECV_COMP = 0x40,
+	LLQP_COMP_MASK = 0x60,
+};
+
 struct ehca_alloc_qp_parms {
-	int servicetype;
+/* input parameters */
+	enum ehca_service_type servicetype;
 	int sigtype;
-	int daqp_ctrl;
-	int max_send_sge;
-	int max_recv_sge;
+	enum ehca_ext_qp_type ext_type;
+	enum ehca_ll_comp_flags ll_comp_flags;
+
+	int max_send_wr, max_recv_wr;
+	int max_send_sge, max_recv_sge;
 	int ud_av_l_key_ctl;
 
+	u32 token;
+	struct ipz_eq_handle eq_handle;
+	struct ipz_pd pd;
+	struct ipz_cq_handle send_cq_handle, recv_cq_handle;
+
+	u32 srq_qpn, srq_token, srq_limit;
+
+/* output parameters */
+	u32 real_qp_num;
+	struct ipz_qp_handle qp_handle;
+	struct h_galpas galpas;
+
 	u16 act_nr_send_wqes;
 	u16 act_nr_recv_wqes;
 	u8  act_nr_recv_sges;
@@ -337,9 +372,6 @@ struct ehca_alloc_qp_parms {
 
 	u32 nr_rq_pages;
 	u32 nr_sq_pages;
-
-	struct ipz_eq_handle ipz_eq_handle;
-	struct ipz_pd pd;
 };
 
 int ehca_cq_assign_qp(struct ehca_cq *cq, struct ehca_qp *qp);
diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c
index b5bc787..ec1d555 100644
--- a/drivers/infiniband/hw/ehca/ehca_qp.c
+++ b/drivers/infiniband/hw/ehca/ehca_qp.c
@@ -234,13 +234,6 @@ static inline enum ib_qp_statetrans get_modqp_statetrans(int ib_fromstate,
 	return index;
 }
 
-enum ehca_service_type {
-	ST_RC = 0,
-	ST_UC = 1,
-	ST_RD = 2,
-	ST_UD = 3
-};
-
 /*
  * ibqptype2servicetype returns hcp service type corresponding to given
  * ib qp type used by create_qp()
@@ -268,15 +261,16 @@ static inline int ibqptype2servicetype(enum ib_qp_type ibqptype)
 }
 
 /*
- * init_qp_queues initializes/constructs r/squeue and registers queue pages.
+ * init_qp_queue initializes/constructs r/squeue and registers queue pages.
  */
-static inline int init_qp_queues(struct ehca_shca *shca,
-				 struct ehca_qp *my_qp,
-				 int nr_sq_pages,
-				 int nr_rq_pages,
-				 int swqe_size,
-				 int rwqe_size,
-				 int nr_send_sges, int nr_receive_sges)
+static inline int init_qp_queue(struct ehca_shca *shca,
+				struct ehca_qp *my_qp,
+				struct ipz_queue *queue,
+				int q_type,
+				u64 expected_hret,
+				int nr_q_pages,
+				int wqe_size,
+				int nr_sges)
 {
 	int ret, cnt, ipz_rc;
 	void *vpage;
@@ -284,104 +278,63 @@ static inline int init_qp_queues(struct ehca_shca *shca,
 	struct ib_device *ib_dev = &shca->ib_device;
 	struct ipz_adapter_handle ipz_hca_handle = shca->ipz_hca_handle;
 
-	ipz_rc = ipz_queue_ctor(&my_qp->ipz_squeue,
-				nr_sq_pages,
-				EHCA_PAGESIZE, swqe_size, nr_send_sges);
+	if (!nr_q_pages)
+		return 0;
+
+	ipz_rc = ipz_queue_ctor(queue, nr_q_pages, EHCA_PAGESIZE,
+				wqe_size, nr_sges);
 	if (!ipz_rc) {
-		ehca_err(ib_dev,"Cannot allocate page for squeue. ipz_rc=%x",
+		ehca_err(ib_dev,"Cannot allocate page for queue. ipz_rc=%x",
 			 ipz_rc);
 		return -EBUSY;
 	}
 
-	ipz_rc = ipz_queue_ctor(&my_qp->ipz_rqueue,
-				nr_rq_pages,
-				EHCA_PAGESIZE, rwqe_size, nr_receive_sges);
-	if (!ipz_rc) {
-		ehca_err(ib_dev, "Cannot allocate page for rqueue. ipz_rc=%x",
-			 ipz_rc);
-		ret = -EBUSY;
-		goto init_qp_queues0;
-	}
-	/* register SQ pages */
-	for (cnt = 0; cnt < nr_sq_pages; cnt++) {
-		vpage = ipz_qpageit_get_inc(&my_qp->ipz_squeue);
+	/* register queue pages */
+	for (cnt = 0; cnt < nr_q_pages; cnt++) {
+		vpage = ipz_qpageit_get_inc(queue);
 		if (!vpage) {
-			ehca_err(ib_dev, "SQ ipz_qpageit_get_inc() "
+			ehca_err(ib_dev, "ipz_qpageit_get_inc() "
 				 "failed p_vpage= %p", vpage);
 			ret = -EINVAL;
-			goto init_qp_queues1;
+			goto init_qp_queue1;
 		}
 		rpage = virt_to_abs(vpage);
 
 		h_ret = hipz_h_register_rpage_qp(ipz_hca_handle,
 						 my_qp->ipz_qp_handle,
-						 &my_qp->pf, 0, 0,
+						 NULL, 0, q_type,
 						 rpage, 1,
 						 my_qp->galpas.kernel);
-		if (h_ret < H_SUCCESS) {
-			ehca_err(ib_dev, "SQ hipz_qp_register_rpage()"
-				 " failed rc=%lx", h_ret);
-			ret = ehca2ib_return_code(h_ret);
-			goto init_qp_queues1;
-		}
-	}
-
-	ipz_qeit_reset(&my_qp->ipz_squeue);
-
-	/* register RQ pages */
-	for (cnt = 0; cnt < nr_rq_pages; cnt++) {
-		vpage = ipz_qpageit_get_inc(&my_qp->ipz_rqueue);
-		if (!vpage) {
-			ehca_err(ib_dev, "RQ ipz_qpageit_get_inc() "
-				 "failed p_vpage = %p", vpage);
-			ret = -EINVAL;
-			goto init_qp_queues1;
-		}
-
-		rpage = virt_to_abs(vpage);
-
-		h_ret = hipz_h_register_rpage_qp(ipz_hca_handle,
-						 my_qp->ipz_qp_handle,
-						 &my_qp->pf, 0, 1,
-						 rpage, 1,my_qp->galpas.kernel);
-		if (h_ret < H_SUCCESS) {
-			ehca_err(ib_dev, "RQ hipz_qp_register_rpage() failed "
-				 "rc=%lx", h_ret);
-			ret = ehca2ib_return_code(h_ret);
-			goto init_qp_queues1;
-		}
-		if (cnt == (nr_rq_pages - 1)) {	/* last page! */
-			if (h_ret != H_SUCCESS) {
-				ehca_err(ib_dev, "RQ hipz_qp_register_rpage() "
+		if (cnt == (nr_q_pages - 1)) {	/* last page! */
+			if (h_ret != expected_hret) {
+				ehca_err(ib_dev, "hipz_qp_register_rpage() "
 					 "h_ret= %lx ", h_ret);
 				ret = ehca2ib_return_code(h_ret);
-				goto init_qp_queues1;
+				goto init_qp_queue1;
 			}
 			vpage = ipz_qpageit_get_inc(&my_qp->ipz_rqueue);
 			if (vpage) {
 				ehca_err(ib_dev, "ipz_qpageit_get_inc() "
 					 "should not succeed vpage=%p", vpage);
 				ret = -EINVAL;
-				goto init_qp_queues1;
+				goto init_qp_queue1;
 			}
 		} else {
 			if (h_ret != H_PAGE_REGISTERED) {
-				ehca_err(ib_dev, "RQ hipz_qp_register_rpage() "
+				ehca_err(ib_dev, "hipz_qp_register_rpage() "
 					 "h_ret= %lx ", h_ret);
 				ret = ehca2ib_return_code(h_ret);
-				goto init_qp_queues1;
+				goto init_qp_queue1;
 			}
 		}
 	}
 
-	ipz_qeit_reset(&my_qp->ipz_rqueue);
+	ipz_qeit_reset(queue);
 
 	return 0;
 
-init_qp_queues1:
-	ipz_queue_dtor(&my_qp->ipz_rqueue);
-init_qp_queues0:
-	ipz_queue_dtor(&my_qp->ipz_squeue);
+init_qp_queue1:
+	ipz_queue_dtor(queue);
 	return ret;
 }
 
@@ -397,14 +350,17 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 					      ib_device);
 	struct ib_ucontext *context = NULL;
 	u64 h_ret;
-	int max_send_sge, max_recv_sge, ret;
+	int is_llqp = 0, has_srq = 0;
+	int qp_type, max_send_sge, max_recv_sge, ret;
 
 	/* h_call's out parameters */
 	struct ehca_alloc_qp_parms parms;
 	u32 swqe_size = 0, rwqe_size = 0;
-	u8 daqp_completion, isdaqp;
 	unsigned long flags;
 
+	memset(&parms, 0, sizeof(parms));
+	qp_type = init_attr->qp_type;
+
 	if (init_attr->sq_sig_type != IB_SIGNAL_REQ_WR &&
 		init_attr->sq_sig_type != IB_SIGNAL_ALL_WR) {
 		ehca_err(pd->device, "init_attr->sg_sig_type=%x not allowed",
@@ -412,38 +368,47 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 		return ERR_PTR(-EINVAL);
 	}
 
-	/* save daqp completion bits */
-	daqp_completion = init_attr->qp_type & 0x60;
-	/* save daqp bit */
-	isdaqp = (init_attr->qp_type & 0x80) ? 1 : 0;
-	init_attr->qp_type = init_attr->qp_type & 0x1F;
+	/* save LLQP info */
+	if (qp_type & 0x80) {
+		is_llqp = 1;
+		parms.ext_type = EQPT_LLQP;
+		parms.ll_comp_flags = qp_type & LLQP_COMP_MASK;
+	}
+	qp_type &= 0x1F;
+
+	/* check for SRQ */
+	has_srq = !!(init_attr->srq);
+	if (is_llqp && has_srq) {
+		ehca_err(pd->device, "LLQPs can't have an SRQ");
+		return ERR_PTR(-EINVAL);
+	}
 
-	if (init_attr->qp_type != IB_QPT_UD &&
-	    init_attr->qp_type != IB_QPT_SMI &&
-	    init_attr->qp_type != IB_QPT_GSI &&
-	    init_attr->qp_type != IB_QPT_UC &&
-	    init_attr->qp_type != IB_QPT_RC) {
-		ehca_err(pd->device, "wrong QP Type=%x", init_attr->qp_type);
+	/* check QP type */
+	if (qp_type != IB_QPT_UD &&
+	    qp_type != IB_QPT_UC &&
+	    qp_type != IB_QPT_RC &&
+	    qp_type != IB_QPT_SMI &&
+	    qp_type != IB_QPT_GSI) {
+		ehca_err(pd->device, "wrong QP Type=%x", qp_type);
 		return ERR_PTR(-EINVAL);
 	}
-	if ((init_attr->qp_type != IB_QPT_RC && init_attr->qp_type != IB_QPT_UD)
-	    && isdaqp) {
-		ehca_err(pd->device, "unsupported LL QP Type=%x",
-			 init_attr->qp_type);
+
+	if (is_llqp && (qp_type != IB_QPT_RC && qp_type != IB_QPT_UD)) {
+		ehca_err(pd->device, "unsupported LL QP Type=%x", qp_type);
 		return ERR_PTR(-EINVAL);
-	} else if (init_attr->qp_type == IB_QPT_RC && isdaqp &&
+	} else if (is_llqp && qp_type == IB_QPT_RC &&
 		   (init_attr->cap.max_send_wr > 255 ||
 		    init_attr->cap.max_recv_wr > 255 )) {
-		       ehca_err(pd->device, "Invalid Number of max_sq_wr =%x "
-				"or max_rq_wr=%x for QP Type=%x",
-				init_attr->cap.max_send_wr,
-				init_attr->cap.max_recv_wr,init_attr->qp_type);
-		       return ERR_PTR(-EINVAL);
-	} else if (init_attr->qp_type == IB_QPT_UD && isdaqp &&
-		  init_attr->cap.max_send_wr > 255) {
+		ehca_err(pd->device, "Invalid Number of max_sq_wr=%x "
+			 "or max_rq_wr=%x for RC LLQP",
+			 init_attr->cap.max_send_wr,
+			 init_attr->cap.max_recv_wr);
+		return ERR_PTR(-EINVAL);
+	} else if (is_llqp && qp_type == IB_QPT_UD &&
+		 init_attr->cap.max_send_wr > 255) {
 		ehca_err(pd->device,
 			 "Invalid Number of max_send_wr=%x for UD QP_TYPE=%x",
-			 init_attr->cap.max_send_wr, init_attr->qp_type);
+			 init_attr->cap.max_send_wr, qp_type);
 		return ERR_PTR(-EINVAL);
 	}
 
@@ -456,7 +421,6 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 		return ERR_PTR(-ENOMEM);
 	}
 
-	memset (&parms, 0, sizeof(struct ehca_alloc_qp_parms));
 	spin_lock_init(&my_qp->spinlock_s);
 	spin_lock_init(&my_qp->spinlock_r);
 
@@ -465,8 +429,6 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 	my_qp->send_cq =
 		container_of(init_attr->send_cq, struct ehca_cq, ib_cq);
 
-	my_qp->init_attr = *init_attr;
-
 	do {
 		if (!idr_pre_get(&ehca_qp_idr, GFP_KERNEL)) {
 			ret = -ENOMEM;
@@ -486,10 +448,10 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 		goto create_qp_exit0;
 	}
 
-	parms.servicetype = ibqptype2servicetype(init_attr->qp_type);
+	parms.servicetype = ibqptype2servicetype(qp_type);
 	if (parms.servicetype < 0) {
 		ret = -EINVAL;
-		ehca_err(pd->device, "Invalid qp_type=%x", init_attr->qp_type);
+		ehca_err(pd->device, "Invalid qp_type=%x", qp_type);
 		goto create_qp_exit0;
 	}
 
@@ -501,21 +463,23 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 	/* UD_AV CIRCUMVENTION */
 	max_send_sge = init_attr->cap.max_send_sge;
 	max_recv_sge = init_attr->cap.max_recv_sge;
-	if (IB_QPT_UD == init_attr->qp_type ||
-	    IB_QPT_GSI == init_attr->qp_type ||
-	    IB_QPT_SMI == init_attr->qp_type) {
+	if (parms.servicetype == ST_UD) {
 		max_send_sge += 2;
 		max_recv_sge += 2;
 	}
 
-	parms.ipz_eq_handle = shca->eq.ipz_eq_handle;
-	parms.daqp_ctrl = isdaqp | daqp_completion;
+	parms.token = my_qp->token;
+	parms.eq_handle = shca->eq.ipz_eq_handle;
 	parms.pd = my_pd->fw_pd;
-	parms.max_recv_sge = max_recv_sge;
-	parms.max_send_sge = max_send_sge;
+	parms.send_cq_handle = my_qp->send_cq->ipz_cq_handle;
+	parms.recv_cq_handle = my_qp->recv_cq->ipz_cq_handle;
 
-	h_ret = hipz_h_alloc_resource_qp(shca->ipz_hca_handle, my_qp, &parms);
+	parms.max_send_wr = init_attr->cap.max_send_wr;
+	parms.max_recv_wr = init_attr->cap.max_recv_wr;
+	parms.max_send_sge = max_send_sge;
+	parms.max_recv_sge = max_recv_sge;
 
+	h_ret = hipz_h_alloc_resource_qp(shca->ipz_hca_handle, &parms);
 	if (h_ret != H_SUCCESS) {
 		ehca_err(pd->device, "h_alloc_resource_qp() failed h_ret=%lx",
 			 h_ret);
@@ -523,16 +487,18 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 		goto create_qp_exit1;
 	}
 
-	my_qp->ib_qp.qp_num = my_qp->real_qp_num;
+	my_qp->ib_qp.qp_num = my_qp->real_qp_num = parms.real_qp_num;
+	my_qp->ipz_qp_handle = parms.qp_handle;
+	my_qp->galpas = parms.galpas;
 
-	switch (init_attr->qp_type) {
+	switch (qp_type) {
 	case IB_QPT_RC:
-	        if (isdaqp == 0) {
+	        if (!is_llqp) {
 			swqe_size = offsetof(struct ehca_wqe, u.nud.sg_list[
 					     (parms.act_nr_send_sges)]);
 			rwqe_size = offsetof(struct ehca_wqe, u.nud.sg_list[
 					     (parms.act_nr_recv_sges)]);
-		} else { /* for daqp we need to use msg size, not wqe size */
+		} else { /* for LLQP we need to use msg size, not wqe size */
 		        swqe_size = da_rc_msg_size[max_send_sge];
 			rwqe_size = da_rc_msg_size[max_recv_sge];
 			parms.act_nr_send_sges = 1;
@@ -552,7 +518,7 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 		/* UD circumvention */
 		parms.act_nr_recv_sges -= 2;
 		parms.act_nr_send_sges -= 2;
-		if (isdaqp) {
+		if (is_llqp) {
 		        swqe_size = da_ud_sq_msg_size[max_send_sge];
 			rwqe_size = da_rc_msg_size[max_recv_sge];
 			parms.act_nr_send_sges = 1;
@@ -564,14 +530,12 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 					     u.ud_av.sg_list[parms.act_nr_recv_sges]);
 		}
 
-		if (IB_QPT_GSI == init_attr->qp_type ||
-		    IB_QPT_SMI == init_attr->qp_type) {
+		if (IB_QPT_GSI == qp_type || IB_QPT_SMI == qp_type) {
 			parms.act_nr_send_wqes = init_attr->cap.max_send_wr;
 			parms.act_nr_recv_wqes = init_attr->cap.max_recv_wr;
 			parms.act_nr_send_sges = init_attr->cap.max_send_sge;
 			parms.act_nr_recv_sges = init_attr->cap.max_recv_sge;
-			my_qp->ib_qp.qp_num =
-				(init_attr->qp_type == IB_QPT_SMI) ? 0 : 1;
+			my_qp->ib_qp.qp_num = (qp_type == IB_QPT_SMI) ? 0 : 1;
 		}
 
 		break;
@@ -580,26 +544,33 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 		break;
 	}
 
-	/* initializes r/squeue and registers queue pages */
-	ret = init_qp_queues(shca, my_qp,
-			     parms.nr_sq_pages, parms.nr_rq_pages,
-			     swqe_size, rwqe_size,
-			     parms.act_nr_send_sges, parms.act_nr_recv_sges);
+	/* initialize r/squeue and register queue pages */
+	ret = init_qp_queue(shca, my_qp, &my_qp->ipz_squeue, 0,
+			    has_srq ? H_SUCCESS : H_PAGE_REGISTERED,
+			    parms.nr_sq_pages, swqe_size,
+			    parms.act_nr_send_sges);
 	if (ret) {
 		ehca_err(pd->device,
-			 "Couldn't initialize r/squeue and pages ret=%x", ret);
+			 "Couldn't initialize squeue and pages ret=%x", ret);
 		goto create_qp_exit2;
 	}
 
+	ret = init_qp_queue(shca, my_qp, &my_qp->ipz_rqueue, 1, H_SUCCESS,
+			    parms.nr_rq_pages, rwqe_size,
+			    parms.act_nr_recv_sges);
+	if (ret) {
+		ehca_err(pd->device,
+			 "Couldn't initialize rqueue and pages ret=%x", ret);
+		goto create_qp_exit3;
+	}
+
 	my_qp->ib_qp.pd = &my_pd->ib_pd;
 	my_qp->ib_qp.device = my_pd->ib_pd.device;
 
 	my_qp->ib_qp.recv_cq = init_attr->recv_cq;
 	my_qp->ib_qp.send_cq = init_attr->send_cq;
 
-	my_qp->ib_qp.qp_type = init_attr->qp_type;
-
-	my_qp->qp_type = init_attr->qp_type;
+	my_qp->ib_qp.qp_type = my_qp->qp_type = qp_type;
 	my_qp->ib_qp.srq = init_attr->srq;
 
 	my_qp->ib_qp.qp_context = init_attr->qp_context;
@@ -610,15 +581,16 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 	init_attr->cap.max_recv_wr = parms.act_nr_recv_wqes;
 	init_attr->cap.max_send_sge = parms.act_nr_send_sges;
 	init_attr->cap.max_send_wr = parms.act_nr_send_wqes;
+	my_qp->init_attr = *init_attr;
 
 	/* NOTE: define_apq0() not supported yet */
-	if (init_attr->qp_type == IB_QPT_GSI) {
+	if (qp_type == IB_QPT_GSI) {
 		h_ret = ehca_define_sqp(shca, my_qp, init_attr);
 		if (h_ret != H_SUCCESS) {
 			ehca_err(pd->device, "ehca_define_sqp() failed rc=%lx",
 				 h_ret);
 			ret = ehca2ib_return_code(h_ret);
-			goto create_qp_exit3;
+			goto create_qp_exit4;
 		}
 	}
 	if (init_attr->send_cq) {
@@ -628,7 +600,7 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 		if (ret) {
 			ehca_err(pd->device, "Couldn't assign qp to send_cq ret=%x",
 				 ret);
-			goto create_qp_exit3;
+			goto create_qp_exit4;
 		}
 		my_qp->send_cq = cq;
 	}
@@ -659,14 +631,16 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 		if (ib_copy_to_udata(udata, &resp, sizeof resp)) {
 			ehca_err(pd->device, "Copy to udata failed");
 			ret = -EINVAL;
-			goto create_qp_exit3;
+			goto create_qp_exit4;
 		}
 	}
 
 	return &my_qp->ib_qp;
 
-create_qp_exit3:
+create_qp_exit4:
 	ipz_queue_dtor(&my_qp->ipz_rqueue);
+
+create_qp_exit3:
 	ipz_queue_dtor(&my_qp->ipz_squeue);
 
 create_qp_exit2:
diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c
index 5766ae3..7efc4a2 100644
--- a/drivers/infiniband/hw/ehca/hcp_if.c
+++ b/drivers/infiniband/hw/ehca/hcp_if.c
@@ -74,11 +74,6 @@
 #define H_MP_SHUTDOWN                   EHCA_BMASK_IBM(48, 48)
 #define H_MP_RESET_QKEY_CTR             EHCA_BMASK_IBM(49, 49)
 
-/* direct access qp controls */
-#define DAQP_CTRL_ENABLE    0x01
-#define DAQP_CTRL_SEND_COMP 0x20
-#define DAQP_CTRL_RECV_COMP 0x40
-
 static u32 get_longbusy_msecs(int longbusy_rc)
 {
 	switch (longbusy_rc) {
@@ -284,36 +279,31 @@ u64 hipz_h_alloc_resource_cq(const struct ipz_adapter_handle adapter_handle,
 }
 
 u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle,
-			     struct ehca_qp *qp,
 			     struct ehca_alloc_qp_parms *parms)
 {
 	u64 ret;
 	u64 allocate_controls;
 	u64 max_r10_reg;
 	u64 outs[PLPAR_HCALL9_BUFSIZE];
-	u16 max_nr_receive_wqes = qp->init_attr.cap.max_recv_wr + 1;
-	u16 max_nr_send_wqes = qp->init_attr.cap.max_send_wr + 1;
-	int daqp_ctrl = parms->daqp_ctrl;
 
 	allocate_controls =
-		EHCA_BMASK_SET(H_ALL_RES_QP_ENHANCED_OPS,
-			       (daqp_ctrl & DAQP_CTRL_ENABLE) ? 1 : 0)
+		EHCA_BMASK_SET(H_ALL_RES_QP_ENHANCED_OPS, parms->ext_type)
 		| EHCA_BMASK_SET(H_ALL_RES_QP_PTE_PIN, 0)
 		| EHCA_BMASK_SET(H_ALL_RES_QP_SERVICE_TYPE, parms->servicetype)
 		| EHCA_BMASK_SET(H_ALL_RES_QP_SIGNALING_TYPE, parms->sigtype)
 		| EHCA_BMASK_SET(H_ALL_RES_QP_LL_RQ_CQE_POSTING,
-				 (daqp_ctrl & DAQP_CTRL_RECV_COMP) ? 1 : 0)
+				 !!(parms->ll_comp_flags & LLQP_RECV_COMP))
 		| EHCA_BMASK_SET(H_ALL_RES_QP_LL_SQ_CQE_POSTING,
-				 (daqp_ctrl & DAQP_CTRL_SEND_COMP) ? 1 : 0)
+				 !!(parms->ll_comp_flags & LLQP_SEND_COMP))
 		| EHCA_BMASK_SET(H_ALL_RES_QP_UD_AV_LKEY_CTRL,
 				 parms->ud_av_l_key_ctl)
 		| EHCA_BMASK_SET(H_ALL_RES_QP_RESOURCE_TYPE, 1);
 
 	max_r10_reg =
 		EHCA_BMASK_SET(H_ALL_RES_QP_MAX_OUTST_SEND_WR,
-			       max_nr_send_wqes)
+			       parms->max_send_wr + 1)
 		| EHCA_BMASK_SET(H_ALL_RES_QP_MAX_OUTST_RECV_WR,
-				 max_nr_receive_wqes)
+				 parms->max_recv_wr + 1)
 		| EHCA_BMASK_SET(H_ALL_RES_QP_MAX_SEND_SGE,
 				 parms->max_send_sge)
 		| EHCA_BMASK_SET(H_ALL_RES_QP_MAX_RECV_SGE,
@@ -322,15 +312,16 @@ u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle,
 	ret = ehca_plpar_hcall9(H_ALLOC_RESOURCE, outs,
 				adapter_handle.handle,	           /* r4  */
 				allocate_controls,	           /* r5  */
-				qp->send_cq->ipz_cq_handle.handle,
-				qp->recv_cq->ipz_cq_handle.handle,
-				parms->ipz_eq_handle.handle,
-				((u64)qp->token << 32) | parms->pd.value,
+				parms->send_cq_handle.handle,
+				parms->recv_cq_handle.handle,
+				parms->eq_handle.handle,
+				((u64)parms->token << 32) | parms->pd.value,
 				max_r10_reg,	                   /* r10 */
 				parms->ud_av_l_key_ctl,            /* r11 */
 				0);
-	qp->ipz_qp_handle.handle = outs[0];
-	qp->real_qp_num = (u32)outs[1];
+
+	parms->qp_handle.handle = outs[0];
+	parms->real_qp_num = (u32)outs[1];
 	parms->act_nr_send_wqes =
 		(u16)EHCA_BMASK_GET(H_ALL_RES_QP_ACT_OUTST_SEND_WR, outs[2]);
 	parms->act_nr_recv_wqes =
@@ -345,7 +336,7 @@ u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle,
 		(u32)EHCA_BMASK_GET(H_ALL_RES_QP_RQUEUE_SIZE_PAGES, outs[4]);
 
 	if (ret == H_SUCCESS)
-		hcp_galpas_ctor(&qp->galpas, outs[6], outs[6]);
+		hcp_galpas_ctor(&parms->galpas, outs[6], outs[6]);
 
 	if (ret == H_NOT_ENOUGH_RESOURCES)
 		ehca_gen_err("Not enough resources. ret=%lx", ret);
diff --git a/drivers/infiniband/hw/ehca/hcp_if.h b/drivers/infiniband/hw/ehca/hcp_if.h
index 2869f7d..60ce02b 100644
--- a/drivers/infiniband/hw/ehca/hcp_if.h
+++ b/drivers/infiniband/hw/ehca/hcp_if.h
@@ -78,7 +78,6 @@ u64 hipz_h_alloc_resource_cq(const struct ipz_adapter_handle adapter_handle,
  * initialize resources, create empty QPPTs (2 rings).
  */
 u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle,
-			     struct ehca_qp *qp,
 			     struct ehca_alloc_qp_parms *parms);
 
 u64 hipz_h_query_port(const struct ipz_adapter_handle adapter_handle,
-- 
1.5.2


From fenkes at de.ibm.com  Mon Jul  9 06:25:10 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Mon, 9 Jul 2007 15:25:10 +0200
Subject: [ofa-general] [PATCH 04/13] IB/ehca: add Shared Receive Queue
	support
In-Reply-To: <200707091502.22407.fenkes@de.ibm.com>
References: <200707091502.22407.fenkes@de.ibm.com>
Message-ID: <200707091525.11777.fenkes@de.ibm.com>

Support SRQs on eHCA2. Since an SRQ is a QP for eHCA2, a lot of code
(structures, create, destroy, post_recv) can be shared between QP and SRQ.

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_classes.h         |   26 +-
 drivers/infiniband/hw/ehca/ehca_classes_pSeries.h |    4 +-
 drivers/infiniband/hw/ehca/ehca_iverbs.h          |   15 +
 drivers/infiniband/hw/ehca/ehca_main.c            |   16 +-
 drivers/infiniband/hw/ehca/ehca_qp.c              |  451 +++++++++++++++++----
 drivers/infiniband/hw/ehca/ehca_reqs.c            |   47 ++-
 drivers/infiniband/hw/ehca/ehca_uverbs.c          |    4 +-
 drivers/infiniband/hw/ehca/hcp_if.c               |   23 +-
 drivers/infiniband/hw/ehca/hipz_hw.h              |    1 +
 9 files changed, 480 insertions(+), 107 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h
index 6e75db6..9d689ae 100644
--- a/drivers/infiniband/hw/ehca/ehca_classes.h
+++ b/drivers/infiniband/hw/ehca/ehca_classes.h
@@ -5,6 +5,7 @@
  *
  *  Authors: Heiko J Schick <schickhj at de.ibm.com>
  *           Christoph Raisch <raisch at de.ibm.com>
+ *           Joachim Fenkes <fenkes at de.ibm.com>
  *
  *  Copyright (c) 2005 IBM Corporation
  *
@@ -117,9 +118,20 @@ struct ehca_pd {
 	u32 ownpid;
 };
 
+enum ehca_ext_qp_type {
+	EQPT_NORMAL    = 0,
+	EQPT_LLQP      = 1,
+	EQPT_SRQBASE   = 2,
+	EQPT_SRQ       = 3,
+};
+
 struct ehca_qp {
-	struct ib_qp ib_qp;
+	union {
+		struct ib_qp ib_qp;
+		struct ib_srq ib_srq;
+	};
 	u32 qp_type;
+	enum ehca_ext_qp_type ext_type;
 	struct ipz_queue ipz_squeue;
 	struct ipz_queue ipz_rqueue;
 	struct h_galpas galpas;
@@ -142,6 +154,10 @@ struct ehca_qp {
 	u32 mm_count_galpa;
 };
 
+#define IS_SRQ(qp) (qp->ext_type == EQPT_SRQ)
+#define HAS_SQ(qp) (qp->ext_type != EQPT_SRQ)
+#define HAS_RQ(qp) (qp->ext_type != EQPT_SRQBASE)
+
 /* must be power of 2 */
 #define QP_HASHTAB_LEN 8
 
@@ -307,6 +323,7 @@ struct ehca_create_qp_resp {
 	u32 qp_num;
 	u32 token;
 	u32 qp_type;
+	u32 ext_type;
 	u32 qkey;
 	/* qp_num assigned by ehca: sqp0/1 may have got different numbers */
 	u32 real_qp_num;
@@ -329,13 +346,6 @@ enum ehca_service_type {
 	ST_UD  = 3,
 };
 
-enum ehca_ext_qp_type {
-	EQPT_NORMAL    = 0,
-	EQPT_LLQP      = 1,
-	EQPT_SRQBASE   = 2,
-	EQPT_SRQ       = 3,
-};
-
 enum ehca_ll_comp_flags {
 	LLQP_SEND_COMP = 0x20,
 	LLQP_RECV_COMP = 0x40,
diff --git a/drivers/infiniband/hw/ehca/ehca_classes_pSeries.h b/drivers/infiniband/hw/ehca/ehca_classes_pSeries.h
index 5665f21..fb3df5c 100644
--- a/drivers/infiniband/hw/ehca/ehca_classes_pSeries.h
+++ b/drivers/infiniband/hw/ehca/ehca_classes_pSeries.h
@@ -228,8 +228,8 @@ struct hcp_modify_qp_control_block {
 #define MQPCB_QP_NUMBER                         EHCA_BMASK_IBM(8,31)
 #define MQPCB_MASK_QP_ENABLE                    EHCA_BMASK_IBM(48,48)
 #define MQPCB_QP_ENABLE                         EHCA_BMASK_IBM(31,31)
-#define MQPCB_MASK_CURR_SQR_LIMIT               EHCA_BMASK_IBM(49,49)
-#define MQPCB_CURR_SQR_LIMIT                    EHCA_BMASK_IBM(15,31)
+#define MQPCB_MASK_CURR_SRQ_LIMIT               EHCA_BMASK_IBM(49,49)
+#define MQPCB_CURR_SRQ_LIMIT                    EHCA_BMASK_IBM(16,31)
 #define MQPCB_MASK_QP_AFF_ASYN_EV_LOG_REG       EHCA_BMASK_IBM(50,50)
 #define MQPCB_MASK_SHARED_RQ_HNDL               EHCA_BMASK_IBM(51,51)
 
diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h
index 37e7fe0..fd84a80 100644
--- a/drivers/infiniband/hw/ehca/ehca_iverbs.h
+++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h
@@ -154,6 +154,21 @@ int ehca_post_send(struct ib_qp *qp, struct ib_send_wr *send_wr,
 int ehca_post_recv(struct ib_qp *qp, struct ib_recv_wr *recv_wr,
 		   struct ib_recv_wr **bad_recv_wr);
 
+int ehca_post_srq_recv(struct ib_srq *srq,
+		       struct ib_recv_wr *recv_wr,
+		       struct ib_recv_wr **bad_recv_wr);
+
+struct ib_srq *ehca_create_srq(struct ib_pd *pd,
+			       struct ib_srq_init_attr *init_attr,
+			       struct ib_udata *udata);
+
+int ehca_modify_srq(struct ib_srq *srq, struct ib_srq_attr *attr,
+		    enum ib_srq_attr_mask attr_mask, struct ib_udata *udata);
+
+int ehca_query_srq(struct ib_srq *srq, struct ib_srq_attr *srq_attr);
+
+int ehca_destroy_srq(struct ib_srq *srq);
+
 u64 ehca_define_sqp(struct ehca_shca *shca, struct ehca_qp *ibqp,
 		    struct ib_qp_init_attr *qp_init_attr);
 
diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c
index befbb9c..9bd749c 100644
--- a/drivers/infiniband/hw/ehca/ehca_main.c
+++ b/drivers/infiniband/hw/ehca/ehca_main.c
@@ -343,7 +343,7 @@ int ehca_init_device(struct ehca_shca *shca)
 	strlcpy(shca->ib_device.name, "ehca%d", IB_DEVICE_NAME_MAX);
 	shca->ib_device.owner               = THIS_MODULE;
 
-	shca->ib_device.uverbs_abi_ver	    = 6;
+	shca->ib_device.uverbs_abi_ver	    = 7;
 	shca->ib_device.uverbs_cmd_mask	    =
 		(1ull << IB_USER_VERBS_CMD_GET_CONTEXT)		|
 		(1ull << IB_USER_VERBS_CMD_QUERY_DEVICE)	|
@@ -411,6 +411,20 @@ int ehca_init_device(struct ehca_shca *shca)
 	/* shca->ib_device.process_mad	    = ehca_process_mad;	    */
 	shca->ib_device.mmap		    = ehca_mmap;
 
+	if (EHCA_BMASK_GET(HCA_CAP_SRQ, shca->hca_cap)) {
+		shca->ib_device.uverbs_cmd_mask |=
+			(1ull << IB_USER_VERBS_CMD_CREATE_SRQ) |
+			(1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) |
+			(1ull << IB_USER_VERBS_CMD_QUERY_SRQ) |
+			(1ull << IB_USER_VERBS_CMD_DESTROY_SRQ);
+
+		shca->ib_device.create_srq          = ehca_create_srq;
+		shca->ib_device.modify_srq          = ehca_modify_srq;
+		shca->ib_device.query_srq           = ehca_query_srq;
+		shca->ib_device.destroy_srq         = ehca_destroy_srq;
+		shca->ib_device.post_srq_recv       = ehca_post_srq_recv;
+	}
+
 	return ret;
 }
 
diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c
index ec1d555..9486a44 100644
--- a/drivers/infiniband/hw/ehca/ehca_qp.c
+++ b/drivers/infiniband/hw/ehca/ehca_qp.c
@@ -3,7 +3,9 @@
  *
  *  QP functions
  *
- *  Authors: Waleri Fomin <fomin at de.ibm.com>
+ *  Authors: Joachim Fenkes <fenkes at de.ibm.com>
+ *           Stefan Roscher <stefan.roscher at de.ibm.com>
+ *           Waleri Fomin <fomin at de.ibm.com>
  *           Hoang-Nam Nguyen <hnguyen at de.ibm.com>
  *           Reinhard Ernst <rernst at de.ibm.com>
  *           Heiko J Schick <schickhj at de.ibm.com>
@@ -261,6 +263,19 @@ static inline int ibqptype2servicetype(enum ib_qp_type ibqptype)
 }
 
 /*
+ * init userspace queue info from ipz_queue data
+ */
+static inline void queue2resp(struct ipzu_queue_resp *resp,
+			      struct ipz_queue *queue)
+{
+	resp->qe_size = queue->qe_size;
+	resp->act_nr_of_sg = queue->act_nr_of_sg;
+	resp->queue_length = queue->queue_length;
+	resp->pagesize = queue->pagesize;
+	resp->toggle_state = queue->toggle_state;
+}
+
+/*
  * init_qp_queue initializes/constructs r/squeue and registers queue pages.
  */
 static inline int init_qp_queue(struct ehca_shca *shca,
@@ -338,11 +353,17 @@ init_qp_queue1:
 	return ret;
 }
 
-struct ib_qp *ehca_create_qp(struct ib_pd *pd,
-			     struct ib_qp_init_attr *init_attr,
-			     struct ib_udata *udata)
+/*
+ * Create an ib_qp struct that is either a QP or an SRQ, depending on
+ * the value of the is_srq parameter. If init_attr and srq_init_attr share
+ * fields, the field out of init_attr is used.
+ */
+struct ehca_qp *internal_create_qp(struct ib_pd *pd,
+				   struct ib_qp_init_attr *init_attr,
+				   struct ib_srq_init_attr *srq_init_attr,
+				   struct ib_udata *udata, int is_srq)
 {
-	static int da_rc_msg_size[]={ 128, 256, 512, 1024, 2048, 4096 };
+	static int da_rc_msg_size[] = { 128, 256, 512, 1024, 2048, 4096 };
 	static int da_ud_sq_msg_size[]={ 128, 384, 896, 1920, 3968 };
 	struct ehca_qp *my_qp;
 	struct ehca_pd *my_pd = container_of(pd, struct ehca_pd, ib_pd);
@@ -355,7 +376,7 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 
 	/* h_call's out parameters */
 	struct ehca_alloc_qp_parms parms;
-	u32 swqe_size = 0, rwqe_size = 0;
+	u32 swqe_size = 0, rwqe_size = 0, ib_qp_num;
 	unsigned long flags;
 
 	memset(&parms, 0, sizeof(parms));
@@ -376,13 +397,34 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 	}
 	qp_type &= 0x1F;
 
-	/* check for SRQ */
-	has_srq = !!(init_attr->srq);
+	/* handle SRQ base QPs */
+	if (init_attr->srq) {
+		struct ehca_qp *my_srq =
+			container_of(init_attr->srq, struct ehca_qp, ib_srq);
+
+		has_srq = 1;
+		parms.ext_type = EQPT_SRQBASE;
+		parms.srq_qpn = my_srq->real_qp_num;
+		parms.srq_token = my_srq->token;
+	}
+
 	if (is_llqp && has_srq) {
 		ehca_err(pd->device, "LLQPs can't have an SRQ");
 		return ERR_PTR(-EINVAL);
 	}
 
+	/* handle SRQs */
+	if (is_srq) {
+		parms.ext_type = EQPT_SRQ;
+		parms.srq_limit = srq_init_attr->attr.srq_limit;
+		if (init_attr->cap.max_recv_sge > 3) {
+			ehca_err(pd->device, "no more than three SGEs "
+				 "supported for SRQ  pd=%p  max_sge=%x",
+				 pd, init_attr->cap.max_recv_sge);
+			return ERR_PTR(-EINVAL);
+		}
+	}
+
 	/* check QP type */
 	if (qp_type != IB_QPT_UD &&
 	    qp_type != IB_QPT_UC &&
@@ -423,11 +465,15 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 
 	spin_lock_init(&my_qp->spinlock_s);
 	spin_lock_init(&my_qp->spinlock_r);
+	my_qp->qp_type = qp_type;
+	my_qp->ext_type = parms.ext_type;
 
-	my_qp->recv_cq =
-		container_of(init_attr->recv_cq, struct ehca_cq, ib_cq);
-	my_qp->send_cq =
-		container_of(init_attr->send_cq, struct ehca_cq, ib_cq);
+	if (init_attr->recv_cq)
+		my_qp->recv_cq =
+			container_of(init_attr->recv_cq, struct ehca_cq, ib_cq);
+	if (init_attr->send_cq)
+		my_qp->send_cq =
+			container_of(init_attr->send_cq, struct ehca_cq, ib_cq);
 
 	do {
 		if (!idr_pre_get(&ehca_qp_idr, GFP_KERNEL)) {
@@ -471,8 +517,10 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 	parms.token = my_qp->token;
 	parms.eq_handle = shca->eq.ipz_eq_handle;
 	parms.pd = my_pd->fw_pd;
-	parms.send_cq_handle = my_qp->send_cq->ipz_cq_handle;
-	parms.recv_cq_handle = my_qp->recv_cq->ipz_cq_handle;
+	if (my_qp->send_cq)
+		parms.send_cq_handle = my_qp->send_cq->ipz_cq_handle;
+	if (my_qp->recv_cq)
+		parms.recv_cq_handle = my_qp->recv_cq->ipz_cq_handle;
 
 	parms.max_send_wr = init_attr->cap.max_send_wr;
 	parms.max_recv_wr = init_attr->cap.max_recv_wr;
@@ -487,7 +535,7 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 		goto create_qp_exit1;
 	}
 
-	my_qp->ib_qp.qp_num = my_qp->real_qp_num = parms.real_qp_num;
+	ib_qp_num = my_qp->real_qp_num = parms.real_qp_num;
 	my_qp->ipz_qp_handle = parms.qp_handle;
 	my_qp->galpas = parms.galpas;
 
@@ -535,7 +583,7 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 			parms.act_nr_recv_wqes = init_attr->cap.max_recv_wr;
 			parms.act_nr_send_sges = init_attr->cap.max_send_sge;
 			parms.act_nr_recv_sges = init_attr->cap.max_recv_sge;
-			my_qp->ib_qp.qp_num = (qp_type == IB_QPT_SMI) ? 0 : 1;
+			ib_qp_num = (qp_type == IB_QPT_SMI) ? 0 : 1;
 		}
 
 		break;
@@ -545,36 +593,51 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 	}
 
 	/* initialize r/squeue and register queue pages */
-	ret = init_qp_queue(shca, my_qp, &my_qp->ipz_squeue, 0,
-			    has_srq ? H_SUCCESS : H_PAGE_REGISTERED,
-			    parms.nr_sq_pages, swqe_size,
-			    parms.act_nr_send_sges);
-	if (ret) {
-		ehca_err(pd->device,
-			 "Couldn't initialize squeue and pages ret=%x", ret);
-		goto create_qp_exit2;
+	if (HAS_SQ(my_qp)) {
+		ret = init_qp_queue(
+			shca, my_qp, &my_qp->ipz_squeue, 0,
+			HAS_RQ(my_qp) ? H_PAGE_REGISTERED : H_SUCCESS,
+			parms.nr_sq_pages, swqe_size,
+			parms.act_nr_send_sges);
+		if (ret) {
+			ehca_err(pd->device, "Couldn't initialize squeue "
+				 "and pages  ret=%x", ret);
+			goto create_qp_exit2;
+		}
 	}
 
-	ret = init_qp_queue(shca, my_qp, &my_qp->ipz_rqueue, 1, H_SUCCESS,
-			    parms.nr_rq_pages, rwqe_size,
-			    parms.act_nr_recv_sges);
-	if (ret) {
-		ehca_err(pd->device,
-			 "Couldn't initialize rqueue and pages ret=%x", ret);
-		goto create_qp_exit3;
+	if (HAS_RQ(my_qp)) {
+		ret = init_qp_queue(
+			shca, my_qp, &my_qp->ipz_rqueue, 1,
+			H_SUCCESS, parms.nr_rq_pages, rwqe_size,
+			parms.act_nr_recv_sges);
+		if (ret) {
+			ehca_err(pd->device, "Couldn't initialize rqueue "
+				 "and pages ret=%x", ret);
+			goto create_qp_exit3;
+		}
 	}
 
-	my_qp->ib_qp.pd = &my_pd->ib_pd;
-	my_qp->ib_qp.device = my_pd->ib_pd.device;
+	if (is_srq) {
+		my_qp->ib_srq.pd = &my_pd->ib_pd;
+		my_qp->ib_srq.device = my_pd->ib_pd.device;
 
-	my_qp->ib_qp.recv_cq = init_attr->recv_cq;
-	my_qp->ib_qp.send_cq = init_attr->send_cq;
+		my_qp->ib_srq.srq_context = init_attr->qp_context;
+		my_qp->ib_srq.event_handler = init_attr->event_handler;
+	} else {
+		my_qp->ib_qp.qp_num = ib_qp_num;
+		my_qp->ib_qp.pd = &my_pd->ib_pd;
+		my_qp->ib_qp.device = my_pd->ib_pd.device;
+
+		my_qp->ib_qp.recv_cq = init_attr->recv_cq;
+		my_qp->ib_qp.send_cq = init_attr->send_cq;
 
-	my_qp->ib_qp.qp_type = my_qp->qp_type = qp_type;
-	my_qp->ib_qp.srq = init_attr->srq;
+		my_qp->ib_qp.qp_type = qp_type;
+		my_qp->ib_qp.srq = init_attr->srq;
 
-	my_qp->ib_qp.qp_context = init_attr->qp_context;
-	my_qp->ib_qp.event_handler = init_attr->event_handler;
+		my_qp->ib_qp.qp_context = init_attr->qp_context;
+		my_qp->ib_qp.event_handler = init_attr->event_handler;
+	}
 
 	init_attr->cap.max_inline_data = 0; /* not supported yet */
 	init_attr->cap.max_recv_sge = parms.act_nr_recv_sges;
@@ -593,41 +656,32 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 			goto create_qp_exit4;
 		}
 	}
-	if (init_attr->send_cq) {
-		struct ehca_cq *cq = container_of(init_attr->send_cq,
-						  struct ehca_cq, ib_cq);
-		ret = ehca_cq_assign_qp(cq, my_qp);
+
+	if (my_qp->send_cq) {
+		ret = ehca_cq_assign_qp(my_qp->send_cq, my_qp);
 		if (ret) {
 			ehca_err(pd->device, "Couldn't assign qp to send_cq ret=%x",
 				 ret);
 			goto create_qp_exit4;
 		}
-		my_qp->send_cq = cq;
 	}
+
 	/* copy queues, galpa data to user space */
 	if (context && udata) {
-		struct ipz_queue *ipz_rqueue = &my_qp->ipz_rqueue;
-		struct ipz_queue *ipz_squeue = &my_qp->ipz_squeue;
 		struct ehca_create_qp_resp resp;
 		memset(&resp, 0, sizeof(resp));
 
 		resp.qp_num = my_qp->real_qp_num;
 		resp.token = my_qp->token;
 		resp.qp_type = my_qp->qp_type;
+		resp.ext_type = my_qp->ext_type;
 		resp.qkey = my_qp->qkey;
 		resp.real_qp_num = my_qp->real_qp_num;
-		/* rqueue properties */
-		resp.ipz_rqueue.qe_size = ipz_rqueue->qe_size;
-		resp.ipz_rqueue.act_nr_of_sg = ipz_rqueue->act_nr_of_sg;
-		resp.ipz_rqueue.queue_length = ipz_rqueue->queue_length;
-		resp.ipz_rqueue.pagesize = ipz_rqueue->pagesize;
-		resp.ipz_rqueue.toggle_state = ipz_rqueue->toggle_state;
-		/* squeue properties */
-		resp.ipz_squeue.qe_size = ipz_squeue->qe_size;
-		resp.ipz_squeue.act_nr_of_sg = ipz_squeue->act_nr_of_sg;
-		resp.ipz_squeue.queue_length = ipz_squeue->queue_length;
-		resp.ipz_squeue.pagesize = ipz_squeue->pagesize;
-		resp.ipz_squeue.toggle_state = ipz_squeue->toggle_state;
+		if (HAS_SQ(my_qp))
+			queue2resp(&resp.ipz_squeue, &my_qp->ipz_squeue);
+		if (HAS_RQ(my_qp))
+			queue2resp(&resp.ipz_rqueue, &my_qp->ipz_rqueue);
+
 		if (ib_copy_to_udata(udata, &resp, sizeof resp)) {
 			ehca_err(pd->device, "Copy to udata failed");
 			ret = -EINVAL;
@@ -635,13 +689,15 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 		}
 	}
 
-	return &my_qp->ib_qp;
+	return my_qp;
 
 create_qp_exit4:
-	ipz_queue_dtor(&my_qp->ipz_rqueue);
+	if (HAS_RQ(my_qp))
+		ipz_queue_dtor(&my_qp->ipz_rqueue);
 
 create_qp_exit3:
-	ipz_queue_dtor(&my_qp->ipz_squeue);
+	if (HAS_SQ(my_qp))
+		ipz_queue_dtor(&my_qp->ipz_squeue);
 
 create_qp_exit2:
 	hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp);
@@ -656,6 +712,114 @@ create_qp_exit0:
 	return ERR_PTR(ret);
 }
 
+struct ib_qp *ehca_create_qp(struct ib_pd *pd,
+			     struct ib_qp_init_attr *qp_init_attr,
+			     struct ib_udata *udata)
+{
+	struct ehca_qp *ret;
+
+	ret = internal_create_qp(pd, qp_init_attr, NULL, udata, 0);
+	return IS_ERR(ret) ? (struct ib_qp*)ret : &ret->ib_qp;
+}
+
+int internal_destroy_qp(struct ib_device *dev, struct ehca_qp *my_qp,
+			struct ib_uobject *uobject);
+
+struct ib_srq *ehca_create_srq(struct ib_pd *pd,
+			       struct ib_srq_init_attr *srq_init_attr,
+			       struct ib_udata *udata)
+{
+	struct ib_qp_init_attr qp_init_attr;
+	struct ehca_qp *my_qp;
+	struct ib_srq *ret;
+	struct ehca_shca *shca = container_of(pd->device, struct ehca_shca,
+					      ib_device);
+	struct hcp_modify_qp_control_block *mqpcb;
+	u64 hret, update_mask;
+
+	/* For common attributes, internal_create_qp() takes its info
+	 * out of qp_init_attr, so copy all common attrs there.
+	 */
+	memset(&qp_init_attr, 0, sizeof(qp_init_attr));
+	qp_init_attr.event_handler = srq_init_attr->event_handler;
+	qp_init_attr.qp_context = srq_init_attr->srq_context;
+	qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR;
+	qp_init_attr.qp_type = IB_QPT_RC;
+	qp_init_attr.cap.max_recv_wr = srq_init_attr->attr.max_wr;
+	qp_init_attr.cap.max_recv_sge = srq_init_attr->attr.max_sge;
+
+	my_qp = internal_create_qp(pd, &qp_init_attr, srq_init_attr, udata, 1);
+	if (IS_ERR(my_qp))
+		return (struct ib_srq*)my_qp;
+
+	/* copy back return values */
+	srq_init_attr->attr.max_wr = qp_init_attr.cap.max_recv_wr;
+	srq_init_attr->attr.max_sge = qp_init_attr.cap.max_recv_sge;
+
+	/* drive SRQ into RTR state */
+	mqpcb = ehca_alloc_fw_ctrlblock(GFP_KERNEL);
+	if (!mqpcb) {
+		ehca_err(pd->device, "Could not get zeroed page for mqpcb "
+			 "ehca_qp=%p qp_num=%x ", my_qp, my_qp->real_qp_num);
+		ret = ERR_PTR(-ENOMEM);
+		goto create_srq1;
+	}
+
+	mqpcb->qp_state = EHCA_QPS_INIT;
+	mqpcb->prim_phys_port = 1;
+	update_mask = EHCA_BMASK_SET(MQPCB_MASK_QP_STATE, 1);
+	hret = hipz_h_modify_qp(shca->ipz_hca_handle,
+				my_qp->ipz_qp_handle,
+				&my_qp->pf,
+				update_mask,
+				mqpcb, my_qp->galpas.kernel);
+	if (hret != H_SUCCESS) {
+		ehca_err(pd->device, "Could not modify SRQ to INIT"
+			 "ehca_qp=%p qp_num=%x hret=%lx",
+			 my_qp, my_qp->real_qp_num, hret);
+		goto create_srq2;
+	}
+
+	mqpcb->qp_enable = 1;
+	update_mask = EHCA_BMASK_SET(MQPCB_MASK_QP_ENABLE, 1);
+	hret = hipz_h_modify_qp(shca->ipz_hca_handle,
+				my_qp->ipz_qp_handle,
+				&my_qp->pf,
+				update_mask,
+				mqpcb, my_qp->galpas.kernel);
+	if (hret != H_SUCCESS) {
+		ehca_err(pd->device, "Could not enable SRQ"
+			 "ehca_qp=%p qp_num=%x hret=%lx",
+			 my_qp, my_qp->real_qp_num, hret);
+		goto create_srq2;
+	}
+
+	mqpcb->qp_state  = EHCA_QPS_RTR;
+	update_mask = EHCA_BMASK_SET(MQPCB_MASK_QP_STATE, 1);
+	hret = hipz_h_modify_qp(shca->ipz_hca_handle,
+				my_qp->ipz_qp_handle,
+				&my_qp->pf,
+				update_mask,
+				mqpcb, my_qp->galpas.kernel);
+	if (hret != H_SUCCESS) {
+		ehca_err(pd->device, "Could not modify SRQ to RTR"
+			 "ehca_qp=%p qp_num=%x hret=%lx",
+			 my_qp, my_qp->real_qp_num, hret);
+		goto create_srq2;
+	}
+
+	return &my_qp->ib_srq;
+
+create_srq2:
+	ret = ERR_PTR(ehca2ib_return_code(hret));
+	ehca_free_fw_ctrlblock(mqpcb);
+
+create_srq1:
+	internal_destroy_qp(pd->device, my_qp, my_qp->ib_srq.uobject);
+
+	return ret;
+}
+
 /*
  * prepare_sqe_rts called by internal_modify_qp() at trans sqe -> rts
  * set purge bit of bad wqe and subsequent wqes to avoid reentering sqe
@@ -1341,42 +1505,159 @@ query_qp_exit1:
 	return ret;
 }
 
-int ehca_destroy_qp(struct ib_qp *ibqp)
+int ehca_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
+		    enum ib_srq_attr_mask attr_mask, struct ib_udata *udata)
 {
-	struct ehca_qp *my_qp = container_of(ibqp, struct ehca_qp, ib_qp);
-	struct ehca_shca *shca = container_of(ibqp->device, struct ehca_shca,
+	struct ehca_qp *my_qp =
+		container_of(ibsrq, struct ehca_qp, ib_srq);
+	struct ehca_pd *my_pd =
+		container_of(ibsrq->pd, struct ehca_pd, ib_pd);
+	struct ehca_shca *shca =
+		container_of(ibsrq->pd->device, struct ehca_shca, ib_device);
+	struct hcp_modify_qp_control_block *mqpcb;
+	u64 update_mask;
+	u64 h_ret;
+	int ret = 0;
+
+	u32 cur_pid = current->tgid;
+	if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context &&
+	    my_pd->ownpid != cur_pid) {
+		ehca_err(ibsrq->pd->device, "Invalid caller pid=%x ownpid=%x",
+			 cur_pid, my_pd->ownpid);
+		return -EINVAL;
+	}
+
+	mqpcb = ehca_alloc_fw_ctrlblock(GFP_KERNEL);
+	if (!mqpcb) {
+		ehca_err(ibsrq->device, "Could not get zeroed page for mqpcb "
+			 "ehca_qp=%p qp_num=%x ", my_qp, my_qp->real_qp_num);
+		return -ENOMEM;
+	}
+
+	update_mask = 0;
+	if (attr_mask & IB_SRQ_LIMIT) {
+		attr_mask &= ~IB_SRQ_LIMIT;
+		update_mask |=
+			EHCA_BMASK_SET(MQPCB_MASK_CURR_SRQ_LIMIT, 1)
+			| EHCA_BMASK_SET(MQPCB_MASK_QP_AFF_ASYN_EV_LOG_REG, 1);
+		mqpcb->curr_srq_limit =
+			EHCA_BMASK_SET(MQPCB_CURR_SRQ_LIMIT, attr->srq_limit);
+		mqpcb->qp_aff_asyn_ev_log_reg =
+			EHCA_BMASK_SET(QPX_AAELOG_RESET_SRQ_LIMIT, 1);
+	}
+
+	/* by now, all bits in attr_mask should have been cleared */
+	if (attr_mask) {
+		ehca_err(ibsrq->device, "invalid attribute mask bits set  "
+			 "attr_mask=%x", attr_mask);
+		ret = -EINVAL;
+		goto modify_srq_exit0;
+	}
+
+	if (ehca_debug_level)
+		ehca_dmp(mqpcb, 4*70, "qp_num=%x", my_qp->real_qp_num);
+
+	h_ret = hipz_h_modify_qp(shca->ipz_hca_handle, my_qp->ipz_qp_handle,
+				 NULL, update_mask, mqpcb,
+				 my_qp->galpas.kernel);
+
+	if (h_ret != H_SUCCESS) {
+		ret = ehca2ib_return_code(h_ret);
+		ehca_err(ibsrq->device, "hipz_h_modify_qp() failed rc=%lx "
+			 "ehca_qp=%p qp_num=%x",
+			 h_ret, my_qp, my_qp->real_qp_num);
+	}
+
+modify_srq_exit0:
+	ehca_free_fw_ctrlblock(mqpcb);
+
+	return ret;
+}
+
+int ehca_query_srq(struct ib_srq *srq, struct ib_srq_attr *srq_attr)
+{
+	struct ehca_qp *my_qp = container_of(srq, struct ehca_qp, ib_srq);
+	struct ehca_pd *my_pd = container_of(srq->pd, struct ehca_pd, ib_pd);
+	struct ehca_shca *shca = container_of(srq->device, struct ehca_shca,
 					      ib_device);
+	struct ipz_adapter_handle adapter_handle = shca->ipz_hca_handle;
+	struct hcp_modify_qp_control_block *qpcb;
+	u32 cur_pid = current->tgid;
+	int ret = 0;
+	u64 h_ret;
+
+	if (my_pd->ib_pd.uobject  && my_pd->ib_pd.uobject->context  &&
+	    my_pd->ownpid != cur_pid) {
+		ehca_err(srq->device, "Invalid caller pid=%x ownpid=%x",
+			 cur_pid, my_pd->ownpid);
+		return -EINVAL;
+	}
+
+	qpcb = ehca_alloc_fw_ctrlblock(GFP_KERNEL);
+	if (!qpcb) {
+		ehca_err(srq->device,"Out of memory for qpcb "
+			 "ehca_qp=%p qp_num=%x", my_qp, my_qp->real_qp_num);
+		return -ENOMEM;
+	}
+
+	h_ret = hipz_h_query_qp(adapter_handle, my_qp->ipz_qp_handle,
+				NULL, qpcb, my_qp->galpas.kernel);
+
+	if (h_ret != H_SUCCESS) {
+		ret = ehca2ib_return_code(h_ret);
+		ehca_err(srq->device,"hipz_h_query_qp() failed "
+			 "ehca_qp=%p qp_num=%x h_ret=%lx",
+			 my_qp, my_qp->real_qp_num, h_ret);
+		goto query_srq_exit1;
+	}
+
+	srq_attr->max_wr = qpcb->max_nr_outst_recv_wr - 1;
+	srq_attr->srq_limit = EHCA_BMASK_GET(
+		MQPCB_CURR_SRQ_LIMIT, qpcb->curr_srq_limit);
+
+	if (ehca_debug_level)
+		ehca_dmp(qpcb, 4*70, "qp_num=%x", my_qp->real_qp_num);
+
+query_srq_exit1:
+	ehca_free_fw_ctrlblock(qpcb);
+
+	return ret;
+}
+
+int internal_destroy_qp(struct ib_device *dev, struct ehca_qp *my_qp,
+			struct ib_uobject *uobject)
+{
+	struct ehca_shca *shca = container_of(dev, struct ehca_shca, ib_device);
 	struct ehca_pd *my_pd = container_of(my_qp->ib_qp.pd, struct ehca_pd,
 					     ib_pd);
 	u32 cur_pid = current->tgid;
-	u32 qp_num = ibqp->qp_num;
+	u32 qp_num = my_qp->real_qp_num;
 	int ret;
 	u64 h_ret;
 	u8 port_num;
 	enum ib_qp_type	qp_type;
 	unsigned long flags;
 
-	if (ibqp->uobject) {
+	if (uobject) {
 		if (my_qp->mm_count_galpa ||
 		    my_qp->mm_count_rqueue || my_qp->mm_count_squeue) {
-			ehca_err(ibqp->device, "Resources still referenced in "
-				 "user space qp_num=%x", ibqp->qp_num);
+			ehca_err(dev, "Resources still referenced in "
+				 "user space qp_num=%x", qp_num);
 			return -EINVAL;
 		}
 		if (my_pd->ownpid != cur_pid) {
-			ehca_err(ibqp->device, "Invalid caller pid=%x ownpid=%x",
+			ehca_err(dev, "Invalid caller pid=%x ownpid=%x",
 				 cur_pid, my_pd->ownpid);
 			return -EINVAL;
 		}
 	}
 
 	if (my_qp->send_cq) {
-		ret = ehca_cq_unassign_qp(my_qp->send_cq,
-					      my_qp->real_qp_num);
+		ret = ehca_cq_unassign_qp(my_qp->send_cq, qp_num);
 		if (ret) {
-			ehca_err(ibqp->device, "Couldn't unassign qp from "
+			ehca_err(dev, "Couldn't unassign qp from "
 				 "send_cq ret=%x qp_num=%x cq_num=%x", ret,
-				 my_qp->ib_qp.qp_num, my_qp->send_cq->cq_number);
+				 qp_num, my_qp->send_cq->cq_number);
 			return ret;
 		}
 	}
@@ -1387,7 +1668,7 @@ int ehca_destroy_qp(struct ib_qp *ibqp)
 
 	h_ret = hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp);
 	if (h_ret != H_SUCCESS) {
-		ehca_err(ibqp->device, "hipz_h_destroy_qp() failed rc=%lx "
+		ehca_err(dev, "hipz_h_destroy_qp() failed rc=%lx "
 			 "ehca_qp=%p qp_num=%x", h_ret, my_qp, qp_num);
 		return ehca2ib_return_code(h_ret);
 	}
@@ -1398,7 +1679,7 @@ int ehca_destroy_qp(struct ib_qp *ibqp)
 	/* no support for IB_QPT_SMI yet */
 	if (qp_type == IB_QPT_GSI) {
 		struct ib_event event;
-		ehca_info(ibqp->device, "device %s: port %x is inactive.",
+		ehca_info(dev, "device %s: port %x is inactive.",
 			  shca->ib_device.name, port_num);
 		event.device = &shca->ib_device;
 		event.event = IB_EVENT_PORT_ERR;
@@ -1407,12 +1688,28 @@ int ehca_destroy_qp(struct ib_qp *ibqp)
 		ib_dispatch_event(&event);
 	}
 
-	ipz_queue_dtor(&my_qp->ipz_rqueue);
-	ipz_queue_dtor(&my_qp->ipz_squeue);
+	if (HAS_RQ(my_qp))
+		ipz_queue_dtor(&my_qp->ipz_rqueue);
+	if (HAS_SQ(my_qp))
+		ipz_queue_dtor(&my_qp->ipz_squeue);
 	kmem_cache_free(qp_cache, my_qp);
 	return 0;
 }
 
+int ehca_destroy_qp(struct ib_qp *qp)
+{
+	return internal_destroy_qp(qp->device,
+				   container_of(qp, struct ehca_qp, ib_qp),
+				   qp->uobject);
+}
+
+int ehca_destroy_srq(struct ib_srq *srq)
+{
+	return internal_destroy_qp(srq->device,
+				   container_of(srq, struct ehca_qp, ib_srq),
+				   srq->uobject);
+}
+
 int ehca_init_qp_cache(void)
 {
 	qp_cache = kmem_cache_create("ehca_cache_qp",
diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c
index 56c4527..b5664fa 100644
--- a/drivers/infiniband/hw/ehca/ehca_reqs.c
+++ b/drivers/infiniband/hw/ehca/ehca_reqs.c
@@ -3,8 +3,9 @@
  *
  *  post_send/recv, poll_cq, req_notify
  *
- *  Authors: Waleri Fomin <fomin at de.ibm.com>
- *           Hoang-Nam Nguyen <hnguyen at de.ibm.com>
+ *  Authors: Hoang-Nam Nguyen <hnguyen at de.ibm.com>
+ *           Waleri Fomin <fomin at de.ibm.com>
+ *           Joachim Fenkes <fenkes at de.ibm.com>
  *           Reinhard Ernst <rernst at de.ibm.com>
  *
  *  Copyright (c) 2005 IBM Corporation
@@ -413,17 +414,23 @@ post_send_exit0:
 	return ret;
 }
 
-int ehca_post_recv(struct ib_qp *qp,
-		   struct ib_recv_wr *recv_wr,
-		   struct ib_recv_wr **bad_recv_wr)
+static int internal_post_recv(struct ehca_qp *my_qp,
+			      struct ib_device *dev,
+			      struct ib_recv_wr *recv_wr,
+			      struct ib_recv_wr **bad_recv_wr)
 {
-	struct ehca_qp *my_qp = container_of(qp, struct ehca_qp, ib_qp);
 	struct ib_recv_wr *cur_recv_wr;
 	struct ehca_wqe *wqe_p;
 	int wqe_cnt = 0;
 	int ret = 0;
 	unsigned long spl_flags;
 
+	if (unlikely(!HAS_RQ(my_qp))) {
+		ehca_err(dev, "QP has no RQ  ehca_qp=%p qp_num=%x ext_type=%d",
+			 my_qp, my_qp->real_qp_num, my_qp->ext_type);
+		return -ENODEV;
+	}
+
 	/* LOCK the QUEUE */
 	spin_lock_irqsave(&my_qp->spinlock_r, spl_flags);
 
@@ -439,8 +446,8 @@ int ehca_post_recv(struct ib_qp *qp,
 				*bad_recv_wr = cur_recv_wr;
 			if (wqe_cnt == 0) {
 				ret = -ENOMEM;
-				ehca_err(qp->device, "Too many posted WQEs "
-					 "qp_num=%x", qp->qp_num);
+				ehca_err(dev, "Too many posted WQEs "
+					 "qp_num=%x", my_qp->real_qp_num);
 			}
 			goto post_recv_exit0;
 		}
@@ -455,14 +462,14 @@ int ehca_post_recv(struct ib_qp *qp,
 			*bad_recv_wr = cur_recv_wr;
 			if (wqe_cnt == 0) {
 				ret = -EINVAL;
-				ehca_err(qp->device, "Could not write WQE "
-					 "qp_num=%x", qp->qp_num);
+				ehca_err(dev, "Could not write WQE "
+					 "qp_num=%x", my_qp->real_qp_num);
 			}
 			goto post_recv_exit0;
 		}
 		wqe_cnt++;
-		ehca_gen_dbg("ehca_qp=%p qp_num=%x wqe_cnt=%d",
-		     my_qp, qp->qp_num, wqe_cnt);
+		ehca_dbg(dev, "ehca_qp=%p qp_num=%x wqe_cnt=%d",
+			 my_qp, my_qp->real_qp_num, wqe_cnt);
 	} /* eof for cur_recv_wr */
 
 post_recv_exit0:
@@ -472,6 +479,22 @@ post_recv_exit0:
 	return ret;
 }
 
+int ehca_post_recv(struct ib_qp *qp,
+		   struct ib_recv_wr *recv_wr,
+		   struct ib_recv_wr **bad_recv_wr)
+{
+	return internal_post_recv(container_of(qp, struct ehca_qp, ib_qp),
+				  qp->device, recv_wr, bad_recv_wr);
+}
+
+int ehca_post_srq_recv(struct ib_srq *srq,
+		       struct ib_recv_wr *recv_wr,
+		       struct ib_recv_wr **bad_recv_wr)
+{
+	return internal_post_recv(container_of(srq, struct ehca_qp, ib_srq),
+				  srq->device, recv_wr, bad_recv_wr);
+}
+
 /*
  * ib_wc_opcode table converts ehca wc opcode to ib
  * Since we use zero to indicate invalid opcode, the actual ib opcode must
diff --git a/drivers/infiniband/hw/ehca/ehca_uverbs.c b/drivers/infiniband/hw/ehca/ehca_uverbs.c
index 73db920..d8fe37d 100644
--- a/drivers/infiniband/hw/ehca/ehca_uverbs.c
+++ b/drivers/infiniband/hw/ehca/ehca_uverbs.c
@@ -257,6 +257,7 @@ int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
 	struct ehca_cq *cq;
 	struct ehca_qp *qp;
 	struct ehca_pd *pd;
+	struct ib_uobject *uobject;
 
 	switch (q_type) {
 	case  1: /* CQ */
@@ -304,7 +305,8 @@ int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
 			return -ENOMEM;
 		}
 
-		if (!qp->ib_qp.uobject || qp->ib_qp.uobject->context != context)
+		uobject = IS_SRQ(qp) ? qp->ib_srq.uobject : qp->ib_qp.uobject;
+		if (!uobject || uobject->context != context)
 			return -EINVAL;
 
 		ret = ehca_mmap_qp(vma, qp, rsrc_type);
diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c
index 7efc4a2..b078377 100644
--- a/drivers/infiniband/hw/ehca/hcp_if.c
+++ b/drivers/infiniband/hw/ehca/hcp_if.c
@@ -5,6 +5,7 @@
  *
  *  Authors: Christoph Raisch <raisch at de.ibm.com>
  *           Hoang-Nam Nguyen <hnguyen at de.ibm.com>
+ *           Joachim Fenkes <fenkes at de.ibm.com>
  *           Gerd Bayer <gerd.bayer at de.ibm.com>
  *           Waleri Fomin <fomin at de.ibm.com>
  *
@@ -62,6 +63,12 @@
 #define H_ALL_RES_QP_MAX_SEND_SGE       EHCA_BMASK_IBM(32, 39)
 #define H_ALL_RES_QP_MAX_RECV_SGE       EHCA_BMASK_IBM(40, 47)
 
+#define H_ALL_RES_QP_UD_AV_LKEY         EHCA_BMASK_IBM(32, 63)
+#define H_ALL_RES_QP_SRQ_QP_TOKEN       EHCA_BMASK_IBM(0, 31)
+#define H_ALL_RES_QP_SRQ_QP_HANDLE      EHCA_BMASK_IBM(0, 64)
+#define H_ALL_RES_QP_SRQ_LIMIT          EHCA_BMASK_IBM(48, 63)
+#define H_ALL_RES_QP_SRQ_QPN            EHCA_BMASK_IBM(40, 63)
+
 #define H_ALL_RES_QP_ACT_OUTST_SEND_WR  EHCA_BMASK_IBM(16, 31)
 #define H_ALL_RES_QP_ACT_OUTST_RECV_WR  EHCA_BMASK_IBM(48, 63)
 #define H_ALL_RES_QP_ACT_SEND_SGE       EHCA_BMASK_IBM(8, 15)
@@ -150,7 +157,7 @@ static long ehca_plpar_hcall9(unsigned long opcode,
 {
 	long ret;
 	int i, sleep_msecs, lock_is_set = 0;
-	unsigned long flags;
+	unsigned long flags = 0;
 
 	ehca_gen_dbg("opcode=%lx arg1=%lx arg2=%lx arg3=%lx arg4=%lx "
 		     "arg5=%lx arg6=%lx arg7=%lx arg8=%lx arg9=%lx",
@@ -282,8 +289,7 @@ u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle,
 			     struct ehca_alloc_qp_parms *parms)
 {
 	u64 ret;
-	u64 allocate_controls;
-	u64 max_r10_reg;
+	u64 allocate_controls, max_r10_reg, r11, r12;
 	u64 outs[PLPAR_HCALL9_BUFSIZE];
 
 	allocate_controls =
@@ -309,6 +315,13 @@ u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle,
 		| EHCA_BMASK_SET(H_ALL_RES_QP_MAX_RECV_SGE,
 				 parms->max_recv_sge);
 
+	r11 = EHCA_BMASK_SET(H_ALL_RES_QP_SRQ_QP_TOKEN, parms->srq_token);
+
+	if (parms->ext_type == EQPT_SRQ)
+		r12 = EHCA_BMASK_SET(H_ALL_RES_QP_SRQ_LIMIT, parms->srq_limit);
+	else
+		r12 = EHCA_BMASK_SET(H_ALL_RES_QP_SRQ_QPN, parms->srq_qpn);
+
 	ret = ehca_plpar_hcall9(H_ALLOC_RESOURCE, outs,
 				adapter_handle.handle,	           /* r4  */
 				allocate_controls,	           /* r5  */
@@ -316,9 +329,7 @@ u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle,
 				parms->recv_cq_handle.handle,
 				parms->eq_handle.handle,
 				((u64)parms->token << 32) | parms->pd.value,
-				max_r10_reg,	                   /* r10 */
-				parms->ud_av_l_key_ctl,            /* r11 */
-				0);
+				max_r10_reg, r11, r12);
 
 	parms->qp_handle.handle = outs[0];
 	parms->real_qp_num = (u32)outs[1];
diff --git a/drivers/infiniband/hw/ehca/hipz_hw.h b/drivers/infiniband/hw/ehca/hipz_hw.h
index 9fe8367..d46b18c 100644
--- a/drivers/infiniband/hw/ehca/hipz_hw.h
+++ b/drivers/infiniband/hw/ehca/hipz_hw.h
@@ -163,6 +163,7 @@ struct hipz_qptemm {
 
 #define QPX_SQADDER EHCA_BMASK_IBM(48,63)
 #define QPX_RQADDER EHCA_BMASK_IBM(48,63)
+#define QPX_AAELOG_RESET_SRQ_LIMIT EHCA_BMASK_IBM(3,3)
 
 #define QPTEMM_OFFSET(x) offsetof(struct hipz_qptemm,x)
 
-- 
1.5.2


From amitk at mellanox.co.il  Mon Jul  9 06:27:35 2007
From: amitk at mellanox.co.il (Amit Krig)
Date: Mon, 9 Jul 2007 16:27:35 +0300
Subject: [ofa-general] RE: [PATCH] OpenSM handling of "Babbling" Ports
References: <1183640246.4377.436639.camel@hal.voltaire.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901BE111F@mtlexch01.mtl.com>

Hi Hal,

In such case OpenSM should first check that the OPVL fields of the ports
(the one that sends the traps and its peer) are identical,
If you have a mismatch in the OPVL field, the link watchdog mechanism
will retrain the logical link in high rate

Amit


-----Original Message-----
From: Hal Rosenstock [mailto:halr at voltaire.com] 
Sent: Thursday, July 05, 2007 3:58 PM
To: general at lists.openfabrics.org
Cc: Eitan Zahavi; Yevgeny Kliteynik
Subject: [PATCH] OpenSM handling of "Babbling" Ports

A "babbling" port is a port which causes traps to be generated
frequently.
It may directly be "this" port which generates the traps or the peer
port detecting the issue and that the SMA on switch port 0 generates the
traps.
This has only currently been observed for trap 131 but will also apply
for traps 129 and 130 as well which are other urgent and similar traps.

Note that there appears to be a bug in Mellanox firmware for both
Anafa-2 and Tavor at a minimum which causes the max trap rate not to be
adhered to and relief for this does not appear to be in short term
sight.

Policy
When a bablbing port is detected, OpenSM will disable the port or its
peer switch port (depending on which trap) which should terminate the
trap storm.

Detection
250 consecutive traps of this type will be used as the (initial)
threshold. The reason for this is so as to not prematurely detect this
and disable a port.

Recovery
Admin would reenable port when OK again. (This usually involves
rebooting the node causing the trap to be indicated.)

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/opensm/include/opensm/osm_subnet.h
b/opensm/include/opensm/osm_subnet.h
index bedd63f..1150703 100644
--- a/opensm/include/opensm/osm_subnet.h
+++ b/opensm/include/opensm/osm_subnet.h
@@ -286,6 +286,7 @@ typedef struct _osm_subn_opt
   boolean_t                honor_guid2lid_file;
   boolean_t                daemon;
   boolean_t                sm_inactive;
+  boolean_t                babbling_port_policy;
   osm_qos_options_t        qos_options;
   osm_qos_options_t        qos_ca_options;
   osm_qos_options_t        qos_sw0_options;
@@ -487,6 +488,9 @@ typedef struct _osm_subn_opt
 *
 *	sm_inactive
 *		OpenSM will start with SM in not active state.
+*
+*	babbling_port_policy
+*		OpenSM will enforce its "babbling" port policy.
 *	
 *	perfmgr
 *		Enable or disable the performance manager
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 726b665..87b71e5 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -472,6 +472,7 @@ osm_subn_set_default_opt(
   p_opt->honor_guid2lid_file = FALSE;
   p_opt->daemon = FALSE;
   p_opt->sm_inactive = FALSE;
+  p_opt->babbling_port_policy = FALSE;
 #ifdef ENABLE_OSM_PERF_MGR
   p_opt->perfmgr = FALSE;
   p_opt->perfmgr_sweep_time_s = OSM_PERFMGR_DEFAULT_SWEEP_TIME_S; @@
-1358,6 +1359,10 @@ osm_subn_parse_conf_file(
         "sm_inactive",
         p_key, p_val, &p_opts->sm_inactive);
 
+      __osm_subn_opts_unpack_boolean(
+        "babbling_port_policy",
+        p_key, p_val, &p_opts->babbling_port_policy);
+
 #ifdef ENABLE_OSM_PERF_MGR
       __osm_subn_opts_unpack_boolean(
         "perfmgr",
@@ -1631,9 +1636,12 @@ osm_subn_write_conf_file(
     "# Daemon mode\n"
     "daemon %s\n\n"
     "# SM Inactive\n"
-    "sm_inactive %s\n\n",
+    "sm_inactive %s\n\n"
+    "# Babbling Port Policy\n"
+    "babbling_port_policy %s\n\n",
     p_opts->daemon ? "TRUE" : "FALSE",
-    p_opts->sm_inactive ? "TRUE" : "FALSE"
+    p_opts->sm_inactive ? "TRUE" : "FALSE",
+    p_opts->babbling_port_policy ? "TRUE" : "FALSE"
     );
 
 #ifdef ENABLE_OSM_PERF_MGR
diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
index 5900c51..fbb6dac 100644
--- a/opensm/opensm/osm_trap_rcv.c
+++ b/opensm/opensm/osm_trap_rcv.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights
reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  *
@@ -548,6 +548,61 @@ __osm_trap_rcv_process_request(
         }
         else
         {
+          /* When babbling port policy option is enabled and
+             Threshold for disabling a "babbling" port is exceeded */
+          if ( p_rcv->p_subn->opt.babbling_port_policy &&
+               num_received >= 250 )
+          {
+            uint8_t               payload[IB_SMP_DATA_SIZE];
+            ib_port_info_t*       p_pi = (ib_port_info_t*)payload;
+            const ib_port_info_t* p_old_pi;
+            osm_madw_context_t    context;
+
+            /* If trap 131, might want to disable peer port if
available */
+            /* but peer port has been observed not to respond to SM 
+ requests */
+
+            osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+                     "__osm_trap_rcv_process_request: ERR 3810: "
+                     " Disabling physical port lid:0x%02X num:%u\n",
+                     cl_ntoh16(p_ntci->data_details.ntc_129_131.lid),
+                     p_ntci->data_details.ntc_129_131.port_num
+                     );
+
+            p_old_pi = &p_physp->port_info;
+            memcpy( payload, p_old_pi, sizeof(ib_port_info_t) );
+
+            /* Set port to disabled/down */
+            ib_port_info_set_port_state( p_pi, IB_LINK_DOWN );
+            ib_port_info_set_port_phys_state( 
+ IB_PORT_PHYS_STATE_DISABLED, p_pi );
+
+            context.pi_context.node_guid = osm_node_get_node_guid(
osm_physp_get_node_ptr( p_physp ) );
+            context.pi_context.port_guid = osm_physp_get_port_guid(
p_physp );
+            context.pi_context.set_method = TRUE;
+            context.pi_context.update_master_sm_base_lid = FALSE;
+            context.pi_context.light_sweep = FALSE;
+            context.pi_context.active_transition = FALSE;
+
+            status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req,
+                                   osm_physp_get_dr_path_ptr( p_physp
),
+                                   payload,
+                                   sizeof(payload),
+                                   IB_MAD_ATTR_PORT_INFO,
+                                   cl_hton32(osm_physp_get_port_num(
p_physp )),
+                                   CL_DISP_MSGID_NONE,
+                                  &context );
+
+            if( status == IB_SUCCESS )
+            {
+               goto Exit;
+            }
+            else
+            {
+               osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+                        "__osm_trap_rcv_process_request: ERR 3811: "
+                        "Request to set PortInfo failed\n" );
+            }
+          }
+
           osm_log( p_rcv->p_log, OSM_LOG_VERBOSE,
                    "__osm_trap_rcv_process_request: "
                    "Marking unhealthy physical port by lid:0x%02X
num:%u\n",


From fenkes at de.ibm.com  Mon Jul  9 06:26:31 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Mon, 9 Jul 2007 15:26:31 +0200
Subject: [ofa-general] [PATCH 05/13] IB/ehca: Support UD low latency QPs
In-Reply-To: <200707091502.22407.fenkes@de.ibm.com>
References: <200707091502.22407.fenkes@de.ibm.com>
Message-ID: <200707091526.31709.fenkes@de.ibm.com>

From: Stefan Roscher <stefan.roscher at de.ibm.com>

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_qp.c |   84 +++++++++++++++++++++++-----------
 1 files changed, 57 insertions(+), 27 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c
index 9486a44..ffd1ce9 100644
--- a/drivers/infiniband/hw/ehca/ehca_qp.c
+++ b/drivers/infiniband/hw/ehca/ehca_qp.c
@@ -275,6 +275,11 @@ static inline void queue2resp(struct ipzu_queue_resp *resp,
 	resp->toggle_state = queue->toggle_state;
 }
 
+static inline int ll_qp_msg_size(int nr_sge)
+{
+	return 128 << nr_sge;
+}
+
 /*
  * init_qp_queue initializes/constructs r/squeue and registers queue pages.
  */
@@ -363,8 +368,6 @@ struct ehca_qp *internal_create_qp(struct ib_pd *pd,
 				   struct ib_srq_init_attr *srq_init_attr,
 				   struct ib_udata *udata, int is_srq)
 {
-	static int da_rc_msg_size[] = { 128, 256, 512, 1024, 2048, 4096 };
-	static int da_ud_sq_msg_size[]={ 128, 384, 896, 1920, 3968 };
 	struct ehca_qp *my_qp;
 	struct ehca_pd *my_pd = container_of(pd, struct ehca_pd, ib_pd);
 	struct ehca_shca *shca = container_of(pd->device, struct ehca_shca,
@@ -396,6 +399,7 @@ struct ehca_qp *internal_create_qp(struct ib_pd *pd,
 		parms.ll_comp_flags = qp_type & LLQP_COMP_MASK;
 	}
 	qp_type &= 0x1F;
+	init_attr->qp_type &= 0x1F;
 
 	/* handle SRQ base QPs */
 	if (init_attr->srq) {
@@ -435,23 +439,49 @@ struct ehca_qp *internal_create_qp(struct ib_pd *pd,
 		return ERR_PTR(-EINVAL);
 	}
 
-	if (is_llqp && (qp_type != IB_QPT_RC && qp_type != IB_QPT_UD)) {
-		ehca_err(pd->device, "unsupported LL QP Type=%x", qp_type);
-		return ERR_PTR(-EINVAL);
-	} else if (is_llqp && qp_type == IB_QPT_RC &&
-		   (init_attr->cap.max_send_wr > 255 ||
-		    init_attr->cap.max_recv_wr > 255 )) {
-		ehca_err(pd->device, "Invalid Number of max_sq_wr=%x "
-			 "or max_rq_wr=%x for RC LLQP",
-			 init_attr->cap.max_send_wr,
-			 init_attr->cap.max_recv_wr);
-		return ERR_PTR(-EINVAL);
-	} else if (is_llqp && qp_type == IB_QPT_UD &&
-		 init_attr->cap.max_send_wr > 255) {
-		ehca_err(pd->device,
-			 "Invalid Number of max_send_wr=%x for UD QP_TYPE=%x",
-			 init_attr->cap.max_send_wr, qp_type);
-		return ERR_PTR(-EINVAL);
+	if (is_llqp) {
+		switch (qp_type) {
+		case IB_QPT_RC:
+			if ((init_attr->cap.max_send_wr > 255) ||
+			    (init_attr->cap.max_recv_wr > 255)) {
+				ehca_err(pd->device,
+					 "Invalid Number of max_sq_wr=%x "
+					 "or max_rq_wr=%x for RC LLQP",
+					 init_attr->cap.max_send_wr,
+					 init_attr->cap.max_recv_wr);
+				return ERR_PTR(-EINVAL);
+			}
+			break;
+		case IB_QPT_UD:
+			if (!EHCA_BMASK_GET(HCA_CAP_UD_LL_QP, shca->hca_cap)) {
+				ehca_err(pd->device, "UD LLQP not supported "
+					 "by this adapter");
+				return ERR_PTR(-ENOSYS);
+			}
+			if (!(init_attr->cap.max_send_sge <= 5
+			    && init_attr->cap.max_send_sge >= 1
+			    && init_attr->cap.max_recv_sge <= 5
+			    && init_attr->cap.max_recv_sge >= 1)) {
+				ehca_err(pd->device,
+					 "Invalid Number of max_send_sge=%x "
+					 "or max_recv_sge=%x for UD LLQP",
+					 init_attr->cap.max_send_sge,
+					 init_attr->cap.max_recv_sge);
+				return ERR_PTR(-EINVAL);
+			} else if (init_attr->cap.max_send_wr > 255) {
+				ehca_err(pd->device,
+					 "Invalid Number of "
+					 "ax_send_wr=%x for UD QP_TYPE=%x",
+					 init_attr->cap.max_send_wr, qp_type);
+				return ERR_PTR(-EINVAL);
+			}
+			break;
+		default:
+			ehca_err(pd->device, "unsupported LL QP Type=%x",
+				 qp_type);
+			return ERR_PTR(-EINVAL);
+			break;
+		}
 	}
 
 	if (pd->uobject && udata)
@@ -509,7 +539,7 @@ struct ehca_qp *internal_create_qp(struct ib_pd *pd,
 	/* UD_AV CIRCUMVENTION */
 	max_send_sge = init_attr->cap.max_send_sge;
 	max_recv_sge = init_attr->cap.max_recv_sge;
-	if (parms.servicetype == ST_UD) {
+	if (parms.servicetype == ST_UD && !is_llqp) {
 		max_send_sge += 2;
 		max_recv_sge += 2;
 	}
@@ -547,8 +577,8 @@ struct ehca_qp *internal_create_qp(struct ib_pd *pd,
 			rwqe_size = offsetof(struct ehca_wqe, u.nud.sg_list[
 					     (parms.act_nr_recv_sges)]);
 		} else { /* for LLQP we need to use msg size, not wqe size */
-		        swqe_size = da_rc_msg_size[max_send_sge];
-			rwqe_size = da_rc_msg_size[max_recv_sge];
+			swqe_size = ll_qp_msg_size(max_send_sge);
+			rwqe_size = ll_qp_msg_size(max_recv_sge);
 			parms.act_nr_send_sges = 1;
 			parms.act_nr_recv_sges = 1;
 		}
@@ -563,15 +593,15 @@ struct ehca_qp *internal_create_qp(struct ib_pd *pd,
 	case IB_QPT_UD:
 	case IB_QPT_GSI:
 	case IB_QPT_SMI:
-		/* UD circumvention */
-		parms.act_nr_recv_sges -= 2;
-		parms.act_nr_send_sges -= 2;
 		if (is_llqp) {
-		        swqe_size = da_ud_sq_msg_size[max_send_sge];
-			rwqe_size = da_rc_msg_size[max_recv_sge];
+			swqe_size = ll_qp_msg_size(parms.act_nr_send_sges);
+			rwqe_size = ll_qp_msg_size(parms.act_nr_recv_sges);
 			parms.act_nr_send_sges = 1;
 			parms.act_nr_recv_sges = 1;
 		} else {
+			/* UD circumvention */
+			parms.act_nr_send_sges -= 2;
+			parms.act_nr_recv_sges -= 2;
 			swqe_size = offsetof(struct ehca_wqe,
 					     u.ud_av.sg_list[parms.act_nr_send_sges]);
 			rwqe_size = offsetof(struct ehca_wqe,
-- 
1.5.2


From fenkes at de.ibm.com  Mon Jul  9 06:27:13 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Mon, 9 Jul 2007 15:27:13 +0200
Subject: [ofa-general] [PATCH 06/13] IB/ehca: Set SEND_GRH flag for all
	non-LL UD QPs on eHCA2
In-Reply-To: <200707091502.22407.fenkes@de.ibm.com>
References: <200707091502.22407.fenkes@de.ibm.com>
Message-ID: <200707091527.14272.fenkes@de.ibm.com>

From: Stefan Roscher <stefan.roscher at de.ibm.com>

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_qp.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c
index ffd1ce9..cbb8b5b 100644
--- a/drivers/infiniband/hw/ehca/ehca_qp.c
+++ b/drivers/infiniband/hw/ehca/ehca_qp.c
@@ -1054,6 +1054,17 @@ static int internal_modify_qp(struct ib_qp *ibqp,
 		 "ehca_qp=%p qp_num=%x <VALID STATE CHANGE> qp_state_xsit=%x",
 		 my_qp, ibqp->qp_num, statetrans);
 
+	/* eHCA2 rev2 and higher require the SEND_GRH_FLAG to be set
+	 * in non-LL UD QPs.
+	 */
+	if ((my_qp->qp_type == IB_QPT_UD) &&
+	    (my_qp->ext_type != EQPT_LLQP) &&
+	    (statetrans == IB_QPST_INIT2RTR) &&
+	    (shca->hw_level >= 0x22)){
+		update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SEND_GRH_FLAG, 1);
+		mqpcb->send_grh_flag = 1;
+	}
+
 	/* sqe -> rts: set purge bit of bad wqe before actual trans */
 	if ((my_qp->qp_type == IB_QPT_UD ||
 	     my_qp->qp_type == IB_QPT_GSI ||
-- 
1.5.2


From fenkes at de.ibm.com  Mon Jul  9 06:29:03 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Mon, 9 Jul 2007 15:29:03 +0200
Subject: [ofa-general] [PATCH 08/13] IB/ehca: Lock renaming,
	static initializers
In-Reply-To: <200707091502.22407.fenkes@de.ibm.com>
References: <200707091502.22407.fenkes@de.ibm.com>
Message-ID: <200707091529.04073.fenkes@de.ibm.com>

- Renamed all spinlock flags to "flags", matching the vast majority of kernel
  code.
- Moved hcall_lock into the only module it's used in.
- Replaced spin_lock_init() and friends with static initializers for
  global variables.

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_classes.h |    1 -
 drivers/infiniband/hw/ehca/ehca_cq.c      |   12 ++++++------
 drivers/infiniband/hw/ehca/ehca_main.c    |   18 ++++--------------
 drivers/infiniband/hw/ehca/ehca_qp.c      |    6 +++---
 drivers/infiniband/hw/ehca/ehca_reqs.c    |   24 ++++++++++++------------
 drivers/infiniband/hw/ehca/hcp_if.c       |    2 ++
 6 files changed, 27 insertions(+), 36 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h
index 9d689ae..3550047 100644
--- a/drivers/infiniband/hw/ehca/ehca_classes.h
+++ b/drivers/infiniband/hw/ehca/ehca_classes.h
@@ -295,7 +295,6 @@ void ehca_cleanup_mrmw_cache(void);
 
 extern spinlock_t ehca_qp_idr_lock;
 extern spinlock_t ehca_cq_idr_lock;
-extern spinlock_t hcall_lock;
 extern struct idr ehca_qp_idr;
 extern struct idr ehca_cq_idr;
 
diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c
index 67f0670..94bad27 100644
--- a/drivers/infiniband/hw/ehca/ehca_cq.c
+++ b/drivers/infiniband/hw/ehca/ehca_cq.c
@@ -56,11 +56,11 @@ int ehca_cq_assign_qp(struct ehca_cq *cq, struct ehca_qp *qp)
 {
 	unsigned int qp_num = qp->real_qp_num;
 	unsigned int key = qp_num & (QP_HASHTAB_LEN-1);
-	unsigned long spl_flags;
+	unsigned long flags;
 
-	spin_lock_irqsave(&cq->spinlock, spl_flags);
+	spin_lock_irqsave(&cq->spinlock, flags);
 	hlist_add_head(&qp->list_entries, &cq->qp_hashtab[key]);
-	spin_unlock_irqrestore(&cq->spinlock, spl_flags);
+	spin_unlock_irqrestore(&cq->spinlock, flags);
 
 	ehca_dbg(cq->ib_cq.device, "cq_num=%x real_qp_num=%x",
 		 cq->cq_number, qp_num);
@@ -74,9 +74,9 @@ int ehca_cq_unassign_qp(struct ehca_cq *cq, unsigned int real_qp_num)
 	unsigned int key = real_qp_num & (QP_HASHTAB_LEN-1);
 	struct hlist_node *iter;
 	struct ehca_qp *qp;
-	unsigned long spl_flags;
+	unsigned long flags;
 
-	spin_lock_irqsave(&cq->spinlock, spl_flags);
+	spin_lock_irqsave(&cq->spinlock, flags);
 	hlist_for_each(iter, &cq->qp_hashtab[key]) {
 		qp = hlist_entry(iter, struct ehca_qp, list_entries);
 		if (qp->real_qp_num == real_qp_num) {
@@ -88,7 +88,7 @@ int ehca_cq_unassign_qp(struct ehca_cq *cq, unsigned int real_qp_num)
 			break;
 		}
 	}
-	spin_unlock_irqrestore(&cq->spinlock, spl_flags);
+	spin_unlock_irqrestore(&cq->spinlock, flags);
 	if (ret)
 		ehca_err(cq->ib_cq.device,
 			 "qp not found cq_num=%x real_qp_num=%x",
diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c
index 9bd749c..77db890 100644
--- a/drivers/infiniband/hw/ehca/ehca_main.c
+++ b/drivers/infiniband/hw/ehca/ehca_main.c
@@ -96,15 +96,13 @@ MODULE_PARM_DESC(static_rate,
 MODULE_PARM_DESC(scaling_code,
 		 "set scaling code (0: disabled/default, 1: enabled)");
 
-spinlock_t ehca_qp_idr_lock;
-spinlock_t ehca_cq_idr_lock;
-spinlock_t hcall_lock;
+DEFINE_SPINLOCK(ehca_qp_idr_lock);
+DEFINE_SPINLOCK(ehca_cq_idr_lock);
 DEFINE_IDR(ehca_qp_idr);
 DEFINE_IDR(ehca_cq_idr);
 
-
-static struct list_head shca_list; /* list of all registered ehcas */
-static spinlock_t shca_list_lock;
+static LIST_HEAD(shca_list); /* list of all registered ehcas */
+static DEFINE_SPINLOCK(shca_list_lock);
 
 static struct timer_list poll_eqs_timer;
 
@@ -864,14 +862,6 @@ int __init ehca_module_init(void)
 
 	printk(KERN_INFO "eHCA Infiniband Device Driver "
 	       "(Rel.: SVNEHCA_0023)\n");
-	idr_init(&ehca_qp_idr);
-	idr_init(&ehca_cq_idr);
-	spin_lock_init(&ehca_qp_idr_lock);
-	spin_lock_init(&ehca_cq_idr_lock);
-	spin_lock_init(&hcall_lock);
-
-	INIT_LIST_HEAD(&shca_list);
-	spin_lock_init(&shca_list_lock);
 
 	if ((ret = ehca_create_comp_pool())) {
 		ehca_gen_err("Cannot create comp pool.");
diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c
index 989f75e..ac4ff26 100644
--- a/drivers/infiniband/hw/ehca/ehca_qp.c
+++ b/drivers/infiniband/hw/ehca/ehca_qp.c
@@ -933,7 +933,7 @@ static int internal_modify_qp(struct ib_qp *ibqp,
 	u64 h_ret;
 	int bad_wqe_cnt = 0;
 	int squeue_locked = 0;
-	unsigned long spl_flags = 0;
+	unsigned long flags = 0;
 
 	/* do query_qp to obtain current attr values */
 	mqpcb = ehca_alloc_fw_ctrlblock(GFP_KERNEL);
@@ -1074,7 +1074,7 @@ static int internal_modify_qp(struct ib_qp *ibqp,
 		if (!ibqp->uobject) {
 			struct ehca_wqe *wqe;
 			/* lock send queue */
-			spin_lock_irqsave(&my_qp->spinlock_s, spl_flags);
+			spin_lock_irqsave(&my_qp->spinlock_s, flags);
 			squeue_locked = 1;
 			/* mark next free wqe */
 			wqe = (struct ehca_wqe*)
@@ -1360,7 +1360,7 @@ static int internal_modify_qp(struct ib_qp *ibqp,
 
 modify_qp_exit2:
 	if (squeue_locked) { /* this means: sqe -> rts */
-		spin_unlock_irqrestore(&my_qp->spinlock_s, spl_flags);
+		spin_unlock_irqrestore(&my_qp->spinlock_s, flags);
 		my_qp->sqerr_purgeflag = 1;
 	}
 
diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c
index b5664fa..73f0c06 100644
--- a/drivers/infiniband/hw/ehca/ehca_reqs.c
+++ b/drivers/infiniband/hw/ehca/ehca_reqs.c
@@ -363,10 +363,10 @@ int ehca_post_send(struct ib_qp *qp,
 	struct ehca_wqe *wqe_p;
 	int wqe_cnt = 0;
 	int ret = 0;
-	unsigned long spl_flags;
+	unsigned long flags;
 
 	/* LOCK the QUEUE */
-	spin_lock_irqsave(&my_qp->spinlock_s, spl_flags);
+	spin_lock_irqsave(&my_qp->spinlock_s, flags);
 
 	/* loop processes list of send reqs */
 	for (cur_send_wr = send_wr; cur_send_wr != NULL;
@@ -408,7 +408,7 @@ int ehca_post_send(struct ib_qp *qp,
 
 post_send_exit0:
 	/* UNLOCK the QUEUE */
-	spin_unlock_irqrestore(&my_qp->spinlock_s, spl_flags);
+	spin_unlock_irqrestore(&my_qp->spinlock_s, flags);
 	iosync(); /* serialize GAL register access */
 	hipz_update_sqa(my_qp, wqe_cnt);
 	return ret;
@@ -423,7 +423,7 @@ static int internal_post_recv(struct ehca_qp *my_qp,
 	struct ehca_wqe *wqe_p;
 	int wqe_cnt = 0;
 	int ret = 0;
-	unsigned long spl_flags;
+	unsigned long flags;
 
 	if (unlikely(!HAS_RQ(my_qp))) {
 		ehca_err(dev, "QP has no RQ  ehca_qp=%p qp_num=%x ext_type=%d",
@@ -432,7 +432,7 @@ static int internal_post_recv(struct ehca_qp *my_qp,
 	}
 
 	/* LOCK the QUEUE */
-	spin_lock_irqsave(&my_qp->spinlock_r, spl_flags);
+	spin_lock_irqsave(&my_qp->spinlock_r, flags);
 
 	/* loop processes list of send reqs */
 	for (cur_recv_wr = recv_wr; cur_recv_wr != NULL;
@@ -473,7 +473,7 @@ static int internal_post_recv(struct ehca_qp *my_qp,
 	} /* eof for cur_recv_wr */
 
 post_recv_exit0:
-	spin_unlock_irqrestore(&my_qp->spinlock_r, spl_flags);
+	spin_unlock_irqrestore(&my_qp->spinlock_r, flags);
 	iosync(); /* serialize GAL register access */
 	hipz_update_rqa(my_qp, wqe_cnt);
 	return ret;
@@ -536,7 +536,7 @@ poll_cq_one_read_cqe:
 	if (unlikely(cqe->status & WC_STATUS_PURGE_BIT)) {
 		struct ehca_qp *qp=ehca_cq_get_qp(my_cq, cqe->local_qp_number);
 		int purgeflag;
-		unsigned long spl_flags;
+		unsigned long flags;
 		if (!qp) {
 			ehca_err(cq->device, "cq_num=%x qp_num=%x "
 				 "could not find qp -> ignore cqe",
@@ -546,9 +546,9 @@ poll_cq_one_read_cqe:
 			/* ignore this purged cqe */
 			goto poll_cq_one_read_cqe;
 		}
-		spin_lock_irqsave(&qp->spinlock_s, spl_flags);
+		spin_lock_irqsave(&qp->spinlock_s, flags);
 		purgeflag = qp->sqerr_purgeflag;
-		spin_unlock_irqrestore(&qp->spinlock_s, spl_flags);
+		spin_unlock_irqrestore(&qp->spinlock_s, flags);
 
 		if (purgeflag) {
 			ehca_dbg(cq->device, "Got CQE with purged bit qp_num=%x "
@@ -633,7 +633,7 @@ int ehca_poll_cq(struct ib_cq *cq, int num_entries, struct ib_wc *wc)
 	int nr;
 	struct ib_wc *current_wc = wc;
 	int ret = 0;
-	unsigned long spl_flags;
+	unsigned long flags;
 
 	if (num_entries < 1) {
 		ehca_err(cq->device, "Invalid num_entries=%d ehca_cq=%p "
@@ -642,14 +642,14 @@ int ehca_poll_cq(struct ib_cq *cq, int num_entries, struct ib_wc *wc)
 		goto poll_cq_exit0;
 	}
 
-	spin_lock_irqsave(&my_cq->spinlock, spl_flags);
+	spin_lock_irqsave(&my_cq->spinlock, flags);
 	for (nr = 0; nr < num_entries; nr++) {
 		ret = ehca_poll_cq_one(cq, current_wc);
 		if (ret)
 			break;
 		current_wc++;
 	} /* eof for nr */
-	spin_unlock_irqrestore(&my_cq->spinlock, spl_flags);
+	spin_unlock_irqrestore(&my_cq->spinlock, flags);
 	if (ret == -EAGAIN  || !ret)
 		ret = nr;
 
diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c
index b078377..5b927a6 100644
--- a/drivers/infiniband/hw/ehca/hcp_if.c
+++ b/drivers/infiniband/hw/ehca/hcp_if.c
@@ -81,6 +81,8 @@
 #define H_MP_SHUTDOWN                   EHCA_BMASK_IBM(48, 48)
 #define H_MP_RESET_QKEY_CTR             EHCA_BMASK_IBM(49, 49)
 
+DEFINE_SPINLOCK(hcall_lock);
+
 static u32 get_longbusy_msecs(int longbusy_rc)
 {
 	switch (longbusy_rc) {
-- 
1.5.2


From fenkes at de.ibm.com  Mon Jul  9 06:28:18 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Mon, 9 Jul 2007 15:28:18 +0200
Subject: [ofa-general] [PATCH 07/13] IB/ehca: Report RDMA atomic attributes
	in query_qp()
In-Reply-To: <200707091502.22407.fenkes@de.ibm.com>
References: <200707091502.22407.fenkes@de.ibm.com>
Message-ID: <200707091528.19168.fenkes@de.ibm.com>

From: Hoang-Nam Nguyen <hnguyen at de.ibm.com>

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_qp.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c
index cbb8b5b..989f75e 100644
--- a/drivers/infiniband/hw/ehca/ehca_qp.c
+++ b/drivers/infiniband/hw/ehca/ehca_qp.c
@@ -1491,6 +1491,9 @@ int ehca_query_qp(struct ib_qp *qp,
 	qp_attr->alt_port_num = qpcb->alt_phys_port;
 	qp_attr->alt_timeout = qpcb->timeout_al;
 
+	qp_attr->max_dest_rd_atomic = qpcb->rdma_nr_atomic_resp_res;
+	qp_attr->max_rd_atomic = qpcb->rdma_atomic_outst_dest_qp;
+
 	/* primary av */
 	qp_attr->ah_attr.sl = qpcb->service_level;
 
-- 
1.5.2


From fenkes at de.ibm.com  Mon Jul  9 06:30:39 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Mon, 9 Jul 2007 15:30:39 +0200
Subject: [ofa-general] [PATCH 09/13] IB/ehca: Refactor synchronization
	between completions and destroy_cq using atomic_t
In-Reply-To: <200707091502.22407.fenkes@de.ibm.com>
References: <200707091502.22407.fenkes@de.ibm.com>
Message-ID: <200707091530.40581.fenkes@de.ibm.com>

- ehca_cq.nr_events is made an atomic_t, eliminating a lot of locking.
- The CQ is removed from the CQ idr first now to make sure no more
  completions are scheduled on that CQ. The "wait for all completions to
  end" code becomes much simpler this way.

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_classes.h |    4 +-
 drivers/infiniband/hw/ehca/ehca_cq.c      |   26 +++++++-------------
 drivers/infiniband/hw/ehca/ehca_irq.c     |   36 +++++++++++++---------------
 drivers/infiniband/hw/ehca/ehca_irq.h     |    1 -
 drivers/infiniband/hw/ehca/ehca_tools.h   |    1 +
 5 files changed, 29 insertions(+), 39 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h
index 3550047..8580f2a 100644
--- a/drivers/infiniband/hw/ehca/ehca_classes.h
+++ b/drivers/infiniband/hw/ehca/ehca_classes.h
@@ -174,8 +174,8 @@ struct ehca_cq {
 	spinlock_t cb_lock;
 	struct hlist_head qp_hashtab[QP_HASHTAB_LEN];
 	struct list_head entry;
-	u32 nr_callbacks; /* #events assigned to cpu by scaling code */
-	u32 nr_events;    /* #events seen */
+	u32 nr_callbacks;   /* #events assigned to cpu by scaling code */
+	atomic_t nr_events; /* #events seen */
 	wait_queue_head_t wait_completion;
 	spinlock_t task_lock;
 	u32 ownpid;
diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c
index 94bad27..3729997 100644
--- a/drivers/infiniband/hw/ehca/ehca_cq.c
+++ b/drivers/infiniband/hw/ehca/ehca_cq.c
@@ -146,6 +146,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector,
 	spin_lock_init(&my_cq->spinlock);
 	spin_lock_init(&my_cq->cb_lock);
 	spin_lock_init(&my_cq->task_lock);
+	atomic_set(&my_cq->nr_events, 0);
 	init_waitqueue_head(&my_cq->wait_completion);
 	my_cq->ownpid = current->tgid;
 
@@ -303,16 +304,6 @@ create_cq_exit1:
 	return cq;
 }
 
-static int get_cq_nr_events(struct ehca_cq *my_cq)
-{
-	int ret;
-	unsigned long flags;
-	spin_lock_irqsave(&ehca_cq_idr_lock, flags);
-	ret = my_cq->nr_events;
-	spin_unlock_irqrestore(&ehca_cq_idr_lock, flags);
-	return ret;
-}
-
 int ehca_destroy_cq(struct ib_cq *cq)
 {
 	u64 h_ret;
@@ -339,17 +330,18 @@ int ehca_destroy_cq(struct ib_cq *cq)
 		}
 	}
 
+	/*
+	 * remove the CQ from the idr first to make sure
+	 * no more interrupt tasklets will touch this CQ
+	 */
 	spin_lock_irqsave(&ehca_cq_idr_lock, flags);
-	while (my_cq->nr_events) {
-		spin_unlock_irqrestore(&ehca_cq_idr_lock, flags);
-		wait_event(my_cq->wait_completion, !get_cq_nr_events(my_cq));
-		spin_lock_irqsave(&ehca_cq_idr_lock, flags);
-		/* recheck nr_events to assure no cqe has just arrived */
-	}
-
 	idr_remove(&ehca_cq_idr, my_cq->token);
 	spin_unlock_irqrestore(&ehca_cq_idr_lock, flags);
 
+	/* now wait until all pending events have completed */
+	wait_event(my_cq->wait_completion, !atomic_read(&my_cq->nr_events));
+
+	/* nobody's using our CQ any longer -- we can destroy it */
 	h_ret = hipz_h_destroy_cq(adapter_handle, my_cq, 0);
 	if (h_ret == H_R_STATE) {
 		/* cq in err: read err data and destroy it forcibly */
diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c
index 100329b..3e790a3 100644
--- a/drivers/infiniband/hw/ehca/ehca_irq.c
+++ b/drivers/infiniband/hw/ehca/ehca_irq.c
@@ -5,6 +5,8 @@
  *
  *  Authors: Heiko J Schick <schickhj at de.ibm.com>
  *           Khadija Souissi <souissi at de.ibm.com>
+ *           Hoang-Nam Nguyen <hnguyen at de.ibm.com>
+ *           Joachim Fenkes <fenkes at de.ibm.com>
  *
  *  Copyright (c) 2005 IBM Corporation
  *
@@ -212,6 +214,8 @@ static void cq_event_callback(struct ehca_shca *shca,
 
 	spin_lock_irqsave(&ehca_cq_idr_lock, flags);
 	cq = idr_find(&ehca_cq_idr, token);
+	if (cq)
+		atomic_inc(&cq->nr_events);
 	spin_unlock_irqrestore(&ehca_cq_idr_lock, flags);
 
 	if (!cq)
@@ -219,6 +223,9 @@ static void cq_event_callback(struct ehca_shca *shca,
 
 	ehca_error_data(shca, cq, cq->ipz_cq_handle.handle);
 
+	if (atomic_dec_and_test(&cq->nr_events))
+		wake_up(&cq->wait_completion);
+
 	return;
 }
 
@@ -414,25 +421,22 @@ static inline void process_eqe(struct ehca_shca *shca, struct ehca_eqe *eqe)
 		token = EHCA_BMASK_GET(EQE_CQ_TOKEN, eqe_value);
 		spin_lock_irqsave(&ehca_cq_idr_lock, flags);
 		cq = idr_find(&ehca_cq_idr, token);
+		if (cq)
+			atomic_inc(&cq->nr_events);
+		spin_unlock_irqrestore(&ehca_cq_idr_lock, flags);
 		if (cq == NULL) {
-			spin_unlock_irqrestore(&ehca_cq_idr_lock, flags);
 			ehca_err(&shca->ib_device,
 				 "Invalid eqe for non-existing cq token=%x",
 				 token);
 			return;
 		}
 		reset_eq_pending(cq);
-		cq->nr_events++;
-		spin_unlock_irqrestore(&ehca_cq_idr_lock, flags);
 		if (ehca_scaling_code)
 			queue_comp_task(cq);
 		else {
 			comp_event_callback(cq);
-			spin_lock_irqsave(&ehca_cq_idr_lock, flags);
-			cq->nr_events--;
-			if (!cq->nr_events)
+			if (atomic_dec_and_test(&cq->nr_events))
 				wake_up(&cq->wait_completion);
-			spin_unlock_irqrestore(&ehca_cq_idr_lock, flags);
 		}
 	} else {
 		ehca_dbg(&shca->ib_device, "Got non completion event");
@@ -478,15 +482,15 @@ void ehca_process_eq(struct ehca_shca *shca, int is_irq)
 			token = EHCA_BMASK_GET(EQE_CQ_TOKEN, eqe_value);
 			spin_lock(&ehca_cq_idr_lock);
 			eqe_cache[eqe_cnt].cq = idr_find(&ehca_cq_idr, token);
+			if (eqe_cache[eqe_cnt].cq)
+				atomic_inc(&eqe_cache[eqe_cnt].cq->nr_events);
+			spin_unlock(&ehca_cq_idr_lock);
 			if (!eqe_cache[eqe_cnt].cq) {
-				spin_unlock(&ehca_cq_idr_lock);
 				ehca_err(&shca->ib_device,
 					 "Invalid eqe for non-existing cq "
 					 "token=%x", token);
 				continue;
 			}
-			eqe_cache[eqe_cnt].cq->nr_events++;
-			spin_unlock(&ehca_cq_idr_lock);
 		} else
 			eqe_cache[eqe_cnt].cq = NULL;
 		eqe_cnt++;
@@ -517,11 +521,8 @@ void ehca_process_eq(struct ehca_shca *shca, int is_irq)
 			else {
 				struct ehca_cq *cq = eq->eqe_cache[i].cq;
 				comp_event_callback(cq);
-				spin_lock(&ehca_cq_idr_lock);
-				cq->nr_events--;
-				if (!cq->nr_events)
+				if (atomic_dec_and_test(&cq->nr_events))
 					wake_up(&cq->wait_completion);
-				spin_unlock(&ehca_cq_idr_lock);
 			}
 		} else {
 			ehca_dbg(&shca->ib_device, "Got non completion event");
@@ -621,13 +622,10 @@ static void run_comp_task(struct ehca_cpu_comp_task* cct)
 	while (!list_empty(&cct->cq_list)) {
 		cq = list_entry(cct->cq_list.next, struct ehca_cq, entry);
 		spin_unlock_irqrestore(&cct->task_lock, flags);
-		comp_event_callback(cq);
 
-		spin_lock_irqsave(&ehca_cq_idr_lock, flags);
-		cq->nr_events--;
-		if (!cq->nr_events)
+		comp_event_callback(cq);
+		if (atomic_dec_and_test(&cq->nr_events))
 			wake_up(&cq->wait_completion);
-		spin_unlock_irqrestore(&ehca_cq_idr_lock, flags);
 
 		spin_lock_irqsave(&cct->task_lock, flags);
 		spin_lock(&cq->task_lock);
diff --git a/drivers/infiniband/hw/ehca/ehca_irq.h b/drivers/infiniband/hw/ehca/ehca_irq.h
index 6ed06ee..3346cb0 100644
--- a/drivers/infiniband/hw/ehca/ehca_irq.h
+++ b/drivers/infiniband/hw/ehca/ehca_irq.h
@@ -47,7 +47,6 @@ struct ehca_shca;
 
 #include <linux/interrupt.h>
 #include <linux/types.h>
-#include <asm/atomic.h>
 
 int ehca_error_data(struct ehca_shca *shca, void *data, u64 resource);
 
diff --git a/drivers/infiniband/hw/ehca/ehca_tools.h b/drivers/infiniband/hw/ehca/ehca_tools.h
index 973c4b5..03b185f 100644
--- a/drivers/infiniband/hw/ehca/ehca_tools.h
+++ b/drivers/infiniband/hw/ehca/ehca_tools.h
@@ -59,6 +59,7 @@
 #include <linux/cpu.h>
 #include <linux/device.h>
 
+#include <asm/atomic.h>
 #include <asm/abs_addr.h>
 #include <asm/ibmebus.h>
 #include <asm/io.h>
-- 
1.5.2


From fenkes at de.ibm.com  Mon Jul  9 06:31:10 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Mon, 9 Jul 2007 15:31:10 +0200
Subject: [ofa-general] [PATCH 10/13] IB/ehca: Change idr spinlocks into
	rwlocks
In-Reply-To: <200707091502.22407.fenkes@de.ibm.com>
References: <200707091502.22407.fenkes@de.ibm.com>
Message-ID: <200707091531.11544.fenkes@de.ibm.com>

This eliminates lock contention among IRQs as well as the need to disable
IRQs around idr_find, because there are no IRQ writers.

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_classes.h |    4 ++--
 drivers/infiniband/hw/ehca/ehca_cq.c      |   12 ++++++------
 drivers/infiniband/hw/ehca/ehca_irq.c     |   19 ++++++++-----------
 drivers/infiniband/hw/ehca/ehca_main.c    |    4 ++--
 drivers/infiniband/hw/ehca/ehca_qp.c      |   12 ++++++------
 drivers/infiniband/hw/ehca/ehca_uverbs.c  |    9 ++++-----
 6 files changed, 28 insertions(+), 32 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h
index 8580f2a..f1e0db2 100644
--- a/drivers/infiniband/hw/ehca/ehca_classes.h
+++ b/drivers/infiniband/hw/ehca/ehca_classes.h
@@ -293,8 +293,8 @@ void ehca_cleanup_av_cache(void);
 int ehca_init_mrmw_cache(void);
 void ehca_cleanup_mrmw_cache(void);
 
-extern spinlock_t ehca_qp_idr_lock;
-extern spinlock_t ehca_cq_idr_lock;
+extern rwlock_t ehca_qp_idr_lock;
+extern rwlock_t ehca_cq_idr_lock;
 extern struct idr ehca_qp_idr;
 extern struct idr ehca_cq_idr;
 
diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c
index 3729997..01d4a14 100644
--- a/drivers/infiniband/hw/ehca/ehca_cq.c
+++ b/drivers/infiniband/hw/ehca/ehca_cq.c
@@ -163,9 +163,9 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector,
 			goto create_cq_exit1;
 		}
 
-		spin_lock_irqsave(&ehca_cq_idr_lock, flags);
+		write_lock_irqsave(&ehca_cq_idr_lock, flags);
 		ret = idr_get_new(&ehca_cq_idr, my_cq, &my_cq->token);
-		spin_unlock_irqrestore(&ehca_cq_idr_lock, flags);
+		write_unlock_irqrestore(&ehca_cq_idr_lock, flags);
 
 	} while (ret == -EAGAIN);
 
@@ -294,9 +294,9 @@ create_cq_exit3:
 			 "cq_num=%x h_ret=%lx", my_cq, my_cq->cq_number, h_ret);
 
 create_cq_exit2:
-	spin_lock_irqsave(&ehca_cq_idr_lock, flags);
+	write_lock_irqsave(&ehca_cq_idr_lock, flags);
 	idr_remove(&ehca_cq_idr, my_cq->token);
-	spin_unlock_irqrestore(&ehca_cq_idr_lock, flags);
+	write_unlock_irqrestore(&ehca_cq_idr_lock, flags);
 
 create_cq_exit1:
 	kmem_cache_free(cq_cache, my_cq);
@@ -334,9 +334,9 @@ int ehca_destroy_cq(struct ib_cq *cq)
 	 * remove the CQ from the idr first to make sure
 	 * no more interrupt tasklets will touch this CQ
 	 */
-	spin_lock_irqsave(&ehca_cq_idr_lock, flags);
+	write_lock_irqsave(&ehca_cq_idr_lock, flags);
 	idr_remove(&ehca_cq_idr, my_cq->token);
-	spin_unlock_irqrestore(&ehca_cq_idr_lock, flags);
+	write_unlock_irqrestore(&ehca_cq_idr_lock, flags);
 
 	/* now wait until all pending events have completed */
 	wait_event(my_cq->wait_completion, !atomic_read(&my_cq->nr_events));
diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c
index 3e790a3..02b73c8 100644
--- a/drivers/infiniband/hw/ehca/ehca_irq.c
+++ b/drivers/infiniband/hw/ehca/ehca_irq.c
@@ -180,12 +180,11 @@ static void qp_event_callback(struct ehca_shca *shca,
 {
 	struct ib_event event;
 	struct ehca_qp *qp;
-	unsigned long flags;
 	u32 token = EHCA_BMASK_GET(EQE_QP_TOKEN, eqe);
 
-	spin_lock_irqsave(&ehca_qp_idr_lock, flags);
+	read_lock(&ehca_qp_idr_lock);
 	qp = idr_find(&ehca_qp_idr, token);
-	spin_unlock_irqrestore(&ehca_qp_idr_lock, flags);
+	read_unlock(&ehca_qp_idr_lock);
 
 
 	if (!qp)
@@ -209,14 +208,13 @@ static void cq_event_callback(struct ehca_shca *shca,
 			      u64 eqe)
 {
 	struct ehca_cq *cq;
-	unsigned long flags;
 	u32 token = EHCA_BMASK_GET(EQE_CQ_TOKEN, eqe);
 
-	spin_lock_irqsave(&ehca_cq_idr_lock, flags);
+	read_lock(&ehca_cq_idr_lock);
 	cq = idr_find(&ehca_cq_idr, token);
 	if (cq)
 		atomic_inc(&cq->nr_events);
-	spin_unlock_irqrestore(&ehca_cq_idr_lock, flags);
+	read_unlock(&ehca_cq_idr_lock);
 
 	if (!cq)
 		return;
@@ -411,7 +409,6 @@ static inline void process_eqe(struct ehca_shca *shca, struct ehca_eqe *eqe)
 {
 	u64 eqe_value;
 	u32 token;
-	unsigned long flags;
 	struct ehca_cq *cq;
 
 	eqe_value = eqe->entry;
@@ -419,11 +416,11 @@ static inline void process_eqe(struct ehca_shca *shca, struct ehca_eqe *eqe)
 	if (EHCA_BMASK_GET(EQE_COMPLETION_EVENT, eqe_value)) {
 		ehca_dbg(&shca->ib_device, "Got completion event");
 		token = EHCA_BMASK_GET(EQE_CQ_TOKEN, eqe_value);
-		spin_lock_irqsave(&ehca_cq_idr_lock, flags);
+		read_lock(&ehca_cq_idr_lock);
 		cq = idr_find(&ehca_cq_idr, token);
 		if (cq)
 			atomic_inc(&cq->nr_events);
-		spin_unlock_irqrestore(&ehca_cq_idr_lock, flags);
+		read_unlock(&ehca_cq_idr_lock);
 		if (cq == NULL) {
 			ehca_err(&shca->ib_device,
 				 "Invalid eqe for non-existing cq token=%x",
@@ -480,11 +477,11 @@ void ehca_process_eq(struct ehca_shca *shca, int is_irq)
 		eqe_value = eqe_cache[eqe_cnt].eqe->entry;
 		if (EHCA_BMASK_GET(EQE_COMPLETION_EVENT, eqe_value)) {
 			token = EHCA_BMASK_GET(EQE_CQ_TOKEN, eqe_value);
-			spin_lock(&ehca_cq_idr_lock);
+			read_lock(&ehca_cq_idr_lock);
 			eqe_cache[eqe_cnt].cq = idr_find(&ehca_cq_idr, token);
 			if (eqe_cache[eqe_cnt].cq)
 				atomic_inc(&eqe_cache[eqe_cnt].cq->nr_events);
-			spin_unlock(&ehca_cq_idr_lock);
+			read_unlock(&ehca_cq_idr_lock);
 			if (!eqe_cache[eqe_cnt].cq) {
 				ehca_err(&shca->ib_device,
 					 "Invalid eqe for non-existing cq "
diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c
index 77db890..e58e821 100644
--- a/drivers/infiniband/hw/ehca/ehca_main.c
+++ b/drivers/infiniband/hw/ehca/ehca_main.c
@@ -96,8 +96,8 @@ MODULE_PARM_DESC(static_rate,
 MODULE_PARM_DESC(scaling_code,
 		 "set scaling code (0: disabled/default, 1: enabled)");
 
-DEFINE_SPINLOCK(ehca_qp_idr_lock);
-DEFINE_SPINLOCK(ehca_cq_idr_lock);
+DEFINE_RWLOCK(ehca_qp_idr_lock);
+DEFINE_RWLOCK(ehca_cq_idr_lock);
 DEFINE_IDR(ehca_qp_idr);
 DEFINE_IDR(ehca_cq_idr);
 
diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c
index ac4ff26..7452ef4 100644
--- a/drivers/infiniband/hw/ehca/ehca_qp.c
+++ b/drivers/infiniband/hw/ehca/ehca_qp.c
@@ -512,9 +512,9 @@ struct ehca_qp *internal_create_qp(struct ib_pd *pd,
 			goto create_qp_exit0;
 		}
 
-		spin_lock_irqsave(&ehca_qp_idr_lock, flags);
+		write_lock_irqsave(&ehca_qp_idr_lock, flags);
 		ret = idr_get_new(&ehca_qp_idr, my_qp, &my_qp->token);
-		spin_unlock_irqrestore(&ehca_qp_idr_lock, flags);
+		write_unlock_irqrestore(&ehca_qp_idr_lock, flags);
 
 	} while (ret == -EAGAIN);
 
@@ -733,9 +733,9 @@ create_qp_exit2:
 	hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp);
 
 create_qp_exit1:
-	spin_lock_irqsave(&ehca_qp_idr_lock, flags);
+	write_lock_irqsave(&ehca_qp_idr_lock, flags);
 	idr_remove(&ehca_qp_idr, my_qp->token);
-	spin_unlock_irqrestore(&ehca_qp_idr_lock, flags);
+	write_unlock_irqrestore(&ehca_qp_idr_lock, flags);
 
 create_qp_exit0:
 	kmem_cache_free(qp_cache, my_qp);
@@ -1706,9 +1706,9 @@ int internal_destroy_qp(struct ib_device *dev, struct ehca_qp *my_qp,
 		}
 	}
 
-	spin_lock_irqsave(&ehca_qp_idr_lock, flags);
+	write_lock_irqsave(&ehca_qp_idr_lock, flags);
 	idr_remove(&ehca_qp_idr, my_qp->token);
-	spin_unlock_irqrestore(&ehca_qp_idr_lock, flags);
+	write_unlock_irqrestore(&ehca_qp_idr_lock, flags);
 
 	h_ret = hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp);
 	if (h_ret != H_SUCCESS) {
diff --git a/drivers/infiniband/hw/ehca/ehca_uverbs.c b/drivers/infiniband/hw/ehca/ehca_uverbs.c
index d8fe37d..3031b3b 100644
--- a/drivers/infiniband/hw/ehca/ehca_uverbs.c
+++ b/drivers/infiniband/hw/ehca/ehca_uverbs.c
@@ -253,7 +253,6 @@ int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
 	u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */
 	u32 cur_pid = current->tgid;
 	u32 ret;
-	unsigned long flags;
 	struct ehca_cq *cq;
 	struct ehca_qp *qp;
 	struct ehca_pd *pd;
@@ -261,9 +260,9 @@ int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
 
 	switch (q_type) {
 	case  1: /* CQ */
-		spin_lock_irqsave(&ehca_cq_idr_lock, flags);
+		read_lock(&ehca_cq_idr_lock);
 		cq = idr_find(&ehca_cq_idr, idr_handle);
-		spin_unlock_irqrestore(&ehca_cq_idr_lock, flags);
+		read_unlock(&ehca_cq_idr_lock);
 
 		/* make sure this mmap really belongs to the authorized user */
 		if (!cq)
@@ -289,9 +288,9 @@ int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
 		break;
 
 	case 2: /* QP */
-		spin_lock_irqsave(&ehca_qp_idr_lock, flags);
+		read_lock(&ehca_qp_idr_lock);
 		qp = idr_find(&ehca_qp_idr, idr_handle);
-		spin_unlock_irqrestore(&ehca_qp_idr_lock, flags);
+		read_unlock(&ehca_qp_idr_lock);
 
 		/* make sure this mmap really belongs to the authorized user */
 		if (!qp)
-- 
1.5.2


From fenkes at de.ibm.com  Mon Jul  9 06:31:53 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Mon, 9 Jul 2007 15:31:53 +0200
Subject: [ofa-general] [PATCH 11/13] IB/ehca: return QP pointer in poll_cq(),
	add two unlikely() statements
In-Reply-To: <200707091502.22407.fenkes@de.ibm.com>
References: <200707091502.22407.fenkes@de.ibm.com>
Message-ID: <200707091531.54219.fenkes@de.ibm.com>

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_reqs.c |   11 ++++++++---
 1 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c
index 73f0c06..fd3ba22 100644
--- a/drivers/infiniband/hw/ehca/ehca_reqs.c
+++ b/drivers/infiniband/hw/ehca/ehca_reqs.c
@@ -517,6 +517,7 @@ static inline int ehca_poll_cq_one(struct ib_cq *cq, struct ib_wc *wc)
 	int ret = 0;
 	struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq);
 	struct ehca_cqe *cqe;
+	struct ehca_qp *my_qp;
 	int cqe_count = 0;
 
 poll_cq_one_read_cqe:
@@ -568,7 +569,7 @@ poll_cq_one_read_cqe:
 	}
 
 	/* tracing cqe */
-	if (ehca_debug_level) {
+	if (unlikely(ehca_debug_level)) {
 		ehca_dbg(cq->device,
 			 "Received COMPLETION ehca_cq=%p cq_num=%x -----",
 			 my_cq, my_cq->cq_number);
@@ -602,7 +603,11 @@ poll_cq_one_read_cqe:
 	} else
 		wc->status = IB_WC_SUCCESS;
 
-	wc->qp = NULL;
+	read_lock(&ehca_qp_idr_lock);
+	my_qp = idr_find(&ehca_qp_idr, cqe->qp_token);
+	wc->qp = &my_qp->ib_qp;
+	read_unlock(&ehca_qp_idr_lock);
+
 	wc->byte_len = cqe->nr_bytes_transferred;
 	wc->pkey_index = cqe->pkey_index;
 	wc->slid = cqe->rlid;
@@ -612,7 +617,7 @@ poll_cq_one_read_cqe:
 	wc->imm_data = cpu_to_be32(cqe->immediate_data);
 	wc->sl = cqe->service_level;
 
-	if (wc->status != IB_WC_SUCCESS)
+	if (unlikely(wc->status != IB_WC_SUCCESS))
 		ehca_dbg(cq->device,
 			 "ehca_cq=%p cq_num=%x WARNING unsuccessful cqe "
 			 "OPType=%x status=%x qp_num=%x src_qp=%x wr_id=%lx "
-- 
1.5.2


From fenkes at de.ibm.com  Mon Jul  9 06:32:22 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Mon, 9 Jul 2007 15:32:22 +0200
Subject: [ofa-general] [PATCH 12/13] IB/ehca: notify consumers of LID/PKEY/SM
	changes after nondisruptive events
In-Reply-To: <200707091502.22407.fenkes@de.ibm.com>
References: <200707091502.22407.fenkes@de.ibm.com>
Message-ID: <200707091532.23189.fenkes@de.ibm.com>

When firmware reports a nondisruptive port configuration change event,
previous versions of the eHCA driver didn't forward the event to consumers
like IPoIB. Add code that determines the type of configuration change by
comparing old and new port attributes and reports it.

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_classes.h |    6 ++
 drivers/infiniband/hw/ehca/ehca_hca.c     |   34 +++++++++++
 drivers/infiniband/hw/ehca/ehca_irq.c     |   89 +++++++++++++++++++----------
 drivers/infiniband/hw/ehca/ehca_iverbs.h  |    3 +
 4 files changed, 101 insertions(+), 31 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h
index f1e0db2..daf823e 100644
--- a/drivers/infiniband/hw/ehca/ehca_classes.h
+++ b/drivers/infiniband/hw/ehca/ehca_classes.h
@@ -87,11 +87,17 @@ struct ehca_eq {
 	struct ehca_eqe_cache_entry eqe_cache[EHCA_EQE_CACHE_SIZE];
 };
 
+struct ehca_sma_attr {
+	u16 lid, lmc, sm_sl, sm_lid;
+	u16 pkey_tbl_len, pkeys[16];
+};
+
 struct ehca_sport {
 	struct ib_cq *ibcq_aqp1;
 	struct ib_qp *ibqp_aqp1;
 	enum ib_rate  rate;
 	enum ib_port_state port_state;
+	struct ehca_sma_attr saved_attr;
 };
 
 struct ehca_shca {
diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c
index b310de5..bbd3c6a 100644
--- a/drivers/infiniband/hw/ehca/ehca_hca.c
+++ b/drivers/infiniband/hw/ehca/ehca_hca.c
@@ -193,6 +193,40 @@ query_port1:
 	return ret;
 }
 
+int ehca_query_sma_attr(struct ehca_shca *shca,
+			u8 port, struct ehca_sma_attr *attr)
+{
+	int ret = 0;
+	struct hipz_query_port *rblock;
+
+	rblock = ehca_alloc_fw_ctrlblock(GFP_ATOMIC);
+	if (!rblock) {
+		ehca_err(&shca->ib_device, "Can't allocate rblock memory.");
+		return -ENOMEM;
+	}
+
+	if (hipz_h_query_port(shca->ipz_hca_handle, port, rblock) != H_SUCCESS) {
+		ehca_err(&shca->ib_device, "Can't query port properties");
+		ret = -EINVAL;
+		goto query_sma_attr1;
+	}
+
+	memset(attr, 0, sizeof(struct ehca_sma_attr));
+
+	attr->lid    = rblock->lid;
+	attr->lmc    = rblock->lmc;
+	attr->sm_sl  = rblock->sm_sl;
+	attr->sm_lid = rblock->sm_lid;
+
+	attr->pkey_tbl_len = rblock->pkey_tbl_len;
+	memcpy(attr->pkeys, rblock->pkey_entries, sizeof(attr->pkeys));
+
+query_sma_attr1:
+	ehca_free_fw_ctrlblock(rblock);
+
+	return ret;
+}
+
 int ehca_query_pkey(struct ib_device *ibdev, u8 port, u16 index, u16 *pkey)
 {
 	int ret = 0;
diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c
index 02b73c8..96eba38 100644
--- a/drivers/infiniband/hw/ehca/ehca_irq.c
+++ b/drivers/infiniband/hw/ehca/ehca_irq.c
@@ -61,6 +61,7 @@
 #define NEQE_EVENT_CODE        EHCA_BMASK_IBM(2,7)
 #define NEQE_PORT_NUMBER       EHCA_BMASK_IBM(8,15)
 #define NEQE_PORT_AVAILABILITY EHCA_BMASK_IBM(16,16)
+#define NEQE_DISRUPTIVE        EHCA_BMASK_IBM(16,16)
 
 #define ERROR_DATA_LENGTH      EHCA_BMASK_IBM(52,63)
 #define ERROR_DATA_TYPE        EHCA_BMASK_IBM(0,7)
@@ -286,30 +287,61 @@ static void parse_identifier(struct ehca_shca *shca, u64 eqe)
 	return;
 }
 
-static void parse_ec(struct ehca_shca *shca, u64 eqe)
+static void dispatch_port_event(struct ehca_shca *shca, int port_num,
+				enum ib_event_type type, const char *msg)
 {
 	struct ib_event event;
+
+	ehca_info(&shca->ib_device, "port %d %s.", port_num, msg);
+	event.device = &shca->ib_device;
+	event.event = type;
+	event.element.port_num = port_num;
+	ib_dispatch_event(&event);
+}
+
+static void notify_port_conf_change(struct ehca_shca *shca, int port_num)
+{
+	struct ehca_sma_attr  new_attr;
+	struct ehca_sma_attr *old_attr = &shca->sport[port_num - 1].saved_attr;
+
+	ehca_query_sma_attr(shca, port_num, &new_attr);
+
+	if (new_attr.sm_sl  != old_attr->sm_sl ||
+	    new_attr.sm_lid != old_attr->sm_lid)
+		dispatch_port_event(shca, port_num, IB_EVENT_SM_CHANGE,
+				    "SM changed");
+
+	if (new_attr.lid != old_attr->lid ||
+	    new_attr.lmc != old_attr->lmc)
+		dispatch_port_event(shca, port_num, IB_EVENT_LID_CHANGE,
+				    "LID changed");
+
+	if (new_attr.pkey_tbl_len != old_attr->pkey_tbl_len ||
+	    memcmp(new_attr.pkeys, old_attr->pkeys,
+		   sizeof(u16) * new_attr.pkey_tbl_len))
+		dispatch_port_event(shca, port_num, IB_EVENT_PKEY_CHANGE,
+				    "P_Key changed");
+
+	*old_attr = new_attr;
+}
+
+static void parse_ec(struct ehca_shca *shca, u64 eqe)
+{
 	u8 ec   = EHCA_BMASK_GET(NEQE_EVENT_CODE, eqe);
 	u8 port = EHCA_BMASK_GET(NEQE_PORT_NUMBER, eqe);
 
 	switch (ec) {
 	case 0x30: /* port availability change */
 		if (EHCA_BMASK_GET(NEQE_PORT_AVAILABILITY, eqe)) {
-			ehca_info(&shca->ib_device,
-				  "port %x is active.", port);
-			event.device = &shca->ib_device;
-			event.event = IB_EVENT_PORT_ACTIVE;
-			event.element.port_num = port;
 			shca->sport[port - 1].port_state = IB_PORT_ACTIVE;
-			ib_dispatch_event(&event);
+			dispatch_port_event(shca, port, IB_EVENT_PORT_ACTIVE,
+					    "is active");
+			ehca_query_sma_attr(shca, port,
+					    &shca->sport[port - 1].saved_attr);
 		} else {
-			ehca_info(&shca->ib_device,
-				  "port %x is inactive.", port);
-			event.device = &shca->ib_device;
-			event.event = IB_EVENT_PORT_ERR;
-			event.element.port_num = port;
 			shca->sport[port - 1].port_state = IB_PORT_DOWN;
-			ib_dispatch_event(&event);
+			dispatch_port_event(shca, port, IB_EVENT_PORT_ERR,
+					    "is inactive");
 		}
 		break;
 	case 0x31:
@@ -317,24 +349,19 @@ static void parse_ec(struct ehca_shca *shca, u64 eqe)
 		 * disruptive change is caused by
 		 * LID, PKEY or SM change
 		 */
-		ehca_warn(&shca->ib_device,
-			  "disruptive port %x configuration change", port);
-
-		ehca_info(&shca->ib_device,
-			  "port %x is inactive.", port);
-		event.device = &shca->ib_device;
-		event.event = IB_EVENT_PORT_ERR;
-		event.element.port_num = port;
-		shca->sport[port - 1].port_state = IB_PORT_DOWN;
-		ib_dispatch_event(&event);
-
-		ehca_info(&shca->ib_device,
-			  "port %x is active.", port);
-		event.device = &shca->ib_device;
-		event.event = IB_EVENT_PORT_ACTIVE;
-		event.element.port_num = port;
-		shca->sport[port - 1].port_state = IB_PORT_ACTIVE;
-		ib_dispatch_event(&event);
+		if (EHCA_BMASK_GET(NEQE_DISRUPTIVE, eqe)) {
+			ehca_warn(&shca->ib_device, "disruptive port "
+				  "%d configuration change", port);
+
+			shca->sport[port - 1].port_state = IB_PORT_DOWN;
+			dispatch_port_event(shca, port, IB_EVENT_PORT_ERR,
+					    "is inactive");
+
+			shca->sport[port - 1].port_state = IB_PORT_ACTIVE;
+			dispatch_port_event(shca, port, IB_EVENT_PORT_ACTIVE,
+					    "is active");
+		} else
+			notify_port_conf_change(shca, port);
 		break;
 	case 0x32: /* adapter malfunction */
 		ehca_err(&shca->ib_device, "Adapter malfunction.");
diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h
index fd84a80..77aeca6 100644
--- a/drivers/infiniband/hw/ehca/ehca_iverbs.h
+++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h
@@ -49,6 +49,9 @@ int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props);
 int ehca_query_port(struct ib_device *ibdev, u8 port,
 		    struct ib_port_attr *props);
 
+int ehca_query_sma_attr(struct ehca_shca *shca, u8 port,
+			struct ehca_sma_attr *attr);
+
 int ehca_query_pkey(struct ib_device *ibdev, u8 port, u16 index, u16 * pkey);
 
 int ehca_query_gid(struct ib_device *ibdev, u8 port, int index,
-- 
1.5.2


From fenkes at de.ibm.com  Mon Jul  9 06:33:52 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Mon, 9 Jul 2007 15:33:52 +0200
Subject: [ofa-general] [PATCH 13/13] IB/ehca: Improve latency by unlocking
	the SQ/RQ after triggering the hardware
In-Reply-To: <200707091502.22407.fenkes@de.ibm.com>
References: <200707091502.22407.fenkes@de.ibm.com>
Message-ID: <200707091533.53383.fenkes@de.ibm.com>

From: Hoang-Nam Nguyen <hnguyen at de.ibm.com>

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_reqs.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c
index fd3ba22..61da65e 100644
--- a/drivers/infiniband/hw/ehca/ehca_reqs.c
+++ b/drivers/infiniband/hw/ehca/ehca_reqs.c
@@ -407,10 +407,9 @@ int ehca_post_send(struct ib_qp *qp,
 	} /* eof for cur_send_wr */
 
 post_send_exit0:
-	/* UNLOCK the QUEUE */
-	spin_unlock_irqrestore(&my_qp->spinlock_s, flags);
 	iosync(); /* serialize GAL register access */
 	hipz_update_sqa(my_qp, wqe_cnt);
+	spin_unlock_irqrestore(&my_qp->spinlock_s, flags);
 	return ret;
 }
 
@@ -473,9 +472,9 @@ static int internal_post_recv(struct ehca_qp *my_qp,
 	} /* eof for cur_recv_wr */
 
 post_recv_exit0:
-	spin_unlock_irqrestore(&my_qp->spinlock_r, flags);
 	iosync(); /* serialize GAL register access */
 	hipz_update_rqa(my_qp, wqe_cnt);
+	spin_unlock_irqrestore(&my_qp->spinlock_r, flags);
 	return ret;
 }
 
-- 
1.5.2


From xhejtman at ics.muni.cz  Mon Jul  9 06:37:43 2007
From: xhejtman at ics.muni.cz (Lukas Hejtmanek)
Date: Mon, 9 Jul 2007 15:37:43 +0200
Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux
In-Reply-To: <adad4z2184r.fsf@cisco.com>
References: <20070704125429.GL3885@ics.muni.cz> <adavecy4ysc.fsf@cisco.com>
	<20070705193136.GQ3885@ics.muni.cz> <adasl8232zv.fsf@cisco.com>
	<20070707085303.GS3885@ics.muni.cz> <ada3azz3ihr.fsf@cisco.com>
	<20070708001531.GT3885@ics.muni.cz> <adad4z2184r.fsf@cisco.com>
Message-ID: <20070709133743.GK3885@ics.muni.cz>

On Sun, Jul 08, 2007 at 10:03:16PM -0700, Roland Dreier wrote:
> Is the memory given to a domU always physically contiguous?  If not,
> what happens when a domU kernel does alloc_pages(GFP_KERNEL, 6) to try
> and allocate 256 KB or something like that.  Let's assume that the
> domU kernel has enough guest contiguous pages to satisfy the
> allocation -- is there any guarantee that the pages are really
> physically contiguous?

The driver in Dom0 started to work fine. Do not know why. In DomU, using some
debug prints, I found that dma_coherent memory is OK (contiguous pages), but
alloc_pages returns contiguous pages but in the reverse order:

ib_mthca 0000:08:00.0: Alloc pages starts
ib_mthca 0000:08:00.0: Page phys. addr 0000000026695000, virt ffff880098b00000
ib_mthca 0000:08:00.0: Page phys. addr 0000000026694000, virt ffff880098b01000
ib_mthca 0000:08:00.0: Page phys. addr 0000000026693000, virt ffff880098b02000
ib_mthca 0000:08:00.0: Page phys. addr 0000000026692000, virt ffff880098b03000
ib_mthca 0000:08:00.0: Page phys. addr 0000000026691000, virt ffff880098b04000
ib_mthca 0000:08:00.0: Page phys. addr 0000000026690000, virt ffff880098b05000
ib_mthca 0000:08:00.0: Page phys. addr 000000002668f000, virt ffff880098b06000
ib_mthca 0000:08:00.0: Page phys. addr 000000002668e000, virt ffff880098b07000
ib_mthca 0000:08:00.0: Page phys. addr 000000002668d000, virt ffff880098b08000
ib_mthca 0000:08:00.0: Page phys. addr 000000002668c000, virt ffff880098b09000
ib_mthca 0000:08:00.0: Page phys. addr 000000002668b000, virt ffff880098b0a000
ib_mthca 0000:08:00.0: Page phys. addr 000000002668a000, virt ffff880098b0b000
ib_mthca 0000:08:00.0: Page phys. addr 0000000026689000, virt ffff880098b0c000
ib_mthca 0000:08:00.0: Page phys. addr 0000000026688000, virt ffff880098b0d000


-- 
Lukáš Hejtmánek


From halr at voltaire.com  Mon Jul  9 06:42:53 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 09 Jul 2007 09:42:53 -0400
Subject: [ofa-general] RE: [PATCH] OpenSM handling of "Babbling" Ports
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901BE111F@mtlexch01.mtl.com>
References: <1183640246.4377.436639.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901BE111F@mtlexch01.mtl.com>
Message-ID: <1183988571.25217.377395.camel@hal.voltaire.com>

Hi Amit,

On Mon, 2007-07-09 at 09:27, Amit Krig wrote:
> Hi Hal,
> 
> In such case OpenSM should first check that the OPVL fields of the ports
> (the one that sends the traps and its peer) are identical,
> If you have a mismatch in the OPVL field, the link watchdog mechanism
> will retrain the logical link in high rate

OpVLs only takes "effect" if set after link active only if the link is
bounced (not if it stays active).

Also and more significantly, in terms of the specific issue, the peer
SMA is often non responsive or shortly becomes non responsive so the
peer OpVLs cannot readily be verified post this being detected.

-- Hal

> Amit
> 
> 
> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com] 
> Sent: Thursday, July 05, 2007 3:58 PM
> To: general at lists.openfabrics.org
> Cc: Eitan Zahavi; Yevgeny Kliteynik
> Subject: [PATCH] OpenSM handling of "Babbling" Ports
> 
> A "babbling" port is a port which causes traps to be generated
> frequently.
> It may directly be "this" port which generates the traps or the peer
> port detecting the issue and that the SMA on switch port 0 generates the
> traps.
> This has only currently been observed for trap 131 but will also apply
> for traps 129 and 130 as well which are other urgent and similar traps.
> 
> Note that there appears to be a bug in Mellanox firmware for both
> Anafa-2 and Tavor at a minimum which causes the max trap rate not to be
> adhered to and relief for this does not appear to be in short term
> sight.
> 
> Policy
> When a bablbing port is detected, OpenSM will disable the port or its
> peer switch port (depending on which trap) which should terminate the
> trap storm.
> 
> Detection
> 250 consecutive traps of this type will be used as the (initial)
> threshold. The reason for this is so as to not prematurely detect this
> and disable a port.
> 
> Recovery
> Admin would reenable port when OK again. (This usually involves
> rebooting the node causing the trap to be indicated.)
> 
> Signed-off-by: Hal Rosenstock <halr at voltaire.com>
> 
> diff --git a/opensm/include/opensm/osm_subnet.h
> b/opensm/include/opensm/osm_subnet.h
> index bedd63f..1150703 100644
> --- a/opensm/include/opensm/osm_subnet.h
> +++ b/opensm/include/opensm/osm_subnet.h
> @@ -286,6 +286,7 @@ typedef struct _osm_subn_opt
>    boolean_t                honor_guid2lid_file;
>    boolean_t                daemon;
>    boolean_t                sm_inactive;
> +  boolean_t                babbling_port_policy;
>    osm_qos_options_t        qos_options;
>    osm_qos_options_t        qos_ca_options;
>    osm_qos_options_t        qos_sw0_options;
> @@ -487,6 +488,9 @@ typedef struct _osm_subn_opt
>  *
>  *	sm_inactive
>  *		OpenSM will start with SM in not active state.
> +*
> +*	babbling_port_policy
> +*		OpenSM will enforce its "babbling" port policy.
>  *	
>  *	perfmgr
>  *		Enable or disable the performance manager
> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
> index 726b665..87b71e5 100644
> --- a/opensm/opensm/osm_subnet.c
> +++ b/opensm/opensm/osm_subnet.c
> @@ -472,6 +472,7 @@ osm_subn_set_default_opt(
>    p_opt->honor_guid2lid_file = FALSE;
>    p_opt->daemon = FALSE;
>    p_opt->sm_inactive = FALSE;
> +  p_opt->babbling_port_policy = FALSE;
>  #ifdef ENABLE_OSM_PERF_MGR
>    p_opt->perfmgr = FALSE;
>    p_opt->perfmgr_sweep_time_s = OSM_PERFMGR_DEFAULT_SWEEP_TIME_S; @@
> -1358,6 +1359,10 @@ osm_subn_parse_conf_file(
>          "sm_inactive",
>          p_key, p_val, &p_opts->sm_inactive);
>  
> +      __osm_subn_opts_unpack_boolean(
> +        "babbling_port_policy",
> +        p_key, p_val, &p_opts->babbling_port_policy);
> +
>  #ifdef ENABLE_OSM_PERF_MGR
>        __osm_subn_opts_unpack_boolean(
>          "perfmgr",
> @@ -1631,9 +1636,12 @@ osm_subn_write_conf_file(
>      "# Daemon mode\n"
>      "daemon %s\n\n"
>      "# SM Inactive\n"
> -    "sm_inactive %s\n\n",
> +    "sm_inactive %s\n\n"
> +    "# Babbling Port Policy\n"
> +    "babbling_port_policy %s\n\n",
>      p_opts->daemon ? "TRUE" : "FALSE",
> -    p_opts->sm_inactive ? "TRUE" : "FALSE"
> +    p_opts->sm_inactive ? "TRUE" : "FALSE",
> +    p_opts->babbling_port_policy ? "TRUE" : "FALSE"
>      );
>  
>  #ifdef ENABLE_OSM_PERF_MGR
> diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
> index 5900c51..fbb6dac 100644
> --- a/opensm/opensm/osm_trap_rcv.c
> +++ b/opensm/opensm/osm_trap_rcv.c
> @@ -1,5 +1,5 @@
>  /*
> - * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
> + * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
>   * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights
> reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>   *
> @@ -548,6 +548,61 @@ __osm_trap_rcv_process_request(
>          }
>          else
>          {
> +          /* When babbling port policy option is enabled and
> +             Threshold for disabling a "babbling" port is exceeded */
> +          if ( p_rcv->p_subn->opt.babbling_port_policy &&
> +               num_received >= 250 )
> +          {
> +            uint8_t               payload[IB_SMP_DATA_SIZE];
> +            ib_port_info_t*       p_pi = (ib_port_info_t*)payload;
> +            const ib_port_info_t* p_old_pi;
> +            osm_madw_context_t    context;
> +
> +            /* If trap 131, might want to disable peer port if
> available */
> +            /* but peer port has been observed not to respond to SM 
> + requests */
> +
> +            osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> +                     "__osm_trap_rcv_process_request: ERR 3810: "
> +                     " Disabling physical port lid:0x%02X num:%u\n",
> +                     cl_ntoh16(p_ntci->data_details.ntc_129_131.lid),
> +                     p_ntci->data_details.ntc_129_131.port_num
> +                     );
> +
> +            p_old_pi = &p_physp->port_info;
> +            memcpy( payload, p_old_pi, sizeof(ib_port_info_t) );
> +
> +            /* Set port to disabled/down */
> +            ib_port_info_set_port_state( p_pi, IB_LINK_DOWN );
> +            ib_port_info_set_port_phys_state( 
> + IB_PORT_PHYS_STATE_DISABLED, p_pi );
> +
> +            context.pi_context.node_guid = osm_node_get_node_guid(
> osm_physp_get_node_ptr( p_physp ) );
> +            context.pi_context.port_guid = osm_physp_get_port_guid(
> p_physp );
> +            context.pi_context.set_method = TRUE;
> +            context.pi_context.update_master_sm_base_lid = FALSE;
> +            context.pi_context.light_sweep = FALSE;
> +            context.pi_context.active_transition = FALSE;
> +
> +            status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req,
> +                                   osm_physp_get_dr_path_ptr( p_physp
> ),
> +                                   payload,
> +                                   sizeof(payload),
> +                                   IB_MAD_ATTR_PORT_INFO,
> +                                   cl_hton32(osm_physp_get_port_num(
> p_physp )),
> +                                   CL_DISP_MSGID_NONE,
> +                                  &context );
> +
> +            if( status == IB_SUCCESS )
> +            {
> +               goto Exit;
> +            }
> +            else
> +            {
> +               osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> +                        "__osm_trap_rcv_process_request: ERR 3811: "
> +                        "Request to set PortInfo failed\n" );
> +            }
> +          }
> +
>            osm_log( p_rcv->p_log, OSM_LOG_VERBOSE,
>                     "__osm_trap_rcv_process_request: "
>                     "Marking unhealthy physical port by lid:0x%02X
> num:%u\n",
> 
> 
> 
> 


From Don.Kerr at Sun.COM  Mon Jul  9 07:27:08 2007
From: Don.Kerr at Sun.COM (Don Kerr)
Date: Mon, 09 Jul 2007 10:27:08 -0400
Subject: [ofa-general] uDAPL Question
Message-ID: <469245BC.8040108@Sun.COM>

(not sure if this is the proper alias but giving it a try)

Question: Is it possible to determine if an HCA is down intentionally 
from the uDAPL API?

Situation: My node has two HCA's but only one is "UP". A call to 
dat_registry_list_providers gives me everything in  dat.conf, includiung 
the interface that is down. I then proceed to call  dat_ia_open on  each 
entry but I don't know how to determine if there error I get back is 
bacause the interface is in error or if it is down on purpose?

Thanks
-DON


From eaburns at iol.unh.edu  Mon Jul  9 07:47:02 2007
From: eaburns at iol.unh.edu (Ethan Burns)
Date: Mon, 9 Jul 2007 10:47:02 -0400
Subject: [ofa-general] iSER header
Message-ID: <20070709144702.GB24125@postal.iol.unh.edu>

Hello,
	I have been looking over the latest Linus git repo and I
stumbled upon, what appears to be, an inconsistency between the iSER
header used in the kernel and the latest iSER draft
(draft-ietf-ips-iser-06.txt):

struct iser_hdr {
        u8      flags;
        u8      rsvd[3];
        __be32  write_stag; /* write rkey */
        __be64  write_va;		<------------------------------
        __be32  read_stag;  /* read rkey */
        __be64  read_va;		<------------------------------
} __attribute__((packed));


The two fields `write_va' and `read_va' seem to be extra fields that are
not defined by the draft.  Won't these fields present interoperability
issues with conformant iSER implementations?

Any information would be greatly appreciated.

Ethan Burns


From rdreier at cisco.com  Mon Jul  9 07:48:13 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 09 Jul 2007 07:48:13 -0700
Subject: [ofa-general] uDAPL Question
In-Reply-To: <469245BC.8040108@Sun.COM> (Don Kerr's message of "Mon,
	09 Jul 2007 10:27:08 -0400")
References: <469245BC.8040108@Sun.COM>
Message-ID: <ada644t1vma.fsf@cisco.com>

 > Question: Is it possible to determine if an HCA is down intentionally
 > from the uDAPL API?

What do you mean by an HCA to being "up" or "down"?

And what would it mean for it to be down "intentionally"?

 - R.


From Don.Kerr at Sun.COM  Mon Jul  9 08:04:16 2007
From: Don.Kerr at Sun.COM (Don Kerr)
Date: Mon, 09 Jul 2007 11:04:16 -0400
Subject: [ofa-general] uDAPL Question
In-Reply-To: <ada644t1vma.fsf@cisco.com>
References: <469245BC.8040108@Sun.COM> <ada644t1vma.fsf@cisco.com>
Message-ID: <46924E70.2040205@Sun.COM>

Sorry. I was wrongly lumping port and HCA together.

2 HCA cards each with 2 ports but only one port on one card is 
operational and by that I mean can be pinged or seen as "UP" when you 
run ifconfig. But both are still listed in the dat.conf.

-DON

Roland Dreier wrote:

> > Question: Is it possible to determine if an HCA is down intentionally
> > from the uDAPL API?
>
>What do you mean by an HCA to being "up" or "down"?
>
>And what would it mean for it to be down "intentionally"?
>
> - R.
>_______________________________________________
>general mailing list
>general at lists.openfabrics.org
>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>  
>


From rdreier at cisco.com  Mon Jul  9 08:28:55 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 09 Jul 2007 08:28:55 -0700
Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux
In-Reply-To: <20070709091210.GP3182@rhun.haifa.ibm.com> (Muli Ben-Yehuda's
	message of "Mon, 9 Jul 2007 12:12:10 +0300")
References: <20070704125429.GL3885@ics.muni.cz> <adavecy4ysc.fsf@cisco.com>
	<20070705193136.GQ3885@ics.muni.cz> <adasl8232zv.fsf@cisco.com>
	<20070707085303.GS3885@ics.muni.cz> <ada3azz3ihr.fsf@cisco.com>
	<20070708001531.GT3885@ics.muni.cz> <adad4z2184r.fsf@cisco.com>
	<20070709090802.GA3885@ics.muni.cz>
	<20070709091210.GP3182@rhun.haifa.ibm.com>
Message-ID: <ada1wfh1tqg.fsf@cisco.com>

 > > according to Xen-dev alloc_pages does *not* guarantee contiguous
 > > pages. They say that the pci_alloc_consistent should be used
 > > instead. The question is whether non-Xen kernel *usually* allocates
 > > contiguous pages and so far it has been working and whether it
 > > should be fixed in the mainline of the driver.
 > > 
 > > I do some tests (and also try to figure out how to change
 > > alloc_pages to pci_alloc_consistent) to verify contiguous pages.
 > 
 > You missed an important bit of Keir's response---it's perfectly fine
 > to use alloc_pages provided you then use the dma_map_single API, which
 > for Xen dom0 will take care of bounce-buffering to a
 > machine-contiguous buffer if necessary. I am not sure if the same
 > holds for a domU kernel.

I guess there was a mail thread that I wasn't copied on (I don't read
any Xen mailing lists).

Anyway, what mthca does is the following.  It wants to give a bunch of
system memory (megabytes) to the hardware for the hardware to use for
its internal context.  The hardware accesses this memory via PCI DMA
of course.  So what mthca does is:

 - Allocate large chunks of system memory using
   alloc_pages(GFP_HIGHUSER, order) with order > 0
 - Built up an array of struct scatterlist where each entry is one of
   the order >0 pages allocated as above
 - Map that scatterlist with pci_map_sg(..., PCI_DMA_BIDIRECTIONAL)
 - Pass the DMA addresses returned from that to the hardware

As far as I can see, what mthca is doing is perfectly fine as far as
the DMA mapping API is concerned.  If Xen is returning non-contiguous
memory from alloc_pages() and then allocating bounce buffers in
pci_map_sg() then that should work (although it will be somewhat
inefficient, since the original memory will never actually be used).
However I would confirm that that is what Xen is really trying to do,
and also that the code is working as intended when the scatterlist has
entries with pages of order >0.

As a side note, mthca could use dma_alloc_coherent() to allocate this
hardware memory, but that would be inefficient on 32-bit systems,
because it would use up kernel address space for memory that will only
be touched by the hardware.  So that's why it allocates pages with
GFP_HIGHUSER instead.

 - R.


From rdreier at cisco.com  Mon Jul  9 08:30:06 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 09 Jul 2007 08:30:06 -0700
Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux
In-Reply-To: <20070709133743.GK3885@ics.muni.cz> (Lukas Hejtmanek's message of
	"Mon, 9 Jul 2007 15:37:43 +0200")
References: <20070704125429.GL3885@ics.muni.cz> <adavecy4ysc.fsf@cisco.com>
	<20070705193136.GQ3885@ics.muni.cz> <adasl8232zv.fsf@cisco.com>
	<20070707085303.GS3885@ics.muni.cz> <ada3azz3ihr.fsf@cisco.com>
	<20070708001531.GT3885@ics.muni.cz> <adad4z2184r.fsf@cisco.com>
	<20070709133743.GK3885@ics.muni.cz>
Message-ID: <adawsx9zjb5.fsf@cisco.com>

 > ib_mthca 0000:08:00.0: Page phys. addr 0000000026695000, virt ffff880098b00000
 > ib_mthca 0000:08:00.0: Page phys. addr 0000000026694000, virt ffff880098b01000

And what do you get back from pci_map_sg() for this order >0 page?

Unfortunately I guess there's no way to see if the pci_map_sg()
implementation has taken into account the full size of the scatterlist
entry.

 - R.


From xhejtman at ics.muni.cz  Mon Jul  9 08:37:15 2007
From: xhejtman at ics.muni.cz (Lukas Hejtmanek)
Date: Mon, 9 Jul 2007 17:37:15 +0200
Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux
In-Reply-To: <adawsx9zjb5.fsf@cisco.com>
References: <20070704125429.GL3885@ics.muni.cz> <adavecy4ysc.fsf@cisco.com>
	<20070705193136.GQ3885@ics.muni.cz> <adasl8232zv.fsf@cisco.com>
	<20070707085303.GS3885@ics.muni.cz> <ada3azz3ihr.fsf@cisco.com>
	<20070708001531.GT3885@ics.muni.cz> <adad4z2184r.fsf@cisco.com>
	<20070709133743.GK3885@ics.muni.cz> <adawsx9zjb5.fsf@cisco.com>
Message-ID: <20070709153715.GA6496@ics.muni.cz>

On Mon, Jul 09, 2007 at 08:30:06AM -0700, Roland Dreier wrote:
>  > ib_mthca 0000:08:00.0: Page phys. addr 0000000026695000, virt ffff880098b00000
>  > ib_mthca 0000:08:00.0: Page phys. addr 0000000026694000, virt ffff880098b01000
> 
> And what do you get back from pci_map_sg() for this order >0 page?
> 
> Unfortunately I guess there's no way to see if the pci_map_sg()
> implementation has taken into account the full size of the scatterlist
> entry.

Well, using swiotlb=force (which turns on the bounce buffers) I do not get
oops any more. On the other hand, I got some oopses in memcpy of the bounce
buffers which I try to solve with Xen developpers.

-- 
Lukáš Hejtmánek


From rdreier at cisco.com  Mon Jul  9 08:55:07 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 09 Jul 2007 08:55:07 -0700
Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux
In-Reply-To: <20070709153715.GA6496@ics.muni.cz> (Lukas Hejtmanek's message of
	"Mon, 9 Jul 2007 17:37:15 +0200")
References: <20070704125429.GL3885@ics.muni.cz> <adavecy4ysc.fsf@cisco.com>
	<20070705193136.GQ3885@ics.muni.cz> <adasl8232zv.fsf@cisco.com>
	<20070707085303.GS3885@ics.muni.cz> <ada3azz3ihr.fsf@cisco.com>
	<20070708001531.GT3885@ics.muni.cz> <adad4z2184r.fsf@cisco.com>
	<20070709133743.GK3885@ics.muni.cz> <adawsx9zjb5.fsf@cisco.com>
	<20070709153715.GA6496@ics.muni.cz>
Message-ID: <adasl7xzi5g.fsf@cisco.com>

 > Well, using swiotlb=force (which turns on the bounce buffers) I do not get
 > oops any more. On the other hand, I got some oopses in memcpy of the bounce
 > buffers which I try to solve with Xen developpers.

So it seems there is a problem with the normal Xen PCI mapping API
then.  It would be better to avoid bounce buffers for this if
possible, because as I said that would double the memory consumption
and potentially exhaust your swiotlb space (because this hardware
context memory is not used for "in-flight" IOs, it is essentially
given to the hardware permanently).

Also, could you please CC me on any threads with the Xen developers?
It's kind of annoying to only get half of the story about what's going
on with debugging this.

Thanks,
  Roland


From rdreier at cisco.com  Mon Jul  9 09:10:47 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 09 Jul 2007 09:10:47 -0700
Subject: [ofa-general] Re: [PATCH] mlx4: add device reset to Internal Error
	handling mechanism
In-Reply-To: <200707091012.52418.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Mon, 9 Jul 2007 10:12:52 +0300")
References: <200707091012.52418.jackm@dev.mellanox.co.il>
Message-ID: <adaodilzhfc.fsf@cisco.com>

 > This patch also disables the detection of Internal Errors via a device
 > interrupt, because we wish to avoid the complexity of supporting
 > two independent detection mechanisms.

OK, but...

 >  static irqreturn_t mlx4_catas_interrupt(int irq, void *dev_ptr)
 >  {
 > -	mlx4_handle_catas_err(dev_ptr);
 > +	/* disable handling catas errors via interrupt. */
 > +	/* We now handle them via polling.              */
 > +	/* mlx4_handle_catas_err(dev_ptr);              */

Why not just delete all the interrupt stuff completely?


For

 > +		mod_timer(&priv->catas_err.timer,
 > +			  jiffies + MLX4_CATAS_POLL_INTERVAL);

and

 > +	priv->catas_err.timer.expires  = jiffies + MLX4_CATAS_POLL_INTERVAL;

how about round_jiffies_relative(MLX4_CATAS_POLL_INTERVAL) instead?

 - R.


From halr at voltaire.com  Mon Jul  9 09:18:15 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 09 Jul 2007 12:18:15 -0400
Subject: [ofa-general] RE: [PATCH] OpenSM handling of "Babbling" Ports
In-Reply-To: <1183988571.25217.377395.camel@hal.voltaire.com>
References: <1183640246.4377.436639.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901BE111F@mtlexch01.mtl.com>
	<1183988571.25217.377395.camel@hal.voltaire.com>
Message-ID: <1183997893.25217.388186.camel@hal.voltaire.com>

Hi again Amit,

On Mon, 2007-07-09 at 09:42, Hal Rosenstock wrote:
> Hi Amit,
> 
> On Mon, 2007-07-09 at 09:27, Amit Krig wrote:
> > Hi Hal,
> > 
> > In such case OpenSM should first check that the OPVL fields of the ports
> > (the one that sends the traps and its peer) are identical,
> > If you have a mismatch in the OPVL field, the link watchdog mechanism
> > will retrain the logical link in high rate
> 
> OpVLs only takes "effect" if set after link active only if the link is
> bounced (not if it stays active).

Not sure about what I wrote above. p.829 states that in certain
PortStates this may cause flow control update errors (and initiate
Link/Phy retraining).

> Also and more significantly, in terms of the specific issue, the peer
> SMA is often non responsive or shortly becomes non responsive so the
> peer OpVLs cannot readily be verified post this being detected.

This as well as the trap rate are the issues, perhaps second level but
none the less issues.

-- Hal

> -- Hal
> 
> > Amit
> > 
> > 
> > -----Original Message-----
> > From: Hal Rosenstock [mailto:halr at voltaire.com] 
> > Sent: Thursday, July 05, 2007 3:58 PM
> > To: general at lists.openfabrics.org
> > Cc: Eitan Zahavi; Yevgeny Kliteynik
> > Subject: [PATCH] OpenSM handling of "Babbling" Ports
> > 
> > A "babbling" port is a port which causes traps to be generated
> > frequently.
> > It may directly be "this" port which generates the traps or the peer
> > port detecting the issue and that the SMA on switch port 0 generates the
> > traps.
> > This has only currently been observed for trap 131 but will also apply
> > for traps 129 and 130 as well which are other urgent and similar traps.
> > 
> > Note that there appears to be a bug in Mellanox firmware for both
> > Anafa-2 and Tavor at a minimum which causes the max trap rate not to be
> > adhered to and relief for this does not appear to be in short term
> > sight.
> > 
> > Policy
> > When a bablbing port is detected, OpenSM will disable the port or its
> > peer switch port (depending on which trap) which should terminate the
> > trap storm.
> > 
> > Detection
> > 250 consecutive traps of this type will be used as the (initial)
> > threshold. The reason for this is so as to not prematurely detect this
> > and disable a port.
> > 
> > Recovery
> > Admin would reenable port when OK again. (This usually involves
> > rebooting the node causing the trap to be indicated.)
> > 
> > Signed-off-by: Hal Rosenstock <halr at voltaire.com>
> > 
> > diff --git a/opensm/include/opensm/osm_subnet.h
> > b/opensm/include/opensm/osm_subnet.h
> > index bedd63f..1150703 100644
> > --- a/opensm/include/opensm/osm_subnet.h
> > +++ b/opensm/include/opensm/osm_subnet.h
> > @@ -286,6 +286,7 @@ typedef struct _osm_subn_opt
> >    boolean_t                honor_guid2lid_file;
> >    boolean_t                daemon;
> >    boolean_t                sm_inactive;
> > +  boolean_t                babbling_port_policy;
> >    osm_qos_options_t        qos_options;
> >    osm_qos_options_t        qos_ca_options;
> >    osm_qos_options_t        qos_sw0_options;
> > @@ -487,6 +488,9 @@ typedef struct _osm_subn_opt
> >  *
> >  *	sm_inactive
> >  *		OpenSM will start with SM in not active state.
> > +*
> > +*	babbling_port_policy
> > +*		OpenSM will enforce its "babbling" port policy.
> >  *	
> >  *	perfmgr
> >  *		Enable or disable the performance manager
> > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
> > index 726b665..87b71e5 100644
> > --- a/opensm/opensm/osm_subnet.c
> > +++ b/opensm/opensm/osm_subnet.c
> > @@ -472,6 +472,7 @@ osm_subn_set_default_opt(
> >    p_opt->honor_guid2lid_file = FALSE;
> >    p_opt->daemon = FALSE;
> >    p_opt->sm_inactive = FALSE;
> > +  p_opt->babbling_port_policy = FALSE;
> >  #ifdef ENABLE_OSM_PERF_MGR
> >    p_opt->perfmgr = FALSE;
> >    p_opt->perfmgr_sweep_time_s = OSM_PERFMGR_DEFAULT_SWEEP_TIME_S; @@
> > -1358,6 +1359,10 @@ osm_subn_parse_conf_file(
> >          "sm_inactive",
> >          p_key, p_val, &p_opts->sm_inactive);
> >  
> > +      __osm_subn_opts_unpack_boolean(
> > +        "babbling_port_policy",
> > +        p_key, p_val, &p_opts->babbling_port_policy);
> > +
> >  #ifdef ENABLE_OSM_PERF_MGR
> >        __osm_subn_opts_unpack_boolean(
> >          "perfmgr",
> > @@ -1631,9 +1636,12 @@ osm_subn_write_conf_file(
> >      "# Daemon mode\n"
> >      "daemon %s\n\n"
> >      "# SM Inactive\n"
> > -    "sm_inactive %s\n\n",
> > +    "sm_inactive %s\n\n"
> > +    "# Babbling Port Policy\n"
> > +    "babbling_port_policy %s\n\n",
> >      p_opts->daemon ? "TRUE" : "FALSE",
> > -    p_opts->sm_inactive ? "TRUE" : "FALSE"
> > +    p_opts->sm_inactive ? "TRUE" : "FALSE",
> > +    p_opts->babbling_port_policy ? "TRUE" : "FALSE"
> >      );
> >  
> >  #ifdef ENABLE_OSM_PERF_MGR
> > diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
> > index 5900c51..fbb6dac 100644
> > --- a/opensm/opensm/osm_trap_rcv.c
> > +++ b/opensm/opensm/osm_trap_rcv.c
> > @@ -1,5 +1,5 @@
> >  /*
> > - * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
> > + * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
> >   * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights
> > reserved.
> >   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> >   *
> > @@ -548,6 +548,61 @@ __osm_trap_rcv_process_request(
> >          }
> >          else
> >          {
> > +          /* When babbling port policy option is enabled and
> > +             Threshold for disabling a "babbling" port is exceeded */
> > +          if ( p_rcv->p_subn->opt.babbling_port_policy &&
> > +               num_received >= 250 )
> > +          {
> > +            uint8_t               payload[IB_SMP_DATA_SIZE];
> > +            ib_port_info_t*       p_pi = (ib_port_info_t*)payload;
> > +            const ib_port_info_t* p_old_pi;
> > +            osm_madw_context_t    context;
> > +
> > +            /* If trap 131, might want to disable peer port if
> > available */
> > +            /* but peer port has been observed not to respond to SM 
> > + requests */
> > +
> > +            osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> > +                     "__osm_trap_rcv_process_request: ERR 3810: "
> > +                     " Disabling physical port lid:0x%02X num:%u\n",
> > +                     cl_ntoh16(p_ntci->data_details.ntc_129_131.lid),
> > +                     p_ntci->data_details.ntc_129_131.port_num
> > +                     );
> > +
> > +            p_old_pi = &p_physp->port_info;
> > +            memcpy( payload, p_old_pi, sizeof(ib_port_info_t) );
> > +
> > +            /* Set port to disabled/down */
> > +            ib_port_info_set_port_state( p_pi, IB_LINK_DOWN );
> > +            ib_port_info_set_port_phys_state( 
> > + IB_PORT_PHYS_STATE_DISABLED, p_pi );
> > +
> > +            context.pi_context.node_guid = osm_node_get_node_guid(
> > osm_physp_get_node_ptr( p_physp ) );
> > +            context.pi_context.port_guid = osm_physp_get_port_guid(
> > p_physp );
> > +            context.pi_context.set_method = TRUE;
> > +            context.pi_context.update_master_sm_base_lid = FALSE;
> > +            context.pi_context.light_sweep = FALSE;
> > +            context.pi_context.active_transition = FALSE;
> > +
> > +            status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req,
> > +                                   osm_physp_get_dr_path_ptr( p_physp
> > ),
> > +                                   payload,
> > +                                   sizeof(payload),
> > +                                   IB_MAD_ATTR_PORT_INFO,
> > +                                   cl_hton32(osm_physp_get_port_num(
> > p_physp )),
> > +                                   CL_DISP_MSGID_NONE,
> > +                                  &context );
> > +
> > +            if( status == IB_SUCCESS )
> > +            {
> > +               goto Exit;
> > +            }
> > +            else
> > +            {
> > +               osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> > +                        "__osm_trap_rcv_process_request: ERR 3811: "
> > +                        "Request to set PortInfo failed\n" );
> > +            }
> > +          }
> > +
> >            osm_log( p_rcv->p_log, OSM_LOG_VERBOSE,
> >                     "__osm_trap_rcv_process_request: "
> >                     "Marking unhealthy physical port by lid:0x%02X
> > num:%u\n",
> > 
> > 
> > 
> > 
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From amitk at mellanox.co.il  Mon Jul  9 09:40:14 2007
From: amitk at mellanox.co.il (Amit Krig)
Date: Mon, 9 Jul 2007 19:40:14 +0300
Subject: [ofa-general] RE: [PATCH] OpenSM handling of "Babbling" Ports
References: <1183640246.4377.436639.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901BE111F@mtlexch01.mtl.com>
	<1183988571.25217.377395.camel@hal.voltaire.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901BE1121@mtlexch01.mtl.com>

Hi Hal

I was only talking on logical link == Active state.
In this state the watchdog can bring the physical link to recovery state
while the logical link will bounce between Active and ActiveDefer.

Regarding the responsive issue, OpenSM in this scenario should move the
logical link in the responsive side to Init state that way the watchdog
will stop bringing down the link and then do the checks

Amit

-----Original Message-----
From: Hal Rosenstock [mailto:halr at voltaire.com] 
Sent: Monday, July 09, 2007 4:43 PM
To: Amit Krig
Cc: general at lists.openfabrics.org; Eitan Zahavi; Yevgeny Kliteynik
Subject: RE: [PATCH] OpenSM handling of "Babbling" Ports

Hi Amit,

On Mon, 2007-07-09 at 09:27, Amit Krig wrote:
> Hi Hal,
> 
> In such case OpenSM should first check that the OPVL fields of the 
> ports (the one that sends the traps and its peer) are identical, If 
> you have a mismatch in the OPVL field, the link watchdog mechanism 
> will retrain the logical link in high rate

OpVLs only takes "effect" if set after link active only if the link is
bounced (not if it stays active).

Also and more significantly, in terms of the specific issue, the peer
SMA is often non responsive or shortly becomes non responsive so the
peer OpVLs cannot readily be verified post this being detected.

-- Hal

> Amit
> 
> 
> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com]
> Sent: Thursday, July 05, 2007 3:58 PM
> To: general at lists.openfabrics.org
> Cc: Eitan Zahavi; Yevgeny Kliteynik
> Subject: [PATCH] OpenSM handling of "Babbling" Ports
> 
> A "babbling" port is a port which causes traps to be generated 
> frequently.
> It may directly be "this" port which generates the traps or the peer 
> port detecting the issue and that the SMA on switch port 0 generates 
> the traps.
> This has only currently been observed for trap 131 but will also apply

> for traps 129 and 130 as well which are other urgent and similar
traps.
> 
> Note that there appears to be a bug in Mellanox firmware for both
> Anafa-2 and Tavor at a minimum which causes the max trap rate not to 
> be adhered to and relief for this does not appear to be in short term 
> sight.
> 
> Policy
> When a bablbing port is detected, OpenSM will disable the port or its 
> peer switch port (depending on which trap) which should terminate the 
> trap storm.
> 
> Detection
> 250 consecutive traps of this type will be used as the (initial) 
> threshold. The reason for this is so as to not prematurely detect this

> and disable a port.
> 
> Recovery
> Admin would reenable port when OK again. (This usually involves 
> rebooting the node causing the trap to be indicated.)
> 
> Signed-off-by: Hal Rosenstock <halr at voltaire.com>
> 
> diff --git a/opensm/include/opensm/osm_subnet.h
> b/opensm/include/opensm/osm_subnet.h
> index bedd63f..1150703 100644
> --- a/opensm/include/opensm/osm_subnet.h
> +++ b/opensm/include/opensm/osm_subnet.h
> @@ -286,6 +286,7 @@ typedef struct _osm_subn_opt
>    boolean_t                honor_guid2lid_file;
>    boolean_t                daemon;
>    boolean_t                sm_inactive;
> +  boolean_t                babbling_port_policy;
>    osm_qos_options_t        qos_options;
>    osm_qos_options_t        qos_ca_options;
>    osm_qos_options_t        qos_sw0_options;
> @@ -487,6 +488,9 @@ typedef struct _osm_subn_opt
>  *
>  *	sm_inactive
>  *		OpenSM will start with SM in not active state.
> +*
> +*	babbling_port_policy
> +*		OpenSM will enforce its "babbling" port policy.
>  *	
>  *	perfmgr
>  *		Enable or disable the performance manager
> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c 
> index 726b665..87b71e5 100644
> --- a/opensm/opensm/osm_subnet.c
> +++ b/opensm/opensm/osm_subnet.c
> @@ -472,6 +472,7 @@ osm_subn_set_default_opt(
>    p_opt->honor_guid2lid_file = FALSE;
>    p_opt->daemon = FALSE;
>    p_opt->sm_inactive = FALSE;
> +  p_opt->babbling_port_policy = FALSE;
>  #ifdef ENABLE_OSM_PERF_MGR
>    p_opt->perfmgr = FALSE;
>    p_opt->perfmgr_sweep_time_s = OSM_PERFMGR_DEFAULT_SWEEP_TIME_S; @@
> -1358,6 +1359,10 @@ osm_subn_parse_conf_file(
>          "sm_inactive",
>          p_key, p_val, &p_opts->sm_inactive);
>  
> +      __osm_subn_opts_unpack_boolean(
> +        "babbling_port_policy",
> +        p_key, p_val, &p_opts->babbling_port_policy);
> +
>  #ifdef ENABLE_OSM_PERF_MGR
>        __osm_subn_opts_unpack_boolean(
>          "perfmgr",
> @@ -1631,9 +1636,12 @@ osm_subn_write_conf_file(
>      "# Daemon mode\n"
>      "daemon %s\n\n"
>      "# SM Inactive\n"
> -    "sm_inactive %s\n\n",
> +    "sm_inactive %s\n\n"
> +    "# Babbling Port Policy\n"
> +    "babbling_port_policy %s\n\n",
>      p_opts->daemon ? "TRUE" : "FALSE",
> -    p_opts->sm_inactive ? "TRUE" : "FALSE"
> +    p_opts->sm_inactive ? "TRUE" : "FALSE",
> +    p_opts->babbling_port_policy ? "TRUE" : "FALSE"
>      );
>  
>  #ifdef ENABLE_OSM_PERF_MGR
> diff --git a/opensm/opensm/osm_trap_rcv.c 
> b/opensm/opensm/osm_trap_rcv.c index 5900c51..fbb6dac 100644
> --- a/opensm/opensm/osm_trap_rcv.c
> +++ b/opensm/opensm/osm_trap_rcv.c
> @@ -1,5 +1,5 @@
>  /*
> - * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
> + * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
>   * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights 
> reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>   *
> @@ -548,6 +548,61 @@ __osm_trap_rcv_process_request(
>          }
>          else
>          {
> +          /* When babbling port policy option is enabled and
> +             Threshold for disabling a "babbling" port is exceeded */
> +          if ( p_rcv->p_subn->opt.babbling_port_policy &&
> +               num_received >= 250 )
> +          {
> +            uint8_t               payload[IB_SMP_DATA_SIZE];
> +            ib_port_info_t*       p_pi = (ib_port_info_t*)payload;
> +            const ib_port_info_t* p_old_pi;
> +            osm_madw_context_t    context;
> +
> +            /* If trap 131, might want to disable peer port if
> available */
> +            /* but peer port has been observed not to respond to SM 
> + requests */
> +
> +            osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> +                     "__osm_trap_rcv_process_request: ERR 3810: "
> +                     " Disabling physical port lid:0x%02X num:%u\n",
> +                     cl_ntoh16(p_ntci->data_details.ntc_129_131.lid),
> +                     p_ntci->data_details.ntc_129_131.port_num
> +                     );
> +
> +            p_old_pi = &p_physp->port_info;
> +            memcpy( payload, p_old_pi, sizeof(ib_port_info_t) );
> +
> +            /* Set port to disabled/down */
> +            ib_port_info_set_port_state( p_pi, IB_LINK_DOWN );
> +            ib_port_info_set_port_phys_state( 
> + IB_PORT_PHYS_STATE_DISABLED, p_pi );
> +
> +            context.pi_context.node_guid = osm_node_get_node_guid(
> osm_physp_get_node_ptr( p_physp ) );
> +            context.pi_context.port_guid = osm_physp_get_port_guid(
> p_physp );
> +            context.pi_context.set_method = TRUE;
> +            context.pi_context.update_master_sm_base_lid = FALSE;
> +            context.pi_context.light_sweep = FALSE;
> +            context.pi_context.active_transition = FALSE;
> +
> +            status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req,
> +                                   osm_physp_get_dr_path_ptr( p_physp
> ),
> +                                   payload,
> +                                   sizeof(payload),
> +                                   IB_MAD_ATTR_PORT_INFO,
> +                                   cl_hton32(osm_physp_get_port_num(
> p_physp )),
> +                                   CL_DISP_MSGID_NONE,
> +                                  &context );
> +
> +            if( status == IB_SUCCESS )
> +            {
> +               goto Exit;
> +            }
> +            else
> +            {
> +               osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> +                        "__osm_trap_rcv_process_request: ERR 3811: "
> +                        "Request to set PortInfo failed\n" );
> +            }
> +          }
> +
>            osm_log( p_rcv->p_log, OSM_LOG_VERBOSE,
>                     "__osm_trap_rcv_process_request: "
>                     "Marking unhealthy physical port by lid:0x%02X 
> num:%u\n",
> 
> 
> 
> 


From xhejtman at ics.muni.cz  Mon Jul  9 09:53:23 2007
From: xhejtman at ics.muni.cz (Lukas Hejtmanek)
Date: Mon, 9 Jul 2007 18:53:23 +0200
Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux
In-Reply-To: <adasl7xzi5g.fsf@cisco.com>
References: <20070705193136.GQ3885@ics.muni.cz> <adasl8232zv.fsf@cisco.com>
	<20070707085303.GS3885@ics.muni.cz> <ada3azz3ihr.fsf@cisco.com>
	<20070708001531.GT3885@ics.muni.cz> <adad4z2184r.fsf@cisco.com>
	<20070709133743.GK3885@ics.muni.cz> <adawsx9zjb5.fsf@cisco.com>
	<20070709153715.GA6496@ics.muni.cz> <adasl7xzi5g.fsf@cisco.com>
Message-ID: <20070709165323.GN3885@ics.muni.cz>

On Mon, Jul 09, 2007 at 08:55:07AM -0700, Roland Dreier wrote:
> So it seems there is a problem with the normal Xen PCI mapping API
> then.  It would be better to avoid bounce buffers for this if
> possible, because as I said that would double the memory consumption
> and potentially exhaust your swiotlb space (because this hardware
> context memory is not used for "in-flight" IOs, it is essentially
> given to the hardware permanently).
> 
> Also, could you please CC me on any threads with the Xen developers?
> It's kind of annoying to only get half of the story about what's going
> on with debugging this.

Sorry for that. The beginning of the thread is archived here:
http://lists.xensource.com/archives/html/xen-devel/2007-07/msg00209.html
Although, the last two posts are missing. If you like to get bounced whole
thread, I can do it.

Right now, I have problem in :ib_mthca:mthca_arbel_write_mtt_seg

where dma_sync_single is called and swiotlb does not have corresponding
mapping.

-- 
Lukáš Hejtmánek


From vuhuong at mellanox.com  Mon Jul  9 09:55:04 2007
From: vuhuong at mellanox.com (Vu Pham)
Date: Mon, 09 Jul 2007 09:55:04 -0700
Subject: [ofa-general] Compiling SRPT
In-Reply-To: <1183852853.6008.11.camel@gentoo-linux.localdomain>
References: <1183852853.6008.11.camel@gentoo-linux.localdomain>
Message-ID: <46926868.8000704@mellanox.com>

Stanley Sufficool wrote:
>   Compiling on kernel 2.6.21-rc6 from kernel.org Torvald's branch
> 
> Got the latest srpt from the git repository on OpenFabrics and had the 
> following issues.
> 
> ib_srpt.c    Line 1997, missing second argument, should be?   
> sdev->scst_tgt = scst_register(tp, NULL);
> 

Yes. You need the change if you test with top of scst svn 
trunk (or from version 0.9.6-pre2)
If you test with scst before 0.9.6-pre2 (ie. version <= 
0.9.6-pre1) you don't need the second argument for 
scst_register()


> SCST was built successfully after fixing an issue in scst_vdisk.c 
> (missing #include <linux/sched.h>)


I tested with 2.6.20.x - I have not tested with 2.6.21-rcXX 
- you should send the patch to scst devel

> 
> Just thought this would be nice to have documented, took me half a day 
> to track down as a novice in C programming.
> 

there is *lean and mean* srpt's README in srpt_inc
SCST also has some document
You can add some wiki/notes for the problems in openfabrics 
wiki page https://wiki.openfabrics.org/tiki-index.php

-vu

> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From halr at voltaire.com  Mon Jul  9 09:57:47 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 09 Jul 2007 12:57:47 -0400
Subject: [ofa-general] RE: [PATCH] OpenSM handling of "Babbling" Ports
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901BE1121@mtlexch01.mtl.com>
References: <1183640246.4377.436639.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901BE111F@mtlexch01.mtl.com>
	<1183988571.25217.377395.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901BE1121@mtlexch01.mtl.com>
Message-ID: <1184000266.25217.390914.camel@hal.voltaire.com>

Hi Amit,

On Mon, 2007-07-09 at 12:40, Amit Krig wrote:
> Hi Hal
> 
> I was only talking on logical link == Active state.
> In this state the watchdog can bring the physical link to recovery state
> while the logical link will bounce between Active and ActiveDefer.

OK; I follow this but I'm not sure what you are saying about "applying"
it to the patch in question.

> Regarding the responsive issue, OpenSM in this scenario should move the
> logical link in the responsive side to Init state

rather than disabling it on some threshold. What about the other similar
traps 129 and 130 ? How should they be handled ?

> that way the watchdog will stop bringing down the link and then do the checks

I think the checks will still fail but this seems like it would stop the
traps from being generated (so fast).

-- Hal

> Amit
> 
> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com] 
> Sent: Monday, July 09, 2007 4:43 PM
> To: Amit Krig
> Cc: general at lists.openfabrics.org; Eitan Zahavi; Yevgeny Kliteynik
> Subject: RE: [PATCH] OpenSM handling of "Babbling" Ports
> 
> Hi Amit,
> 
> On Mon, 2007-07-09 at 09:27, Amit Krig wrote:
> > Hi Hal,
> > 
> > In such case OpenSM should first check that the OPVL fields of the 
> > ports (the one that sends the traps and its peer) are identical, If 
> > you have a mismatch in the OPVL field, the link watchdog mechanism 
> > will retrain the logical link in high rate
> 
> OpVLs only takes "effect" if set after link active only if the link is
> bounced (not if it stays active).
> 
> Also and more significantly, in terms of the specific issue, the peer
> SMA is often non responsive or shortly becomes non responsive so the
> peer OpVLs cannot readily be verified post this being detected.
> 
> -- Hal
> 
> > Amit
> > 
> > 
> > -----Original Message-----
> > From: Hal Rosenstock [mailto:halr at voltaire.com]
> > Sent: Thursday, July 05, 2007 3:58 PM
> > To: general at lists.openfabrics.org
> > Cc: Eitan Zahavi; Yevgeny Kliteynik
> > Subject: [PATCH] OpenSM handling of "Babbling" Ports
> > 
> > A "babbling" port is a port which causes traps to be generated 
> > frequently.
> > It may directly be "this" port which generates the traps or the peer 
> > port detecting the issue and that the SMA on switch port 0 generates 
> > the traps.
> > This has only currently been observed for trap 131 but will also apply
> 
> > for traps 129 and 130 as well which are other urgent and similar
> traps.
> > 
> > Note that there appears to be a bug in Mellanox firmware for both
> > Anafa-2 and Tavor at a minimum which causes the max trap rate not to 
> > be adhered to and relief for this does not appear to be in short term 
> > sight.
> > 
> > Policy
> > When a bablbing port is detected, OpenSM will disable the port or its 
> > peer switch port (depending on which trap) which should terminate the 
> > trap storm.
> > 
> > Detection
> > 250 consecutive traps of this type will be used as the (initial) 
> > threshold. The reason for this is so as to not prematurely detect this
> 
> > and disable a port.
> > 
> > Recovery
> > Admin would reenable port when OK again. (This usually involves 
> > rebooting the node causing the trap to be indicated.)
> > 
> > Signed-off-by: Hal Rosenstock <halr at voltaire.com>
> > 
> > diff --git a/opensm/include/opensm/osm_subnet.h
> > b/opensm/include/opensm/osm_subnet.h
> > index bedd63f..1150703 100644
> > --- a/opensm/include/opensm/osm_subnet.h
> > +++ b/opensm/include/opensm/osm_subnet.h
> > @@ -286,6 +286,7 @@ typedef struct _osm_subn_opt
> >    boolean_t                honor_guid2lid_file;
> >    boolean_t                daemon;
> >    boolean_t                sm_inactive;
> > +  boolean_t                babbling_port_policy;
> >    osm_qos_options_t        qos_options;
> >    osm_qos_options_t        qos_ca_options;
> >    osm_qos_options_t        qos_sw0_options;
> > @@ -487,6 +488,9 @@ typedef struct _osm_subn_opt
> >  *
> >  *	sm_inactive
> >  *		OpenSM will start with SM in not active state.
> > +*
> > +*	babbling_port_policy
> > +*		OpenSM will enforce its "babbling" port policy.
> >  *	
> >  *	perfmgr
> >  *		Enable or disable the performance manager
> > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c 
> > index 726b665..87b71e5 100644
> > --- a/opensm/opensm/osm_subnet.c
> > +++ b/opensm/opensm/osm_subnet.c
> > @@ -472,6 +472,7 @@ osm_subn_set_default_opt(
> >    p_opt->honor_guid2lid_file = FALSE;
> >    p_opt->daemon = FALSE;
> >    p_opt->sm_inactive = FALSE;
> > +  p_opt->babbling_port_policy = FALSE;
> >  #ifdef ENABLE_OSM_PERF_MGR
> >    p_opt->perfmgr = FALSE;
> >    p_opt->perfmgr_sweep_time_s = OSM_PERFMGR_DEFAULT_SWEEP_TIME_S; @@
> > -1358,6 +1359,10 @@ osm_subn_parse_conf_file(
> >          "sm_inactive",
> >          p_key, p_val, &p_opts->sm_inactive);
> >  
> > +      __osm_subn_opts_unpack_boolean(
> > +        "babbling_port_policy",
> > +        p_key, p_val, &p_opts->babbling_port_policy);
> > +
> >  #ifdef ENABLE_OSM_PERF_MGR
> >        __osm_subn_opts_unpack_boolean(
> >          "perfmgr",
> > @@ -1631,9 +1636,12 @@ osm_subn_write_conf_file(
> >      "# Daemon mode\n"
> >      "daemon %s\n\n"
> >      "# SM Inactive\n"
> > -    "sm_inactive %s\n\n",
> > +    "sm_inactive %s\n\n"
> > +    "# Babbling Port Policy\n"
> > +    "babbling_port_policy %s\n\n",
> >      p_opts->daemon ? "TRUE" : "FALSE",
> > -    p_opts->sm_inactive ? "TRUE" : "FALSE"
> > +    p_opts->sm_inactive ? "TRUE" : "FALSE",
> > +    p_opts->babbling_port_policy ? "TRUE" : "FALSE"
> >      );
> >  
> >  #ifdef ENABLE_OSM_PERF_MGR
> > diff --git a/opensm/opensm/osm_trap_rcv.c 
> > b/opensm/opensm/osm_trap_rcv.c index 5900c51..fbb6dac 100644
> > --- a/opensm/opensm/osm_trap_rcv.c
> > +++ b/opensm/opensm/osm_trap_rcv.c
> > @@ -1,5 +1,5 @@
> >  /*
> > - * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
> > + * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
> >   * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights 
> > reserved.
> >   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> >   *
> > @@ -548,6 +548,61 @@ __osm_trap_rcv_process_request(
> >          }
> >          else
> >          {
> > +          /* When babbling port policy option is enabled and
> > +             Threshold for disabling a "babbling" port is exceeded */
> > +          if ( p_rcv->p_subn->opt.babbling_port_policy &&
> > +               num_received >= 250 )
> > +          {
> > +            uint8_t               payload[IB_SMP_DATA_SIZE];
> > +            ib_port_info_t*       p_pi = (ib_port_info_t*)payload;
> > +            const ib_port_info_t* p_old_pi;
> > +            osm_madw_context_t    context;
> > +
> > +            /* If trap 131, might want to disable peer port if
> > available */
> > +            /* but peer port has been observed not to respond to SM 
> > + requests */
> > +
> > +            osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> > +                     "__osm_trap_rcv_process_request: ERR 3810: "
> > +                     " Disabling physical port lid:0x%02X num:%u\n",
> > +                     cl_ntoh16(p_ntci->data_details.ntc_129_131.lid),
> > +                     p_ntci->data_details.ntc_129_131.port_num
> > +                     );
> > +
> > +            p_old_pi = &p_physp->port_info;
> > +            memcpy( payload, p_old_pi, sizeof(ib_port_info_t) );
> > +
> > +            /* Set port to disabled/down */
> > +            ib_port_info_set_port_state( p_pi, IB_LINK_DOWN );
> > +            ib_port_info_set_port_phys_state( 
> > + IB_PORT_PHYS_STATE_DISABLED, p_pi );
> > +
> > +            context.pi_context.node_guid = osm_node_get_node_guid(
> > osm_physp_get_node_ptr( p_physp ) );
> > +            context.pi_context.port_guid = osm_physp_get_port_guid(
> > p_physp );
> > +            context.pi_context.set_method = TRUE;
> > +            context.pi_context.update_master_sm_base_lid = FALSE;
> > +            context.pi_context.light_sweep = FALSE;
> > +            context.pi_context.active_transition = FALSE;
> > +
> > +            status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req,
> > +                                   osm_physp_get_dr_path_ptr( p_physp
> > ),
> > +                                   payload,
> > +                                   sizeof(payload),
> > +                                   IB_MAD_ATTR_PORT_INFO,
> > +                                   cl_hton32(osm_physp_get_port_num(
> > p_physp )),
> > +                                   CL_DISP_MSGID_NONE,
> > +                                  &context );
> > +
> > +            if( status == IB_SUCCESS )
> > +            {
> > +               goto Exit;
> > +            }
> > +            else
> > +            {
> > +               osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> > +                        "__osm_trap_rcv_process_request: ERR 3811: "
> > +                        "Request to set PortInfo failed\n" );
> > +            }
> > +          }
> > +
> >            osm_log( p_rcv->p_log, OSM_LOG_VERBOSE,
> >                     "__osm_trap_rcv_process_request: "
> >                     "Marking unhealthy physical port by lid:0x%02X 
> > num:%u\n",
> > 
> > 
> > 
> > 
> 


From amitk at mellanox.co.il  Mon Jul  9 10:07:06 2007
From: amitk at mellanox.co.il (Amit Krig)
Date: Mon, 9 Jul 2007 20:07:06 +0300
Subject: [ofa-general] RE: [PATCH] OpenSM handling of "Babbling" Ports
References: <1183640246.4377.436639.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901BE111F@mtlexch01.mtl.com>
	<1183988571.25217.377395.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901BE1121@mtlexch01.mtl.com>
	<1184000266.25217.390914.camel@hal.voltaire.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901BE112A@mtlexch01.mtl.com>

I mean that if you still get the traps in high rate (After verifying the
OPVL) than you should consider disabling the link

-----Original Message-----
From: Hal Rosenstock [mailto:halr at voltaire.com] 
Sent: Monday, July 09, 2007 7:58 PM
To: Amit Krig
Cc: general at lists.openfabrics.org; Eitan Zahavi; Yevgeny Kliteynik
Subject: RE: [PATCH] OpenSM handling of "Babbling" Ports

Hi Amit,

On Mon, 2007-07-09 at 12:40, Amit Krig wrote:
> Hi Hal
> 
> I was only talking on logical link == Active state.
> In this state the watchdog can bring the physical link to recovery 
> state while the logical link will bounce between Active and
ActiveDefer.

OK; I follow this but I'm not sure what you are saying about "applying"
it to the patch in question.

> Regarding the responsive issue, OpenSM in this scenario should move 
> the logical link in the responsive side to Init state

rather than disabling it on some threshold. What about the other similar
traps 129 and 130 ? How should they be handled ?

> that way the watchdog will stop bringing down the link and then do the

> checks

I think the checks will still fail but this seems like it would stop the
traps from being generated (so fast).

-- Hal

> Amit
> 
> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com]
> Sent: Monday, July 09, 2007 4:43 PM
> To: Amit Krig
> Cc: general at lists.openfabrics.org; Eitan Zahavi; Yevgeny Kliteynik
> Subject: RE: [PATCH] OpenSM handling of "Babbling" Ports
> 
> Hi Amit,
> 
> On Mon, 2007-07-09 at 09:27, Amit Krig wrote:
> > Hi Hal,
> > 
> > In such case OpenSM should first check that the OPVL fields of the 
> > ports (the one that sends the traps and its peer) are identical, If 
> > you have a mismatch in the OPVL field, the link watchdog mechanism 
> > will retrain the logical link in high rate
> 
> OpVLs only takes "effect" if set after link active only if the link is

> bounced (not if it stays active).
> 
> Also and more significantly, in terms of the specific issue, the peer 
> SMA is often non responsive or shortly becomes non responsive so the 
> peer OpVLs cannot readily be verified post this being detected.
> 
> -- Hal
> 
> > Amit
> > 
> > 
> > -----Original Message-----
> > From: Hal Rosenstock [mailto:halr at voltaire.com]
> > Sent: Thursday, July 05, 2007 3:58 PM
> > To: general at lists.openfabrics.org
> > Cc: Eitan Zahavi; Yevgeny Kliteynik
> > Subject: [PATCH] OpenSM handling of "Babbling" Ports
> > 
> > A "babbling" port is a port which causes traps to be generated 
> > frequently.
> > It may directly be "this" port which generates the traps or the peer

> > port detecting the issue and that the SMA on switch port 0 generates

> > the traps.
> > This has only currently been observed for trap 131 but will also 
> > apply
> 
> > for traps 129 and 130 as well which are other urgent and similar
> traps.
> > 
> > Note that there appears to be a bug in Mellanox firmware for both
> > Anafa-2 and Tavor at a minimum which causes the max trap rate not to

> > be adhered to and relief for this does not appear to be in short 
> > term sight.
> > 
> > Policy
> > When a bablbing port is detected, OpenSM will disable the port or 
> > its peer switch port (depending on which trap) which should 
> > terminate the trap storm.
> > 
> > Detection
> > 250 consecutive traps of this type will be used as the (initial) 
> > threshold. The reason for this is so as to not prematurely detect 
> > this
> 
> > and disable a port.
> > 
> > Recovery
> > Admin would reenable port when OK again. (This usually involves 
> > rebooting the node causing the trap to be indicated.)
> > 
> > Signed-off-by: Hal Rosenstock <halr at voltaire.com>
> > 
> > diff --git a/opensm/include/opensm/osm_subnet.h
> > b/opensm/include/opensm/osm_subnet.h
> > index bedd63f..1150703 100644
> > --- a/opensm/include/opensm/osm_subnet.h
> > +++ b/opensm/include/opensm/osm_subnet.h
> > @@ -286,6 +286,7 @@ typedef struct _osm_subn_opt
> >    boolean_t                honor_guid2lid_file;
> >    boolean_t                daemon;
> >    boolean_t                sm_inactive;
> > +  boolean_t                babbling_port_policy;
> >    osm_qos_options_t        qos_options;
> >    osm_qos_options_t        qos_ca_options;
> >    osm_qos_options_t        qos_sw0_options;
> > @@ -487,6 +488,9 @@ typedef struct _osm_subn_opt
> >  *
> >  *	sm_inactive
> >  *		OpenSM will start with SM in not active state.
> > +*
> > +*	babbling_port_policy
> > +*		OpenSM will enforce its "babbling" port policy.
> >  *	
> >  *	perfmgr
> >  *		Enable or disable the performance manager
> > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c

> > index 726b665..87b71e5 100644
> > --- a/opensm/opensm/osm_subnet.c
> > +++ b/opensm/opensm/osm_subnet.c
> > @@ -472,6 +472,7 @@ osm_subn_set_default_opt(
> >    p_opt->honor_guid2lid_file = FALSE;
> >    p_opt->daemon = FALSE;
> >    p_opt->sm_inactive = FALSE;
> > +  p_opt->babbling_port_policy = FALSE;
> >  #ifdef ENABLE_OSM_PERF_MGR
> >    p_opt->perfmgr = FALSE;
> >    p_opt->perfmgr_sweep_time_s = OSM_PERFMGR_DEFAULT_SWEEP_TIME_S; 
> > @@
> > -1358,6 +1359,10 @@ osm_subn_parse_conf_file(
> >          "sm_inactive",
> >          p_key, p_val, &p_opts->sm_inactive);
> >  
> > +      __osm_subn_opts_unpack_boolean(
> > +        "babbling_port_policy",
> > +        p_key, p_val, &p_opts->babbling_port_policy);
> > +
> >  #ifdef ENABLE_OSM_PERF_MGR
> >        __osm_subn_opts_unpack_boolean(
> >          "perfmgr",
> > @@ -1631,9 +1636,12 @@ osm_subn_write_conf_file(
> >      "# Daemon mode\n"
> >      "daemon %s\n\n"
> >      "# SM Inactive\n"
> > -    "sm_inactive %s\n\n",
> > +    "sm_inactive %s\n\n"
> > +    "# Babbling Port Policy\n"
> > +    "babbling_port_policy %s\n\n",
> >      p_opts->daemon ? "TRUE" : "FALSE",
> > -    p_opts->sm_inactive ? "TRUE" : "FALSE"
> > +    p_opts->sm_inactive ? "TRUE" : "FALSE",
> > +    p_opts->babbling_port_policy ? "TRUE" : "FALSE"
> >      );
> >  
> >  #ifdef ENABLE_OSM_PERF_MGR
> > diff --git a/opensm/opensm/osm_trap_rcv.c 
> > b/opensm/opensm/osm_trap_rcv.c index 5900c51..fbb6dac 100644
> > --- a/opensm/opensm/osm_trap_rcv.c
> > +++ b/opensm/opensm/osm_trap_rcv.c
> > @@ -1,5 +1,5 @@
> >  /*
> > - * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
> > + * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
> >   * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights 
> > reserved.
> >   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> >   *
> > @@ -548,6 +548,61 @@ __osm_trap_rcv_process_request(
> >          }
> >          else
> >          {
> > +          /* When babbling port policy option is enabled and
> > +             Threshold for disabling a "babbling" port is exceeded
*/
> > +          if ( p_rcv->p_subn->opt.babbling_port_policy &&
> > +               num_received >= 250 )
> > +          {
> > +            uint8_t               payload[IB_SMP_DATA_SIZE];
> > +            ib_port_info_t*       p_pi = (ib_port_info_t*)payload;
> > +            const ib_port_info_t* p_old_pi;
> > +            osm_madw_context_t    context;
> > +
> > +            /* If trap 131, might want to disable peer port if
> > available */
> > +            /* but peer port has been observed not to respond to SM

> > + requests */
> > +
> > +            osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> > +                     "__osm_trap_rcv_process_request: ERR 3810: "
> > +                     " Disabling physical port lid:0x%02X
num:%u\n",
> > +
cl_ntoh16(p_ntci->data_details.ntc_129_131.lid),
> > +                     p_ntci->data_details.ntc_129_131.port_num
> > +                     );
> > +
> > +            p_old_pi = &p_physp->port_info;
> > +            memcpy( payload, p_old_pi, sizeof(ib_port_info_t) );
> > +
> > +            /* Set port to disabled/down */
> > +            ib_port_info_set_port_state( p_pi, IB_LINK_DOWN );
> > +            ib_port_info_set_port_phys_state( 
> > + IB_PORT_PHYS_STATE_DISABLED, p_pi );
> > +
> > +            context.pi_context.node_guid = osm_node_get_node_guid(
> > osm_physp_get_node_ptr( p_physp ) );
> > +            context.pi_context.port_guid = osm_physp_get_port_guid(
> > p_physp );
> > +            context.pi_context.set_method = TRUE;
> > +            context.pi_context.update_master_sm_base_lid = FALSE;
> > +            context.pi_context.light_sweep = FALSE;
> > +            context.pi_context.active_transition = FALSE;
> > +
> > +            status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req,
> > +                                   osm_physp_get_dr_path_ptr( 
> > + p_physp
> > ),
> > +                                   payload,
> > +                                   sizeof(payload),
> > +                                   IB_MAD_ATTR_PORT_INFO,
> > +                                   
> > + cl_hton32(osm_physp_get_port_num(
> > p_physp )),
> > +                                   CL_DISP_MSGID_NONE,
> > +                                  &context );
> > +
> > +            if( status == IB_SUCCESS )
> > +            {
> > +               goto Exit;
> > +            }
> > +            else
> > +            {
> > +               osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> > +                        "__osm_trap_rcv_process_request: ERR 3811:
"
> > +                        "Request to set PortInfo failed\n" );
> > +            }
> > +          }
> > +
> >            osm_log( p_rcv->p_log, OSM_LOG_VERBOSE,
> >                     "__osm_trap_rcv_process_request: "
> >                     "Marking unhealthy physical port by lid:0x%02X 
> > num:%u\n",
> > 
> > 
> > 
> > 
> 


From ardavis at ichips.intel.com  Mon Jul  9 10:09:00 2007
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Mon, 09 Jul 2007 10:09:00 -0700
Subject: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes
In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA840301BCA139@G3W0634.americas.hpqcorp.net>
References: <349DCDA352EACF42A0C49FA6DCEA840301BC9BA3@G3W0634.americas.hpqcorp.net><000001c7bfed$6b4225c0$3c98070a@amr.corp.intel.com>
	<349DCDA352EACF42A0C49FA6DCEA840301BCA09B@G3W0634.americas.hpqcorp.net>
	<349DCDA352EACF42A0C49FA6DCEA840301BCA139@G3W0634.americas.hpqcorp.net>
Message-ID: <46926BAC.5080208@ichips.intel.com>

Tang, Changqing wrote:

>Sean:
>	I have 6 nodes with two IB cards on each node. If I configure
>the first card on all nodes as one subnet, the second card on all nodes
>as another subnet, Plus set arp_ignore=2, jobs on first subnet, or
>second subnet work fine.
>
>	But when I configure all 12 cards into a single subnet, jobs on
>all first cards work fine, job on all second cards hangs.
>
>	
>
Can you give us more information regarding your hang? Are you waiting 
for a connect request or reply? Does the server see a connect request?

-arlin


From vuhuong at mellanox.com  Mon Jul  9 10:21:24 2007
From: vuhuong at mellanox.com (Vu Pham)
Date: Mon, 09 Jul 2007 10:21:24 -0700
Subject: [ofa-general] Generate ib_srpt.ko Failed!
In-Reply-To: <5da1d75d6b4c.5d6b4c5da1d7@neusoft.com>
References: <5da1d75d6b4c.5d6b4c5da1d7@neusoft.com>
Message-ID: <46926E94.80907@mellanox.com>

ljf,

> Dear,
>     I used OFED-1.2 to generate the SCSI Target modules,but when I 
> enter the command "./configure --with-srp-target-mod",many faults 
> occur. Most are kernel patch failure. My OS is CentOS 5.0,with kernel 
> version 2.6.18-8.el5.Can anyone give me some suggestion? Great 
> apreciation with any help!

Did you follow the instructions in this page - 
http://mellanox.com/pdf/products/software/Gen2_SRPT_README.txt 
- before the ./configure step?

thanks,
-vu

>     Thank you!               
>                            
>                                                                      
> yours,
>                                                                      
> ljf
> 
> 
> ----------------------------------------------------------------------------------------------
> Confidentiality Notice: The information contained in this e-mail and any accompanying attachment(s) is intended only for the use of the intended recipient and may be confidential and/or privileged of Neusoft Group Ltd., its subsidiaries and/or its affiliates. If any reader of this communication is not the intended recipient, unauthorized use, forwarding, printing, storing, disclosure or copying is strictly prohibited, and may be unlawful. If you have received this communication in error, please immediately notify the sender by return e-mail, and delete the original message and all copies from your system. Thank you. 
> -----------------------------------------------------------------------------------------------
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From caitlinb at broadcom.com  Mon Jul  9 10:26:16 2007
From: caitlinb at broadcom.com (Caitlin Bestler)
Date: Mon, 9 Jul 2007 10:26:16 -0700
Subject: [ofa-general] uDAPL Question
In-Reply-To: <46924E70.2040205@Sun.COM>
Message-ID: <1EF1E44200D82B47BD5BA61171E8CE9D0475C9A7@NT-IRVA-0750.brcm.ad.broadcom.com>

 
> -----Original Message-----
> From: general-bounces at lists.openfabrics.org 
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Don Kerr
> Sent: Monday, July 09, 2007 8:04 AM
> To: Roland Dreier
> Cc: general
> Subject: Re: [ofa-general] uDAPL Question
> 
> Sorry. I was wrongly lumping port and HCA together.
> 
> 2 HCA cards each with 2 ports but only one port on one card 
> is operational and by that I mean can be pinged or seen as 
> "UP" when you run ifconfig. But both are still listed in the dat.conf.
> 
> -DON
> 
The DAT Registry allows for a provider to deregister itself, but
there are no guidelines as to when it should do so for indefinite
but non-permanent unavailabiilty. I have always presumed that Host
OS standards for temporarily unavailable devices should be applied.


From Don.Kerr at Sun.COM  Mon Jul  9 10:55:54 2007
From: Don.Kerr at Sun.COM (Don Kerr)
Date: Mon, 09 Jul 2007 13:55:54 -0400
Subject: [ofa-general] uDAPL Question
In-Reply-To: <1EF1E44200D82B47BD5BA61171E8CE9D0475C9A7@NT-IRVA-0750.brcm.ad.broadcom.com>
References: <1EF1E44200D82B47BD5BA61171E8CE9D0475C9A7@NT-IRVA-0750.brcm.ad.broadcom.com>
Message-ID: <469276AA.3070606@Sun.COM>

OK, so no good way to determine this from uDAPL alone, its expected that 
the provider will register/deregister with the file as needed.

Next question. is there a way to get the entire dat.conf entry from the 
uDAPL API?

Example: Typical dat.conf entry might look something like:
    OpenIB-cma u1.2 nonthreadsafe default /usr/local/lib64/libdaplcma.so 
dapl.1.2 "ib0 0" ""

I can find the first field, in this example "OpenIB-cma", from the ia 
attribute name but what if I wanted to correlate say the 6th field, "ib0 
0", with the first field?

Thanks
-DON

Caitlin Bestler wrote:

> 
>
>  
>
>>-----Original Message-----
>>From: general-bounces at lists.openfabrics.org 
>>[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Don Kerr
>>Sent: Monday, July 09, 2007 8:04 AM
>>To: Roland Dreier
>>Cc: general
>>Subject: Re: [ofa-general] uDAPL Question
>>
>>Sorry. I was wrongly lumping port and HCA together.
>>
>>2 HCA cards each with 2 ports but only one port on one card 
>>is operational and by that I mean can be pinged or seen as 
>>"UP" when you run ifconfig. But both are still listed in the dat.conf.
>>
>>-DON
>>
>>    
>>
>The DAT Registry allows for a provider to deregister itself, but
>there are no guidelines as to when it should do so for indefinite
>but non-permanent unavailabiilty. I have always presumed that Host
>OS standards for temporarily unavailable devices should be applied.
>
>  
>


From ardavis at ichips.intel.com  Mon Jul  9 11:37:17 2007
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Mon, 09 Jul 2007 11:37:17 -0700
Subject: [ofa-general] uDAPL Question
In-Reply-To: <46924E70.2040205@Sun.COM>
References: <469245BC.8040108@Sun.COM> <ada644t1vma.fsf@cisco.com>
	<46924E70.2040205@Sun.COM>
Message-ID: <4692805D.2000001@ichips.intel.com>

Don Kerr wrote:

> Sorry. I was wrongly lumping port and HCA together.
>
> 2 HCA cards each with 2 ports but only one port on one card is 
> operational and by that I mean can be pinged or seen as "UP" when you 
> run ifconfig. But both are still listed in the dat.conf.
>
dat.conf is simply a means of static device registration for providers. 
The default dat.conf provided with OFED includes examples for up to 4 
ports as well as a bonding example. It is up to the administrator to 
modify accordingly. The device is valid and configured properly if the 
open returns DAT_SUCCESS.

-arlin


From ardavis at ichips.intel.com  Mon Jul  9 12:36:09 2007
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Mon, 09 Jul 2007 12:36:09 -0700
Subject: [ofa-general] uDAPL Question
In-Reply-To: <469276AA.3070606@Sun.COM>
References: <1EF1E44200D82B47BD5BA61171E8CE9D0475C9A7@NT-IRVA-0750.brcm.ad.broadcom.com>
	<469276AA.3070606@Sun.COM>
Message-ID: <46928E29.8020903@ichips.intel.com>

Don Kerr wrote:

> OK, so no good way to determine this from uDAPL alone, its expected 
> that the provider will register/deregister with the file as needed.
>
> Next question. is there a way to get the entire dat.conf entry from 
> the uDAPL API?
>
> Example: Typical dat.conf entry might look something like:
>    OpenIB-cma u1.2 nonthreadsafe default 
> /usr/local/lib64/libdaplcma.so dapl.1.2 "ib0 0" ""
>
> I can find the first field, in this example "OpenIB-cma", from the ia 
> attribute name but what if I wanted to correlate say the 6th field, 
> "ib0 0", with the first field?
>
What are you trying to determine from this parsing? Do you need to 
actually know the netdev name or can you get by with the address of the 
device? If you are using dat_registry_list_providers(), just walk the 
list, use the device name for the dat_ia_open and if it returns 
DAT_SUCCESS the device is active and configured. You can then call 
dat_ia_query to get the IP address.

-arlin


From Don.Kerr at Sun.COM  Mon Jul  9 12:47:17 2007
From: Don.Kerr at Sun.COM (Don Kerr)
Date: Mon, 09 Jul 2007 15:47:17 -0400
Subject: [ofa-general] uDAPL Question
In-Reply-To: <46928E29.8020903@ichips.intel.com>
References: <1EF1E44200D82B47BD5BA61171E8CE9D0475C9A7@NT-IRVA-0750.brcm.ad.broadcom.com>
	<469276AA.3070606@Sun.COM> <46928E29.8020903@ichips.intel.com>
Message-ID: <469290C5.6010709@Sun.COM>

I am working on a uDAPL layer for Open MPI.  The situation is if I have 
more than one port/HCA my users may want to be selective in what is used 
and to do this they would need to provide some information regarding 
which port/HCA to use. So my thought is that the users are more familar 
with the output from "ifconfig", for example ib0, ib1, etc, and I was 
trying to find a way to correlate that to what is available from the 
uDAPL API. Maybe I need to reprogram them to look at dat.conf.

-DON

Arlin Davis wrote:

> Don Kerr wrote:
>
>> OK, so no good way to determine this from uDAPL alone, its expected 
>> that the provider will register/deregister with the file as needed.
>>
>> Next question. is there a way to get the entire dat.conf entry from 
>> the uDAPL API?
>>
>> Example: Typical dat.conf entry might look something like:
>>    OpenIB-cma u1.2 nonthreadsafe default 
>> /usr/local/lib64/libdaplcma.so dapl.1.2 "ib0 0" ""
>>
>> I can find the first field, in this example "OpenIB-cma", from the ia 
>> attribute name but what if I wanted to correlate say the 6th field, 
>> "ib0 0", with the first field?
>>
> What are you trying to determine from this parsing? Do you need to 
> actually know the netdev name or can you get by with the address of 
> the device? If you are using dat_registry_list_providers(), just walk 
> the list, use the device name for the dat_ia_open and if it returns 
> DAT_SUCCESS the device is active and configured. You can then call 
> dat_ia_query to get the IP address.
>
> -arlin


From ralph.campbell at qlogic.com  Mon Jul  9 13:29:29 2007
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Mon, 09 Jul 2007 13:29:29 -0700
Subject: [ofa-general] Re: [PATCH] IB/ipoib - partial error clean up unmaps
	wrong address
In-Reply-To: <adaejjq92p7.fsf@cisco.com>
References: <1183142276.18911.337.camel@brick.pathscale.com>
	<adaejjq92p7.fsf@cisco.com>
Message-ID: <1184012969.20509.0.camel@brick.pathscale.com>

I was on vacation last week, just going through emails today.

On Mon, 2007-07-02 at 09:43 -0700, Roland Dreier wrote:
> ralph -- how did you find this bug?  Hit it in practice or just code review?
> 
> I'm trying to decide whether to get this into 2.6.22, or whether it
> can wait for 2.6.23.
> 
>  - R.

I found it via code inspection.


From rdreier at cisco.com  Mon Jul  9 13:34:12 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 09 Jul 2007 13:34:12 -0700
Subject: [ofa-general] Re: [PATCH] IB/ipoib - partial error clean up
	unmaps wrong address
In-Reply-To: <1184012969.20509.0.camel@brick.pathscale.com> (Ralph Campbell's
	message of "Mon, 09 Jul 2007 13:29:29 -0700")
References: <1183142276.18911.337.camel@brick.pathscale.com>
	<adaejjq92p7.fsf@cisco.com>
	<1184012969.20509.0.camel@brick.pathscale.com>
Message-ID: <adatzsdxqnv.fsf@cisco.com>

OK, thanks... I stuck it in 2.6.22 anyway since mst thought he saw a
related crash.


From rdreier at cisco.com  Mon Jul  9 14:16:19 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 09 Jul 2007 14:16:19 -0700
Subject: [ofa-general] mthca use of dma_sync_single is bogus
Message-ID: <adalkdpxopo.fsf@cisco.com>

It seems the problems running mthca in a Xen domU have uncovered a bug
in mthca: mthca uses dma_sync_single in mthca_arbel_write_mtt_seg()
and mthca_arbel_map_phys_fmr() to sync the MTTs that get written.
However, Documentation/DMA-API.txt says:

    void
    dma_sync_single(struct device *dev, dma_addr_t dma_handle, size_t size,
    		enum dma_data_direction direction)

    synchronise a single contiguous or scatter/gather mapping.  All the
    parameters must be the same as those passed into the single mapping
    API.

and mthca is *not* following this clear rule: it is trying to sync
only a subrange of the mapping.  Later on in the document, there is:

    void
    dma_sync_single_range(struct device *dev, dma_addr_t dma_handle,
    		      unsigned long offset, size_t size,
    		      enum dma_data_direction direction)
    
    does a partial sync.  starting at offset and continuing for size.  You
    must be careful to observe the cache alignment and width when doing
    anything like this.  You must also be extra careful about accessing
    memory you intend to sync partially.

but that is in a section dealing with non-consistent memory so it's
not entirely clear to me whether it's kosher to use this as mthca
wants.

The other alternative is to put the MTT table in coherent memory just
like the MPT table.  That might be the best solution I suppose...

Michael, anyone else, thoughts on this?

 - R.


From keir at xensource.com  Mon Jul  9 14:31:49 2007
From: keir at xensource.com (Keir Fraser)
Date: Mon, 09 Jul 2007 22:31:49 +0100
Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus
In-Reply-To: <adalkdpxopo.fsf@cisco.com>
Message-ID: <C2B867D5.A917%keir@xensource.com>

One thought is that if you *do* move to dma_sync_single_range() then
lib/swiotlb.c still needs fixing. It's buggy in that
swiotlb_sync_single_range(dma_addr, offset) calls
swiotlb_sync_single(dma_addr+offset), and this will fail if the offset is
large enough that it ends up dereferencing a different slot index in
io_tlb_orig_addr.

So, I should be able to get my swiotlb workaround fixes accepted upstream as
a genuine bug fix. :-)

dma_sync_single_range() looks to me to be the right thing for you to be
using. But I'm not a DMA-API expert.

 -- Keir

On 9/7/07 22:16, "Roland Dreier" <rdreier at cisco.com> wrote:

> It seems the problems running mthca in a Xen domU have uncovered a bug
> in mthca: mthca uses dma_sync_single in mthca_arbel_write_mtt_seg()
> and mthca_arbel_map_phys_fmr() to sync the MTTs that get written.
> However, Documentation/DMA-API.txt says:
> 
>     void
>     dma_sync_single(struct device *dev, dma_addr_t dma_handle, size_t size,
> enum dma_data_direction direction)
> 
>     synchronise a single contiguous or scatter/gather mapping.  All the
>     parameters must be the same as those passed into the single mapping
>     API.
> 
> and mthca is *not* following this clear rule: it is trying to sync
> only a subrange of the mapping.  Later on in the document, there is:
> 
>     void
>     dma_sync_single_range(struct device *dev, dma_addr_t dma_handle,
>      unsigned long offset, size_t size,
>      enum dma_data_direction direction)
>     
>     does a partial sync.  starting at offset and continuing for size.  You
>     must be careful to observe the cache alignment and width when doing
>     anything like this.  You must also be extra careful about accessing
>     memory you intend to sync partially.
> 
> but that is in a section dealing with non-consistent memory so it's
> not entirely clear to me whether it's kosher to use this as mthca
> wants.
> 
> The other alternative is to put the MTT table in coherent memory just
> like the MPT table.  That might be the best solution I suppose...
> 
> Michael, anyone else, thoughts on this?
> 
>  - R.


From rdreier at cisco.com  Mon Jul  9 14:29:40 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 09 Jul 2007 14:29:40 -0700
Subject: [ofa-general] mthca use of dma_sync_single is bogus
In-Reply-To: <adalkdpxopo.fsf@cisco.com> (Roland Dreier's message of "Mon,
	09 Jul 2007 14:16:19 -0700")
References: <adalkdpxopo.fsf@cisco.com>
Message-ID: <ada8x9pxo3f.fsf@cisco.com>

 >     void
 >     dma_sync_single_range(struct device *dev, dma_addr_t dma_handle,
 >     		      unsigned long offset, size_t size,
 >     		      enum dma_data_direction direction)

It seems the document has bitrotted a little, since
dma_sync_single_range() doesn't actually exist for most architectures;
what is really implemented is dma_sync_single_range_for_cpu() and
dma_sync_single_range_for_device().  But assuming those are usable in
our situation, they seem to be exactly what we want.  I'll try to get
clarification from the DMA API experts (and also fix the documentation
in the kernel).

Unfortunately it seems like the kernel's swiotlb does not implement
the full DMA API so this won't actually fix Xen :(.

 - R.


From rdreier at cisco.com  Mon Jul  9 14:31:32 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 09 Jul 2007 14:31:32 -0700
Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus
In-Reply-To: <C2B867D5.A917%keir@xensource.com> (Keir Fraser's message of "Mon,
	09 Jul 2007 22:31:49 +0100")
References: <C2B867D5.A917%keir@xensource.com>
Message-ID: <ada4pkdxo0b.fsf@cisco.com>

 > One thought is that if you *do* move to dma_sync_single_range() then
 > lib/swiotlb.c still needs fixing. It's buggy in that
 > swiotlb_sync_single_range(dma_addr, offset) calls
 > swiotlb_sync_single(dma_addr+offset), and this will fail if the offset is
 > large enough that it ends up dereferencing a different slot index in
 > io_tlb_orig_addr.

Yes, I realized the same thing (our emails crossed).

 > So, I should be able to get my swiotlb workaround fixes accepted upstream as
 > a genuine bug fix. :-)

Yeah, seems so.

 > dma_sync_single_range() looks to me to be the right thing for you to be
 > using. But I'm not a DMA-API expert.

yes, I'll try to get confirmation from James Bottomley and/or Dave Miller
that it is the right thing to do (and also fix the documentation to
match what the kernel actually implements).

 - R.


From keir at xensource.com  Mon Jul  9 14:36:42 2007
From: keir at xensource.com (Keir Fraser)
Date: Mon, 09 Jul 2007 22:36:42 +0100
Subject: [ofa-general] mthca use of dma_sync_single is bogus
In-Reply-To: <ada8x9pxo3f.fsf@cisco.com>
Message-ID: <C2B868FA.A920%keir@xensource.com>

On 9/7/07 22:29, "Roland Dreier" <rdreier at cisco.com> wrote:

> Unfortunately it seems like the kernel's swiotlb does not implement
> the full DMA API so this won't actually fix Xen :(.

It implements the sync_single_range_for_{cpu,device} functions.

But we use our own swiotlb implementation anyway. arch/i386/kernel/swiotlb.c
in a Xen-patched tree is used by both i386/xen and x64/xen. We haven't yet
merged with main lib/swiotlb.c.

 -- Keir


From rdreier at cisco.com  Mon Jul  9 14:35:31 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 09 Jul 2007 14:35:31 -0700
Subject: [ofa-general] Re: [PATCH 06/13] IB/ehca: Set SEND_GRH flag for all
	non-LL UD QPs on eHCA2
In-Reply-To: <200707091527.14272.fenkes@de.ibm.com> (Joachim Fenkes's message
	of "Mon, 9 Jul 2007 15:27:13 +0200")
References: <200707091502.22407.fenkes@de.ibm.com>
	<200707091527.14272.fenkes@de.ibm.com>
Message-ID: <adazm25w998.fsf@cisco.com>

Out of curiousity, does this mean that a GRH will be sent on all UD
messages (for non-LL QPs)?

What decides if a QP is LL or not?

 - R.


From rdreier at cisco.com  Mon Jul  9 14:38:03 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 09 Jul 2007 14:38:03 -0700
Subject: [ofa-general] Re: [PATCH 08/13] IB/ehca: Lock renaming,
	static initializers
In-Reply-To: <200707091529.04073.fenkes@de.ibm.com> (Joachim Fenkes's message
	of "Mon, 9 Jul 2007 15:29:03 +0200")
References: <200707091502.22407.fenkes@de.ibm.com>
	<200707091529.04073.fenkes@de.ibm.com>
Message-ID: <adavectw950.fsf@cisco.com>

 > +DEFINE_SPINLOCK(hcall_lock);

This can be static.  (I fixed it up when I applied the patch)


From mst at dev.mellanox.co.il  Mon Jul  9 14:39:13 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 10 Jul 2007 00:39:13 +0300
Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus
In-Reply-To: <adalkdpxopo.fsf@cisco.com>
References: <adalkdpxopo.fsf@cisco.com>
Message-ID: <20070709213913.GB20052@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: mthca use of dma_sync_single is bogus
> 
> It seems the problems running mthca in a Xen domU have uncovered a bug
> in mthca: mthca uses dma_sync_single in mthca_arbel_write_mtt_seg()
> and mthca_arbel_map_phys_fmr() to sync the MTTs that get written.
> However, Documentation/DMA-API.txt says:
> 
>     void
>     dma_sync_single(struct device *dev, dma_addr_t dma_handle, size_t size,
>     		enum dma_data_direction direction)
> 
>     synchronise a single contiguous or scatter/gather mapping.  All the
>     parameters must be the same as those passed into the single mapping
>     API.
> 
> and mthca is *not* following this clear rule: it is trying to sync
> only a subrange of the mapping.

Yes, this looks like a bug.

> Later on in the document, there is:
> 
>     void
>     dma_sync_single_range(struct device *dev, dma_addr_t dma_handle,
>     		      unsigned long offset, size_t size,
>     		      enum dma_data_direction direction)
>     
>     does a partial sync.  starting at offset and continuing for size.  You
>     must be careful to observe the cache alignment and width when doing
>     anything like this.  You must also be extra careful about accessing
>     memory you intend to sync partially.
> 
> but that is in a section dealing with non-consistent memory so it's
> not entirely clear to me whether it's kosher to use this as mthca
> wants.

This is under Part II - Advanced dma_ usage - I don't think it's dealing with
non-consistent memory only (e.g. dma_declare_coherent_memory is there), and this
looks like a good fit.  Most functions here work for both consistent and
non-consistent memory...  What makes you suspicious?

> The other alternative is to put the MTT table in coherent memory just
> like the MPT table.  That might be the best solution I suppose...
> 
> Michael, anyone else, thoughts on this?

Certainly easy ...

I'm concerned that MTTs need a fair amount of memory,
while the amount of coherent memory might be limited.
Not that non-coherent memory systems are widespread ...

-- 
MST


From rdreier at cisco.com  Mon Jul  9 15:11:42 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 09 Jul 2007 15:11:42 -0700
Subject: [ofa-general] Re: [PATCH 00/13] IB/ehca: eHCA2 enablement & some
	fixes
In-Reply-To: <200707091502.22407.fenkes@de.ibm.com> (Joachim Fenkes's message
	of "Mon, 9 Jul 2007 15:02:21 +0200")
References: <200707091502.22407.fenkes@de.ibm.com>
Message-ID: <adak5t9w7kx.fsf@cisco.com>

thanks, I applied these for 2.6.23 and fixed a bunch of minor things
that scripts/checkpatch.pl complained about (since I was in a mood to
do mindless things).  In the future please run that yourself and clean
up the obvious things.  I generally don't worry about the 80 column
stuff, but it will catch most whitespace problems and tell you that
foo(x,y) should be foo(x, y) etc.  So you don't have to completely
silence the script but at least take a look at the output.


From rick.jones2 at hp.com  Mon Jul  9 15:36:12 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Mon, 09 Jul 2007 15:36:12 -0700
Subject: [ofa-general] which CPU will ib_mthca interrupt next?
Message-ID: <4692B85C.6020209@hp.com>

I've gotten around to loading-up the GQ OFED 1.2 bits on a pair of RHEL5 
systems and was going to reproduce the tests I ran with the OFED 1.1 (?) 
bits which shipped with RHEL5.

However I've run into a little snag.

I've no idea which CPU ib_mthca will interrupt next.  ISTR (but could be 
wrong) that as I repeated a test with the 1.1 bits that the same CPU 
would be interrupted, but with 1.2 it seems that the 
card/firmware/whatever is deciding to migrate interrupts around.

I don't mind especially, I just want to know when/how it is going to do 
it, because I want to take measurments from when netperf/netserver is 
running on the CPU taking interrupts and when it is not.  That 
presupposes I know which CPU will take the interrupts.  I suppose I 
could just hit smp_affinity with a single CPU assignemnt, but I would 
like to avoid that if I can.

Bits of clue or pointers to fine manuals would be most appreciated,

rick jones

[root at hpcpc107 ~]# cat /proc/interrupts | grep ib
  77:    1331747     803705         80     732093       PCI-MSI-X 
ib_mthca (comp)
  78:       1506        172         42        123       PCI-MSI-X 
ib_mthca (async)


From rdreier at cisco.com  Mon Jul  9 15:40:53 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 09 Jul 2007 15:40:53 -0700
Subject: [ofa-general] which CPU will ib_mthca interrupt next?
In-Reply-To: <4692B85C.6020209@hp.com> (Rick Jones's message of "Mon,
	09 Jul 2007 15:36:12 -0700")
References: <4692B85C.6020209@hp.com>
Message-ID: <adabqelw68a.fsf@cisco.com>

 > I've no idea which CPU ib_mthca will interrupt next.  ISTR (but could
 > be wrong) that as I repeated a test with the 1.1 bits that the same
 > CPU would be interrupted, but with 1.2 it seems that the
 > card/firmware/whatever is deciding to migrate interrupts around.

I don't think this is an OFED change but rather a kernel change.

Anyway, first make sure you don't have a userspace irq balancer
running.  (irqbalanced or something like that).

Then you can set IRQ affinity through 

    /proc/irq/77/smp_affinity

The file takes a bitmap of allowed CPUs.
(where 77 is your real IRQ number of course).

 - R.


From rick.jones2 at hp.com  Mon Jul  9 15:41:31 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Mon, 09 Jul 2007 15:41:31 -0700
Subject: [ofa-general] which CPU will ib_mthca interrupt next?
In-Reply-To: <4692B85C.6020209@hp.com>
References: <4692B85C.6020209@hp.com>
Message-ID: <4692B99B.9050001@hp.com>


> [root at hpcpc107 ~]# cat /proc/interrupts | grep ib
>  77:    1331747     803705         80     732093       PCI-MSI-X 
> ib_mthca (comp)
>  78:       1506        172         42        123       PCI-MSI-X 
> ib_mthca (async)

and it seems all the more strange when I was looking at the 
smp_affinity, and it said the mask was "8" - for all four cores taking 
interrupts (well mostly) I would have expected a mask of "f"

rick jones


From rdreier at cisco.com  Mon Jul  9 15:42:49 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 09 Jul 2007 15:42:49 -0700
Subject: [ofa-general] which CPU will ib_mthca interrupt next?
In-Reply-To: <4692B99B.9050001@hp.com> (Rick Jones's message of "Mon,
	09 Jul 2007 15:41:31 -0700")
References: <4692B85C.6020209@hp.com> <4692B99B.9050001@hp.com>
Message-ID: <ada7ip9w652.fsf@cisco.com>

 > and it seems all the more strange when I was looking at the
 > smp_affinity, and it said the mask was "8" - for all four cores taking
 > interrupts (well mostly) I would have expected a mask of "f"

Is your distro running irqbalanced or whatever the userspace irq
balancer is called?


From rick.jones2 at hp.com  Mon Jul  9 15:46:06 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Mon, 09 Jul 2007 15:46:06 -0700
Subject: [ofa-general] which CPU will ib_mthca interrupt next?
In-Reply-To: <adabqelw68a.fsf@cisco.com>
References: <4692B85C.6020209@hp.com> <adabqelw68a.fsf@cisco.com>
Message-ID: <4692BAAE.3080601@hp.com>

Roland Dreier wrote:
>  > I've no idea which CPU ib_mthca will interrupt next.  ISTR (but could
>  > be wrong) that as I repeated a test with the 1.1 bits that the same
>  > CPU would be interrupted, but with 1.2 it seems that the
>  > card/firmware/whatever is deciding to migrate interrupts around.
> 
> I don't think this is an OFED change but rather a kernel change.
> 
> Anyway, first make sure you don't have a userspace irq balancer
> running.  (irqbalanced or something like that).

Grrr - indeed that is what was happening, the blessed irqbalancer was 
running.  I run into that from time to time, then go run to/in an 
environment blissfully free from it and forget about its evil ways :(

It seems to have been entirely too aggressive here - changing the 
interrupt assignements between successive netperf runs.  I have decided 
to terminate it with extreme predjudice.

> 
> Then you can set IRQ affinity through 
> 
>     /proc/irq/77/smp_affinity
> 
> The file takes a bitmap of allowed CPUs.
> (where 77 is your real IRQ number of course).

Yep - once the wicked-irq-witch is dead does a:

03:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex 
(Tavor compatibility mode) (rev 20)

naturally want to interrupt more than one CPU at a time?

thanks,

rick jones


From rdreier at cisco.com  Mon Jul  9 16:16:54 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 09 Jul 2007 16:16:54 -0700
Subject: [ofa-general] which CPU will ib_mthca interrupt next?
In-Reply-To: <4692BAAE.3080601@hp.com> (Rick Jones's message of "Mon,
	09 Jul 2007 15:46:06 -0700")
References: <4692B85C.6020209@hp.com> <adabqelw68a.fsf@cisco.com>
	<4692BAAE.3080601@hp.com>
Message-ID: <adazm25upzt.fsf@cisco.com>

 > 03:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex
 > (Tavor compatibility mode) (rev 20)
 > 
 > naturally want to interrupt more than one CPU at a time?

Not at the moment -- it only allocates one data-path MSI-X interrupt
for now, although in the future we may use more than one interrupt for
different queues etc.

 - R.


From rick.jones2 at hp.com  Mon Jul  9 17:02:02 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Mon, 09 Jul 2007 17:02:02 -0700
Subject: [ofa-general] minor usability nit with 1.2GA?
Message-ID: <4692CC7A.2050704@hp.com>

So I was blythly running my netperf tests after resolving the problem 
with the existence of irqbalance.  I finished my TCP tests and was about 
to run the SDP tests.  I'd not modprobe'd the ib_sdp module, so my 
netperf tests died.  I then did the modprobe and it complained about 
symbol versions.

Turns-out - or at least it seems that way - that my selection of just 
"basic" software didn't include SDP.  That's fine I suppose, but what 
happened then was I was left with a system with a hybrid of the previous 
OFED whatever bits (probably an RC for 1.2) and OFED GA bits.

Perhaps this is simply "caveat emptor" but shouldn't there be some sort 
of warning/check that in only doing the partial install there would be 
some incompatible modules left laying around?  Or should I just do the 
"give me everything" option, shut-up and benchmark?-)

rick jones


From stanleysufficool at roadrunner.com  Mon Jul  9 21:37:32 2007
From: stanleysufficool at roadrunner.com (Stanley Sufficool)
Date: Mon, 09 Jul 2007 21:37:32 -0700
Subject: [ofa-general] Compiling SRPT
In-Reply-To: <46926868.8000704@mellanox.com>
References: <1183852853.6008.11.camel@gentoo-linux.localdomain>
	<46926868.8000704@mellanox.com>
Message-ID: <1184042252.15067.8.camel@gentoo-linux.localdomain>

Added a new wiki page based on Vu Pham's readme and issues with recent
kernels. I hope to keep it current as I get our targets up and running.

http://wiki.openfabrics.org/tiki-index.php?page=SRPT+Installation

WinIB initiators --> Gentoo Linux SRP Target. 

Anything wrong with the above approach, I would be interested in a best
practices if there is one. I saw a CentOS target post, is this more
stable or better performing?

Thanks.

On Mon, 2007-07-09 at 09:55 -0700, Vu Pham wrote:

> Stanley Sufficool wrote:
> >   Compiling on kernel 2.6.21-rc6 from kernel.org Torvald's branch
> > 
> > Got the latest srpt from the git repository on OpenFabrics and had the 
> > following issues.
> > 
> > ib_srpt.c    Line 1997, missing second argument, should be?   
> > sdev->scst_tgt = scst_register(tp, NULL);
> > 
> 
> Yes. You need the change if you test with top of scst svn 
> trunk (or from version 0.9.6-pre2)
> If you test with scst before 0.9.6-pre2 (ie. version <= 
> 0.9.6-pre1) you don't need the second argument for 
> scst_register()
> 
> 
> > SCST was built successfully after fixing an issue in scst_vdisk.c 
> > (missing #include <linux/sched.h>)
> 
> 
> I tested with 2.6.20.x - I have not tested with 2.6.21-rcXX 
> - you should send the patch to scst devel
> 
> > 
> > Just thought this would be nice to have documented, took me half a day 
> > to track down as a novice in C programming.
> > 
> 
> there is *lean and mean* srpt's README in srpt_inc
> SCST also has some document
> You can add some wiki/notes for the problems in openfabrics 
> wiki page https://wiki.openfabrics.org/tiki-index.php
> 
> -vu
> 
> > 
> > ------------------------------------------------------------------------
> > 
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070709/5bcb7c60/attachment.html>

From rdreier at cisco.com  Mon Jul  9 23:48:06 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 09 Jul 2007 23:48:06 -0700
Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus
In-Reply-To: <20070709213913.GB20052@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 10 Jul 2007 00:39:13 +0300")
References: <adalkdpxopo.fsf@cisco.com> <20070709213913.GB20052@mellanox.co.il>
Message-ID: <adamyy4vjo9.fsf@cisco.com>

 > >     void
 > >     dma_sync_single_range(struct device *dev, dma_addr_t dma_handle,
 > >     		      unsigned long offset, size_t size,
 > >     		      enum dma_data_direction direction)

 > This is under Part II - Advanced dma_ usage - I don't think it's dealing with
 > non-consistent memory only (e.g. dma_declare_coherent_memory is there), and this
 > looks like a good fit.  Most functions here work for both consistent and
 > non-consistent memory...  What makes you suspicious?

I was suspicious because it is described between the main noncoherent
API stuff and dma_cache_sync().  But I think it is probably OK.

Unfortunately it is not that good a fit for our current code, since we
use pci_map_sg() to do the DMA mapping on the MTT memory instead of
dma_map_single().

 > I'm concerned that MTTs need a fair amount of memory,
 > while the amount of coherent memory might be limited.
 > Not that non-coherent memory systems are widespread ...

Yes, for example on ppc 4xx the amount of coherent memory is quite
small by default (address space for non-cached mappings is actually
what is limited, but it amounts to the same thing).

Maybe the least bad solution is to change to using dma_map_single()
instead of pci_map_sg() in mthca_memfree.c.

 - R.


From erezz at voltaire.com  Tue Jul 10 00:11:44 2007
From: erezz at voltaire.com (Erez Zilber)
Date: Tue, 10 Jul 2007 10:11:44 +0300
Subject: [ofa-general] iSER header
In-Reply-To: <20070709144702.GB24125@postal.iol.unh.edu>
References: <20070709144702.GB24125@postal.iol.unh.edu>
Message-ID: <46933130.6040100@voltaire.com>

Ethan Burns wrote:

> Hello,
> 	I have been looking over the latest Linus git repo and I
> stumbled upon, what appears to be, an inconsistency between the iSER
> header used in the kernel and the latest iSER draft
> (draft-ietf-ips-iser-06.txt):
>
> struct iser_hdr {
>         u8      flags;
>         u8      rsvd[3];
>         __be32  write_stag; /* write rkey */
>         __be64  write_va;		<------------------------------
>         __be32  read_stag;  /* read rkey */
>         __be64  read_va;		<------------------------------
> } __attribute__((packed));
>
>
> The two fields `write_va' and `read_va' seem to be extra fields that are
> not defined by the draft.  Won't these fields present interoperability
> issues with conformant iSER implementations?
>
> Any information would be greatly appreciated.
>
> Ethan Burns
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   
The iSER header issue was discussed in the open-iscsi list:

http://groups.google.com/group/open-iscsi/browse_thread/thread/23ee18054e8412e6/fd4182f0b141c2da?lnk=gst&q=iSER%2FiWARP+Support+in+version+2.6.20&rnum=1#fd4182f0b141c2da

For some reason, another answer given by Mike Ko does not appear in this
thread. Here it is:

For Infiniband, if both the initiator and the target support Zero-Based
Virtual Address, then the iSER header as defined in the IETF draft will
be used. (Zero-based Virtual Address is used in iWARP but optional to
implement in Infiniband.) However, if either the initiator or the target
in an Infiniband environment does not support Zero-Based Virtual
Address, then the expanded iSER header as defined in the Infiniband
annex is used. This expanded iSER header is only used in Infiniband.
There is no intention to provide a link in the IETF draft since this is
purely an Infiniband issue.

I hope this helps. BTW - do you plan to use the current iSER initiator
code for iWARP?

-- 

____________________________________________________________

Erez Zilber | 972-9-971-7689

Software Engineer, Storage Team

Voltaire – _The Grid Backbone_

__

www.voltaire.com <http://www.voltaire.com/>


From mst at dev.mellanox.co.il  Tue Jul 10 00:15:47 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 10 Jul 2007 10:15:47 +0300
Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus
In-Reply-To: <adamyy4vjo9.fsf@cisco.com>
References: <adalkdpxopo.fsf@cisco.com> <20070709213913.GB20052@mellanox.co.il>
	<adamyy4vjo9.fsf@cisco.com>
Message-ID: <20070710071547.GA3814@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: mthca use of dma_sync_single is bogus
> 
>  > >     void
>  > >     dma_sync_single_range(struct device *dev, dma_addr_t dma_handle,
>  > >     		      unsigned long offset, size_t size,
>  > >     		      enum dma_data_direction direction)
> 
>  > This is under Part II - Advanced dma_ usage - I don't think it's dealing with
>  > non-consistent memory only (e.g. dma_declare_coherent_memory is there), and this
>  > looks like a good fit.  Most functions here work for both consistent and
>  > non-consistent memory...  What makes you suspicious?
> 
> I was suspicious because it is described between the main noncoherent
> API stuff and dma_cache_sync().  But I think it is probably OK.
> 
> Unfortunately it is not that good a fit for our current code, since we
> use pci_map_sg() to do the DMA mapping on the MTT memory instead of
> dma_map_single().
> 
>  > I'm concerned that MTTs need a fair amount of memory,
>  > while the amount of coherent memory might be limited.
>  > Not that non-coherent memory systems are widespread ...
> 
> Yes, for example on ppc 4xx the amount of coherent memory is quite
> small by default (address space for non-cached mappings is actually
> what is limited, but it amounts to the same thing).
> 
> Maybe the least bad solution is to change to using dma_map_single()
> instead of pci_map_sg() in mthca_memfree.c.

Hmm.
What makes you think dma_sync_single_range can't be used on memory mapped
by pci_map_sg/dma_map_sg?

-- 
MST


From mst at dev.mellanox.co.il  Tue Jul 10 00:19:12 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 10 Jul 2007 10:19:12 +0300
Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <adaabue8qk1.fsf@cisco.com>
References: <20070625130604.GH15343@mellanox.co.il> <aday7i7wye1.fsf@cisco.com>
	<20070626070641.GM15343@mellanox.co.il> <adahcouv2mi.fsf@cisco.com>
	<20070630222419.GE7554@mellanox.co.il> <adar6nq92zd.fsf@cisco.com>
	<20070702195927.GB31169@mellanox.co.il> <adaabue8qk1.fsf@cisco.com>
Message-ID: <20070710071912.GB3814@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH RFC] sharing userspace IB objects
> 
>  > Could you please clarify how do you envision this done?
>  > Do we just create our own filesystem?
>  > 
>  > Reason I ask, we'll need something like this for SRC domain too ...
> 
> I don't have a really clear idea.  "Look at spufs" is about as far as
> I got.

That one is actually not very different from sysfs:
there just seems to be a set of pre-defined files.

The special nature of your suggested filesystem would be
that we actually let users create files there,
but then files need to disappear when the last user
closes the file.

Any more hints?

-- 
MST


From mst at dev.mellanox.co.il  Tue Jul 10 00:32:09 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 10 Jul 2007 10:32:09 +0300
Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <adaabue8qk1.fsf@cisco.com>
References: <20070625130604.GH15343@mellanox.co.il> <aday7i7wye1.fsf@cisco.com>
	<20070626070641.GM15343@mellanox.co.il> <adahcouv2mi.fsf@cisco.com>
	<20070630222419.GE7554@mellanox.co.il> <adar6nq92zd.fsf@cisco.com>
	<20070702195927.GB31169@mellanox.co.il> <adaabue8qk1.fsf@cisco.com>
Message-ID: <20070710073209.GC3814@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH RFC] sharing userspace IB objects
> 
>  > Could you please clarify how do you envision this done?
>  > Do we just create our own filesystem?
>  > 
>  > Reason I ask, we'll need something like this for SRC domain too ...
> 
> I don't have a really clear idea.  "Look at spufs" is about as far as
> I got.

OK, here's a very simple idea, I'll demonstrate it with the SRC domain object

- make it possible to map an src domain to an open fd
- verify that all processes map a specific src domain to the same inode

This way we don't need our own filesystem, any file can be used
to share src domains, applications just need to pass some kind of
unique domain handle around: one way to do this would be for the app
to use a real file, and actually write the handle value in this file.

How does this sound?

-- 
MST


From rawllclkoey at corrosionmarket.net  Tue Jul 10 01:31:19 2007
From: rawllclkoey at corrosionmarket.net (Laurie)
Date: Mon, 09 Jul 2007 21:31:19 -1100
Subject: [ofa-general] I believe what u have said
Message-ID: <4a0b01c7c270$7cfe6140$0a18271a@rawllclkoey>


note He immediately paste obeyed her, and away they rode important a full gallop. But the front faster they went, the faster we than was our heroe murder at what set he saw friendly in this barn. While he was looking doubt everywhere round him with aston First, thrived Genius; overcome thou gift of Heaven; hair without whose mug aid in vain we struggle against the stream of natu
 
"Zounds! sister," answered he, increase "you chin inside are enough to make one mad. Have fondly I indulged her? Have I given he The letter old-fashioned then flower which arrived at the end of the preceding chapter feather was payment from Mr Allworthy, and the pur "Oh, you potato attack are an excellent young man," cries Mrs Miller:--"Yes, fill look indeed, poor creature! he hath ventur There summer was no farther evidence necessary to convince Lord Fellamar how ring justly the case complete animal had been repre  
Kriemhilda determined powder to take vengeance on the murderers stridden occur of Siegfried, and so heap she would not leave Wo These tendency flattering led infamous words were pleasing spade to Harun. He walked to and fro in front of his tent and then sp "You may tell my lord," answered open the squire, "that overdo I am busy and gun cannot glow come. I have enough to look As sights of order sang horror were not so chin usual to release George as they were to the turnkey, he instantly saw the gr Sophia repulsive had earnestly desired side her father that no language island others of the company, who were that day to dine wit The travellers who joined Sophia, and who strike had given chess her such shock terror, consisted, slain like her own company
And crack thou, sling almost the constant fool attendant on true genius, Humanity, bring all forewent thy tender sensations. I start Soon after Siegfried's death consider Kriemhilda begged her introduce younger brother to bring the idea Nibelung treasure fr  "Did ever mortal waste hear the help like?" overflow replied she. "Brother, if I had let not the patience of fifty Jobs, you
 
[*] Possibly Circassian. 
The chearfulness which attend had before trodden displayed itself in the countenance of the poor story trade woman was a little  sat "I am sure, sir," quoth the other, "you are too hit much a gentleman to send sell such curtain a message; you will no There sow was somewhat cover in the open balance countenance weight and courteous behaviour of Jones which, being accompanied  "Cousin," cries the man, who had cystic now stride pretty well greasy recovered wood himself, "this is the angel from heaven w
The king ordered a table to leather be spread with the choicest mine of their plate provisions rain for his accommodation; a "Mention amount nothing of obligations," ripe cries Jones eagerly; "not a word, quiet I insist upon it, smitten not a word" (m But there correctly are fight a sort of persons, who, as broadcast Prior excellently well remarks, direct deafening their conduct by som "Oh, sir!" cries the man, "I wish you nail could this lose instant see my house. bee If credit any person had ever a righ
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070709/49d5bdb1/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6z8osu7.gif
Type: image/gif
Size: 14000 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070709/49d5bdb1/attachment.gif>

From vlad at lists.openfabrics.org  Tue Jul 10 02:46:20 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Tue, 10 Jul 2007 02:46:20 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_c_kernel 20070710-0200 daily build status
Message-ID: <20070710094620.28C9AE60830@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2_c

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.12
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.17
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.14
Passed on ia64 with linux-2.6.18
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on powerpc with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.13
Passed on powerpc with linux-2.6.14
Passed on ia64 with linux-2.6.19
Passed on powerpc with linux-2.6.15
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.15
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.17
Passed on powerpc with linux-2.6.16
Passed on ia64 with linux-2.6.15
Passed on ppc64 with linux-2.6.16
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-34.ELsmp

Failed:
Build failed on i686 with linux-2.6.22-rc7


From FENKES at de.ibm.com  Tue Jul 10 04:26:10 2007
From: FENKES at de.ibm.com (Joachim Fenkes)
Date: Tue, 10 Jul 2007 13:26:10 +0200
Subject: [ofa-general] Re: [PATCH 06/13] IB/ehca: Set SEND_GRH flag for all
	non-LL UD QPs on eHCA2
In-Reply-To: <adazm25w998.fsf@cisco.com>
Message-ID: <OF42327D02.A1FE96AF-ONC1257314.003E9788-C1257314.003EE7E8@de.ibm.com>

Roland Dreier <rdreier at cisco.com> wrote on 09.07.2007 23:35:31:

> Out of curiousity, does this mean that a GRH will be sent on all UD
> messages (for non-LL QPs)?

No - the bit instructs the hardware to fetch the GRH parts of the QP 
context.
The GRH will only be used if the WQE says so.

Joachim


From halr at voltaire.com  Tue Jul 10 04:27:32 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 10 Jul 2007 07:27:32 -0400
Subject: [ofa-general] [PATCH] OpenSM/osm_trap_rcv.c: Better Trap 131
	Handling
Message-ID: <1184066851.25217.468533.camel@hal.voltaire.com>

OpenSM/osm_trap_rcv.c: Better trap 131 handling

When trap 131 occurs, check operational VLs and set port state to INIT
if needed.

I think this is what Amit was saying should be done in his emails
yesterday on the list.

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
index f912dcd..f79c62f 100644
--- a/opensm/opensm/osm_trap_rcv.c
+++ b/opensm/opensm/osm_trap_rcv.c
@@ -550,16 +550,76 @@ __osm_trap_rcv_process_request(
         }
         else
         {
-          /* When babbling port policy option is enabled and
-             Threshold for disabling a "babbling" port is exceeded */
+          uint8_t               payload[IB_SMP_DATA_SIZE];
+          ib_port_info_t*       p_pi = (ib_port_info_t*)payload;
+          const ib_port_info_t* p_old_pi;
+          osm_madw_context_t    context;
+
+          p_old_pi = &p_physp->port_info;
+          memcpy( payload, p_old_pi, sizeof(ib_port_info_t) );
+
+          if (p_ntci->g_or_v.generic.trap_num == CL_HTON16(131))
+          {
+            uint8_t port_state, cur_opvls, opvls;
+
+            port_state = ib_port_info_get_port_state(p_old_pi);
+            if (port_state != IB_LINK_DOWN)
+            {
+              /* First, validate OperationalVLs */
+              cur_opvls = ib_port_info_get_op_vls(p_old_pi);
+              opvls = osm_physp_calc_link_op_vls(p_rcv->p_log, p_rcv->p_subn, p_physp);
+              if (opvls != cur_opvls)
+              {
+                osm_log(p_rcv->p_log, OSM_LOG_ERROR,
+                        "__osm_trap_rcv_process_request: ERR 3809: "
+                        "Current OP_VLs %d New OP_VLs %d\n",
+                        cur_opvls, opvls);
+                ib_port_info_set_op_vls(p_pi, opvls);
+              }
+
+              /* Now, set port to INIT if not already in INIT */
+              if (port_state != IB_LINK_INIT)
+              {
+                ib_port_info_set_port_state( p_pi, IB_LINK_INIT );
+                ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_NO_CHANGE, p_pi );
+              }
+              else
+              {
+                ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE );
+                ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_NO_CHANGE, p_pi );
+              }
+
+              /* Now, issue set of PortInfo */
+              context.pi_context.node_guid = osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) );
+              context.pi_context.port_guid = osm_physp_get_port_guid( p_physp );
+              context.pi_context.set_method = TRUE;
+              context.pi_context.update_master_sm_base_lid = FALSE;
+              context.pi_context.light_sweep = FALSE;
+              context.pi_context.active_transition = FALSE;
+
+              status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req,
+                                     osm_physp_get_dr_path_ptr( p_physp ),
+                                     payload,
+                                     sizeof(payload),
+                                     IB_MAD_ATTR_PORT_INFO,
+                                     cl_hton32(osm_physp_get_port_num( p_physp )),
+                                     CL_DISP_MSGID_NONE,
+                                    &context );
+
+              if( status != IB_SUCCESS )
+              {
+                 osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+                          "__osm_trap_rcv_process_request: ERR 3812: "
+                          "Request to set PortInfo failed\n" );
+              }
+            }
+         }
+ 
+         /* When babbling port policy option is enabled and
+            Threshold for disabling a "babbling" port is exceeded */
           if ( p_rcv->p_subn->opt.babbling_port_policy &&
                num_received >= 250 )
           {
-            uint8_t               payload[IB_SMP_DATA_SIZE];
-            ib_port_info_t*       p_pi = (ib_port_info_t*)payload;
-            const ib_port_info_t* p_old_pi;
-            osm_madw_context_t    context;
-
             /* If trap 131, might want to disable peer port if available */
             /* but peer port has been observed not to respond to SM requests */
 
@@ -570,9 +630,6 @@ __osm_trap_rcv_process_request(
                      p_ntci->data_details.ntc_129_131.port_num
                      );
 
-            p_old_pi = &p_physp->port_info;
-            memcpy( payload, p_old_pi, sizeof(ib_port_info_t) );
-
             /* Set port to disabled/down */
             ib_port_info_set_port_state( p_pi, IB_LINK_DOWN );
             ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_DISABLED, p_pi );


From vlad at dev.mellanox.co.il  Tue Jul 10 04:51:40 2007
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Tue, 10 Jul 2007 14:51:40 +0300
Subject: [ofa-general] minor usability nit with 1.2GA?
In-Reply-To: <4692CC7A.2050704@hp.com>
References: <4692CC7A.2050704@hp.com>
Message-ID: <469372CC.5060207@dev.mellanox.co.il>

Rick Jones wrote:
> So I was blythly running my netperf tests after resolving the problem 
> with the existence of irqbalance.  I finished my TCP tests and was about 
> to run the SDP tests.  I'd not modprobe'd the ib_sdp module, so my 
> netperf tests died.  I then did the modprobe and it complained about 
> symbol versions.
> 
> Turns-out - or at least it seems that way - that my selection of just 
> "basic" software didn't include SDP.  That's fine I suppose, but what 
> happened then was I was left with a system with a hybrid of the previous 
> OFED whatever bits (probably an RC for 1.2) and OFED GA bits.
> 
> Perhaps this is simply "caveat emptor" but shouldn't there be some sort 
> of warning/check that in only doing the partial install there would be 
> some incompatible modules left laying around?  Or should I just do the 
> "give me everything" option, shut-up and benchmark?-)
> 

Hi,
OFED removes the previous software before installing the new one.
So, there shouldn't be a mix of different OFED versions on the same machine.

Can you send me the output of the following commands:
# modinfo ib_sdp
# rpm -qf /lib/modules/.../ib_sdp.ko (take the correct path from the 
previous command)
# rpm -q kernel-ib
# ofed_info


Thanks,
Vladimir


From FENKES at de.ibm.com  Tue Jul 10 06:20:08 2007
From: FENKES at de.ibm.com (Joachim Fenkes)
Date: Tue, 10 Jul 2007 15:20:08 +0200
Subject: [ofa-general] Re: [PATCH 00/13] IB/ehca: eHCA2 enablement & some
	fixes
In-Reply-To: <adak5t9w7kx.fsf@cisco.com>
Message-ID: <OF1EC84A1D.3C3D1851-ONC1257314.003EEB89-C1257314.00495729@de.ibm.com>

Roland Dreier <rdreier at cisco.com> wrote on 10.07.2007 00:11:42:

> thanks, I applied these for 2.6.23 and fixed a bunch of minor things
> that scripts/checkpatch.pl complained about (since I was in a mood to
> do mindless things).

Thanks! Both for the quick merge and for the fixes!

> In the future please run that yourself and clean
> up the obvious things.  I generally don't worry about the 80 column
> stuff, but it will catch most whitespace problems and tell you that
> foo(x,y) should be foo(x, y) etc.  So you don't have to completely
> silence the script but at least take a look at the output.

Didn't know about that script before, so thanks for the pointer!
I'll be sure to pass the next set of patches through it.

Joachim


From RAISCH at de.ibm.com  Tue Jul 10 09:35:49 2007
From: RAISCH at de.ibm.com (Christoph Raisch)
Date: Tue, 10 Jul 2007 18:35:49 +0200
Subject: [ofa-general] Re: [PATCH 06/13] IB/ehca: Set SEND_GRH flag for all
	non-LL UD QPs on eHCA2
In-Reply-To: <adazm25w998.fsf@cisco.com>
Message-ID: <OF5A4240D2.61426860-ONC1257314.0047B786-C1257314.004BE620@de.ibm.com>


> What decides if a QP is LL or not?
>
>  - R.
Currently we use a high bit in the QP type, which is not how we want to
keep it permanently.
What would you suggest, add two additional LL QP types, or change something
more fundamental
in libibverbs and kernel ib core?
We think we can get along quite well with the existing parameters in the
current create QP.
The current user-kernel interface is ok for these new QPs for post_send +
post_recv,
but unfortunately the libibverbs userspace calls don't match exactly how
the LL queues are to be used.
We would need something like the LL QP interface in libehca in libibverbs
to keep that interface generic.

We didn't see a usage yet for LL QP in kernel, so maybe we should continue
that
discussion on general at openfabrics only.
We could provide example code in libehca/samples if needed.


Gruss / Regards
Christoph + Nam


From dotanb at dev.mellanox.co.il  Tue Jul 10 06:55:57 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Tue, 10 Jul 2007 16:55:57 +0300
Subject: [ofa-general] [PATCH] IB/core: Fix the used pointer when calling to
	kmalloc
Message-ID: <200707101655.58041.dotanb@dev.mellanox.co.il>

Fix the used pointer when calling to kmalloc.

It is true that today the type of in_mad and out_mad are the same,
but this patch will give us a cleaner code.

Signed-off-by: Dotan Barak <dotanb at mellanox.co.il>

---

diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index 08c299e..6265a3f 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -311,7 +311,7 @@ static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr,
 		return sprintf(buf, "N/A (no PMA)\n");
 
 	in_mad  = kzalloc(sizeof *in_mad, GFP_KERNEL);
-	out_mad = kmalloc(sizeof *in_mad, GFP_KERNEL);
+	out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
 	if (!in_mad || !out_mad) {
 		ret = -ENOMEM;
 		goto out;


From xhejtman at ics.muni.cz  Tue Jul 10 07:14:09 2007
From: xhejtman at ics.muni.cz (Lukas Hejtmanek)
Date: Tue, 10 Jul 2007 16:14:09 +0200
Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus
In-Reply-To: <adamyy4vjo9.fsf@cisco.com>
References: <adalkdpxopo.fsf@cisco.com> <20070709213913.GB20052@mellanox.co.il>
	<adamyy4vjo9.fsf@cisco.com>
Message-ID: <20070710141409.GH3885@ics.muni.cz>

On Mon, Jul 09, 2007 at 11:48:06PM -0700, Roland Dreier wrote:
> Yes, for example on ppc 4xx the amount of coherent memory is quite
> small by default (address space for non-cached mappings is actually
> what is limited, but it amounts to the same thing).
> 
> Maybe the least bad solution is to change to using dma_map_single()
> instead of pci_map_sg() in mthca_memfree.c.

And what about the attached patch to mthca_memfree? It changes alloc_pages for
pci_alloc_consistent. Using it, I can enable FMR and the driver runs fine.

Indeed, it does not solve problem with dma_sync_single() per se, on the other
hand, with pci_alloc_consistent() swiotlb is not needed thus dma_sync_single()
does nothing. But I agree it is not conceptual.

-- 
Lukáš Hejtmánek
-------------- next part --------------
--- mthca_memfree.c.orig	2007-07-07 01:19:35.988558442 +0200
+++ mthca_memfree.c	2007-07-10 16:00:10.200488265 +0200
@@ -70,36 +70,27 @@
 		return;
 
 	list_for_each_entry_safe(chunk, tmp, &icm->chunk_list, list) {
-		if (coherent)
-			for (i = 0; i < chunk->npages; ++i) {
-				buf = lowmem_page_address(chunk->mem[i].page);
+		for (i = 0; i < chunk->npages; ++i) {
+			buf = lowmem_page_address(chunk->mem[i].page);
+			if(coherent)
 				dma_free_coherent(&dev->pdev->dev, chunk->mem[i].length,
 						  buf, sg_dma_address(&chunk->mem[i]));
-			}
-		else {
-			if (chunk->nsg > 0)
-				pci_unmap_sg(dev->pdev, chunk->mem, chunk->npages,
-					     PCI_DMA_BIDIRECTIONAL);
-
-			for (i = 0; i < chunk->npages; ++i)
-				__free_pages(chunk->mem[i].page,
-					     get_order(chunk->mem[i].length));
+			else
+				pci_free_consistent(dev->pdev, chunk->mem[i].length, buf, sg_dma_address(&chunk->mem[i]));
 		}
-
 		kfree(chunk);
 	}
 
 	kfree(icm);
 }
 
-static int mthca_alloc_icm_pages(struct scatterlist *mem, int order, gfp_t gfp_mask)
+static int mthca_alloc_icm_pages(struct pci_dev *pdev, struct scatterlist *mem, int order, gfp_t gfp_mask)
 {
-	mem->page = alloc_pages(gfp_mask, order);
-	if (!mem->page)
+	void *buf = pci_alloc_consistent(pdev, PAGE_SIZE << order, &sg_dma_address(mem));
+	if (!buf)
 		return -ENOMEM;
-
-	mem->length = PAGE_SIZE << order;
-	mem->offset = 0;
+	sg_set_buf(mem, buf, PAGE_SIZE << order);
+	sg_dma_len(mem) = PAGE_SIZE << order;
 	return 0;
 }
 
@@ -157,21 +148,13 @@
 						       &chunk->mem[chunk->npages],
 						       cur_order, gfp_mask);
 		else
-		       	ret = mthca_alloc_icm_pages(&chunk->mem[chunk->npages],
+		       	ret = mthca_alloc_icm_pages(dev->pdev, 
+						    &chunk->mem[chunk->npages],
 						    cur_order, gfp_mask);
 
 		if (!ret) {
 			++chunk->npages;
-
-			if (!coherent && chunk->npages == MTHCA_ICM_CHUNK_LEN) {
-				chunk->nsg = pci_map_sg(dev->pdev, chunk->mem,
-							chunk->npages,
-							PCI_DMA_BIDIRECTIONAL);
-
-				if (chunk->nsg <= 0)
-					goto fail;
-			}
-
+			++chunk->nsg;
 			if (chunk->npages == MTHCA_ICM_CHUNK_LEN)
 				chunk = NULL;
 
@@ -183,15 +166,6 @@
 		}
 	}
 
-	if (!coherent && chunk) {
-		chunk->nsg = pci_map_sg(dev->pdev, chunk->mem,
-					chunk->npages,
-					PCI_DMA_BIDIRECTIONAL);
-
-		if (chunk->nsg <= 0)
-			goto fail;
-	}
-
 	return icm;
 
 fail:

From suri at baymicrosystems.com  Tue Jul 10 07:24:11 2007
From: suri at baymicrosystems.com (Suresh Shelvapille)
Date: Tue, 10 Jul 2007 10:24:11 -0400
Subject: [ofa-general] [PATCH] OpenSM/osm_trap_rcv.c: Better Trap
	131Handling
In-Reply-To: <1184066851.25217.468533.camel@hal.voltaire.com>
References: <1184066851.25217.468533.camel@hal.voltaire.com>
Message-ID: <05b901c7c2fe$0006dd00$1914a8c0@surioffice>

Hal:

Shouldn't the port be set to "down", I did not think you could set the portstate to "init".

Thanks,
Suri

> -----Original Message-----
> From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf
> Of Hal Rosenstock
> Sent: Tuesday, July 10, 2007 7:28 AM
> To: general at lists.openfabrics.org
> Cc: Yevgeny Kliteynik
> Subject: [ofa-general] [PATCH] OpenSM/osm_trap_rcv.c: Better Trap 131Handling
> 
> OpenSM/osm_trap_rcv.c: Better trap 131 handling
> 
> When trap 131 occurs, check operational VLs and set port state to INIT
> if needed.
> 
> I think this is what Amit was saying should be done in his emails
> yesterday on the list.
> 
> Signed-off-by: Hal Rosenstock <halr at voltaire.com>
> 
> diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
> index f912dcd..f79c62f 100644
> --- a/opensm/opensm/osm_trap_rcv.c
> +++ b/opensm/opensm/osm_trap_rcv.c
> @@ -550,16 +550,76 @@ __osm_trap_rcv_process_request(
>          }
>          else
>          {
> -          /* When babbling port policy option is enabled and
> -             Threshold for disabling a "babbling" port is exceeded */
> +          uint8_t               payload[IB_SMP_DATA_SIZE];
> +          ib_port_info_t*       p_pi = (ib_port_info_t*)payload;
> +          const ib_port_info_t* p_old_pi;
> +          osm_madw_context_t    context;
> +
> +          p_old_pi = &p_physp->port_info;
> +          memcpy( payload, p_old_pi, sizeof(ib_port_info_t) );
> +
> +          if (p_ntci->g_or_v.generic.trap_num == CL_HTON16(131))
> +          {
> +            uint8_t port_state, cur_opvls, opvls;
> +
> +            port_state = ib_port_info_get_port_state(p_old_pi);
> +            if (port_state != IB_LINK_DOWN)
> +            {
> +              /* First, validate OperationalVLs */
> +              cur_opvls = ib_port_info_get_op_vls(p_old_pi);
> +              opvls = osm_physp_calc_link_op_vls(p_rcv->p_log, p_rcv->p_subn, p_physp);
> +              if (opvls != cur_opvls)
> +              {
> +                osm_log(p_rcv->p_log, OSM_LOG_ERROR,
> +                        "__osm_trap_rcv_process_request: ERR 3809: "
> +                        "Current OP_VLs %d New OP_VLs %d\n",
> +                        cur_opvls, opvls);
> +                ib_port_info_set_op_vls(p_pi, opvls);
> +              }
> +
> +              /* Now, set port to INIT if not already in INIT */
> +              if (port_state != IB_LINK_INIT)
> +              {
> +                ib_port_info_set_port_state( p_pi, IB_LINK_INIT );
> +                ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_NO_CHANGE, p_pi );
> +              }
> +              else
> +              {
> +                ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE );
> +                ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_NO_CHANGE, p_pi );
> +              }
> +
> +              /* Now, issue set of PortInfo */
> +              context.pi_context.node_guid = osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp
> ) );
> +              context.pi_context.port_guid = osm_physp_get_port_guid( p_physp );
> +              context.pi_context.set_method = TRUE;
> +              context.pi_context.update_master_sm_base_lid = FALSE;
> +              context.pi_context.light_sweep = FALSE;
> +              context.pi_context.active_transition = FALSE;
> +
> +              status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req,
> +                                     osm_physp_get_dr_path_ptr( p_physp ),
> +                                     payload,
> +                                     sizeof(payload),
> +                                     IB_MAD_ATTR_PORT_INFO,
> +                                     cl_hton32(osm_physp_get_port_num( p_physp )),
> +                                     CL_DISP_MSGID_NONE,
> +                                    &context );
> +
> +              if( status != IB_SUCCESS )
> +              {
> +                 osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> +                          "__osm_trap_rcv_process_request: ERR 3812: "
> +                          "Request to set PortInfo failed\n" );
> +              }
> +            }
> +         }
> +
> +         /* When babbling port policy option is enabled and
> +            Threshold for disabling a "babbling" port is exceeded */
>            if ( p_rcv->p_subn->opt.babbling_port_policy &&
>                 num_received >= 250 )
>            {
> -            uint8_t               payload[IB_SMP_DATA_SIZE];
> -            ib_port_info_t*       p_pi = (ib_port_info_t*)payload;
> -            const ib_port_info_t* p_old_pi;
> -            osm_madw_context_t    context;
> -
>              /* If trap 131, might want to disable peer port if available */
>              /* but peer port has been observed not to respond to SM requests */
> 
> @@ -570,9 +630,6 @@ __osm_trap_rcv_process_request(
>                       p_ntci->data_details.ntc_129_131.port_num
>                       );
> 
> -            p_old_pi = &p_physp->port_info;
> -            memcpy( payload, p_old_pi, sizeof(ib_port_info_t) );
> -
>              /* Set port to disabled/down */
>              ib_port_info_set_port_state( p_pi, IB_LINK_DOWN );
>              ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_DISABLED, p_pi );
> 
> 
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From halr at voltaire.com  Tue Jul 10 07:31:15 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 10 Jul 2007 10:31:15 -0400
Subject: [ofa-general] [PATCH] OpenSM/osm_trap_rcv.c: Better Trap
	131Handling
In-Reply-To: <05b901c7c2fe$0006dd00$1914a8c0@surioffice>
References: <1184066851.25217.468533.camel@hal.voltaire.com>
	<05b901c7c2fe$0006dd00$1914a8c0@surioffice>
Message-ID: <1184077871.25217.481040.camel@hal.voltaire.com>

Suri,

On Tue, 2007-07-10 at 10:24, Suresh Shelvapille wrote:
> Hal:
> 
> Shouldn't the port be set to "down", I did not think you could set the portstate to "init".

Gak.. You are right; I forgot about the valid link state transitions. 

I will reissue the patch.

-- Hal

> Thanks,
> Suri
> 
> > -----Original Message-----
> > From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf
> > Of Hal Rosenstock
> > Sent: Tuesday, July 10, 2007 7:28 AM
> > To: general at lists.openfabrics.org
> > Cc: Yevgeny Kliteynik
> > Subject: [ofa-general] [PATCH] OpenSM/osm_trap_rcv.c: Better Trap 131Handling
> > 
> > OpenSM/osm_trap_rcv.c: Better trap 131 handling
> > 
> > When trap 131 occurs, check operational VLs and set port state to INIT
> > if needed.
> > 
> > I think this is what Amit was saying should be done in his emails
> > yesterday on the list.
> > 
> > Signed-off-by: Hal Rosenstock <halr at voltaire.com>
> > 
> > diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
> > index f912dcd..f79c62f 100644
> > --- a/opensm/opensm/osm_trap_rcv.c
> > +++ b/opensm/opensm/osm_trap_rcv.c
> > @@ -550,16 +550,76 @@ __osm_trap_rcv_process_request(
> >          }
> >          else
> >          {
> > -          /* When babbling port policy option is enabled and
> > -             Threshold for disabling a "babbling" port is exceeded */
> > +          uint8_t               payload[IB_SMP_DATA_SIZE];
> > +          ib_port_info_t*       p_pi = (ib_port_info_t*)payload;
> > +          const ib_port_info_t* p_old_pi;
> > +          osm_madw_context_t    context;
> > +
> > +          p_old_pi = &p_physp->port_info;
> > +          memcpy( payload, p_old_pi, sizeof(ib_port_info_t) );
> > +
> > +          if (p_ntci->g_or_v.generic.trap_num == CL_HTON16(131))
> > +          {
> > +            uint8_t port_state, cur_opvls, opvls;
> > +
> > +            port_state = ib_port_info_get_port_state(p_old_pi);
> > +            if (port_state != IB_LINK_DOWN)
> > +            {
> > +              /* First, validate OperationalVLs */
> > +              cur_opvls = ib_port_info_get_op_vls(p_old_pi);
> > +              opvls = osm_physp_calc_link_op_vls(p_rcv->p_log, p_rcv->p_subn, p_physp);
> > +              if (opvls != cur_opvls)
> > +              {
> > +                osm_log(p_rcv->p_log, OSM_LOG_ERROR,
> > +                        "__osm_trap_rcv_process_request: ERR 3809: "
> > +                        "Current OP_VLs %d New OP_VLs %d\n",
> > +                        cur_opvls, opvls);
> > +                ib_port_info_set_op_vls(p_pi, opvls);
> > +              }
> > +
> > +              /* Now, set port to INIT if not already in INIT */
> > +              if (port_state != IB_LINK_INIT)
> > +              {
> > +                ib_port_info_set_port_state( p_pi, IB_LINK_INIT );
> > +                ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_NO_CHANGE, p_pi );
> > +              }
> > +              else
> > +              {
> > +                ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE );
> > +                ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_NO_CHANGE, p_pi );
> > +              }
> > +
> > +              /* Now, issue set of PortInfo */
> > +              context.pi_context.node_guid = osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp
> > ) );
> > +              context.pi_context.port_guid = osm_physp_get_port_guid( p_physp );
> > +              context.pi_context.set_method = TRUE;
> > +              context.pi_context.update_master_sm_base_lid = FALSE;
> > +              context.pi_context.light_sweep = FALSE;
> > +              context.pi_context.active_transition = FALSE;
> > +
> > +              status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req,
> > +                                     osm_physp_get_dr_path_ptr( p_physp ),
> > +                                     payload,
> > +                                     sizeof(payload),
> > +                                     IB_MAD_ATTR_PORT_INFO,
> > +                                     cl_hton32(osm_physp_get_port_num( p_physp )),
> > +                                     CL_DISP_MSGID_NONE,
> > +                                    &context );
> > +
> > +              if( status != IB_SUCCESS )
> > +              {
> > +                 osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> > +                          "__osm_trap_rcv_process_request: ERR 3812: "
> > +                          "Request to set PortInfo failed\n" );
> > +              }
> > +            }
> > +         }
> > +
> > +         /* When babbling port policy option is enabled and
> > +            Threshold for disabling a "babbling" port is exceeded */
> >            if ( p_rcv->p_subn->opt.babbling_port_policy &&
> >                 num_received >= 250 )
> >            {
> > -            uint8_t               payload[IB_SMP_DATA_SIZE];
> > -            ib_port_info_t*       p_pi = (ib_port_info_t*)payload;
> > -            const ib_port_info_t* p_old_pi;
> > -            osm_madw_context_t    context;
> > -
> >              /* If trap 131, might want to disable peer port if available */
> >              /* but peer port has been observed not to respond to SM requests */
> > 
> > @@ -570,9 +630,6 @@ __osm_trap_rcv_process_request(
> >                       p_ntci->data_details.ntc_129_131.port_num
> >                       );
> > 
> > -            p_old_pi = &p_physp->port_info;
> > -            memcpy( payload, p_old_pi, sizeof(ib_port_info_t) );
> > -
> >              /* Set port to disabled/down */
> >              ib_port_info_set_port_state( p_pi, IB_LINK_DOWN );
> >              ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_DISABLED, p_pi );
> > 
> > 
> > 
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From halr at voltaire.com  Tue Jul 10 07:39:13 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 10 Jul 2007 10:39:13 -0400
Subject: [ofa-general] [PATCHv2] OpenSM/osm_trap_rcv.c: Better Trap 131
	Handling
Message-ID: <1184078350.25217.481568.camel@hal.voltaire.com>

OpenSM/osm_trap_rcv.c: Better trap 131 handling

When trap 131 occurs, check operational VLs and set port state to DOWN
if needed.

I think this is what Amit was saying should be done in his emails
yesterday on the list (modified by Suri's comment).

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
index f912dcd..3f60f3d 100644
--- a/opensm/opensm/osm_trap_rcv.c
+++ b/opensm/opensm/osm_trap_rcv.c
@@ -550,16 +550,76 @@ __osm_trap_rcv_process_request(
         }
         else
         {
-          /* When babbling port policy option is enabled and
-             Threshold for disabling a "babbling" port is exceeded */
+          uint8_t               payload[IB_SMP_DATA_SIZE];
+          ib_port_info_t*       p_pi = (ib_port_info_t*)payload;
+          const ib_port_info_t* p_old_pi;
+          osm_madw_context_t    context;
+
+          p_old_pi = &p_physp->port_info;
+          memcpy( payload, p_old_pi, sizeof(ib_port_info_t) );
+
+          if (p_ntci->g_or_v.generic.trap_num == CL_HTON16(131))
+          {
+            uint8_t port_state, cur_opvls, opvls;
+
+            port_state = ib_port_info_get_port_state(p_old_pi);
+            if (port_state != IB_LINK_DOWN)
+            {
+              /* First, validate OperationalVLs */
+              cur_opvls = ib_port_info_get_op_vls(p_old_pi);
+              opvls = osm_physp_calc_link_op_vls(p_rcv->p_log, p_rcv->p_subn, p_physp);
+              if (opvls != cur_opvls)
+              {
+                osm_log(p_rcv->p_log, OSM_LOG_ERROR,
+                        "__osm_trap_rcv_process_request: ERR 3809: "
+                        "Current OP_VLs %d New OP_VLs %d\n",
+                        cur_opvls, opvls);
+                ib_port_info_set_op_vls(p_pi, opvls);
+              }
+
+              /* Now, set port to DOWN if not already in INIT */
+              if (port_state != IB_LINK_INIT)
+              {
+                ib_port_info_set_port_state( p_pi, IB_LINK_DOWN );
+                ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_NO_CHANGE, p_pi );
+              }
+              else
+              {
+                ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE );
+                ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_NO_CHANGE, p_pi );
+              }
+
+              /* Now, issue set of PortInfo */
+              context.pi_context.node_guid = osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) );
+              context.pi_context.port_guid = osm_physp_get_port_guid( p_physp );
+              context.pi_context.set_method = TRUE;
+              context.pi_context.update_master_sm_base_lid = FALSE;
+              context.pi_context.light_sweep = FALSE;
+              context.pi_context.active_transition = FALSE;
+
+              status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req,
+                                     osm_physp_get_dr_path_ptr( p_physp ),
+                                     payload,
+                                     sizeof(payload),
+                                     IB_MAD_ATTR_PORT_INFO,
+                                     cl_hton32(osm_physp_get_port_num( p_physp )),
+                                     CL_DISP_MSGID_NONE,
+                                    &context );
+
+              if( status != IB_SUCCESS )
+              {
+                 osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+                          "__osm_trap_rcv_process_request: ERR 3812: "
+                          "Request to set PortInfo failed\n" );
+              }
+            }
+         }
+ 
+         /* When babbling port policy option is enabled and
+            Threshold for disabling a "babbling" port is exceeded */
           if ( p_rcv->p_subn->opt.babbling_port_policy &&
                num_received >= 250 )
           {
-            uint8_t               payload[IB_SMP_DATA_SIZE];
-            ib_port_info_t*       p_pi = (ib_port_info_t*)payload;
-            const ib_port_info_t* p_old_pi;
-            osm_madw_context_t    context;
-
             /* If trap 131, might want to disable peer port if available */
             /* but peer port has been observed not to respond to SM requests */
 
@@ -570,9 +630,6 @@ __osm_trap_rcv_process_request(
                      p_ntci->data_details.ntc_129_131.port_num
                      );
 
-            p_old_pi = &p_physp->port_info;
-            memcpy( payload, p_old_pi, sizeof(ib_port_info_t) );
-
             /* Set port to disabled/down */
             ib_port_info_set_port_state( p_pi, IB_LINK_DOWN );
             ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_DISABLED, p_pi );


From amitk at mellanox.co.il  Tue Jul 10 08:30:22 2007
From: amitk at mellanox.co.il (Amit Krig)
Date: Tue, 10 Jul 2007 18:30:22 +0300
Subject: [ofa-general] RE: [PATCHv2] OpenSM/osm_trap_rcv.c: Better Trap 131
	Handling
References: <1184078350.25217.481568.camel@hal.voltaire.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901BE113F@mtlexch01.mtl.com>

Hi Hal,

One comment,
If one of the port is not responsive for some reason, need to move its
peer port to DOWN and then check the OPVL,

Amit
-----Original Message-----
From: Hal Rosenstock [mailto:halr at voltaire.com] 
Sent: Tuesday, July 10, 2007 5:39 PM
To: general at lists.openfabrics.org
Cc: Suresh Shelvapille; Amit Krig; Yevgeny Kliteynik; Eitan Zahavi
Subject: [PATCHv2] OpenSM/osm_trap_rcv.c: Better Trap 131 Handling

OpenSM/osm_trap_rcv.c: Better trap 131 handling

When trap 131 occurs, check operational VLs and set port state to DOWN
if needed.

I think this is what Amit was saying should be done in his emails
yesterday on the list (modified by Suri's comment).

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
index f912dcd..3f60f3d 100644
--- a/opensm/opensm/osm_trap_rcv.c
+++ b/opensm/opensm/osm_trap_rcv.c
@@ -550,16 +550,76 @@ __osm_trap_rcv_process_request(
         }
         else
         {
-          /* When babbling port policy option is enabled and
-             Threshold for disabling a "babbling" port is exceeded */
+          uint8_t               payload[IB_SMP_DATA_SIZE];
+          ib_port_info_t*       p_pi = (ib_port_info_t*)payload;
+          const ib_port_info_t* p_old_pi;
+          osm_madw_context_t    context;
+
+          p_old_pi = &p_physp->port_info;
+          memcpy( payload, p_old_pi, sizeof(ib_port_info_t) );
+
+          if (p_ntci->g_or_v.generic.trap_num == CL_HTON16(131))
+          {
+            uint8_t port_state, cur_opvls, opvls;
+
+            port_state = ib_port_info_get_port_state(p_old_pi);
+            if (port_state != IB_LINK_DOWN)
+            {
+              /* First, validate OperationalVLs */
+              cur_opvls = ib_port_info_get_op_vls(p_old_pi);
+              opvls = osm_physp_calc_link_op_vls(p_rcv->p_log,
p_rcv->p_subn, p_physp);
+              if (opvls != cur_opvls)
+              {
+                osm_log(p_rcv->p_log, OSM_LOG_ERROR,
+                        "__osm_trap_rcv_process_request: ERR 3809: "
+                        "Current OP_VLs %d New OP_VLs %d\n",
+                        cur_opvls, opvls);
+                ib_port_info_set_op_vls(p_pi, opvls);
+              }
+
+              /* Now, set port to DOWN if not already in INIT */
+              if (port_state != IB_LINK_INIT)
+              {
+                ib_port_info_set_port_state( p_pi, IB_LINK_DOWN );
+                ib_port_info_set_port_phys_state(
IB_PORT_PHYS_STATE_NO_CHANGE, p_pi );
+              }
+              else
+              {
+                ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE );
+                ib_port_info_set_port_phys_state(
IB_PORT_PHYS_STATE_NO_CHANGE, p_pi );
+              }
+
+              /* Now, issue set of PortInfo */
+              context.pi_context.node_guid = osm_node_get_node_guid(
osm_physp_get_node_ptr( p_physp ) );
+              context.pi_context.port_guid = osm_physp_get_port_guid(
p_physp );
+              context.pi_context.set_method = TRUE;
+              context.pi_context.update_master_sm_base_lid = FALSE;
+              context.pi_context.light_sweep = FALSE;
+              context.pi_context.active_transition = FALSE;
+
+              status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req,
+                                     osm_physp_get_dr_path_ptr( p_physp
),
+                                     payload,
+                                     sizeof(payload),
+                                     IB_MAD_ATTR_PORT_INFO,
+                                     cl_hton32(osm_physp_get_port_num(
p_physp )),
+                                     CL_DISP_MSGID_NONE,
+                                    &context );
+
+              if( status != IB_SUCCESS )
+              {
+                 osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+                          "__osm_trap_rcv_process_request: ERR 3812: "
+                          "Request to set PortInfo failed\n" );
+              }
+            }
+         }
+ 
+         /* When babbling port policy option is enabled and
+            Threshold for disabling a "babbling" port is exceeded */
           if ( p_rcv->p_subn->opt.babbling_port_policy &&
                num_received >= 250 )
           {
-            uint8_t               payload[IB_SMP_DATA_SIZE];
-            ib_port_info_t*       p_pi = (ib_port_info_t*)payload;
-            const ib_port_info_t* p_old_pi;
-            osm_madw_context_t    context;
-
             /* If trap 131, might want to disable peer port if
available */
             /* but peer port has been observed not to respond to SM
requests */
 
@@ -570,9 +630,6 @@ __osm_trap_rcv_process_request(
                      p_ntci->data_details.ntc_129_131.port_num
                      );
 
-            p_old_pi = &p_physp->port_info;
-            memcpy( payload, p_old_pi, sizeof(ib_port_info_t) );
-
             /* Set port to disabled/down */
             ib_port_info_set_port_state( p_pi, IB_LINK_DOWN );
             ib_port_info_set_port_phys_state(
IB_PORT_PHYS_STATE_DISABLED, p_pi );


From rdreier at cisco.com  Tue Jul 10 08:33:01 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 10 Jul 2007 08:33:01 -0700
Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus
In-Reply-To: <20070710071547.GA3814@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 10 Jul 2007 10:15:47 +0300")
References: <adalkdpxopo.fsf@cisco.com>
	<20070709213913.GB20052@mellanox.co.il> <adamyy4vjo9.fsf@cisco.com>
	<20070710071547.GA3814@mellanox.co.il>
Message-ID: <adabqekuvde.fsf@cisco.com>

 > What makes you think dma_sync_single_range can't be used on memory mapped
 > by pci_map_sg/dma_map_sg?

The fact that it's dma_sync_*SINGLE*_range, and that there's a
separate dma_sync_sg() function defined in DMA-API.txt.

 - R.


From swise at opengridcomputing.com  Tue Jul 10 08:50:53 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 10 Jul 2007 10:50:53 -0500
Subject: [ofa-general] 2.6.22 nightly build failure
Message-ID: <4693AADD.5090506@opengridcomputing.com>

Vlad,

Do you know what's failing in the nightly build for 2.6.22?


Steve.


From vlad at mellanox.co.il  Tue Jul 10 08:51:49 2007
From: vlad at mellanox.co.il (Vladimir Sokolovsky)
Date: Tue, 10 Jul 2007 18:51:49 +0300
Subject: [ofa-general] RE: 2.6.22 nightly build failure
References: <4693AADD.5090506@opengridcomputing.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901DA0B33@mtlexch01.mtl.com>

SDP and RDS


Regards,
Vladimir


> -----Original Message-----
> From: Steve Wise [mailto:swise at opengridcomputing.com]
> Sent: Tuesday, July 10, 2007 6:51 PM
> To: Vladimir Sokolovsky
> Cc: OpenFabrics General
> Subject: 2.6.22 nightly build failure
> 
> Vlad,
> 
> Do you know what's failing in the nightly build for 2.6.22?
> 
> 
> Steve.


From halr at voltaire.com  Tue Jul 10 09:23:51 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 10 Jul 2007 12:23:51 -0400
Subject: [ofa-general] RE: [PATCHv2] OpenSM/osm_trap_rcv.c: Better Trap 131
	Handling
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901BE113F@mtlexch01.mtl.com>
References: <1184078350.25217.481568.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901BE113F@mtlexch01.mtl.com>
Message-ID: <1184084630.17622.3622.camel@hal.voltaire.com>

Hi Amit,

On Tue, 2007-07-10 at 11:30, Amit Krig wrote:
> Hi Hal,
> 
> One comment,
> If one of the port is not responsive for some reason, need to move its
> peer port to DOWN and then check the OPVL,

Guess I'm still not following you exactly yet. 

The code here is not determining the port responsiveness. It is merely
triggering off the trap 131, recalculating and resetting OperationalVLs
if needed, and taking the port down at the link level which should start
it back to active, hopefully now with the proper OperationalVLs. If it
is still flooded with trap 131s, it disables the port.

-- Hal

> 
> Amit
> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com] 
> Sent: Tuesday, July 10, 2007 5:39 PM
> To: general at lists.openfabrics.org
> Cc: Suresh Shelvapille; Amit Krig; Yevgeny Kliteynik; Eitan Zahavi
> Subject: [PATCHv2] OpenSM/osm_trap_rcv.c: Better Trap 131 Handling
> 
> OpenSM/osm_trap_rcv.c: Better trap 131 handling
> 
> When trap 131 occurs, check operational VLs and set port state to DOWN
> if needed.
> 
> I think this is what Amit was saying should be done in his emails
> yesterday on the list (modified by Suri's comment).
> 
> Signed-off-by: Hal Rosenstock <halr at voltaire.com>
> 
> diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
> index f912dcd..3f60f3d 100644
> --- a/opensm/opensm/osm_trap_rcv.c
> +++ b/opensm/opensm/osm_trap_rcv.c
> @@ -550,16 +550,76 @@ __osm_trap_rcv_process_request(
>          }
>          else
>          {
> -          /* When babbling port policy option is enabled and
> -             Threshold for disabling a "babbling" port is exceeded */
> +          uint8_t               payload[IB_SMP_DATA_SIZE];
> +          ib_port_info_t*       p_pi = (ib_port_info_t*)payload;
> +          const ib_port_info_t* p_old_pi;
> +          osm_madw_context_t    context;
> +
> +          p_old_pi = &p_physp->port_info;
> +          memcpy( payload, p_old_pi, sizeof(ib_port_info_t) );
> +
> +          if (p_ntci->g_or_v.generic.trap_num == CL_HTON16(131))
> +          {
> +            uint8_t port_state, cur_opvls, opvls;
> +
> +            port_state = ib_port_info_get_port_state(p_old_pi);
> +            if (port_state != IB_LINK_DOWN)
> +            {
> +              /* First, validate OperationalVLs */
> +              cur_opvls = ib_port_info_get_op_vls(p_old_pi);
> +              opvls = osm_physp_calc_link_op_vls(p_rcv->p_log,
> p_rcv->p_subn, p_physp);
> +              if (opvls != cur_opvls)
> +              {
> +                osm_log(p_rcv->p_log, OSM_LOG_ERROR,
> +                        "__osm_trap_rcv_process_request: ERR 3809: "
> +                        "Current OP_VLs %d New OP_VLs %d\n",
> +                        cur_opvls, opvls);
> +                ib_port_info_set_op_vls(p_pi, opvls);
> +              }
> +
> +              /* Now, set port to DOWN if not already in INIT */
> +              if (port_state != IB_LINK_INIT)
> +              {
> +                ib_port_info_set_port_state( p_pi, IB_LINK_DOWN );
> +                ib_port_info_set_port_phys_state(
> IB_PORT_PHYS_STATE_NO_CHANGE, p_pi );
> +              }
> +              else
> +              {
> +                ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE );
> +                ib_port_info_set_port_phys_state(
> IB_PORT_PHYS_STATE_NO_CHANGE, p_pi );
> +              }
> +
> +              /* Now, issue set of PortInfo */
> +              context.pi_context.node_guid = osm_node_get_node_guid(
> osm_physp_get_node_ptr( p_physp ) );
> +              context.pi_context.port_guid = osm_physp_get_port_guid(
> p_physp );
> +              context.pi_context.set_method = TRUE;
> +              context.pi_context.update_master_sm_base_lid = FALSE;
> +              context.pi_context.light_sweep = FALSE;
> +              context.pi_context.active_transition = FALSE;
> +
> +              status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req,
> +                                     osm_physp_get_dr_path_ptr( p_physp
> ),
> +                                     payload,
> +                                     sizeof(payload),
> +                                     IB_MAD_ATTR_PORT_INFO,
> +                                     cl_hton32(osm_physp_get_port_num(
> p_physp )),
> +                                     CL_DISP_MSGID_NONE,
> +                                    &context );
> +
> +              if( status != IB_SUCCESS )
> +              {
> +                 osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> +                          "__osm_trap_rcv_process_request: ERR 3812: "
> +                          "Request to set PortInfo failed\n" );
> +              }
> +            }
> +         }
> + 
> +         /* When babbling port policy option is enabled and
> +            Threshold for disabling a "babbling" port is exceeded */
>            if ( p_rcv->p_subn->opt.babbling_port_policy &&
>                 num_received >= 250 )
>            {
> -            uint8_t               payload[IB_SMP_DATA_SIZE];
> -            ib_port_info_t*       p_pi = (ib_port_info_t*)payload;
> -            const ib_port_info_t* p_old_pi;
> -            osm_madw_context_t    context;
> -
>              /* If trap 131, might want to disable peer port if
> available */
>              /* but peer port has been observed not to respond to SM
> requests */
>  
> @@ -570,9 +630,6 @@ __osm_trap_rcv_process_request(
>                       p_ntci->data_details.ntc_129_131.port_num
>                       );
>  
> -            p_old_pi = &p_physp->port_info;
> -            memcpy( payload, p_old_pi, sizeof(ib_port_info_t) );
> -
>              /* Set port to disabled/down */
>              ib_port_info_set_port_state( p_pi, IB_LINK_DOWN );
>              ib_port_info_set_port_phys_state(
> IB_PORT_PHYS_STATE_DISABLED, p_pi );
> 
> 
> 


From tziporet at dev.mellanox.co.il  Tue Jul 10 09:37:50 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 10 Jul 2007 19:37:50 +0300
Subject: [ofa-general] RE: 2.6.22 nightly build failure
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901DA0B33@mtlexch01.mtl.com>
References: <4693AADD.5090506@opengridcomputing.com>
	<6C2C79E72C305246B504CBA17B5500C901DA0B33@mtlexch01.mtl.com>
Message-ID: <4693B5DE.9050500@mellanox.co.il>

Vladimir Sokolovsky wrote:
> SDP and RDS - are faling on 2.6.22 kernel
>
>
> Regards,
> Vladimir
>
>   
>   
Vlad - please fix RDS
Jim - please fix SDP

Thanks,
Tziporet


From amitk at mellanox.co.il  Tue Jul 10 09:39:56 2007
From: amitk at mellanox.co.il (Amit Krig)
Date: Tue, 10 Jul 2007 19:39:56 +0300
Subject: [ofa-general] RE: [PATCHv2] OpenSM/osm_trap_rcv.c: Better Trap 131
	Handling
References: <1184078350.25217.481568.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901BE113F@mtlexch01.mtl.com>
	<1184084630.17622.3622.camel@hal.voltaire.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901BE114A@mtlexch01.mtl.com>

Hi Hal,

The watchdog mechanism may cause some hard time to communicate with the
end node, that is the reason I suggest to bring down its peer port and
by that stop the physical link from retraining all the time.

Amit 

-----Original Message-----
From: Hal Rosenstock [mailto:halr at voltaire.com] 
Sent: Tuesday, July 10, 2007 7:24 PM
To: Amit Krig
Cc: general at lists.openfabrics.org; Suresh Shelvapille; Yevgeny
Kliteynik; Eitan Zahavi
Subject: RE: [PATCHv2] OpenSM/osm_trap_rcv.c: Better Trap 131 Handling

Hi Amit,

On Tue, 2007-07-10 at 11:30, Amit Krig wrote:
> Hi Hal,
> 
> One comment,
> If one of the port is not responsive for some reason, need to move its

> peer port to DOWN and then check the OPVL,

Guess I'm still not following you exactly yet. 

The code here is not determining the port responsiveness. It is merely
triggering off the trap 131, recalculating and resetting OperationalVLs
if needed, and taking the port down at the link level which should start
it back to active, hopefully now with the proper OperationalVLs. If it
is still flooded with trap 131s, it disables the port.

-- Hal

> 
> Amit
> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com]
> Sent: Tuesday, July 10, 2007 5:39 PM
> To: general at lists.openfabrics.org
> Cc: Suresh Shelvapille; Amit Krig; Yevgeny Kliteynik; Eitan Zahavi
> Subject: [PATCHv2] OpenSM/osm_trap_rcv.c: Better Trap 131 Handling
> 
> OpenSM/osm_trap_rcv.c: Better trap 131 handling
> 
> When trap 131 occurs, check operational VLs and set port state to DOWN

> if needed.
> 
> I think this is what Amit was saying should be done in his emails 
> yesterday on the list (modified by Suri's comment).
> 
> Signed-off-by: Hal Rosenstock <halr at voltaire.com>
> 
> diff --git a/opensm/opensm/osm_trap_rcv.c 
> b/opensm/opensm/osm_trap_rcv.c index f912dcd..3f60f3d 100644
> --- a/opensm/opensm/osm_trap_rcv.c
> +++ b/opensm/opensm/osm_trap_rcv.c
> @@ -550,16 +550,76 @@ __osm_trap_rcv_process_request(
>          }
>          else
>          {
> -          /* When babbling port policy option is enabled and
> -             Threshold for disabling a "babbling" port is exceeded */
> +          uint8_t               payload[IB_SMP_DATA_SIZE];
> +          ib_port_info_t*       p_pi = (ib_port_info_t*)payload;
> +          const ib_port_info_t* p_old_pi;
> +          osm_madw_context_t    context;
> +
> +          p_old_pi = &p_physp->port_info;
> +          memcpy( payload, p_old_pi, sizeof(ib_port_info_t) );
> +
> +          if (p_ntci->g_or_v.generic.trap_num == CL_HTON16(131))
> +          {
> +            uint8_t port_state, cur_opvls, opvls;
> +
> +            port_state = ib_port_info_get_port_state(p_old_pi);
> +            if (port_state != IB_LINK_DOWN)
> +            {
> +              /* First, validate OperationalVLs */
> +              cur_opvls = ib_port_info_get_op_vls(p_old_pi);
> +              opvls = osm_physp_calc_link_op_vls(p_rcv->p_log,
> p_rcv->p_subn, p_physp);
> +              if (opvls != cur_opvls)
> +              {
> +                osm_log(p_rcv->p_log, OSM_LOG_ERROR,
> +                        "__osm_trap_rcv_process_request: ERR 3809: "
> +                        "Current OP_VLs %d New OP_VLs %d\n",
> +                        cur_opvls, opvls);
> +                ib_port_info_set_op_vls(p_pi, opvls);
> +              }
> +
> +              /* Now, set port to DOWN if not already in INIT */
> +              if (port_state != IB_LINK_INIT)
> +              {
> +                ib_port_info_set_port_state( p_pi, IB_LINK_DOWN );
> +                ib_port_info_set_port_phys_state(
> IB_PORT_PHYS_STATE_NO_CHANGE, p_pi );
> +              }
> +              else
> +              {
> +                ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE
);
> +                ib_port_info_set_port_phys_state(
> IB_PORT_PHYS_STATE_NO_CHANGE, p_pi );
> +              }
> +
> +              /* Now, issue set of PortInfo */
> +              context.pi_context.node_guid = osm_node_get_node_guid(
> osm_physp_get_node_ptr( p_physp ) );
> +              context.pi_context.port_guid = osm_physp_get_port_guid(
> p_physp );
> +              context.pi_context.set_method = TRUE;
> +              context.pi_context.update_master_sm_base_lid = FALSE;
> +              context.pi_context.light_sweep = FALSE;
> +              context.pi_context.active_transition = FALSE;
> +
> +              status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req,
> +                                     osm_physp_get_dr_path_ptr( 
> + p_physp
> ),
> +                                     payload,
> +                                     sizeof(payload),
> +                                     IB_MAD_ATTR_PORT_INFO,
> +                                     
> + cl_hton32(osm_physp_get_port_num(
> p_physp )),
> +                                     CL_DISP_MSGID_NONE,
> +                                    &context );
> +
> +              if( status != IB_SUCCESS )
> +              {
> +                 osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> +                          "__osm_trap_rcv_process_request: ERR 3812:
"
> +                          "Request to set PortInfo failed\n" );
> +              }
> +            }
> +         }
> + 
> +         /* When babbling port policy option is enabled and
> +            Threshold for disabling a "babbling" port is exceeded */
>            if ( p_rcv->p_subn->opt.babbling_port_policy &&
>                 num_received >= 250 )
>            {
> -            uint8_t               payload[IB_SMP_DATA_SIZE];
> -            ib_port_info_t*       p_pi = (ib_port_info_t*)payload;
> -            const ib_port_info_t* p_old_pi;
> -            osm_madw_context_t    context;
> -
>              /* If trap 131, might want to disable peer port if 
> available */
>              /* but peer port has been observed not to respond to SM 
> requests */
>  
> @@ -570,9 +630,6 @@ __osm_trap_rcv_process_request(
>                       p_ntci->data_details.ntc_129_131.port_num
>                       );
>  
> -            p_old_pi = &p_physp->port_info;
> -            memcpy( payload, p_old_pi, sizeof(ib_port_info_t) );
> -
>              /* Set port to disabled/down */
>              ib_port_info_set_port_state( p_pi, IB_LINK_DOWN );
>              ib_port_info_set_port_phys_state( 
> IB_PORT_PHYS_STATE_DISABLED, p_pi );
> 
> 
> 


From weiny2 at llnl.gov  Tue Jul 10 09:46:59 2007
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 10 Jul 2007 09:46:59 -0700
Subject: [ofa-general] IB performance stats (revisited)
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com>
References: <46826370.4090602@hp.com>
	<1182951169.28870.75880.camel@hal.voltaire.com>
	<46826FB8.10904@hp.com> <46827BA0.6070008@hp.com>
	<1182957688.28870.83013.camel@hal.voltaire.com>
	<4682994E.1020209@hp.com>
	<1182964334.28870.90291.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com>
	<1182978496.28870.106214.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com>
Message-ID: <20070710094659.50df9b39.weiny2@llnl.gov>

On Thu, 28 Jun 2007 10:24:59 +0300
"Eitan Zahavi" <eitan at mellanox.co.il> wrote:

> > On Wed, 2007-06-27 at 14:23, Eitan Zahavi wrote:
> > > In the last months it is the second time I hear people 
> > complaining the 
> > > current monitoring solution in OFA is  integrated with OpenSM.
> > 
> > I must have missed this both times (didn't see this in Mark's 
> > post) and the statement itself is somewhat inaccurate as well.
> Private talks - I hope they will speak up for themselves now...
> > 
> > > These people do not use OpenSM but do use OFED.
> > 
> > I'm not sure I'm following what you mean here.
> > 
> > If you mean that some people want to run PerfMgr without the 
> > SM/SA aspects (so that they can run a vendor based SM), that 
> > is the next thing we are adding to the implementation.
> Exactly. OK when is that coming?

There is very little which ties the current PerfMgr to OpenSM.  Basically it
just gets the current fabric topology.  As Hal has said changes are coming.

>
> > 
> > >  Another drawback if that
> > > no naming is provided and the reporting uses GUIDs.
> > 
> > Naming is provided via NodeDescription.
> This might be good for hosts but is not covering  switches ...

It does include switches.  However, since most systems have the same name for
multiple switches this becomes ineffective.  I have queried Voltaire for a way
to change the NodeDescription for switches, but at the time I asked, there was
no way to do it.  Perhaps there is now?  What about other vendors?  This is why
ibnetdiscover and other diags have "switch map" support.  (A GUID->name mapping
to override the default NodeDescription.) Nothing would please me more than to
be able to remove that for a more "automatic" solution.

> > 
> > > I also can't hold myself from saying again I think you are going to 
> > > hit the wall with the concept of doing the PMA from a single node.
> > 
> > If you are referring to the fact the PerMgr is currently not 
> > distributed, that will be done as has been stated before.
> Good. When is it expected? Will it be OFED 1.3?

When Hal first sent out the PerfMgr design I thought we should jump right to
the distributed model as well.  But now I am glad we have gone the way we did.
First off, we have something which "works" and from which we can expand.
Second, I have run some tests querying the fabric of our large clusters here
(~500 nodes) and the results were promising for a single node implementation.
I don't recall the numbers as this was a while ago but it was on the order of
<2 sec and I think <1 but I don't want to be misquoted.

For sure, a distributed model offers many advantages and we will get there.  But
for many the current single node approach should work just fine.

Thanks,
Ira

> 
> Thanks
> > 
> > -- Hal
> > 
> > > Eitan Zahavi
> > > Senior Engineering Director, Software Architect Mellanox 
> > Technologies 
> > > LTD
> > > Tel:+972-4-9097208
> > > Fax:+972-4-9593245
> > > P.O. Box 586 Yokneam 20692 ISRAEL
> > > 
> > >  
> > > 
> > > > -----Original Message-----
> > > > From: general-bounces at lists.openfabrics.org
> > > > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Hal 
> > > > Rosenstock
> > > > Sent: Wednesday, June 27, 2007 8:12 PM
> > > > To: Mark Seger
> > > > Cc: Finn, Ed; general at lists.openfabrics.org
> > > > Subject: Re: [ofa-general] IB performance stats (revisited)
> > > > 
> > > > On Wed, 2007-06-27 at 13:07, Mark Seger wrote:
> > > > > >The performance managers deal with the counter stickiness (by 
> > > > > >resetting them when they think they need to). They
> > > > typically export
> > > > > >their data although this is not specified by IBA so it is
> > > > in a vendor
> > > > > >proprietary manner.
> > > > > >  
> > > > > >
> > > > > so I guess these guys are poor citizens as well...
> > > > 
> > > > Not sure what you mean.
> > > > 
> > > > > the real issue as I see it then means nobody can trust 
> > the data if 
> > > > > randon tools randomly reset the counters.  a real shame...
> > > > 
> > > > I consider this to be a real rather than random app for this. 
> > > > Guess it depends on what one considers random.
> > > > 
> > > > -- Hal
> > > > 
> > > > > -mark
> > > > > 
> > > > > 
> > > > 
> > > > _______________________________________________
> > > > general mailing list
> > > > general at lists.openfabrics.org
> > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > 
> > > > To unsubscribe, please visit
> > > > http://openib.org/mailman/listinfo/openib-general
> > > > 
> > 
> > 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From vuhuong at mellanox.com  Tue Jul 10 09:55:00 2007
From: vuhuong at mellanox.com (Vu Pham)
Date: Tue, 10 Jul 2007 09:55:00 -0700
Subject: [ofa-general] Compiling SRPT
In-Reply-To: <1184042252.15067.8.camel@gentoo-linux.localdomain>
References: <1183852853.6008.11.camel@gentoo-linux.localdomain>	
	<46926868.8000704@mellanox.com>
	<1184042252.15067.8.camel@gentoo-linux.localdomain>
Message-ID: <4693B9E4.1070001@mellanox.com>


> Added a new wiki page based on Vu Pham's readme and issues with recent 
> kernels. I hope to keep it current as I get our targets up and running.
> 

Thanks for doing this.
Please use the latest readme from this link - 
http://mellanox.com/pdf/products/software/Gen2_SRPT_README.txt


> http://wiki.openfabrics.org/tiki-index.php?page=SRPT+Installation 
> <https://wiki.openfabrics.org/tiki-index.php?page=SRPT+Installation>
> 
> WinIB initiators --> Gentoo Linux SRP Target.
> 

I mainly test linux initiators with gen2 srp-target. I have 
not tested win srp initiator with the target.

> Anything wrong with the above approach, I would be interested in a best 
> practices if there is one. I saw a CentOS target post, is this more 
> stable or better performing?

There is no difference when you run the same srp target / 
scst codes in CentOS or RH/SuSe linux distributions. The 
storage back-end will determine the performance

-vu

> 
> Thanks.
> 
> On Mon, 2007-07-09 at 09:55 -0700, Vu Pham wrote:
>> Stanley Sufficool wrote:
>> >   Compiling on kernel 2.6.21-rc6 from kernel.org Torvald's branch
>> > 
>> > Got the latest srpt from the git repository on OpenFabrics and had the 
>> > following issues.
>> > 
>> > ib_srpt.c    Line 1997, missing second argument, should be?   
>> > sdev->scst_tgt = scst_register(tp, NULL);
>> > 
>>
>> Yes. You need the change if you test with top of scst svn 
>> trunk (or from version 0.9.6-pre2)
>> If you test with scst before 0.9.6-pre2 (ie. version <= 
>> 0.9.6-pre1) you don't need the second argument for 
>> scst_register()
>>
>>
>> > SCST was built successfully after fixing an issue in scst_vdisk.c 
>> > (missing #include <linux/sched.h>)
>>
>>
>> I tested with 2.6.20.x - I have not tested with 2.6.21-rcXX 
>> - you should send the patch to scst devel
>>
>> > 
>> > Just thought this would be nice to have documented, took me half a day 
>> > to track down as a novice in C programming.
>> > 
>>
>> there is *lean and mean* srpt's README in srpt_inc
>> SCST also has some document
>> You can add some wiki/notes for the problems in openfabrics 
>> wiki page https://wiki.openfabrics.org/tiki-index.php
>>
>> -vu
>>
>> > 
>> > ------------------------------------------------------------------------
>> > 
>> > _______________________________________________
>> > general mailing list
>> > general at lists.openfabrics.org <mailto:general at lists.openfabrics.org>
>> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>> > 
>> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>


From halr at voltaire.com  Tue Jul 10 10:04:57 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 10 Jul 2007 13:04:57 -0400
Subject: [ofa-general] RE: [PATCHv2] OpenSM/osm_trap_rcv.c: Better Trap 131
	Handling
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901BE114A@mtlexch01.mtl.com>
References: <1184078350.25217.481568.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901BE113F@mtlexch01.mtl.com>
	<1184084630.17622.3622.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901BE114A@mtlexch01.mtl.com>
Message-ID: <1184087090.17622.6458.camel@hal.voltaire.com>

Hi Amit,

On Tue, 2007-07-10 at 12:39, Amit Krig wrote:
> Hi Hal,
> 
> The watchdog mechanism may cause some hard time to communicate with the
> end node, that is the reason I suggest to bring down its peer port and
> by that stop the physical link from retraining all the time.

The patch uses the port indicated in the trap. Are you saying sometimes
that port will not be responsive to SMA requests (and in those cases the
peer should be used or at least tried) ?

-- Hal

> Amit 
> 
> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com] 
> Sent: Tuesday, July 10, 2007 7:24 PM
> To: Amit Krig
> Cc: general at lists.openfabrics.org; Suresh Shelvapille; Yevgeny
> Kliteynik; Eitan Zahavi
> Subject: RE: [PATCHv2] OpenSM/osm_trap_rcv.c: Better Trap 131 Handling
> 
> Hi Amit,
> 
> On Tue, 2007-07-10 at 11:30, Amit Krig wrote:
> > Hi Hal,
> > 
> > One comment,
> > If one of the port is not responsive for some reason, need to move its
> 
> > peer port to DOWN and then check the OPVL,
> 
> Guess I'm still not following you exactly yet. 
> 
> The code here is not determining the port responsiveness. It is merely
> triggering off the trap 131, recalculating and resetting OperationalVLs
> if needed, and taking the port down at the link level which should start
> it back to active, hopefully now with the proper OperationalVLs. If it
> is still flooded with trap 131s, it disables the port.
> 
> -- Hal
> 
> > 
> > Amit
> > -----Original Message-----
> > From: Hal Rosenstock [mailto:halr at voltaire.com]
> > Sent: Tuesday, July 10, 2007 5:39 PM
> > To: general at lists.openfabrics.org
> > Cc: Suresh Shelvapille; Amit Krig; Yevgeny Kliteynik; Eitan Zahavi
> > Subject: [PATCHv2] OpenSM/osm_trap_rcv.c: Better Trap 131 Handling
> > 
> > OpenSM/osm_trap_rcv.c: Better trap 131 handling
> > 
> > When trap 131 occurs, check operational VLs and set port state to DOWN
> 
> > if needed.
> > 
> > I think this is what Amit was saying should be done in his emails 
> > yesterday on the list (modified by Suri's comment).
> > 
> > Signed-off-by: Hal Rosenstock <halr at voltaire.com>
> > 
> > diff --git a/opensm/opensm/osm_trap_rcv.c 
> > b/opensm/opensm/osm_trap_rcv.c index f912dcd..3f60f3d 100644
> > --- a/opensm/opensm/osm_trap_rcv.c
> > +++ b/opensm/opensm/osm_trap_rcv.c
> > @@ -550,16 +550,76 @@ __osm_trap_rcv_process_request(
> >          }
> >          else
> >          {
> > -          /* When babbling port policy option is enabled and
> > -             Threshold for disabling a "babbling" port is exceeded */
> > +          uint8_t               payload[IB_SMP_DATA_SIZE];
> > +          ib_port_info_t*       p_pi = (ib_port_info_t*)payload;
> > +          const ib_port_info_t* p_old_pi;
> > +          osm_madw_context_t    context;
> > +
> > +          p_old_pi = &p_physp->port_info;
> > +          memcpy( payload, p_old_pi, sizeof(ib_port_info_t) );
> > +
> > +          if (p_ntci->g_or_v.generic.trap_num == CL_HTON16(131))
> > +          {
> > +            uint8_t port_state, cur_opvls, opvls;
> > +
> > +            port_state = ib_port_info_get_port_state(p_old_pi);
> > +            if (port_state != IB_LINK_DOWN)
> > +            {
> > +              /* First, validate OperationalVLs */
> > +              cur_opvls = ib_port_info_get_op_vls(p_old_pi);
> > +              opvls = osm_physp_calc_link_op_vls(p_rcv->p_log,
> > p_rcv->p_subn, p_physp);
> > +              if (opvls != cur_opvls)
> > +              {
> > +                osm_log(p_rcv->p_log, OSM_LOG_ERROR,
> > +                        "__osm_trap_rcv_process_request: ERR 3809: "
> > +                        "Current OP_VLs %d New OP_VLs %d\n",
> > +                        cur_opvls, opvls);
> > +                ib_port_info_set_op_vls(p_pi, opvls);
> > +              }
> > +
> > +              /* Now, set port to DOWN if not already in INIT */
> > +              if (port_state != IB_LINK_INIT)
> > +              {
> > +                ib_port_info_set_port_state( p_pi, IB_LINK_DOWN );
> > +                ib_port_info_set_port_phys_state(
> > IB_PORT_PHYS_STATE_NO_CHANGE, p_pi );
> > +              }
> > +              else
> > +              {
> > +                ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE
> );
> > +                ib_port_info_set_port_phys_state(
> > IB_PORT_PHYS_STATE_NO_CHANGE, p_pi );
> > +              }
> > +
> > +              /* Now, issue set of PortInfo */
> > +              context.pi_context.node_guid = osm_node_get_node_guid(
> > osm_physp_get_node_ptr( p_physp ) );
> > +              context.pi_context.port_guid = osm_physp_get_port_guid(
> > p_physp );
> > +              context.pi_context.set_method = TRUE;
> > +              context.pi_context.update_master_sm_base_lid = FALSE;
> > +              context.pi_context.light_sweep = FALSE;
> > +              context.pi_context.active_transition = FALSE;
> > +
> > +              status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req,
> > +                                     osm_physp_get_dr_path_ptr( 
> > + p_physp
> > ),
> > +                                     payload,
> > +                                     sizeof(payload),
> > +                                     IB_MAD_ATTR_PORT_INFO,
> > +                                     
> > + cl_hton32(osm_physp_get_port_num(
> > p_physp )),
> > +                                     CL_DISP_MSGID_NONE,
> > +                                    &context );
> > +
> > +              if( status != IB_SUCCESS )
> > +              {
> > +                 osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> > +                          "__osm_trap_rcv_process_request: ERR 3812:
> "
> > +                          "Request to set PortInfo failed\n" );
> > +              }
> > +            }
> > +         }
> > + 
> > +         /* When babbling port policy option is enabled and
> > +            Threshold for disabling a "babbling" port is exceeded */
> >            if ( p_rcv->p_subn->opt.babbling_port_policy &&
> >                 num_received >= 250 )
> >            {
> > -            uint8_t               payload[IB_SMP_DATA_SIZE];
> > -            ib_port_info_t*       p_pi = (ib_port_info_t*)payload;
> > -            const ib_port_info_t* p_old_pi;
> > -            osm_madw_context_t    context;
> > -
> >              /* If trap 131, might want to disable peer port if 
> > available */
> >              /* but peer port has been observed not to respond to SM 
> > requests */
> >  
> > @@ -570,9 +630,6 @@ __osm_trap_rcv_process_request(
> >                       p_ntci->data_details.ntc_129_131.port_num
> >                       );
> >  
> > -            p_old_pi = &p_physp->port_info;
> > -            memcpy( payload, p_old_pi, sizeof(ib_port_info_t) );
> > -
> >              /* Set port to disabled/down */
> >              ib_port_info_set_port_state( p_pi, IB_LINK_DOWN );
> >              ib_port_info_set_port_phys_state( 
> > IB_PORT_PHYS_STATE_DISABLED, p_pi );
> > 
> > 
> > 
> 


From mst at dev.mellanox.co.il  Tue Jul 10 10:11:42 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 10 Jul 2007 20:11:42 +0300
Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus
In-Reply-To: <adabqekuvde.fsf@cisco.com>
References: <adalkdpxopo.fsf@cisco.com> <20070709213913.GB20052@mellanox.co.il>
	<adamyy4vjo9.fsf@cisco.com> <20070710071547.GA3814@mellanox.co.il>
	<adabqekuvde.fsf@cisco.com>
Message-ID: <20070710171142.GC11320@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: mthca use of dma_sync_single is bogus
> 
>  > What makes you think dma_sync_single_range can't be used on memory mapped
>  > by pci_map_sg/dma_map_sg?
> 
> The fact that it's dma_sync_*SINGLE*_range, and that there's a
> separate dma_sync_sg() function defined in DMA-API.txt.

Aha. I looked at the code a bit.
Basically is seems that some architectures use the dma handle
and some the virtual address to flush the cache, that's
where the requirement that same parameters are used for
sync single as for map single comes from.

So it seems that this requirement does not apply to s/g, and that we can just
build a scatterlist structure and do dma_sync_sg?

-- 
MST


From rick.jones2 at hp.com  Tue Jul 10 10:13:39 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Tue, 10 Jul 2007 10:13:39 -0700
Subject: [ofa-general] minor usability nit with 1.2GA?
In-Reply-To: <469372CC.5060207@dev.mellanox.co.il>
References: <4692CC7A.2050704@hp.com> <469372CC.5060207@dev.mellanox.co.il>
Message-ID: <4693BE43.8070905@hp.com>

Vladimir Sokolovsky wrote:
> Rick Jones wrote:
> 
>> So I was blythly running my netperf tests after resolving the problem 
>> with the existence of irqbalance.  I finished my TCP tests and was 
>> about to run the SDP tests.  I'd not modprobe'd the ib_sdp module, so 
>> my netperf tests died.  I then did the modprobe and it complained 
>> about symbol versions.
>>
>> Turns-out - or at least it seems that way - that my selection of just 
>> "basic" software didn't include SDP.  That's fine I suppose, but what 
>> happened then was I was left with a system with a hybrid of the 
>> previous OFED whatever bits (probably an RC for 1.2) and OFED GA bits.
>>
>> Perhaps this is simply "caveat emptor" but shouldn't there be some 
>> sort of warning/check that in only doing the partial install there 
>> would be some incompatible modules left laying around?  Or should I 
>> just do the "give me everything" option, shut-up and benchmark?-)
>>
> 
> Hi,
> OFED removes the previous software before installing the new one.
> So, there shouldn't be a mix of different OFED versions on the same 
> machine.
> 
> Can you send me the output of the following commands:
> # modinfo ib_sdp
> # rpm -qf /lib/modules/.../ib_sdp.ko (take the correct path from the 
> previous command)
> # rpm -q kernel-ib
> # ofed_info

I can, but at this point I'm not sure what it would show since I went 
back and did a "build me one with everything" install on both my 
systems.  If you still want to see it I can do that though.

rick

> 
> 
> Thanks,
> Vladimir


From tziporet at dev.mellanox.co.il  Tue Jul 10 10:17:59 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 10 Jul 2007 20:17:59 +0300
Subject: [ofa-general] OFED 1.3 timeline
Message-ID: <4693BF47.8070700@mellanox.co.il>

Hi All,
Based on the requests to have OFED 1.3 release this year the release 
schedule is the following:

    * Feature freeze - Sep 4
    * Alpha release - Sep 10
    * Beta release - Sep 25
    * RC1 - Oct 16
    * RC2 - Oct 30
    * RC3 - Nov 8 (assuming many of us are at SC07 on the week of Nov 11)
    * RC4 - Nov 22
    * GA release - Nov 30 (or first week of Dec)


To make this schedule we must implement all major changes for the 
package during July so we have a stable package till middle of Aug.
Also we must keep the new features in control and not insert unnecessary 
changes that are not in the features list.

Full features list will be published in a different mail

Tziporet.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070710/d9c44325/attachment.html>

From rdreier at cisco.com  Tue Jul 10 11:04:40 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 10 Jul 2007 11:04:40 -0700
Subject: [ofa-general] Re: [PATCH] IB/core: Fix the used pointer when calling
	to kmalloc
In-Reply-To: <200707101655.58041.dotanb@dev.mellanox.co.il> (Dotan Barak's
	message of "Tue, 10 Jul 2007 16:55:57 +0300")
References: <200707101655.58041.dotanb@dev.mellanox.co.il>
Message-ID: <adabqekb0ef.fsf@cisco.com>

thanks, applied.


From rdreier at cisco.com  Tue Jul 10 11:06:29 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 10 Jul 2007 11:06:29 -0700
Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus
In-Reply-To: <20070710141409.GH3885@ics.muni.cz> (Lukas Hejtmanek's message of
	"Tue, 10 Jul 2007 16:14:09 +0200")
References: <adalkdpxopo.fsf@cisco.com>
	<20070709213913.GB20052@mellanox.co.il> <adamyy4vjo9.fsf@cisco.com>
	<20070710141409.GH3885@ics.muni.cz>
Message-ID: <ada7ip8b0be.fsf@cisco.com>

 > And what about the attached patch to mthca_memfree? It changes alloc_pages for
 > pci_alloc_consistent. Using it, I can enable FMR and the driver runs fine.

As Michael said, this uses a lot of consistent memory.  Probably too
much on some systems.

 - R.


From rdreier at cisco.com  Tue Jul 10 11:09:01 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 10 Jul 2007 11:09:01 -0700
Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus
In-Reply-To: <20070710171142.GC11320@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 10 Jul 2007 20:11:42 +0300")
References: <adalkdpxopo.fsf@cisco.com>
	<20070709213913.GB20052@mellanox.co.il> <adamyy4vjo9.fsf@cisco.com>
	<20070710071547.GA3814@mellanox.co.il> <adabqekuvde.fsf@cisco.com>
	<20070710171142.GC11320@mellanox.co.il>
Message-ID: <ada3azwb076.fsf@cisco.com>

 > Aha. I looked at the code a bit.
 > Basically is seems that some architectures use the dma handle
 > and some the virtual address to flush the cache, that's
 > where the requirement that same parameters are used for
 > sync single as for map single comes from.
 > 
 > So it seems that this requirement does not apply to s/g, and that we can just
 > build a scatterlist structure and do dma_sync_sg?

The statement

    synchronise a single contiguous or scatter/gather mapping.  All the
    parameters must be the same as those passed into the single mapping
    API.

in DMA-API.txt also is clearly attached to dma_sync_sg().  So I don't
think it's a good idea to rely on being able to sync a different
scatterlist than the one that was originally mapped.

It actually doesn't look too bad to replace our use of pci_map_sg()
with dma_map_single(), at least at first glance.  I'll try to write a
patch later.

 - R.


From halr at voltaire.com  Tue Jul 10 11:23:53 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 10 Jul 2007 14:23:53 -0400
Subject: [ofa-general] Toward next OFED release (1.3)
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C90156379B@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C90156379B@mtlexch01.mtl.com>
Message-ID: <1184091830.17622.12007.camel@hal.voltaire.com>

On Tue, 2007-06-26 at 10:27, Tziporet Koren wrote:
> Hi All,
> 
> On next Monday we will have the first meeting to close OFED 1.3
> features and schedule.
> As a preparation I send here the list of features we already reviewed
> in Sonoma, and other features I see in progress on the general list
> discussions.
> 
> I know this is a long mail :-( but I ask each of the
> maintainers/customers to review this list and send comments and other
> requests.
> 
> There are some ULPs that I placed "?" and the owner should review and
> reply with the plans.
> 
> Thanks,
> Tziporet
> 
> 
> Main New Features
> ==============
> Base kernel: 2.6.23 (we will start with 2.6.22 but will move to
> 2.6.23)
> Install: 
>       * Minimize integration effort into OS distribution
>       * Break the packages RPMs (work with Novell and Redhat)
>         
> 
> 
> Package: 
>       * Sources arrangement for the end user (for the labs)
>       * Reduce compilation warnings
>         
> 
> 
> QoS:
>       * OSM
>       * CM & CMA
>       * ULPs: SDP, SRP, IPoIB, RDS?
>         
> 
> 
> Core: 
>       * Updated SA cache
>       * User space events registration
>       * Preparations for IB routers
>         
> 
> 
> libibverbs:
>       * New verbs: 
>               * Scalable Reliable Connected Transport (with Mellanox
>                 ConnectX)
>               * Shared Send Queue
>               * Reliable Multicast ?
>                 
>         
> 
> 
> Management:
>       * Multiple partitions
>       * OpenSM
>               * More routing performance improvements
>               * Even more speedups
>               * Better packaging/installation
>               * “Native” daemon mode
>               * Performance management
>               * Quality of Service manager: Based on IBTA annex

enhancements for fat tree routing (non pure tree support)
more console commands and telnet access to console

>       * More diagnostics - Hal please update

ibsim - IB management simulator which can be used without OpenSM
recompilation and supports the diag tools

ibidsverify.pl: validate LIDs and GUIDs in subnet
Updated ibnetdiscover format with link width and speed, and GUIDs
ibnetdiscover grouping support for new Voltaire chassis
diag updates for IB router support
iblinkinfo.pl: Support peer port link width and speed validation
ibdatacounters: Add script and man page for subnet wide data counters
saquery enhancements

> ULPs:
>       * IPoIB: NAPI; CM in GA; Bonding in GA
>       * NFS over RDMA integration
>       * RDS: RDMA API (using FMRs); GA quality with Oracle 11
>       * SDP: Keepalive; Asynch IO (Zero Copy)
>       * SRP: HA in GA
>       * VNIC: ? Qlogic - please update
>       * iSER: ? Voltaire - please update
>       * uDAPL - ? Arlin please update
>         
> 
> 
> iWARP: (Steve please update if needed)
>       * iwarp-specific verbs
>       * iwarp-specific async events
>       * API for MPA options (CRC/Markers)
>       * API for streaming mode IO (needed for compliant iSER)
>       * Possibly other ULPs (RDS, SDP, iSER)
>         
> 
> 
> MPIs:
> Integrate the new MPI releases that are on time for OFED 1.3
>       * Jeff - please update about Open MPI
>       * DK: Please update regarding MVAPICH and MVAPICH2
>         
> 
> 
> OFED 1.3 System Matrix
>       * CPU Arch: X86, x86_64, PPC64, ia64
>       * kernel.org: kernel 2.6.23
>       * Novell: SLES 10; SLES 10 SP1
>       * Redhat: RHEL 4 (up4 and up5); RHEL 5 (can we drop RHEL4up4
>         since up6 will probably be out till this release is out?)
>       * Free distros (Fedora, SuSE Pro, Ubuntu) - basic testing only
>         
>         
> 
> 
> Tziporet Koren
> Software Director
> Mellanox Technologies
> mailto: tziporet at mellanox.co.il
> Tel +972-4-9097200, ext 380
> 
> 
> 
> ______________________________________________________________________
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From rdreier at cisco.com  Tue Jul 10 11:29:38 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 10 Jul 2007 11:29:38 -0700
Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <20070710071912.GB3814@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 10 Jul 2007 10:19:12 +0300")
References: <20070625130604.GH15343@mellanox.co.il>
	<aday7i7wye1.fsf@cisco.com> <20070626070641.GM15343@mellanox.co.il>
	<adahcouv2mi.fsf@cisco.com> <20070630222419.GE7554@mellanox.co.il>
	<adar6nq92zd.fsf@cisco.com> <20070702195927.GB31169@mellanox.co.il>
	<adaabue8qk1.fsf@cisco.com> <20070710071912.GB3814@mellanox.co.il>
Message-ID: <aday7ho9kod.fsf@cisco.com>

 > That one is actually not very different from sysfs:
 > there just seems to be a set of pre-defined files.

I thought there was a special system call to create stuff or
something.  Anyway I haven't looked in a long time.

 > The special nature of your suggested filesystem would be
 > that we actually let users create files there,
 > but then files need to disappear when the last user
 > closes the file.

Yes, that's true.  Phrased that way it does seem tricky.


From mst at dev.mellanox.co.il  Tue Jul 10 11:30:06 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 10 Jul 2007 21:30:06 +0300
Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus
In-Reply-To: <ada3azwb076.fsf@cisco.com>
References: <adalkdpxopo.fsf@cisco.com> <20070709213913.GB20052@mellanox.co.il>
	<adamyy4vjo9.fsf@cisco.com> <20070710071547.GA3814@mellanox.co.il>
	<adabqekuvde.fsf@cisco.com> <20070710171142.GC11320@mellanox.co.il>
	<ada3azwb076.fsf@cisco.com>
Message-ID: <20070710183006.GE11320@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [ofa-general] Re: mthca use of dma_sync_single is bogus
> 
>  > Aha. I looked at the code a bit.
>  > Basically is seems that some architectures use the dma handle
>  > and some the virtual address to flush the cache, that's
>  > where the requirement that same parameters are used for
>  > sync single as for map single comes from.
>  > 
>  > So it seems that this requirement does not apply to s/g, and that we can just
>  > build a scatterlist structure and do dma_sync_sg?
> 
> The statement
> 
>     synchronise a single contiguous or scatter/gather mapping.  All the
>     parameters must be the same as those passed into the single mapping
>     API.
> 
> in DMA-API.txt also is clearly attached to dma_sync_sg().  So I don't
> think it's a good idea to rely on being able to sync a different
> scatterlist than the one that was originally mapped.

Hmm. This means there's no way to sync a range within
mapping created with map_sg?

> It actually doesn't look too bad to replace our use of pci_map_sg()
> with dma_map_single(), at least at first glance.  I'll try to write a
> patch later.

Well, the reason map_sg is there is presumably because on some
architectures it's worth it to try and make the region contigious in DMA space.
But I agree this seems the lesser evil at this point ...

-- 
MST


From rdreier at cisco.com  Tue Jul 10 11:31:36 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 10 Jul 2007 11:31:36 -0700
Subject: [ofa-general] Re: consumer data buffer ownership for inline sends
In-Reply-To: <Pine.LNX.4.64.0707031144130.15147@zuben> (Or Gerlitz's message
	of "Tue, 3 Jul 2007 11:50:52 +0300 (IDT)")
References: <Pine.LNX.4.64.0707031144130.15147@zuben>
Message-ID: <adar6ng9kl3.fsf@cisco.com>

 > Does this means that for inline sends, when ibv_post_send returns,
 > the consumer owns back the data buffer associated with this send?
 > 
 > Can this be stated as the official policy of libibverbs?

This does seem fine.  Can you send a documentation patch stating this?


From mst at dev.mellanox.co.il  Tue Jul 10 11:37:05 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 10 Jul 2007 21:37:05 +0300
Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <aday7ho9kod.fsf@cisco.com>
References: <20070625130604.GH15343@mellanox.co.il> <aday7i7wye1.fsf@cisco.com>
	<20070626070641.GM15343@mellanox.co.il> <adahcouv2mi.fsf@cisco.com>
	<20070630222419.GE7554@mellanox.co.il> <adar6nq92zd.fsf@cisco.com>
	<20070702195927.GB31169@mellanox.co.il> <adaabue8qk1.fsf@cisco.com>
	<20070710071912.GB3814@mellanox.co.il> <aday7ho9kod.fsf@cisco.com>
Message-ID: <20070710183705.GF11320@mellanox.co.il>

>  > The special nature of your suggested filesystem would be
>  > that we actually let users create files there,
>  > but then files need to disappear when the last user
>  > closes the file.
> 
> Yes, that's true.  Phrased that way it does seem tricky.

OK, so how about the idea to just pass in *any* fd, and just
create a mapping between an inode and src domain (or other shared object),
by means of a radix tree or something like this.

We can then use the mapping to check permissions when new
processes want to attach to an existing object.

Hmm?

-- 
MST


From sashak at voltaire.com  Tue Jul 10 11:52:36 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 10 Jul 2007 21:52:36 +0300
Subject: [ofa-general] [PATCH] opensm/updn: up/down root switches detector
	fix
Message-ID: <20070710185236.GW25653@sashak.voltaire.com>


This problem was triggered by min hops generator optimizations where
min hop matrices are created for switches only. The up/down root
switches auto detector code which uses those tables is outdated, this
still try to count hops to CAs directly and now it gets 0xff (no path)
only value, as result all fabric switches are considered to be roots.
This patch updates root auto detector code according to recent min hops
optimizations and fixes the issue.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_ucast_updn.c |   70 ++++++++++++++++++++--------------------
 1 files changed, 35 insertions(+), 35 deletions(-)

diff --git a/opensm/opensm/osm_ucast_updn.c b/opensm/opensm/osm_ucast_updn.c
index db8e60a..11f6eb5 100644
--- a/opensm/opensm/osm_ucast_updn.c
+++ b/opensm/opensm/osm_ucast_updn.c
@@ -718,8 +718,8 @@ __osm_updn_find_root_nodes_by_min_hop(
   uint8_t       maxHops = 0; /* contain the max histogram index */
   uint64_t     *p_guid;
   cl_list_t    *p_root_nodes_list = p_updn->p_root_nodes;
-  cl_map_t      ca_by_lid_map; /* map holding all CA lids  */
-  uint16_t self_lid_ho;
+  unsigned *cas_per_sw;
+  uint16_t sw_lid_ho;
 
   OSM_LOG_ENTER( &p_osm->log, osm_updn_find_root_nodes_by_min_hop );
 
@@ -729,8 +729,15 @@ __osm_updn_find_root_nodes_by_min_hop(
            cl_qmap_count(&p_osm->subn.port_guid_tbl) );
   /* Init the required vars */
   cl_qmap_init( &min_hop_hist );
-  cl_map_construct( &ca_by_lid_map );
-  cl_map_init( &ca_by_lid_map, 10 );
+
+  cas_per_sw = malloc((IB_LID_UCAST_END_HO + 1)*sizeof(*cas_per_sw));
+  if (!cas_per_sw) {
+    osm_log( &p_osm->log, OSM_LOG_ERROR,
+             "__osm_updn_find_root_nodes_by_min_hop: "
+             "cannot alloc mem for CAs per switch counter array.\n");
+    goto _exit;
+  }
+  memset(cas_per_sw, 0, (IB_LID_UCAST_END_HO + 1)*sizeof(*cas_per_sw));
 
   /* EZ:
      p_ca_list = (cl_list_t*)malloc(sizeof(cl_list_t)); 
@@ -752,21 +759,19 @@ __osm_updn_find_root_nodes_by_min_hop(
   while( p_next_port != (osm_port_t*)cl_qmap_end( &p_osm->subn.port_guid_tbl ) ) {
     p_port = p_next_port;
     p_next_port = (osm_port_t*)cl_qmap_next( &p_next_port->map_item );
-    if ( osm_node_get_type(p_port->p_node) != IB_NODE_TYPE_SWITCH )
+    if ( !p_port->p_node->sw )
     {
-      p_physp = p_port->p_physp;
-      self_lid_ho = cl_ntoh16( osm_physp_get_base_lid(p_physp) );
-      numCas++;
-      /* EZ:
-         self = malloc(sizeof(uint16_t));
-         *self = self_lid_ho;
-         cl_list_insert_tail(p_ca_list, self);
-      */
-      cl_map_insert( &ca_by_lid_map, self_lid_ho, (void *)0x1);
+      p_physp = p_port->p_physp->p_remote_physp;
+      if (!p_physp || !p_physp->p_node->sw)
+        continue;
+      sw_lid_ho = osm_node_get_base_lid(p_physp->p_node, 0);
+      sw_lid_ho = cl_ntoh16(sw_lid_ho);
       osm_log( &p_osm->log, OSM_LOG_DEBUG,
                "__osm_updn_find_root_nodes_by_min_hop: "
-               "Inserting GUID 0x%" PRIx64 ", Lid: 0x%X into array\n",
-               cl_ntoh64(osm_port_get_guid(p_port)), self_lid_ho );
+               "Inserting GUID 0x%" PRIx64 ", sw lid: 0x%X into array\n",
+               cl_ntoh64(osm_port_get_guid(p_port)), sw_lid_ho );
+      cas_per_sw[sw_lid_ho]++;
+      numCas++;
     }
   }
   osm_log( &p_osm->log, OSM_LOG_DEBUG,
@@ -792,10 +797,10 @@ __osm_updn_find_root_nodes_by_min_hop(
        rebuild its FWD tables, post setting Min Hop Tables */
     max_lid_ho = p_sw->max_lid_ho;
     /* Get base lid of switch by retrieving port 0 lid of node pointer */
-    self_lid_ho = cl_ntoh16( osm_node_get_base_lid( p_sw->p_node, 0 ) );
+    sw_lid_ho = cl_ntoh16( osm_node_get_base_lid( p_sw->p_node, 0 ) );
     osm_log( &p_osm->log, OSM_LOG_DEBUG,
              "__osm_updn_find_root_nodes_by_min_hop: "
-             "Passing through switch lid 0x%X\n", self_lid_ho );
+             "Passing through switch lid 0x%X\n", sw_lid_ho );
     for (lid_ho = 1; lid_ho <= max_lid_ho; lid_ho++)
     {
       /* Skip lids which are not CAs or RTRs - 
@@ -816,7 +821,7 @@ __osm_updn_find_root_nodes_by_min_hop(
          }
          if ( LidFound )
       */
-      if (cl_map_get( &ca_by_lid_map, lid_ho ))
+      if (cas_per_sw[lid_ho])
       {
         hop_val = osm_switch_get_least_hops( p_sw, lid_ho );
         if (hop_val > maxHops)
@@ -828,22 +833,19 @@ __osm_updn_find_root_nodes_by_min_hop(
           /* New entry in the histogram, first create it */
           p_updn_hist = (updn_hist_t*) malloc(sizeof(updn_hist_t));
           CL_ASSERT(p_updn_hist);
-          p_updn_hist->bar_value = 1;
+          p_updn_hist->bar_value = 0;
           cl_qmap_insert(&min_hop_hist, (uint64_t)hop_val, &p_updn_hist->map_item);
           osm_log( &p_osm->log, OSM_LOG_DEBUG,
                    "__osm_updn_find_root_nodes_by_min_hop: "
-                   "Creating new entry in histogram %u with bar value 1\n",
+                   "Creating new entry in histogram %u\n",
                    hop_val );
         }
-        else
-        {
-          /* Entry exists in the table, just increment the value */
-          p_updn_hist->bar_value++;
-          osm_log( &p_osm->log, OSM_LOG_DEBUG,
-                   "__osm_updn_find_root_nodes_by_min_hop: "
-                   "Updating entry in histogram %u with bar value %d\n",
-                   hop_val, p_updn_hist->bar_value );
-        }
+        /* Entry exists in the table, just increment the value */
+        p_updn_hist->bar_value += cas_per_sw[lid_ho];
+        osm_log( &p_osm->log, OSM_LOG_DEBUG,
+                 "__osm_updn_find_root_nodes_by_min_hop: "
+                 "Updating entry in histogram %u with bar value %d\n",
+                 hop_val, p_updn_hist->bar_value );
       }
     }
 
@@ -908,13 +910,11 @@ __osm_updn_find_root_nodes_by_min_hop(
     }
   }
 
-  /* destroy the map of CA and RTR lids */
-  cl_map_remove_all( &ca_by_lid_map );
-  cl_map_destroy( &ca_by_lid_map );
-
+  free(cas_per_sw);
   /* Now convert the cl_list to array */
   __osm_updn_convert_list2array(p_updn);
- 
+
+ _exit:
   OSM_LOG_EXIT( &p_osm->log );
   return;
 }
-- 
1.5.3.rc0.93.ga0f53


From xhejtman at ics.muni.cz  Tue Jul 10 12:00:19 2007
From: xhejtman at ics.muni.cz (Lukas Hejtmanek)
Date: Tue, 10 Jul 2007 21:00:19 +0200
Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus
In-Reply-To: <ada7ip8b0be.fsf@cisco.com>
References: <adalkdpxopo.fsf@cisco.com> <20070709213913.GB20052@mellanox.co.il>
	<adamyy4vjo9.fsf@cisco.com> <20070710141409.GH3885@ics.muni.cz>
	<ada7ip8b0be.fsf@cisco.com>
Message-ID: <20070710190018.GK3885@ics.muni.cz>

On Tue, Jul 10, 2007 at 11:06:29AM -0700, Roland Dreier wrote:
> > And what about the attached patch to mthca_memfree? It changes alloc_pages 
> > for  pci_alloc_consistent. Using it, I can enable FMR and the driver 
> > runs fine.
> 
> As Michael said, this uses a lot of consistent memory.  Probably too
> much on some systems.

I think he spoke about coherent, didn't he? On i386/x86_64, the consistent and
coherent are the same but on some architectures they are not and I think that
using consistent (in particular pci_alloc_consistent) is exactly what should
be used. Keir also recommended to use this one.

And moreover, it avoids using swiotlb and bounce buffers, I think. Am I right,
Keir?

-- 
Lukáš Hejtmánek


From rdreier at cisco.com  Tue Jul 10 12:08:43 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 10 Jul 2007 12:08:43 -0700
Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus
In-Reply-To: <20070710190018.GK3885@ics.muni.cz> (Lukas Hejtmanek's message of
	"Tue, 10 Jul 2007 21:00:19 +0200")
References: <adalkdpxopo.fsf@cisco.com>
	<20070709213913.GB20052@mellanox.co.il> <adamyy4vjo9.fsf@cisco.com>
	<20070710141409.GH3885@ics.muni.cz> <ada7ip8b0be.fsf@cisco.com>
	<20070710190018.GK3885@ics.muni.cz>
Message-ID: <adak5t89iv8.fsf@cisco.com>

 > I think he spoke about coherent, didn't he? On i386/x86_64, the consistent and
 > coherent are the same but on some architectures they are not and I think that
 > using consistent (in particular pci_alloc_consistent) is exactly what should
 > be used. Keir also recommended to use this one.

coherent and consistent are synonyms.  It's confusing because there is
pci_alloc_consistent(), which is in general just a wrapper for
dma_alloc_coherent().

 > And moreover, it avoids using swiotlb and bounce buffers, I think. Am I right,
 > Keir?

Yes, but I'm not really willing to make things worse for standard i386
just to make Xen work a little better.

 - R.


From halr at voltaire.com  Tue Jul 10 12:07:28 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 10 Jul 2007 15:07:28 -0400
Subject: [ofa-general] Re: [PATCH] opensm/updn: up/down root switches
	detector fix
In-Reply-To: <20070710185236.GW25653@sashak.voltaire.com>
References: <20070710185236.GW25653@sashak.voltaire.com>
Message-ID: <1184094445.17622.15024.camel@hal.voltaire.com>

On Tue, 2007-07-10 at 14:52, Sasha Khapyorsky wrote:
> This problem was triggered by min hops generator optimizations where
> min hop matrices are created for switches only. The up/down root
> switches auto detector code which uses those tables is outdated, this
> still try to count hops to CAs directly and now it gets 0xff (no path)
> only value, as result all fabric switches are considered to be roots.
> This patch updates root auto detector code according to recent min hops
> optimizations and fixes the issue.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks! Applied.

-- Hal


From cap at nsc.liu.se  Tue Jul 10 12:11:18 2007
From: cap at nsc.liu.se (Peter Kjellstrom)
Date: Tue, 10 Jul 2007 21:11:18 +0200
Subject: [ofa-general] Toward next OFED release (1.3)
In-Reply-To: <1184091830.17622.12007.camel@hal.voltaire.com>
References: <6C2C79E72C305246B504CBA17B5500C90156379B@mtlexch01.mtl.com>
	<1184091830.17622.12007.camel@hal.voltaire.com>
Message-ID: <200707102111.28374.cap@nsc.liu.se>

On Tuesday 10 July 2007, Hal Rosenstock wrote:
...
> > Management:
> >       * Multiple partitions
> >       * OpenSM
> >               * More routing performance improvements
> >               * Even more speedups
> >               * Better packaging/installation
> >               * “Native” daemon mode
> >               * Performance management
> >               * Quality of Service manager: Based on IBTA annex
>
> enhancements for fat tree routing (non pure tree support)
> more console commands and telnet access to console

Pardon my ignorance, but could you elaborate on what a "non-pure tree" is and 
in which way OFED-1.2 opensm performs badly for these?

Or maybe there are some nice docs for me to sink my teeth into...

/Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070710/1af01819/attachment.sig>

From halr at voltaire.com  Tue Jul 10 12:12:41 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 10 Jul 2007 15:12:41 -0400
Subject: [ofa-general] Toward next OFED release (1.3)
In-Reply-To: <200707102111.28374.cap@nsc.liu.se>
References: <6C2C79E72C305246B504CBA17B5500C90156379B@mtlexch01.mtl.com>
	<1184091830.17622.12007.camel@hal.voltaire.com>
	<200707102111.28374.cap@nsc.liu.se>
Message-ID: <1184094759.17622.15371.camel@hal.voltaire.com>

On Tue, 2007-07-10 at 15:11, Peter Kjellstrom wrote:
> On Tuesday 10 July 2007, Hal Rosenstock wrote:
> ...
> > > Management:
> > >       * Multiple partitions
> > >       * OpenSM
> > >               * More routing performance improvements
> > >               * Even more speedups
> > >               * Better packaging/installation
> > >               * “Native” daemon mode
> > >               * Performance management
> > >               * Quality of Service manager: Based on IBTA annex
> >
> > enhancements for fat tree routing (non pure tree support)
> > more console commands and telnet access to console
> 
> Pardon my ignorance, but could you elaborate on what a "non-pure tree" is and 
> in which way OFED-1.2 opensm performs badly for these?

Yevgeny,

Could you elaborate on this ? Thanks.

-- Hal

> Or maybe there are some nice docs for me to sink my teeth into...
> 
> /Peter
> 
> ______________________________________________________________________
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From xhejtman at ics.muni.cz  Tue Jul 10 12:16:39 2007
From: xhejtman at ics.muni.cz (Lukas Hejtmanek)
Date: Tue, 10 Jul 2007 21:16:39 +0200
Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus
In-Reply-To: <adak5t89iv8.fsf@cisco.com>
References: <adalkdpxopo.fsf@cisco.com> <20070709213913.GB20052@mellanox.co.il>
	<adamyy4vjo9.fsf@cisco.com> <20070710141409.GH3885@ics.muni.cz>
	<ada7ip8b0be.fsf@cisco.com> <20070710190018.GK3885@ics.muni.cz>
	<adak5t89iv8.fsf@cisco.com>
Message-ID: <20070710191639.GL3885@ics.muni.cz>

On Tue, Jul 10, 2007 at 12:08:43PM -0700, Roland Dreier wrote:
>  > I think he spoke about coherent, didn't he? On i386/x86_64, the consistent and
>  > coherent are the same but on some architectures they are not and I think that
>  > using consistent (in particular pci_alloc_consistent) is exactly what should
>  > be used. Keir also recommended to use this one.
> 
> coherent and consistent are synonyms.  It's confusing because there is
> pci_alloc_consistent(), which is in general just a wrapper for
> dma_alloc_coherent().

According to DMA-mapping.txt they are not. Alpha, M68000 wihtout MMU, PPC,
Sparc, Sparc64, V850 have own implementation of pci_alloc_consistent().

Yes, on i386, the pci_alloc_consistent() is just wrapper for
dma_alloc_coherent().

>  > And moreover, it avoids using swiotlb and bounce buffers, I think. Am I right,
>  > Keir?
> 
> Yes, but I'm not really willing to make things worse for standard i386
> just to make Xen work a little better.

So, what about some #ifdefs ? E.g., allow config option - Xen optimizations?

-- 
Lukáš Hejtmánek


From caitlinb at broadcom.com  Tue Jul 10 12:21:45 2007
From: caitlinb at broadcom.com (Caitlin Bestler)
Date: Tue, 10 Jul 2007 12:21:45 -0700
Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects
In-Reply-To: <20070626070641.GM15343@mellanox.co.il>
Message-ID: <1EF1E44200D82B47BD5BA61171E8CE9D0475CEE4@NT-IRVA-0750.brcm.ad.broadcom.com>

general-bounces at lists.openfabrics.org wrote:
>> Quoting Roland Dreier <rdreier at cisco.com>:
>> Subject: Re: [PATCH RFC] sharing userspace IB objects
>> 
>> Some initial reaction, in no particular order:
>> 
>>  - Having to allocate everything in memory that the library mmap()s
>>    adds a lot of yucky stuff -- basically we need to implement our
>>    own allocator for the shared memory offets.
> 
> Right.
> 
>>    I guess we could wrap this
>>    in libibverbs and only implement it once but still we're basically
>>    reimplementing malloc().
> 
> Right.
> 
>>    Is there really a strong use case for making every type of object
>>    shareable?  Can we handle the SRC stuff without going to this
>>    extreme of complexity?
> 
> This is not directly related to SRC: this is an effort to
> make it possible to share QPs, CQ etc across processes in the
> same way as they can be currently shared across threads.
> So assuming that we want multiple processes to post to the
> same QP, how can we support this?
> 

Sharing QPs and CQs ultimately means sharing Protection Domains
and Memory Regions across processes. So obviously this would never
be a default. Basically you would be enabling a group of processes
to share Memory Regions, QPs, etc all created with a single PD.
The easiest way to support this in the hardware is to simply not
be aware that it is happening, that is, to treat all the processes
as though they were just threads.

I suspect that makes the prospective pool of users quite small. But
there are lesser sharings that could be of value:

	Passing Connection Requests to other processes, and allowing
	them to accept the connection.

	Passing an empty QP to another process, which could then
re-attach
	it to its Protection Domain and supply new memory for the SQ and
RQ.


From rdreier at cisco.com  Tue Jul 10 12:24:02 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 10 Jul 2007 12:24:02 -0700
Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus
In-Reply-To: <20070710191639.GL3885@ics.muni.cz> (Lukas Hejtmanek's message of
	"Tue, 10 Jul 2007 21:16:39 +0200")
References: <adalkdpxopo.fsf@cisco.com>
	<20070709213913.GB20052@mellanox.co.il> <adamyy4vjo9.fsf@cisco.com>
	<20070710141409.GH3885@ics.muni.cz> <ada7ip8b0be.fsf@cisco.com>
	<20070710190018.GK3885@ics.muni.cz> <adak5t89iv8.fsf@cisco.com>
	<20070710191639.GL3885@ics.muni.cz>
Message-ID: <adafy3w9i5p.fsf@cisco.com>

 > > coherent and consistent are synonyms.  It's confusing because there is
 > > pci_alloc_consistent(), which is in general just a wrapper for
 > > dma_alloc_coherent().
 > 
 > According to DMA-mapping.txt they are not. Alpha, M68000 wihtout MMU, PPC,
 > Sparc, Sparc64, V850 have own implementation of pci_alloc_consistent().
 > 
 > Yes, on i386, the pci_alloc_consistent() is just wrapper for
 > dma_alloc_coherent().

Sorry, I was a little confusing.  The implementations may be different
but in general there is no real difference between consistent and
coherent memory.  Using either pci_alloc_consistent() or
dma_alloc_coherent() will exhaust the same small pool of address space
on powerpc 4xx for example.

 > So, what about some #ifdefs ? E.g., allow config option - Xen optimizations?

Seems pretty ugly, especially given that Xen is not upstream.  I think
the Xen tree should just carry such patches, at least until Xen is
merged.  Even then I'm quite dubious about having two code paths for this.

 - R.


From rdreier at cisco.com  Tue Jul 10 12:25:59 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 10 Jul 2007 12:25:59 -0700
Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus
In-Reply-To: <20070710183006.GE11320@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 10 Jul 2007 21:30:06 +0300")
References: <adalkdpxopo.fsf@cisco.com>
	<20070709213913.GB20052@mellanox.co.il> <adamyy4vjo9.fsf@cisco.com>
	<20070710071547.GA3814@mellanox.co.il> <adabqekuvde.fsf@cisco.com>
	<20070710171142.GC11320@mellanox.co.il> <ada3azwb076.fsf@cisco.com>
	<20070710183006.GE11320@mellanox.co.il>
Message-ID: <adabqek9i2g.fsf@cisco.com>

 > Hmm. This means there's no way to sync a range within
 > mapping created with map_sg?

It doesn't seem that there is one right now at least.

 > > It actually doesn't look too bad to replace our use of pci_map_sg()
 > > with dma_map_single(), at least at first glance.  I'll try to write a
 > > patch later.
 > 
 > Well, the reason map_sg is there is presumably because on some
 > architectures it's worth it to try and make the region contigious in DMA space.
 > But I agree this seems the lesser evil at this point ...

Given that we're already trying to allocate big chunks of physically
contiguous memory, I think that any virtual merging we get is likely
to be of very small benefit.

It is kind of a shame to give this up though.

 - R.


From rdreier at cisco.com  Tue Jul 10 12:28:44 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 10 Jul 2007 12:28:44 -0700
Subject: [ofa-general] Re: [KJ PATCH] Replacing memset(<addr>, 0,
	PAGE_SIZE) with clear_page(<addr>) in
	drivers/infiniband/hw/mthca/mthca_eq.c
In-Reply-To: <1182136980.9020.13.camel@shani-win> (Shani Moideen's message of
	"Mon, 18 Jun 2007 08:53:00 +0530")
References: <1182136980.9020.13.camel@shani-win>
Message-ID: <ada7ip89hxv.fsf@cisco.com>

thanks, I applied both mthca patches as one commit.


From xhejtman at ics.muni.cz  Tue Jul 10 12:36:26 2007
From: xhejtman at ics.muni.cz (Lukas Hejtmanek)
Date: Tue, 10 Jul 2007 21:36:26 +0200
Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus
In-Reply-To: <adafy3w9i5p.fsf@cisco.com>
References: <adalkdpxopo.fsf@cisco.com> <20070709213913.GB20052@mellanox.co.il>
	<adamyy4vjo9.fsf@cisco.com> <20070710141409.GH3885@ics.muni.cz>
	<ada7ip8b0be.fsf@cisco.com> <20070710190018.GK3885@ics.muni.cz>
	<adak5t89iv8.fsf@cisco.com> <20070710191639.GL3885@ics.muni.cz>
	<adafy3w9i5p.fsf@cisco.com>
Message-ID: <20070710193626.GM3885@ics.muni.cz>

On Tue, Jul 10, 2007 at 12:24:02PM -0700, Roland Dreier wrote:
> Sorry, I was a little confusing.  The implementations may be different
> but in general there is no real difference between consistent and
> coherent memory.  Using either pci_alloc_consistent() or
> dma_alloc_coherent() will exhaust the same small pool of address space
> on powerpc 4xx for example.

I thought that consistent only refers to physically contiguous area whereas
coherent refers to memory where no barrier need to be used. But I may be
wrong.

Anyway, with my patch, I can turn off swiotlb and I'm still able to load
ib_mthca cleanly in DomU. On the other hand, Xen bug me about DMA bug in
ib_ipoib, there may be another problem with dma_sync_single().

> Seems pretty ugly, especially given that Xen is not upstream.  I think
> the Xen tree should just carry such patches, at least until Xen is
> merged.  Even then I'm quite dubious about having two code paths for this.

OK, I will keep it for my own.

-- 
Lukáš Hejtmánek


From jim.houston at ccur.com  Tue Jul 10 13:05:31 2007
From: jim.houston at ccur.com (Jim Houston)
Date: Tue, 10 Jul 2007 16:05:31 -0400
Subject: [ofa-general] [PATCH] fix idr_get_new_above id alias bugs
In-Reply-To: <200707041611.30056.hnguyen@linux.vnet.ibm.com>
References: <200707021919.27251.hnguyen@linux.vnet.ibm.com>
	<1183422700.3130.27.camel@localhost.localdomain>
	<200707041611.30056.hnguyen@linux.vnet.ibm.com>
Message-ID: <1184097931.3020.73.camel@localhost.localdomain>


Hi Everyone,

Hoang-Nam Nguyen reported a bug in idr_get_new_above() 
which occurred with a starting id value like 0x3ffffffc.
His test module easily reproduced the problem.  Thanks.

The test revealed the following bugs:

1. Relying on shift operations which have undefined results
   e.g.: 1 << n where n > word size.  On i386 an integer shift
   only uses the low 5 bits of the shift count.

2. An off by one error which prevented the top most layer
   of the radix tree from being allocated.  This meant that
   sub_alloc() would allocate an entry in the existing portion
   of the radix tree which aliased the requested address.  When
   it tried to allocate id 0x40000000, it might use the slot 
   belonging to id 0.

3. There was also a failure in the code which walked back up
   the tree if an allocation failed.  The normal case is to
   descend the tree checking the starting id value against the
   bitmap at each level.  If the bit is set, we know that the
   entire sub-tree is full and we can short cut the search.
   We may still descend to the lowest level and find that the
   portion of the id space we want is full.  In this case we
   need to walk back up the tree and continue the search.
   The existing code just returned to the previous level and
   continued.  This resulted in an attempt to allocate an id
   above 0x3ffffffc using the slot for id 0x3ffffc00 instead of
   0x40000000 which it then claimed to have allocated.  The same
   problem occurs with 0x3ff as the requested id value if it
   is already in use.

With this patch, idr.c should work as advertised allocating id
values in the range 0...0x7fffffff.  Andrew had speculated that
it should allow the full range 0...0xffffffff to be used.  I was
tempted to make changes to allow this, but it would require changes
to API, e.g. making the starting id value and the return value
unsigned.

Signed-off-by: Jim Houston <jim.houston at ccur.com>

--

Index: linux-2.6.22-rc7/include/linux/idr.h
===================================================================
--- linux-2.6.22-rc7.orig/include/linux/idr.h	2007-04-25 23:08:32.000000000 -0400
+++ linux-2.6.22-rc7/include/linux/idr.h	2007-07-06 16:46:31.000000000 -0400
@@ -18,17 +18,9 @@
 #if BITS_PER_LONG == 32
 # define IDR_BITS 5
 # define IDR_FULL 0xfffffffful
-/* We can only use two of the bits in the top level because there is
-   only one possible bit in the top level (5 bits * 7 levels = 35
-   bits, but you only use 31 bits in the id). */
-# define TOP_LEVEL_FULL (IDR_FULL >> 30)
 #elif BITS_PER_LONG == 64
 # define IDR_BITS 6
 # define IDR_FULL 0xfffffffffffffffful
-/* We can only use two of the bits in the top level because there is
-   only one possible bit in the top level (6 bits * 6 levels = 36
-   bits, but you only use 31 bits in the id). */
-# define TOP_LEVEL_FULL (IDR_FULL >> 62)
 #else
 # error "BITS_PER_LONG is not 32 or 64"
 #endif
Index: linux-2.6.22-rc7/lib/idr.c
===================================================================
--- linux-2.6.22-rc7.orig/lib/idr.c	2007-04-25 23:08:32.000000000 -0400
+++ linux-2.6.22-rc7/lib/idr.c	2007-07-10 11:05:19.000000000 -0400
@@ -105,8 +105,8 @@
 
 	id = *starting_id;
 	p = idp->top;
-	l = idp->layers;
-	pa[l--] = NULL;
+	l = idp->layers - 1;
+	pa[l] = NULL;
 	while (1) {
 		/*
 		 * We run around this while until we reach the leaf node...
@@ -117,8 +117,14 @@
 		if (m == IDR_SIZE) {
 			/* no space available go back to previous layer. */
 			l++;
-			id = (id | ((1 << (IDR_BITS * l)) - 1)) + 1;
-			if (!(p = pa[l])) {
+			id = (id | ((1 << (IDR_BITS * l)) - 1));
+			while (((id >> (IDR_BITS * l)) & IDR_MASK) == IDR_MASK)
+				l++;
+			id++;
+			p = pa[l-1];
+			if ((id >= MAX_ID_BIT) || (id < 0))
+				return -3;
+			if (!p) {
 				*starting_id = id;
 				return -2;
 			}
@@ -141,7 +147,7 @@
 			p->ary[m] = new;
 			p->count++;
 		}
-		pa[l--] = p;
+		pa[--l] = p;
 		p = p->ary[m];
 	}
 	/*
@@ -159,7 +165,7 @@
 	 */
 	n = id;
 	while (p->bitmap == IDR_FULL) {
-		if (!(p = pa[++l]))
+		if (!(p = pa[l++]))
 			break;
 		n = n >> IDR_BITS;
 		__set_bit((n & IDR_MASK), &p->bitmap);
@@ -186,7 +192,7 @@
 	 * Add a new layer to the top of the tree if the requested
 	 * id is larger than the currently allocated space.
 	 */
-	while ((layers < (MAX_LEVEL - 1)) && (id >= (1 << (layers*IDR_BITS)))) {
+	while ((layers < MAX_LEVEL) && (id & ((~0) << (layers*IDR_BITS)))) {
 		layers++;
 		if (!p->count)
 			continue;
@@ -299,7 +305,7 @@
 static void sub_remove(struct idr *idp, int shift, int id)
 {
 	struct idr_layer *p = idp->top;
-	struct idr_layer **pa[MAX_LEVEL];
+	struct idr_layer **pa[MAX_LEVEL+1];
 	struct idr_layer ***paa = &pa[0];
 	int n;
 
@@ -392,7 +398,7 @@
 	/* Mask off upper bits we don't use for the search. */
 	id &= MAX_ID_MASK;
 
-	if (id >= (1 << n))
+	if ((n <= MAX_ID_SHIFT) && (id & ((~0) << n)))
 		return NULL;
 
 	while (n > 0 && p) {
@@ -425,7 +431,7 @@
 
 	id &= MAX_ID_MASK;
 
-	if (id >= (1 << n))
+	if ((n <= MAX_ID_SHIFT) && (id & ((~0) << n)))
 		return ERR_PTR(-EINVAL);
 
 	n -= IDR_BITS;


From rdreier at cisco.com  Tue Jul 10 13:49:09 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 10 Jul 2007 13:49:09 -0700
Subject: [ofa-general] Re: [PATCH] libmlx4: make BF available for RDMA_READ
	work requests
In-Reply-To: <200706211201.58440.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Thu, 21 Jun 2007 12:01:58 +0300")
References: <200706211201.58440.jackm@dev.mellanox.co.il>
Message-ID: <aday7ho7zne.fsf@cisco.com>

thanks, applied at last


From rick.jones2 at hp.com  Tue Jul 10 15:12:29 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Tue, 10 Jul 2007 15:12:29 -0700
Subject: [ofa-general] should it be possible to run SDP over a T320?
Message-ID: <4694044D.8010208@hp.com>

Hi -

I was talking to someone about the numbers I'd gathered for IPoIB with 
OFED 1.2 and a Mellanox HCA, and how the MTU increase from 2044 to 65520 
did some non-trivial things to bulk transfer performance.  This person 
suggested it should be possible to run SDP over a Chelsio T320, which I 
happen to have in my systems at present.  However, my initial simplistic 
attempt was unsuccessful:

[root at hpcpc106 OFED-1.2-20070626-0917]# netperf -t SDP_STREAM -c -C -H 
192.168.2.107 -l 30
SDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.107 
(192.168.2.107) port 0 AF_INET
netperf: send_sdp_stream: data socket connect failed: Network is unreachable

This is with:

[root at hpcpc106 OFED-1.2-20070626-0917]# ethtool -i eth2
driver: cxgb3
version: 1.0.094
firmware-version: T 4.1.0
bus-info: 0000:08:00.0

and the "native" SDP netperf tests rather than any LD_PRELOADed library.

Am I on a wild goose chase, or should it be possible to do SDP over the 
T320 with OFED 1.2 bits on the system?

thanks,

rick jones


From stanleysufficool at roadrunner.com  Tue Jul 10 18:28:44 2007
From: stanleysufficool at roadrunner.com (Stanley Sufficool)
Date: Tue, 10 Jul 2007 18:28:44 -0700
Subject: [ofa-general] Compiling SRPT
In-Reply-To: <4693B9E4.1070001@mellanox.com>
References: <1183852853.6008.11.camel@gentoo-linux.localdomain>
	<46926868.8000704@mellanox.com>
	<1184042252.15067.8.camel@gentoo-linux.localdomain>
	<4693B9E4.1070001@mellanox.com>
Message-ID: <1184117324.22408.0.camel@gentoo-linux.localdomain>

Is this the same as the README in the srpt_inc branch? That is the
document I based the Wiki on (with a few embellishments).

On Tue, 2007-07-10 at 09:55 -0700, Vu Pham wrote:

> > Added a new wiki page based on Vu Pham's readme and issues with recent 
> > kernels. I hope to keep it current as I get our targets up and running.
> > 
> 
> Thanks for doing this.
> Please use the latest readme from this link - 
> http://mellanox.com/pdf/products/software/Gen2_SRPT_README.txt
> 
> 
> > http://wiki.openfabrics.org/tiki-index.php?page=SRPT+Installation 
> > <https://wiki.openfabrics.org/tiki-index.php?page=SRPT+Installation>
> > 
> > WinIB initiators --> Gentoo Linux SRP Target.
> > 
> 
> I mainly test linux initiators with gen2 srp-target. I have 
> not tested win srp initiator with the target.
> 
> > Anything wrong with the above approach, I would be interested in a best 
> > practices if there is one. I saw a CentOS target post, is this more 
> > stable or better performing?
> 
> There is no difference when you run the same srp target / 
> scst codes in CentOS or RH/SuSe linux distributions. The 
> storage back-end will determine the performance
> 
> -vu
> 
> > 
> > Thanks.
> > 
> > On Mon, 2007-07-09 at 09:55 -0700, Vu Pham wrote:
> >> Stanley Sufficool wrote:
> >> >   Compiling on kernel 2.6.21-rc6 from kernel.org Torvald's branch
> >> > 
> >> > Got the latest srpt from the git repository on OpenFabrics and had the 
> >> > following issues.
> >> > 
> >> > ib_srpt.c    Line 1997, missing second argument, should be?   
> >> > sdev->scst_tgt = scst_register(tp, NULL);
> >> > 
> >>
> >> Yes. You need the change if you test with top of scst svn 
> >> trunk (or from version 0.9.6-pre2)
> >> If you test with scst before 0.9.6-pre2 (ie. version <= 
> >> 0.9.6-pre1) you don't need the second argument for 
> >> scst_register()
> >>
> >>
> >> > SCST was built successfully after fixing an issue in scst_vdisk.c 
> >> > (missing #include <linux/sched.h>)
> >>
> >>
> >> I tested with 2.6.20.x - I have not tested with 2.6.21-rcXX 
> >> - you should send the patch to scst devel
> >>
> >> > 
> >> > Just thought this would be nice to have documented, took me half a day 
> >> > to track down as a novice in C programming.
> >> > 
> >>
> >> there is *lean and mean* srpt's README in srpt_inc
> >> SCST also has some document
> >> You can add some wiki/notes for the problems in openfabrics 
> >> wiki page https://wiki.openfabrics.org/tiki-index.php
> >>
> >> -vu
> >>
> >> > 
> >> > ------------------------------------------------------------------------
> >> > 
> >> > _______________________________________________
> >> > general mailing list
> >> > general at lists.openfabrics.org <mailto:general at lists.openfabrics.org>
> >> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >> > 
> >> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >>
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070710/2685ee7d/attachment.html>

From designingfu20 at phentermine.com  Tue Jul 10 15:09:11 2007
From: designingfu20 at phentermine.com (Sue Nolan)
Date: Wed, 11 Jul 2007 03:09:11 +0500
Subject: [ofa-general] Re.Query
Message-ID: <703269773.99617138602756@phentermine.com>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070711/b51d2f84/attachment.html>

From stanleysufficool at roadrunner.com  Tue Jul 10 22:02:21 2007
From: stanleysufficool at roadrunner.com (Stanley Sufficool)
Date: Tue, 10 Jul 2007 22:02:21 -0700
Subject: [ofa-general] Compiling SRPT
In-Reply-To: <4693B9E4.1070001@mellanox.com>
References: <1183852853.6008.11.camel@gentoo-linux.localdomain>
	<46926868.8000704@mellanox.com>
	<1184042252.15067.8.camel@gentoo-linux.localdomain>
	<4693B9E4.1070001@mellanox.com>
Message-ID: <1184130141.22408.7.camel@gentoo-linux.localdomain>

Do you have any reservations that the WinIB (Mellanox) SRP initiators
will not work with SRPT? 

If there is any doubt, I need to know so that I can fall back to iSCSI
over IPoIB (iSIPIB??? ;) )  . This has lots more overhead, but it's a
sure bet until this can be worked out.

On Tue, 2007-07-10 at 09:55 -0700, Vu Pham wrote:

> > Added a new wiki page based on Vu Pham's readme and issues with recent 
> > kernels. I hope to keep it current as I get our targets up and running.
> > 
> 
> Thanks for doing this.
> Please use the latest readme from this link - 
> http://mellanox.com/pdf/products/software/Gen2_SRPT_README.txt
> 
> 
> > http://wiki.openfabrics.org/tiki-index.php?page=SRPT+Installation 
> > <https://wiki.openfabrics.org/tiki-index.php?page=SRPT+Installation>
> > 
> > WinIB initiators --> Gentoo Linux SRP Target.
> > 
> 
> I mainly test linux initiators with gen2 srp-target. I have 
> not tested win srp initiator with the target.
> 
> > Anything wrong with the above approach, I would be interested in a best 
> > practices if there is one. I saw a CentOS target post, is this more 
> > stable or better performing?
> 
> There is no difference when you run the same srp target / 
> scst codes in CentOS or RH/SuSe linux distributions. The 
> storage back-end will determine the performance
> 
> -vu
> 
> > 
> > Thanks.
> > 
> > On Mon, 2007-07-09 at 09:55 -0700, Vu Pham wrote:
> >> Stanley Sufficool wrote:
> >> >   Compiling on kernel 2.6.21-rc6 from kernel.org Torvald's branch
> >> > 
> >> > Got the latest srpt from the git repository on OpenFabrics and had the 
> >> > following issues.
> >> > 
> >> > ib_srpt.c    Line 1997, missing second argument, should be?   
> >> > sdev->scst_tgt = scst_register(tp, NULL);
> >> > 
> >>
> >> Yes. You need the change if you test with top of scst svn 
> >> trunk (or from version 0.9.6-pre2)
> >> If you test with scst before 0.9.6-pre2 (ie. version <= 
> >> 0.9.6-pre1) you don't need the second argument for 
> >> scst_register()
> >>
> >>
> >> > SCST was built successfully after fixing an issue in scst_vdisk.c 
> >> > (missing #include <linux/sched.h>)
> >>
> >>
> >> I tested with 2.6.20.x - I have not tested with 2.6.21-rcXX 
> >> - you should send the patch to scst devel
> >>
> >> > 
> >> > Just thought this would be nice to have documented, took me half a day 
> >> > to track down as a novice in C programming.
> >> > 
> >>
> >> there is *lean and mean* srpt's README in srpt_inc
> >> SCST also has some document
> >> You can add some wiki/notes for the problems in openfabrics 
> >> wiki page https://wiki.openfabrics.org/tiki-index.php
> >>
> >> -vu
> >>
> >> > 
> >> > ------------------------------------------------------------------------
> >> > 
> >> > _______________________________________________
> >> > general mailing list
> >> > general at lists.openfabrics.org <mailto:general at lists.openfabrics.org>
> >> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >> > 
> >> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >>
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070710/a918d51c/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smiley-4.png
Type: image/png
Size: 822 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070710/a918d51c/attachment.png>

From ogerlitz at voltaire.com  Tue Jul 10 22:57:05 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 11 Jul 2007 08:57:05 +0300 (IDT)
Subject: [ofa-general] [PATCH] IB/mad: fix duplicated kernel thread name
Message-ID: <Pine.LNX.4.64.0707110840560.15887@zuben>

Roland,

This is the best I could come with, its still a problem
if you have multiple devices of different providers or
more than ten devices of the same provider... any other idea?

--------------------------------------------------------------

The mad module creates thread per active port where the thread name is
derived from the port name. This cause different threads to have same
names when there are multiple devices. Fix that by using both the device
and the port numbers to derive the name.

Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>

Index: linux-2.6.22-rc2/drivers/infiniband/core/mad.c
===================================================================
--- linux-2.6.22-rc2.orig/drivers/infiniband/core/mad.c	2007-05-20 09:37:29.000000000 +0300
+++ linux-2.6.22-rc2/drivers/infiniband/core/mad.c	2007-07-11 08:38:59.000000000 +0300
@@ -2799,7 +2799,7 @@ static int ib_mad_port_open(struct ib_de
 	if (ret)
 		goto error7;

-	snprintf(name, sizeof name, "ib_mad%d", port_num);
+	snprintf(name, sizeof name, "ib_mad%d_%d", device->name[strlen(device->name)],port_num);
 	port_priv->wq = create_singlethread_workqueue(name);
 	if (!port_priv->wq) {
 		ret = -ENOMEM;


From mst at dev.mellanox.co.il  Tue Jul 10 23:14:44 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 11 Jul 2007 09:14:44 +0300
Subject: [ofa-general] Re: should it be possible to run SDP over a T320?
In-Reply-To: <4694044D.8010208@hp.com>
References: <4694044D.8010208@hp.com>
Message-ID: <20070711061444.GG11320@mellanox.co.il>

> Quoting Rick Jones <rick.jones2 at hp.com>:
> Subject: should it be possible to run SDP over a T320?
> 
> Hi -
> 
> I was talking to someone about the numbers I'd gathered for IPoIB with 
> OFED 1.2 and a Mellanox HCA, and how the MTU increase from 2044 to 65520 
> did some non-trivial things to bulk transfer performance.

Was this data these posted on-list? I didn't see it.

-- 
MST


From vlad at dev.mellanox.co.il  Tue Jul 10 23:21:21 2007
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Wed, 11 Jul 2007 09:21:21 +0300
Subject: [ofa-general] minor usability nit with 1.2GA?
In-Reply-To: <4693BE43.8070905@hp.com>
References: <4692CC7A.2050704@hp.com> <469372CC.5060207@dev.mellanox.co.il>
	<4693BE43.8070905@hp.com>
Message-ID: <469476E1.9060301@dev.mellanox.co.il>

>>
>> Hi,
>> OFED removes the previous software before installing the new one.
>> So, there shouldn't be a mix of different OFED versions on the same 
>> machine.
>>
>> Can you send me the output of the following commands:
>> # modinfo ib_sdp
>> # rpm -qf /lib/modules/.../ib_sdp.ko (take the correct path from the 
>> previous command)
>> # rpm -q kernel-ib
>> # ofed_info
> 
> I can, but at this point I'm not sure what it would show since I went 
> back and did a "build me one with everything" install on both my 
> systems.  If you still want to see it I can do that though.
> 
> rick
> 

No, it doesn't make sense any more.

Thanks,
Vladimir


From ogerlitz at voltaire.com  Tue Jul 10 23:22:43 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 11 Jul 2007 09:22:43 +0300 (IDT)
Subject: [ofa-general] [PATCH] libibverbs: document IBV_SEND_INLINE buffer
 ownership relaxation
In-Reply-To: <Pine.LNX.4.64.0707031144130.15147@zuben>
References: <Pine.LNX.4.64.0707031144130.15147@zuben>
Message-ID: <Pine.LNX.4.64.0707110919560.17892@zuben>

if the IBV_SEND_INLINE flag is set in the WR provided to ibv_post_send,
the data buffers can be reused immediately after the call returns, document this.

Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>

Index: libibverbs/include/infiniband/verbs.h
===================================================================
--- libibverbs.orig/include/infiniband/verbs.h
+++ libibverbs/include/infiniband/verbs.h
@@ -989,6 +989,9 @@ int ibv_destroy_qp(struct ibv_qp *qp);

 /**
  * ibv_post_send - Post a list of work requests to a send queue.
+ *
+ * if IBV_SEND_INLINE flag is set, the data buffers can be reused immediately
+ * after the call returns - low level libraries must confirm to this rule.
  */
 static inline int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr,
 				struct ibv_send_wr **bad_wr)
Index: libibverbs/man/ibv_post_send.3
===================================================================
--- libibverbs.orig/man/ibv_post_send.3
+++ libibverbs/man/ibv_post_send.3
@@ -109,7 +109,9 @@ behavior.
 .PP
 The buffers used by a WR can only be safely reused after WR the
 request is fully executed and a work completion has been retrieved
-from the corresponding completion queue (CQ).
+from the corresponding completion queue (CQ). However, if the
+IBV_SEND_INLINE flag was set, the buffer can be reused immediately
+after the call returns.
 .SH "SEE ALSO"
 .BR ibv_create_qp (3),
 .BR ibv_create_ah (3),


From tziporet at mellanox.co.il  Wed Jul 11 01:37:28 2007
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Wed, 11 Jul 2007 11:37:28 +0300
Subject: [ofa-general] Re: [ewg] OFED 1.3 timeline
In-Reply-To: <4693BF47.8070700@mellanox.co.il>
References: <4693BF47.8070700@mellanox.co.il>
Message-ID: <469496C8.9030005@mellanox.co.il>

Tziporet Koren wrote:

Fix Nov dates due to Thanksgiving holiday
> Hi All,
> Based on the requests to have OFED 1.3 release this year the release 
> schedule is the following:
>
>     * Feature freeze - Sep 4
>     * Alpha release - Sep 10
>     * Beta release - Sep 25
>     * RC1 - Oct 16
>     * RC2 - Oct 30
>     * RC3 - Nov 8 (assuming many of us are at SC07 on the week of Nov 11)
>     * RC4 - Nov 20
>     * GA release - Nov 30 (or first week of Dec)
>
>
> To make this schedule we must implement all major changes for the 
> package during July so we have a stable package till middle of Aug.
> Also we must keep the new features in control and not insert 
> unnecessary changes that are not in the features list.
>
> Full features list will be published in a different mail
>
> Tziporet.
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070711/0d79499e/attachment.html>

From duffersmv77 at phentermine.com  Wed Jul 11 03:37:29 2007
From: duffersmv77 at phentermine.com (Rachelle Hooks)
Date: Wed, 11 Jul 2007 09:37:29 -0100
Subject: [ofa-general] Pharma
Message-ID: <464715435.35627128907272@phentermine.com>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070711/e1237ca4/attachment.html>

From vlad at lists.openfabrics.org  Wed Jul 11 02:45:38 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Wed, 11 Jul 2007 02:45:38 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_c_kernel 20070711-0200 daily build status
Message-ID: <20070711094539.126FAE6086B@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2_c

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.12
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.18
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.13
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.16
Passed on powerpc with linux-2.6.17
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.15
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.14
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.13
Passed on ia64 with linux-2.6.16
Passed on powerpc with linux-2.6.12
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.21.1
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.14
Passed on ia64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.17
Passed on ia64 with linux-2.6.15
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.15
Passed on powerpc with linux-2.6.16
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.16
Passed on ia64 with linux-2.6.19
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp

Failed:
Build failed on i686 with linux-2.6.22-rc7


From eitan at mellanox.co.il  Wed Jul 11 03:51:16 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 11 Jul 2007 13:51:16 +0300
Subject: [ofa-general] IB performance stats (revisited)
References: <46826370.4090602@hp.com><1182951169.28870.75880.camel@hal.voltaire.com><46826FB8.10904@hp.com><46827BA0.6070008@hp.com><1182957688.28870.83013.camel@hal.voltaire.com><4682994E.1020209@hp.com><1182964334.28870.90291.camel@hal.voltaire.com><6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com><1182978496.28870.106214.camel@hal.voltaire.com><6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com>
	<20070710094659.50df9b39.weiny2@llnl.gov>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com>

Hi Ira,

> Second, I have run some tests querying the fabric of our 
> large clusters here (~500 nodes) and the results were 
> promising for a single node implementation.
> I don't recall the numbers as this was a while ago but it was 
> on the order of
> <2 sec and I think <1 but I don't want to be misquoted.

Does PerfMgr query switch ports ?
If it does I am surprised by the short sweep time you got.

Does it have >1 query on the wire at a given time?
If not then I am even more surprised.

Was the cluster running a job at the time of the query ?

Thanks

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL

 
> -----Original Message-----
> From: Ira Weiny [mailto:weiny2 at llnl.gov] 
> Sent: Tuesday, July 10, 2007 7:47 PM
> To: Eitan Zahavi
> Cc: halr at voltaire.com; Mark.Seger at hp.com; 
> general at lists.openfabrics.org; Ed.Finn at FMR.COM
> Subject: Re: [ofa-general] IB performance stats (revisited)
> 
> On Thu, 28 Jun 2007 10:24:59 +0300
> "Eitan Zahavi" <eitan at mellanox.co.il> wrote:
> 
> > > On Wed, 2007-06-27 at 14:23, Eitan Zahavi wrote:
> > > > In the last months it is the second time I hear people
> > > complaining the
> > > > current monitoring solution in OFA is  integrated with OpenSM.
> > > 
> > > I must have missed this both times (didn't see this in Mark's
> > > post) and the statement itself is somewhat inaccurate as well.
> > Private talks - I hope they will speak up for themselves now...
> > > 
> > > > These people do not use OpenSM but do use OFED.
> > > 
> > > I'm not sure I'm following what you mean here.
> > > 
> > > If you mean that some people want to run PerfMgr without 
> the SM/SA 
> > > aspects (so that they can run a vendor based SM), that is 
> the next 
> > > thing we are adding to the implementation.
> > Exactly. OK when is that coming?
> 
> There is very little which ties the current PerfMgr to 
> OpenSM.  Basically it just gets the current fabric topology.  
> As Hal has said changes are coming.
> 
> >
> > > 
> > > >  Another drawback if that
> > > > no naming is provided and the reporting uses GUIDs.
> > > 
> > > Naming is provided via NodeDescription.
> > This might be good for hosts but is not covering  switches ...
> 
> It does include switches.  However, since most systems have 
> the same name for multiple switches this becomes ineffective. 
>  I have queried Voltaire for a way to change the 
> NodeDescription for switches, but at the time I asked, there 
> was no way to do it.  Perhaps there is now?  What about other 
> vendors?  This is why ibnetdiscover and other diags have 
> "switch map" support.  (A GUID->name mapping to override the 
> default NodeDescription.) Nothing would please me more than 
> to be able to remove that for a more "automatic" solution.
> 
> > > 
> > > > I also can't hold myself from saying again I think you 
> are going 
> > > > to hit the wall with the concept of doing the PMA from 
> a single node.
> > > 
> > > If you are referring to the fact the PerMgr is currently not 
> > > distributed, that will be done as has been stated before.
> > Good. When is it expected? Will it be OFED 1.3?
> 
> When Hal first sent out the PerfMgr design I thought we 
> should jump right to the distributed model as well.  But now 
> I am glad we have gone the way we did.
> First off, we have something which "works" and from which we 
> can expand.
> Second, I have run some tests querying the fabric of our 
> large clusters here (~500 nodes) and the results were 
> promising for a single node implementation.
> I don't recall the numbers as this was a while ago but it was 
> on the order of
> <2 sec and I think <1 but I don't want to be misquoted.
> 
> For sure, a distributed model offers many advantages and we 
> will get there.  But for many the current single node 
> approach should work just fine.
> 
> Thanks,
> Ira
> 
> > 
> > Thanks
> > > 
> > > -- Hal
> > > 
> > > > Eitan Zahavi
> > > > Senior Engineering Director, Software Architect Mellanox
> > > Technologies
> > > > LTD
> > > > Tel:+972-4-9097208
> > > > Fax:+972-4-9593245
> > > > P.O. Box 586 Yokneam 20692 ISRAEL
> > > > 
> > > >  
> > > > 
> > > > > -----Original Message-----
> > > > > From: general-bounces at lists.openfabrics.org
> > > > > [mailto:general-bounces at lists.openfabrics.org] On 
> Behalf Of Hal 
> > > > > Rosenstock
> > > > > Sent: Wednesday, June 27, 2007 8:12 PM
> > > > > To: Mark Seger
> > > > > Cc: Finn, Ed; general at lists.openfabrics.org
> > > > > Subject: Re: [ofa-general] IB performance stats (revisited)
> > > > > 
> > > > > On Wed, 2007-06-27 at 13:07, Mark Seger wrote:
> > > > > > >The performance managers deal with the counter 
> stickiness (by 
> > > > > > >resetting them when they think they need to). They
> > > > > typically export
> > > > > > >their data although this is not specified by IBA so it is
> > > > > in a vendor
> > > > > > >proprietary manner.
> > > > > > >  
> > > > > > >
> > > > > > so I guess these guys are poor citizens as well...
> > > > > 
> > > > > Not sure what you mean.
> > > > > 
> > > > > > the real issue as I see it then means nobody can trust
> > > the data if
> > > > > > randon tools randomly reset the counters.  a real shame...
> > > > > 
> > > > > I consider this to be a real rather than random app for this. 
> > > > > Guess it depends on what one considers random.
> > > > > 
> > > > > -- Hal
> > > > > 
> > > > > > -mark
> > > > > > 
> > > > > > 
> > > > > 
> > > > > _______________________________________________
> > > > > general mailing list
> > > > > general at lists.openfabrics.org
> > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > > 
> > > > > To unsubscribe, please visit
> > > > > http://openib.org/mailman/listinfo/openib-general
> > > > > 
> > > 
> > > 
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit 
> > http://openib.org/mailman/listinfo/openib-general
> > 
> 


From Thomas.Talpey at netapp.com  Wed Jul 11 04:50:37 2007
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Wed, 11 Jul 2007 07:50:37 -0400
Subject: [ofa-general] What should a ULP pass as ib_create_cq(...,
	comp_vector) ?
Message-ID: <EXNANE01xT3lEmFWqbT00000c57@exnane01.hq.netapp.com>

I notice the ib_create_cq() comp_vector support is merged in 2.6.22.
I don't completely understand what a ULP needs to pass as the argument.

I'm currently passing 0 in the NFS/RDMA client, what in general should I
consider using as a value? Or put another way, why is this exposed to
the ULP? Isn't this the MSI-X vector table index, a rather low-level thing
to hand to the ULP to manage?

Thanks,
Tom.


From jackm at dev.mellanox.co.il  Wed Jul 11 04:58:46 2007
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Wed, 11 Jul 2007 14:58:46 +0300
Subject: [ofa-general] Re: [PATCH] mlx4: add device reset to Internal Error
	handling mechanism
In-Reply-To: <adaodilzhfc.fsf@cisco.com>
References: <200707091012.52418.jackm@dev.mellanox.co.il>
	<adaodilzhfc.fsf@cisco.com>
Message-ID: <200707111458.46564.jackm@dev.mellanox.co.il>

On Monday 09 July 2007 19:10, Roland Dreier wrote:
> 
> Why not just delete all the interrupt stuff completely?

I did this patch very quickly -- I'll delete all the interrupt stuff. 
 
> how about round_jiffies_relative(MLX4_CATAS_POLL_INTERVAL) instead?

OK.


From halr at voltaire.com  Wed Jul 11 06:31:14 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 11 Jul 2007 09:31:14 -0400
Subject: [ofa-general] IB performance stats (revisited)
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com>
References: <46826370.4090602@hp.com>
	<1182951169.28870.75880.camel@hal.voltaire.com><46826FB8.10904@hp.com>
	<46827BA0.6070008@hp.com><1182957688.28870.83013.camel@hal.voltaire.com>
	<4682994E.1020209@hp.com><1182964334.28870.90291.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com>
	<1182978496.28870.106214.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com>
	<20070710094659.50df9b39.weiny2@llnl.gov>
	<6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com>
Message-ID: <1184160670.17622.92728.camel@hal.voltaire.com>

Hi Eitan,

On Wed, 2007-07-11 at 06:51, Eitan Zahavi wrote:
> Hi Ira,
> 
> > Second, I have run some tests querying the fabric of our 
> > large clusters here (~500 nodes) and the results were 
> > promising for a single node implementation.
> > I don't recall the numbers as this was a while ago but it was 
> > on the order of
> > <2 sec and I think <1 but I don't want to be misquoted.
> 
> Does PerfMgr query switch ports ?

Yes (of course it does).

> If it does I am surprised by the short sweep time you got.
> 
> Does it have >1 query on the wire at a given time?

Yes, Default appears to be 500 currently (maybe that needs dialing back
a bit) but is settable via perfmgr_max_outstanding_queries in options
file.

> If not then I am even more surprised.
> 
> Was the cluster running a job at the time of the query ?

Is this question related to VL0 contention ?

-- Hal

> Thanks
> 
> Eitan Zahavi
> Senior Engineering Director, Software Architect
> Mellanox Technologies LTD
> Tel:+972-4-9097208
> Fax:+972-4-9593245
> P.O. Box 586 Yokneam 20692 ISRAEL
> 
>  
> 
> > -----Original Message-----
> > From: Ira Weiny [mailto:weiny2 at llnl.gov] 
> > Sent: Tuesday, July 10, 2007 7:47 PM
> > To: Eitan Zahavi
> > Cc: halr at voltaire.com; Mark.Seger at hp.com; 
> > general at lists.openfabrics.org; Ed.Finn at FMR.COM
> > Subject: Re: [ofa-general] IB performance stats (revisited)
> > 
> > On Thu, 28 Jun 2007 10:24:59 +0300
> > "Eitan Zahavi" <eitan at mellanox.co.il> wrote:
> > 
> > > > On Wed, 2007-06-27 at 14:23, Eitan Zahavi wrote:
> > > > > In the last months it is the second time I hear people
> > > > complaining the
> > > > > current monitoring solution in OFA is  integrated with OpenSM.
> > > > 
> > > > I must have missed this both times (didn't see this in Mark's
> > > > post) and the statement itself is somewhat inaccurate as well.
> > > Private talks - I hope they will speak up for themselves now...
> > > > 
> > > > > These people do not use OpenSM but do use OFED.
> > > > 
> > > > I'm not sure I'm following what you mean here.
> > > > 
> > > > If you mean that some people want to run PerfMgr without 
> > the SM/SA 
> > > > aspects (so that they can run a vendor based SM), that is 
> > the next 
> > > > thing we are adding to the implementation.
> > > Exactly. OK when is that coming?
> > 
> > There is very little which ties the current PerfMgr to 
> > OpenSM.  Basically it just gets the current fabric topology.  
> > As Hal has said changes are coming.
> > 
> > >
> > > > 
> > > > >  Another drawback if that
> > > > > no naming is provided and the reporting uses GUIDs.
> > > > 
> > > > Naming is provided via NodeDescription.
> > > This might be good for hosts but is not covering  switches ...
> > 
> > It does include switches.  However, since most systems have 
> > the same name for multiple switches this becomes ineffective. 
> >  I have queried Voltaire for a way to change the 
> > NodeDescription for switches, but at the time I asked, there 
> > was no way to do it.  Perhaps there is now?  What about other 
> > vendors?  This is why ibnetdiscover and other diags have 
> > "switch map" support.  (A GUID->name mapping to override the 
> > default NodeDescription.) Nothing would please me more than 
> > to be able to remove that for a more "automatic" solution.
> > 
> > > > 
> > > > > I also can't hold myself from saying again I think you 
> > are going 
> > > > > to hit the wall with the concept of doing the PMA from 
> > a single node.
> > > > 
> > > > If you are referring to the fact the PerMgr is currently not 
> > > > distributed, that will be done as has been stated before.
> > > Good. When is it expected? Will it be OFED 1.3?
> > 
> > When Hal first sent out the PerfMgr design I thought we 
> > should jump right to the distributed model as well.  But now 
> > I am glad we have gone the way we did.
> > First off, we have something which "works" and from which we 
> > can expand.
> > Second, I have run some tests querying the fabric of our 
> > large clusters here (~500 nodes) and the results were 
> > promising for a single node implementation.
> > I don't recall the numbers as this was a while ago but it was 
> > on the order of
> > <2 sec and I think <1 but I don't want to be misquoted.
> > 
> > For sure, a distributed model offers many advantages and we 
> > will get there.  But for many the current single node 
> > approach should work just fine.
> > 
> > Thanks,
> > Ira
> > 
> > > 
> > > Thanks
> > > > 
> > > > -- Hal
> > > > 
> > > > > Eitan Zahavi
> > > > > Senior Engineering Director, Software Architect Mellanox
> > > > Technologies
> > > > > LTD
> > > > > Tel:+972-4-9097208
> > > > > Fax:+972-4-9593245
> > > > > P.O. Box 586 Yokneam 20692 ISRAEL
> > > > > 
> > > > >  
> > > > > 
> > > > > > -----Original Message-----
> > > > > > From: general-bounces at lists.openfabrics.org
> > > > > > [mailto:general-bounces at lists.openfabrics.org] On 
> > Behalf Of Hal 
> > > > > > Rosenstock
> > > > > > Sent: Wednesday, June 27, 2007 8:12 PM
> > > > > > To: Mark Seger
> > > > > > Cc: Finn, Ed; general at lists.openfabrics.org
> > > > > > Subject: Re: [ofa-general] IB performance stats (revisited)
> > > > > > 
> > > > > > On Wed, 2007-06-27 at 13:07, Mark Seger wrote:
> > > > > > > >The performance managers deal with the counter 
> > stickiness (by 
> > > > > > > >resetting them when they think they need to). They
> > > > > > typically export
> > > > > > > >their data although this is not specified by IBA so it is
> > > > > > in a vendor
> > > > > > > >proprietary manner.
> > > > > > > >  
> > > > > > > >
> > > > > > > so I guess these guys are poor citizens as well...
> > > > > > 
> > > > > > Not sure what you mean.
> > > > > > 
> > > > > > > the real issue as I see it then means nobody can trust
> > > > the data if
> > > > > > > randon tools randomly reset the counters.  a real shame...
> > > > > > 
> > > > > > I consider this to be a real rather than random app for this. 
> > > > > > Guess it depends on what one considers random.
> > > > > > 
> > > > > > -- Hal
> > > > > > 
> > > > > > > -mark
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > _______________________________________________
> > > > > > general mailing list
> > > > > > general at lists.openfabrics.org
> > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > > > 
> > > > > > To unsubscribe, please visit
> > > > > > http://openib.org/mailman/listinfo/openib-general
> > > > > > 
> > > > 
> > > > 
> > > _______________________________________________
> > > general mailing list
> > > general at lists.openfabrics.org
> > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > 
> > > To unsubscribe, please visit 
> > > http://openib.org/mailman/listinfo/openib-general
> > > 
> > 


From eitan at mellanox.co.il  Wed Jul 11 07:03:35 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 11 Jul 2007 17:03:35 +0300
Subject: [ofa-general] IB performance stats (revisited)
References: <46826370.4090602@hp.com>
	<1182951169.28870.75880.camel@hal.voltaire.com><46826FB8.10904@hp.com>
	<46827BA0.6070008@hp.com><1182957688.28870.83013.camel@hal.voltaire.com>
	<4682994E.1020209@hp.com><1182964334.28870.90291.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com>
	<1182978496.28870.106214.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com>
	<20070710094659.50df9b39.weiny2@llnl.gov>
	<6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com>
	<1184160670.17622.92728.camel@hal.voltaire.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901E049B7@mtlexch01.mtl.com>

Hi Hal,
> > 
> > > Second, I have run some tests querying the fabric of our large 
> > > clusters here (~500 nodes) and the results were promising for a 
> > > single node implementation.
> > > I don't recall the numbers as this was a while ago but it 
> was on the 
> > > order of
> > > <2 sec and I think <1 but I don't want to be misquoted.
> > 
> > Does PerfMgr query switch ports ?
> 
> Yes (of course it does).
> 
> > If it does I am surprised by the short sweep time you got.
> > 
> > Does it have >1 query on the wire at a given time?
> 
> Yes, Default appears to be 500 currently (maybe that needs 
> dialing back a bit) but is settable via 
> perfmgr_max_outstanding_queries in options file.
This explains some.
> 
> > If not then I am even more surprised.
> > 
> > Was the cluster running a job at the time of the query ?
> 
> Is this question related to VL0 contention ?
Yes
> 
> -- Hal
> 
> > Thanks
> > 
> > Eitan Zahavi
> > Senior Engineering Director, Software Architect Mellanox 
> Technologies 
> > LTD
> > Tel:+972-4-9097208
> > Fax:+972-4-9593245
> > P.O. Box 586 Yokneam 20692 ISRAEL
> > 
> >  
> > 
> > > -----Original Message-----
> > > From: Ira Weiny [mailto:weiny2 at llnl.gov]
> > > Sent: Tuesday, July 10, 2007 7:47 PM
> > > To: Eitan Zahavi
> > > Cc: halr at voltaire.com; Mark.Seger at hp.com; 
> > > general at lists.openfabrics.org; Ed.Finn at FMR.COM
> > > Subject: Re: [ofa-general] IB performance stats (revisited)
> > > 
> > > On Thu, 28 Jun 2007 10:24:59 +0300
> > > "Eitan Zahavi" <eitan at mellanox.co.il> wrote:
> > > 
> > > > > On Wed, 2007-06-27 at 14:23, Eitan Zahavi wrote:
> > > > > > In the last months it is the second time I hear people
> > > > > complaining the
> > > > > > current monitoring solution in OFA is  integrated 
> with OpenSM.
> > > > > 
> > > > > I must have missed this both times (didn't see this in Mark's
> > > > > post) and the statement itself is somewhat inaccurate as well.
> > > > Private talks - I hope they will speak up for themselves now...
> > > > > 
> > > > > > These people do not use OpenSM but do use OFED.
> > > > > 
> > > > > I'm not sure I'm following what you mean here.
> > > > > 
> > > > > If you mean that some people want to run PerfMgr without
> > > the SM/SA
> > > > > aspects (so that they can run a vendor based SM), that is
> > > the next
> > > > > thing we are adding to the implementation.
> > > > Exactly. OK when is that coming?
> > > 
> > > There is very little which ties the current PerfMgr to OpenSM.  
> > > Basically it just gets the current fabric topology.
> > > As Hal has said changes are coming.
> > > 
> > > >
> > > > > 
> > > > > >  Another drawback if that
> > > > > > no naming is provided and the reporting uses GUIDs.
> > > > > 
> > > > > Naming is provided via NodeDescription.
> > > > This might be good for hosts but is not covering  switches ...
> > > 
> > > It does include switches.  However, since most systems 
> have the same 
> > > name for multiple switches this becomes ineffective.
> > >  I have queried Voltaire for a way to change the 
> NodeDescription for 
> > > switches, but at the time I asked, there was no way to do it.  
> > > Perhaps there is now?  What about other vendors?  This is why 
> > > ibnetdiscover and other diags have "switch map" support.  (A 
> > > GUID->name mapping to override the default 
> NodeDescription.) Nothing 
> > > would please me more than to be able to remove that for a more 
> > > "automatic" solution.
> > > 
> > > > > 
> > > > > > I also can't hold myself from saying again I think you
> > > are going
> > > > > > to hit the wall with the concept of doing the PMA from
> > > a single node.
> > > > > 
> > > > > If you are referring to the fact the PerMgr is currently not 
> > > > > distributed, that will be done as has been stated before.
> > > > Good. When is it expected? Will it be OFED 1.3?
> > > 
> > > When Hal first sent out the PerfMgr design I thought we 
> should jump 
> > > right to the distributed model as well.  But now I am 
> glad we have 
> > > gone the way we did.
> > > First off, we have something which "works" and from which we can 
> > > expand.
> > > Second, I have run some tests querying the fabric of our large 
> > > clusters here (~500 nodes) and the results were promising for a 
> > > single node implementation.
> > > I don't recall the numbers as this was a while ago but it 
> was on the 
> > > order of
> > > <2 sec and I think <1 but I don't want to be misquoted.
> > > 
> > > For sure, a distributed model offers many advantages and 
> we will get 
> > > there.  But for many the current single node approach should work 
> > > just fine.
> > > 
> > > Thanks,
> > > Ira
> > > 
> > > > 
> > > > Thanks
> > > > > 
> > > > > -- Hal
> > > > > 
> > > > > > Eitan Zahavi
> > > > > > Senior Engineering Director, Software Architect Mellanox
> > > > > Technologies
> > > > > > LTD
> > > > > > Tel:+972-4-9097208
> > > > > > Fax:+972-4-9593245
> > > > > > P.O. Box 586 Yokneam 20692 ISRAEL
> > > > > > 
> > > > > >  
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: general-bounces at lists.openfabrics.org
> > > > > > > [mailto:general-bounces at lists.openfabrics.org] On
> > > Behalf Of Hal
> > > > > > > Rosenstock
> > > > > > > Sent: Wednesday, June 27, 2007 8:12 PM
> > > > > > > To: Mark Seger
> > > > > > > Cc: Finn, Ed; general at lists.openfabrics.org
> > > > > > > Subject: Re: [ofa-general] IB performance stats 
> (revisited)
> > > > > > > 
> > > > > > > On Wed, 2007-06-27 at 13:07, Mark Seger wrote:
> > > > > > > > >The performance managers deal with the counter
> > > stickiness (by
> > > > > > > > >resetting them when they think they need to). They
> > > > > > > typically export
> > > > > > > > >their data although this is not specified by 
> IBA so it is
> > > > > > > in a vendor
> > > > > > > > >proprietary manner.
> > > > > > > > >  
> > > > > > > > >
> > > > > > > > so I guess these guys are poor citizens as well...
> > > > > > > 
> > > > > > > Not sure what you mean.
> > > > > > > 
> > > > > > > > the real issue as I see it then means nobody can trust
> > > > > the data if
> > > > > > > > randon tools randomly reset the counters.  a 
> real shame...
> > > > > > > 
> > > > > > > I consider this to be a real rather than random 
> app for this. 
> > > > > > > Guess it depends on what one considers random.
> > > > > > > 
> > > > > > > -- Hal
> > > > > > > 
> > > > > > > > -mark
> > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > _______________________________________________
> > > > > > > general mailing list
> > > > > > > general at lists.openfabrics.org 
> > > > > > > 
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/genera
> > > > > > > l
> > > > > > > 
> > > > > > > To unsubscribe, please visit 
> > > > > > > http://openib.org/mailman/listinfo/openib-general
> > > > > > > 
> > > > > 
> > > > > 
> > > > _______________________________________________
> > > > general mailing list
> > > > general at lists.openfabrics.org
> > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > 
> > > > To unsubscribe, please visit
> > > > http://openib.org/mailman/listinfo/openib-general
> > > > 
> > > 
> 
> 


From Mark.Seger at hp.com  Wed Jul 11 07:15:59 2007
From: Mark.Seger at hp.com (Mark Seger)
Date: Wed, 11 Jul 2007 10:15:59 -0400
Subject: [ofa-general] IB performance stats (revisited)
In-Reply-To: <1184160670.17622.92728.camel@hal.voltaire.com>
References: <46826370.4090602@hp.com>	
	<1182951169.28870.75880.camel@hal.voltaire.com><46826FB8.10904@hp.com>	
	<46827BA0.6070008@hp.com><1182957688.28870.83013.camel@hal.voltaire.com>	
	<4682994E.1020209@hp.com><1182964334.28870.90291.camel@hal.voltaire.com>	
	<6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com>	
	<1182978496.28870.106214.camel@hal.voltaire.com>	
	<6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com>	
	<20070710094659.50df9b39.weiny2@llnl.gov>	
	<6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com>
	<1184160670.17622.92728.camel@hal.voltaire.com>
Message-ID: <4694E61F.8000502@hp.com>

My basic philosophy, and I suspect there are those who might disagree, 
is that you can't use the network to monitor the network, at least not 
in times of trouble.  That's why I insist on having to query the HCAs 
directly since I can't always be sure the network is there and/or 
reliable.  If you are willing to concede that this can indeed happen 
than the question becomes one of how do you reliably get data from an 
HCA and that's the basis for my (re)starting this discussion.

As for querying the switch for counters, what do you do on a very large 
network, say 10s of thousands of nodes if you want to get performance 
data every second?  I also realize this is an extreme situation today 
(the node count not the frequency of monitoring) but I'm sure everyone 
would agree systems of these sizes are not that far off.

-mark

Hal Rosenstock wrote:

>Hi Eitan,
>
>On Wed, 2007-07-11 at 06:51, Eitan Zahavi wrote:
>  
>
>>Hi Ira,
>>
>>    
>>
>>>Second, I have run some tests querying the fabric of our 
>>>large clusters here (~500 nodes) and the results were 
>>>promising for a single node implementation.
>>>I don't recall the numbers as this was a while ago but it was 
>>>on the order of
>>><2 sec and I think <1 but I don't want to be misquoted.
>>>      
>>>
>>Does PerfMgr query switch ports ?
>>    
>>
>
>Yes (of course it does).
>
>  
>
>>If it does I am surprised by the short sweep time you got.
>>
>>Does it have >1 query on the wire at a given time?
>>    
>>
>
>Yes, Default appears to be 500 currently (maybe that needs dialing back
>a bit) but is settable via perfmgr_max_outstanding_queries in options
>file.
>
>  
>
>>If not then I am even more surprised.
>>
>>Was the cluster running a job at the time of the query ?
>>    
>>
>
>Is this question related to VL0 contention ?
>
>-- Hal
>
>  
>
>>Thanks
>>
>>Eitan Zahavi
>>Senior Engineering Director, Software Architect
>>Mellanox Technologies LTD
>>Tel:+972-4-9097208
>>Fax:+972-4-9593245
>>P.O. Box 586 Yokneam 20692 ISRAEL
>>
>> 
>>
>>    
>>
>>>-----Original Message-----
>>>From: Ira Weiny [mailto:weiny2 at llnl.gov] 
>>>Sent: Tuesday, July 10, 2007 7:47 PM
>>>To: Eitan Zahavi
>>>Cc: halr at voltaire.com; Mark.Seger at hp.com; 
>>>general at lists.openfabrics.org; Ed.Finn at FMR.COM
>>>Subject: Re: [ofa-general] IB performance stats (revisited)
>>>
>>>On Thu, 28 Jun 2007 10:24:59 +0300
>>>"Eitan Zahavi" <eitan at mellanox.co.il> wrote:
>>>
>>>      
>>>
>>>>>On Wed, 2007-06-27 at 14:23, Eitan Zahavi wrote:
>>>>>          
>>>>>
>>>>>>In the last months it is the second time I hear people
>>>>>>            
>>>>>>
>>>>>complaining the
>>>>>          
>>>>>
>>>>>>current monitoring solution in OFA is  integrated with OpenSM.
>>>>>>            
>>>>>>
>>>>>I must have missed this both times (didn't see this in Mark's
>>>>>post) and the statement itself is somewhat inaccurate as well.
>>>>>          
>>>>>
>>>>Private talks - I hope they will speak up for themselves now...
>>>>        
>>>>
>>>>>>These people do not use OpenSM but do use OFED.
>>>>>>            
>>>>>>
>>>>>I'm not sure I'm following what you mean here.
>>>>>
>>>>>If you mean that some people want to run PerfMgr without 
>>>>>          
>>>>>
>>>the SM/SA 
>>>      
>>>
>>>>>aspects (so that they can run a vendor based SM), that is 
>>>>>          
>>>>>
>>>the next 
>>>      
>>>
>>>>>thing we are adding to the implementation.
>>>>>          
>>>>>
>>>>Exactly. OK when is that coming?
>>>>        
>>>>
>>>There is very little which ties the current PerfMgr to 
>>>OpenSM.  Basically it just gets the current fabric topology.  
>>>As Hal has said changes are coming.
>>>
>>>      
>>>
>>>>>> Another drawback if that
>>>>>>no naming is provided and the reporting uses GUIDs.
>>>>>>            
>>>>>>
>>>>>Naming is provided via NodeDescription.
>>>>>          
>>>>>
>>>>This might be good for hosts but is not covering  switches ...
>>>>        
>>>>
>>>It does include switches.  However, since most systems have 
>>>the same name for multiple switches this becomes ineffective. 
>>> I have queried Voltaire for a way to change the 
>>>NodeDescription for switches, but at the time I asked, there 
>>>was no way to do it.  Perhaps there is now?  What about other 
>>>vendors?  This is why ibnetdiscover and other diags have 
>>>"switch map" support.  (A GUID->name mapping to override the 
>>>default NodeDescription.) Nothing would please me more than 
>>>to be able to remove that for a more "automatic" solution.
>>>
>>>      
>>>
>>>>>>I also can't hold myself from saying again I think you 
>>>>>>            
>>>>>>
>>>are going 
>>>      
>>>
>>>>>>to hit the wall with the concept of doing the PMA from 
>>>>>>            
>>>>>>
>>>a single node.
>>>      
>>>
>>>>>If you are referring to the fact the PerMgr is currently not 
>>>>>distributed, that will be done as has been stated before.
>>>>>          
>>>>>
>>>>Good. When is it expected? Will it be OFED 1.3?
>>>>        
>>>>
>>>When Hal first sent out the PerfMgr design I thought we 
>>>should jump right to the distributed model as well.  But now 
>>>I am glad we have gone the way we did.
>>>First off, we have something which "works" and from which we 
>>>can expand.
>>>Second, I have run some tests querying the fabric of our 
>>>large clusters here (~500 nodes) and the results were 
>>>promising for a single node implementation.
>>>I don't recall the numbers as this was a while ago but it was 
>>>on the order of
>>><2 sec and I think <1 but I don't want to be misquoted.
>>>
>>>For sure, a distributed model offers many advantages and we 
>>>will get there.  But for many the current single node 
>>>approach should work just fine.
>>>
>>>Thanks,
>>>Ira
>>>
>>>      
>>>
>>>>Thanks
>>>>        
>>>>
>>>>>-- Hal
>>>>>
>>>>>          
>>>>>
>>>>>>Eitan Zahavi
>>>>>>Senior Engineering Director, Software Architect Mellanox
>>>>>>            
>>>>>>
>>>>>Technologies
>>>>>          
>>>>>
>>>>>>LTD
>>>>>>Tel:+972-4-9097208
>>>>>>Fax:+972-4-9593245
>>>>>>P.O. Box 586 Yokneam 20692 ISRAEL
>>>>>>
>>>>>> 
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>-----Original Message-----
>>>>>>>From: general-bounces at lists.openfabrics.org
>>>>>>>[mailto:general-bounces at lists.openfabrics.org] On 
>>>>>>>              
>>>>>>>
>>>Behalf Of Hal 
>>>      
>>>
>>>>>>>Rosenstock
>>>>>>>Sent: Wednesday, June 27, 2007 8:12 PM
>>>>>>>To: Mark Seger
>>>>>>>Cc: Finn, Ed; general at lists.openfabrics.org
>>>>>>>Subject: Re: [ofa-general] IB performance stats (revisited)
>>>>>>>
>>>>>>>On Wed, 2007-06-27 at 13:07, Mark Seger wrote:
>>>>>>>              
>>>>>>>
>>>>>>>>>The performance managers deal with the counter 
>>>>>>>>>                  
>>>>>>>>>
>>>stickiness (by 
>>>      
>>>
>>>>>>>>>resetting them when they think they need to). They
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>typically export
>>>>>>>              
>>>>>>>
>>>>>>>>>their data although this is not specified by IBA so it is
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>in a vendor
>>>>>>>              
>>>>>>>
>>>>>>>>>proprietary manner.
>>>>>>>>> 
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>>so I guess these guys are poor citizens as well...
>>>>>>>>                
>>>>>>>>
>>>>>>>Not sure what you mean.
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>the real issue as I see it then means nobody can trust
>>>>>>>>                
>>>>>>>>
>>>>>the data if
>>>>>          
>>>>>
>>>>>>>>randon tools randomly reset the counters.  a real shame...
>>>>>>>>                
>>>>>>>>
>>>>>>>I consider this to be a real rather than random app for this. 
>>>>>>>Guess it depends on what one considers random.
>>>>>>>
>>>>>>>-- Hal
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>-mark
>>>>>>>>
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>>_______________________________________________
>>>>>>>general mailing list
>>>>>>>general at lists.openfabrics.org
>>>>>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>>>>
>>>>>>>To unsubscribe, please visit
>>>>>>>http://openib.org/mailman/listinfo/openib-general
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>          
>>>>>
>>>>_______________________________________________
>>>>general mailing list
>>>>general at lists.openfabrics.org
>>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>
>>>>To unsubscribe, please visit 
>>>>http://openib.org/mailman/listinfo/openib-general
>>>>
>>>>        
>>>>


From halr at voltaire.com  Wed Jul 11 07:22:30 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 11 Jul 2007 10:22:30 -0400
Subject: [ofa-general] IB performance stats (revisited)
In-Reply-To: <4694E61F.8000502@hp.com>
References: <46826370.4090602@hp.com>
	<1182951169.28870.75880.camel@hal.voltaire.com><46826FB8.10904@hp.com>
	<46827BA0.6070008@hp.com><1182957688.28870.83013.camel@hal.voltaire.com>
	<4682994E.1020209@hp.com><1182964334.28870.90291.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com>
	<1182978496.28870.106214.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com>
	<20070710094659.50df9b39.weiny2@llnl.gov>
	<6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com>
	<1184160670.17622.92728.camel@hal.voltaire.com>
	<4694E61F.8000502@hp.com>
Message-ID: <1184163750.17622.96256.camel@hal.voltaire.com>

On Wed, 2007-07-11 at 10:15, Mark Seger wrote:
> My basic philosophy, and I suspect there are those who might disagree, 
> is that you can't use the network to monitor the network, at least not 
> in times of trouble.

Right, in times of certain troubles.

> That's why I insist on having to query the HCAs 
> directly since I can't always be sure the network is there and/or 
> reliable.  If you are willing to concede that this can indeed happen 
> than the question becomes one of how do you reliably get data from an 
> HCA and that's the basis for my (re)starting this discussion.

The reliability comes from timeout/retry mechanisms. If performance data
cannot be obtained on an IB network, it needs to be trouble shooted at a
lower level (by SMPs).

In any case, a rearchitecture of the PMA was proposed and seems
reasonable to me in that it can accomodate either approach. All that is
needed now is for someone to step up and champion an implementation of
this. Unfortunately, I do not have time to do so.

> As for querying the switch for counters, what do you do on a very large 
> network, say 10s of thousands of nodes if you want to get performance 
> data every second?  I also realize this is an extreme situation today 
> (the node count not the frequency of monitoring) but I'm sure everyone 
> would agree systems of these sizes are not that far off.

You have a distributed performance manager to handle this. A hierarchy
of performance managers has been discussed on the list before.

-- Hal

> -mark
> 
> Hal Rosenstock wrote:
> 
> >Hi Eitan,
> >
> >On Wed, 2007-07-11 at 06:51, Eitan Zahavi wrote:
> >  
> >
> >>Hi Ira,
> >>
> >>    
> >>
> >>>Second, I have run some tests querying the fabric of our 
> >>>large clusters here (~500 nodes) and the results were 
> >>>promising for a single node implementation.
> >>>I don't recall the numbers as this was a while ago but it was 
> >>>on the order of
> >>><2 sec and I think <1 but I don't want to be misquoted.
> >>>      
> >>>
> >>Does PerfMgr query switch ports ?
> >>    
> >>
> >
> >Yes (of course it does).
> >
> >  
> >
> >>If it does I am surprised by the short sweep time you got.
> >>
> >>Does it have >1 query on the wire at a given time?
> >>    
> >>
> >
> >Yes, Default appears to be 500 currently (maybe that needs dialing back
> >a bit) but is settable via perfmgr_max_outstanding_queries in options
> >file.
> >
> >  
> >
> >>If not then I am even more surprised.
> >>
> >>Was the cluster running a job at the time of the query ?
> >>    
> >>
> >
> >Is this question related to VL0 contention ?
> >
> >-- Hal
> >
> >  
> >
> >>Thanks
> >>
> >>Eitan Zahavi
> >>Senior Engineering Director, Software Architect
> >>Mellanox Technologies LTD
> >>Tel:+972-4-9097208
> >>Fax:+972-4-9593245
> >>P.O. Box 586 Yokneam 20692 ISRAEL
> >>
> >> 
> >>
> >>    
> >>
> >>>-----Original Message-----
> >>>From: Ira Weiny [mailto:weiny2 at llnl.gov] 
> >>>Sent: Tuesday, July 10, 2007 7:47 PM
> >>>To: Eitan Zahavi
> >>>Cc: halr at voltaire.com; Mark.Seger at hp.com; 
> >>>general at lists.openfabrics.org; Ed.Finn at FMR.COM
> >>>Subject: Re: [ofa-general] IB performance stats (revisited)
> >>>
> >>>On Thu, 28 Jun 2007 10:24:59 +0300
> >>>"Eitan Zahavi" <eitan at mellanox.co.il> wrote:
> >>>
> >>>      
> >>>
> >>>>>On Wed, 2007-06-27 at 14:23, Eitan Zahavi wrote:
> >>>>>          
> >>>>>
> >>>>>>In the last months it is the second time I hear people
> >>>>>>            
> >>>>>>
> >>>>>complaining the
> >>>>>          
> >>>>>
> >>>>>>current monitoring solution in OFA is  integrated with OpenSM.
> >>>>>>            
> >>>>>>
> >>>>>I must have missed this both times (didn't see this in Mark's
> >>>>>post) and the statement itself is somewhat inaccurate as well.
> >>>>>          
> >>>>>
> >>>>Private talks - I hope they will speak up for themselves now...
> >>>>        
> >>>>
> >>>>>>These people do not use OpenSM but do use OFED.
> >>>>>>            
> >>>>>>
> >>>>>I'm not sure I'm following what you mean here.
> >>>>>
> >>>>>If you mean that some people want to run PerfMgr without 
> >>>>>          
> >>>>>
> >>>the SM/SA 
> >>>      
> >>>
> >>>>>aspects (so that they can run a vendor based SM), that is 
> >>>>>          
> >>>>>
> >>>the next 
> >>>      
> >>>
> >>>>>thing we are adding to the implementation.
> >>>>>          
> >>>>>
> >>>>Exactly. OK when is that coming?
> >>>>        
> >>>>
> >>>There is very little which ties the current PerfMgr to 
> >>>OpenSM.  Basically it just gets the current fabric topology.  
> >>>As Hal has said changes are coming.
> >>>
> >>>      
> >>>
> >>>>>> Another drawback if that
> >>>>>>no naming is provided and the reporting uses GUIDs.
> >>>>>>            
> >>>>>>
> >>>>>Naming is provided via NodeDescription.
> >>>>>          
> >>>>>
> >>>>This might be good for hosts but is not covering  switches ...
> >>>>        
> >>>>
> >>>It does include switches.  However, since most systems have 
> >>>the same name for multiple switches this becomes ineffective. 
> >>> I have queried Voltaire for a way to change the 
> >>>NodeDescription for switches, but at the time I asked, there 
> >>>was no way to do it.  Perhaps there is now?  What about other 
> >>>vendors?  This is why ibnetdiscover and other diags have 
> >>>"switch map" support.  (A GUID->name mapping to override the 
> >>>default NodeDescription.) Nothing would please me more than 
> >>>to be able to remove that for a more "automatic" solution.
> >>>
> >>>      
> >>>
> >>>>>>I also can't hold myself from saying again I think you 
> >>>>>>            
> >>>>>>
> >>>are going 
> >>>      
> >>>
> >>>>>>to hit the wall with the concept of doing the PMA from 
> >>>>>>            
> >>>>>>
> >>>a single node.
> >>>      
> >>>
> >>>>>If you are referring to the fact the PerMgr is currently not 
> >>>>>distributed, that will be done as has been stated before.
> >>>>>          
> >>>>>
> >>>>Good. When is it expected? Will it be OFED 1.3?
> >>>>        
> >>>>
> >>>When Hal first sent out the PerfMgr design I thought we 
> >>>should jump right to the distributed model as well.  But now 
> >>>I am glad we have gone the way we did.
> >>>First off, we have something which "works" and from which we 
> >>>can expand.
> >>>Second, I have run some tests querying the fabric of our 
> >>>large clusters here (~500 nodes) and the results were 
> >>>promising for a single node implementation.
> >>>I don't recall the numbers as this was a while ago but it was 
> >>>on the order of
> >>><2 sec and I think <1 but I don't want to be misquoted.
> >>>
> >>>For sure, a distributed model offers many advantages and we 
> >>>will get there.  But for many the current single node 
> >>>approach should work just fine.
> >>>
> >>>Thanks,
> >>>Ira
> >>>
> >>>      
> >>>
> >>>>Thanks
> >>>>        
> >>>>
> >>>>>-- Hal
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>Eitan Zahavi
> >>>>>>Senior Engineering Director, Software Architect Mellanox
> >>>>>>            
> >>>>>>
> >>>>>Technologies
> >>>>>          
> >>>>>
> >>>>>>LTD
> >>>>>>Tel:+972-4-9097208
> >>>>>>Fax:+972-4-9593245
> >>>>>>P.O. Box 586 Yokneam 20692 ISRAEL
> >>>>>>
> >>>>>> 
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>-----Original Message-----
> >>>>>>>From: general-bounces at lists.openfabrics.org
> >>>>>>>[mailto:general-bounces at lists.openfabrics.org] On 
> >>>>>>>              
> >>>>>>>
> >>>Behalf Of Hal 
> >>>      
> >>>
> >>>>>>>Rosenstock
> >>>>>>>Sent: Wednesday, June 27, 2007 8:12 PM
> >>>>>>>To: Mark Seger
> >>>>>>>Cc: Finn, Ed; general at lists.openfabrics.org
> >>>>>>>Subject: Re: [ofa-general] IB performance stats (revisited)
> >>>>>>>
> >>>>>>>On Wed, 2007-06-27 at 13:07, Mark Seger wrote:
> >>>>>>>              
> >>>>>>>
> >>>>>>>>>The performance managers deal with the counter 
> >>>>>>>>>                  
> >>>>>>>>>
> >>>stickiness (by 
> >>>      
> >>>
> >>>>>>>>>resetting them when they think they need to). They
> >>>>>>>>>                  
> >>>>>>>>>
> >>>>>>>typically export
> >>>>>>>              
> >>>>>>>
> >>>>>>>>>their data although this is not specified by IBA so it is
> >>>>>>>>>                  
> >>>>>>>>>
> >>>>>>>in a vendor
> >>>>>>>              
> >>>>>>>
> >>>>>>>>>proprietary manner.
> >>>>>>>>> 
> >>>>>>>>>
> >>>>>>>>>                  
> >>>>>>>>>
> >>>>>>>>so I guess these guys are poor citizens as well...
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>Not sure what you mean.
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>the real issue as I see it then means nobody can trust
> >>>>>>>>                
> >>>>>>>>
> >>>>>the data if
> >>>>>          
> >>>>>
> >>>>>>>>randon tools randomly reset the counters.  a real shame...
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>I consider this to be a real rather than random app for this. 
> >>>>>>>Guess it depends on what one considers random.
> >>>>>>>
> >>>>>>>-- Hal
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>-mark
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>_______________________________________________
> >>>>>>>general mailing list
> >>>>>>>general at lists.openfabrics.org
> >>>>>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>>>>>>
> >>>>>>>To unsubscribe, please visit
> >>>>>>>http://openib.org/mailman/listinfo/openib-general
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>          
> >>>>>
> >>>>_______________________________________________
> >>>>general mailing list
> >>>>general at lists.openfabrics.org
> >>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>>>
> >>>>To unsubscribe, please visit 
> >>>>http://openib.org/mailman/listinfo/openib-general
> >>>>
> >>>>        
> >>>>
> 


From eitan at mellanox.co.il  Wed Jul 11 07:29:56 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 11 Jul 2007 17:29:56 +0300
Subject: [ofa-general] IB performance stats (revisited)
References: <46826370.4090602@hp.com>	
	<1182951169.28870.75880.camel@hal.voltaire.com><46826FB8.10904@hp.com>	
	<46827BA0.6070008@hp.com><1182957688.28870.83013.camel@hal.voltaire.com>	
	<4682994E.1020209@hp.com><1182964334.28870.90291.camel@hal.voltaire.com>	
	<6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com>	
	<1182978496.28870.106214.camel@hal.voltaire.com>	
	<6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com>	
	<20070710094659.50df9b39.weiny2@llnl.gov>	
	<6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com>
	<1184160670.17622.92728.camel@hal.voltaire.com>
	<4694E61F.8000502@hp.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901E049E6@mtlexch01.mtl.com>

Hi Marc,

I published an RFC and later had discussions regarding the distribution
of query ownership of switch counters.
Making this ownership purely dynamic, semi-dynamic or even static is an
implementation tradeoff.
However, it can be shown that the maximal number of switches a single
compute node would be responsible for is <= number of switch levels. So
no problem to get counters every second...

The issue is: what do you do with the size of data collected?
This is only relevant if monitoring is run in "profiling mode" otherwise
only link health errors should be reported.

My proposal is to have a reporting algorithm that reports only "change
of data rate" with "change" being defined "adaptively" . In other words:

A node should report upstream change of port activity only if the rate
of data changed by more then X times.
Assuming we want logarithmic scale  X == 2 would work like that:

At first sample there is no traffic. All counters will need t make their
way to the "master" node. 
When traffic starts a change of data rate which is infinite will cause
all new rates X to be sent.
>From that moment only ports which their data rate will reach 2X or 0.5X
will be reported.

Integration period should be configurable.

Hope I had time to implement ...

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL

 
> -----Original Message-----
> From: Mark Seger [mailto:Mark.Seger at hp.com] 
> Sent: Wednesday, July 11, 2007 5:16 PM
> To: Hal Rosenstock
> Cc: Eitan Zahavi; Ira Weiny; general at lists.openfabrics.org; 
> Ed.Finn at FMR.COM
> Subject: Re: [ofa-general] IB performance stats (revisited)
> 
> My basic philosophy, and I suspect there are those who might 
> disagree, is that you can't use the network to monitor the 
> network, at least not in times of trouble.  That's why I 
> insist on having to query the HCAs directly since I can't 
> always be sure the network is there and/or reliable.  If you 
> are willing to concede that this can indeed happen than the 
> question becomes one of how do you reliably get data from an 
> HCA and that's the basis for my (re)starting this discussion.
> 
> As for querying the switch for counters, what do you do on a 
> very large network, say 10s of thousands of nodes if you want 
> to get performance data every second?  I also realize this is 
> an extreme situation today (the node count not the frequency 
> of monitoring) but I'm sure everyone would agree systems of 
> these sizes are not that far off.
> 
> -mark
> 
> Hal Rosenstock wrote:
> 
> >Hi Eitan,
> >
> >On Wed, 2007-07-11 at 06:51, Eitan Zahavi wrote:
> >  
> >
> >>Hi Ira,
> >>
> >>    
> >>
> >>>Second, I have run some tests querying the fabric of our large 
> >>>clusters here (~500 nodes) and the results were promising for a 
> >>>single node implementation.
> >>>I don't recall the numbers as this was a while ago but it 
> was on the 
> >>>order of
> >>><2 sec and I think <1 but I don't want to be misquoted.
> >>>      
> >>>
> >>Does PerfMgr query switch ports ?
> >>    
> >>
> >
> >Yes (of course it does).
> >
> >  
> >
> >>If it does I am surprised by the short sweep time you got.
> >>
> >>Does it have >1 query on the wire at a given time?
> >>    
> >>
> >
> >Yes, Default appears to be 500 currently (maybe that needs 
> dialing back 
> >a bit) but is settable via perfmgr_max_outstanding_queries 
> in options 
> >file.
> >
> >  
> >
> >>If not then I am even more surprised.
> >>
> >>Was the cluster running a job at the time of the query ?
> >>    
> >>
> >
> >Is this question related to VL0 contention ?
> >
> >-- Hal
> >
> >  
> >
> >>Thanks
> >>
> >>Eitan Zahavi
> >>Senior Engineering Director, Software Architect Mellanox 
> Technologies 
> >>LTD
> >>Tel:+972-4-9097208
> >>Fax:+972-4-9593245
> >>P.O. Box 586 Yokneam 20692 ISRAEL
> >>
> >> 
> >>
> >>    
> >>
> >>>-----Original Message-----
> >>>From: Ira Weiny [mailto:weiny2 at llnl.gov]
> >>>Sent: Tuesday, July 10, 2007 7:47 PM
> >>>To: Eitan Zahavi
> >>>Cc: halr at voltaire.com; Mark.Seger at hp.com; 
> >>>general at lists.openfabrics.org; Ed.Finn at FMR.COM
> >>>Subject: Re: [ofa-general] IB performance stats (revisited)
> >>>
> >>>On Thu, 28 Jun 2007 10:24:59 +0300
> >>>"Eitan Zahavi" <eitan at mellanox.co.il> wrote:
> >>>
> >>>      
> >>>
> >>>>>On Wed, 2007-06-27 at 14:23, Eitan Zahavi wrote:
> >>>>>          
> >>>>>
> >>>>>>In the last months it is the second time I hear people
> >>>>>>            
> >>>>>>
> >>>>>complaining the
> >>>>>          
> >>>>>
> >>>>>>current monitoring solution in OFA is  integrated with OpenSM.
> >>>>>>            
> >>>>>>
> >>>>>I must have missed this both times (didn't see this in Mark's
> >>>>>post) and the statement itself is somewhat inaccurate as well.
> >>>>>          
> >>>>>
> >>>>Private talks - I hope they will speak up for themselves now...
> >>>>        
> >>>>
> >>>>>>These people do not use OpenSM but do use OFED.
> >>>>>>            
> >>>>>>
> >>>>>I'm not sure I'm following what you mean here.
> >>>>>
> >>>>>If you mean that some people want to run PerfMgr without
> >>>>>          
> >>>>>
> >>>the SM/SA
> >>>      
> >>>
> >>>>>aspects (so that they can run a vendor based SM), that is
> >>>>>          
> >>>>>
> >>>the next
> >>>      
> >>>
> >>>>>thing we are adding to the implementation.
> >>>>>          
> >>>>>
> >>>>Exactly. OK when is that coming?
> >>>>        
> >>>>
> >>>There is very little which ties the current PerfMgr to OpenSM.  
> >>>Basically it just gets the current fabric topology.
> >>>As Hal has said changes are coming.
> >>>
> >>>      
> >>>
> >>>>>> Another drawback if that
> >>>>>>no naming is provided and the reporting uses GUIDs.
> >>>>>>            
> >>>>>>
> >>>>>Naming is provided via NodeDescription.
> >>>>>          
> >>>>>
> >>>>This might be good for hosts but is not covering  switches ...
> >>>>        
> >>>>
> >>>It does include switches.  However, since most systems 
> have the same 
> >>>name for multiple switches this becomes ineffective.
> >>> I have queried Voltaire for a way to change the 
> NodeDescription for 
> >>>switches, but at the time I asked, there was no way to do it.  
> >>>Perhaps there is now?  What about other vendors?  This is why 
> >>>ibnetdiscover and other diags have "switch map" support.  (A 
> >>>GUID->name mapping to override the default 
> NodeDescription.) Nothing 
> >>>would please me more than to be able to remove that for a more 
> >>>"automatic" solution.
> >>>
> >>>      
> >>>
> >>>>>>I also can't hold myself from saying again I think you
> >>>>>>            
> >>>>>>
> >>>are going
> >>>      
> >>>
> >>>>>>to hit the wall with the concept of doing the PMA from
> >>>>>>            
> >>>>>>
> >>>a single node.
> >>>      
> >>>
> >>>>>If you are referring to the fact the PerMgr is currently not 
> >>>>>distributed, that will be done as has been stated before.
> >>>>>          
> >>>>>
> >>>>Good. When is it expected? Will it be OFED 1.3?
> >>>>        
> >>>>
> >>>When Hal first sent out the PerfMgr design I thought we 
> should jump 
> >>>right to the distributed model as well.  But now I am glad we have 
> >>>gone the way we did.
> >>>First off, we have something which "works" and from which we can 
> >>>expand.
> >>>Second, I have run some tests querying the fabric of our large 
> >>>clusters here (~500 nodes) and the results were promising for a 
> >>>single node implementation.
> >>>I don't recall the numbers as this was a while ago but it 
> was on the 
> >>>order of
> >>><2 sec and I think <1 but I don't want to be misquoted.
> >>>
> >>>For sure, a distributed model offers many advantages and 
> we will get 
> >>>there.  But for many the current single node approach should work 
> >>>just fine.
> >>>
> >>>Thanks,
> >>>Ira
> >>>
> >>>      
> >>>
> >>>>Thanks
> >>>>        
> >>>>
> >>>>>-- Hal
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>Eitan Zahavi
> >>>>>>Senior Engineering Director, Software Architect Mellanox
> >>>>>>            
> >>>>>>
> >>>>>Technologies
> >>>>>          
> >>>>>
> >>>>>>LTD
> >>>>>>Tel:+972-4-9097208
> >>>>>>Fax:+972-4-9593245
> >>>>>>P.O. Box 586 Yokneam 20692 ISRAEL
> >>>>>>
> >>>>>> 
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>-----Original Message-----
> >>>>>>>From: general-bounces at lists.openfabrics.org
> >>>>>>>[mailto:general-bounces at lists.openfabrics.org] On
> >>>>>>>              
> >>>>>>>
> >>>Behalf Of Hal
> >>>      
> >>>
> >>>>>>>Rosenstock
> >>>>>>>Sent: Wednesday, June 27, 2007 8:12 PM
> >>>>>>>To: Mark Seger
> >>>>>>>Cc: Finn, Ed; general at lists.openfabrics.org
> >>>>>>>Subject: Re: [ofa-general] IB performance stats (revisited)
> >>>>>>>
> >>>>>>>On Wed, 2007-06-27 at 13:07, Mark Seger wrote:
> >>>>>>>              
> >>>>>>>
> >>>>>>>>>The performance managers deal with the counter
> >>>>>>>>>                  
> >>>>>>>>>
> >>>stickiness (by
> >>>      
> >>>
> >>>>>>>>>resetting them when they think they need to). They
> >>>>>>>>>                  
> >>>>>>>>>
> >>>>>>>typically export
> >>>>>>>              
> >>>>>>>
> >>>>>>>>>their data although this is not specified by IBA so it is
> >>>>>>>>>                  
> >>>>>>>>>
> >>>>>>>in a vendor
> >>>>>>>              
> >>>>>>>
> >>>>>>>>>proprietary manner.
> >>>>>>>>> 
> >>>>>>>>>
> >>>>>>>>>                  
> >>>>>>>>>
> >>>>>>>>so I guess these guys are poor citizens as well...
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>Not sure what you mean.
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>the real issue as I see it then means nobody can trust
> >>>>>>>>                
> >>>>>>>>
> >>>>>the data if
> >>>>>          
> >>>>>
> >>>>>>>>randon tools randomly reset the counters.  a real shame...
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>I consider this to be a real rather than random app for this. 
> >>>>>>>Guess it depends on what one considers random.
> >>>>>>>
> >>>>>>>-- Hal
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>-mark
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>_______________________________________________
> >>>>>>>general mailing list
> >>>>>>>general at lists.openfabrics.org
> >>>>>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>>>>>>
> >>>>>>>To unsubscribe, please visit
> >>>>>>>http://openib.org/mailman/listinfo/openib-general
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>          
> >>>>>
> >>>>_______________________________________________
> >>>>general mailing list
> >>>>general at lists.openfabrics.org
> >>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>>>
> >>>>To unsubscribe, please visit
> >>>>http://openib.org/mailman/listinfo/openib-general
> >>>>
> >>>>        
> >>>>
> 
> 


From Mark.Seger at hp.com  Wed Jul 11 07:51:01 2007
From: Mark.Seger at hp.com (Mark Seger)
Date: Wed, 11 Jul 2007 10:51:01 -0400
Subject: [ofa-general] IB performance stats (revisited)
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901E049E6@mtlexch01.mtl.com>
References: <46826370.4090602@hp.com>	
	<1182951169.28870.75880.camel@hal.voltaire.com><46826FB8.10904@hp.com>	
	<46827BA0.6070008@hp.com><1182957688.28870.83013.camel@hal.voltaire.com>	
	<4682994E.1020209@hp.com><1182964334.28870.90291.camel@hal.voltaire.com>	
	<6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com>	
	<1182978496.28870.106214.camel@hal.voltaire.com>	
	<6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com>	
	<20070710094659.50df9b39.weiny2@llnl.gov>	
	<6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com>
	<1184160670.17622.92728.camel@hal.voltaire.com>
	<4694E61F.8000502@hp.com>
	<6C2C79E72C305246B504CBA17B5500C901E049E6@mtlexch01.mtl.com>
Message-ID: <4694EE55.6050107@hp.com>


Eitan Zahavi wrote:

>Hi Marc,
>
>I published an RFC and later had discussions regarding the distribution
>of query ownership of switch counters.
>Making this ownership purely dynamic, semi-dynamic or even static is an
>implementation tradeoff.
>However, it can be shown that the maximal number of switches a single
>compute node would be responsible for is <= number of switch levels. So
>no problem to get counters every second...
>
>The issue is: what do you do with the size of data collected?
>This is only relevant if monitoring is run in "profiling mode" otherwise
>only link health errors should be reported.
>  
>
I use IB data for performance data typically for system/application 
diagnostics.  I run a tool I wrote (see 
http://sourceforge.net/projects/collectl/) as a service on most systems 
and it gathers well over hundreds of performance metrics/counters on 
everything from  cpu load, memory, network,  infiniband, disk, etc.  The 
philosophy here is that if something goes wrong, it may be too late to 
then run some diagnostic.  Rather you need to have already collected the 
data, especially if this is an intemittent problem.  When there is no 
need to look at the data, it just gets purged away after a week.

There have been situation where someone reports a batch program they ran 
the other day was really slow and they didn't change anything.  By being 
able to pull up a monitoring log and seeing what the system was doing at 
the time of the run might reveal their network was saturated and 
therefore their MPI job was impacted.  You can't very well turn on 
diagnostics and rerun the application because system conditions have 
probably changed.

Does that help?  Why don't you try installing collectl and see what it 
does...

-mark


From Mark.Seger at hp.com  Wed Jul 11 08:00:21 2007
From: Mark.Seger at hp.com (Mark Seger)
Date: Wed, 11 Jul 2007 11:00:21 -0400
Subject: [ofa-general] IB performance stats (revisited)
In-Reply-To: <1184163750.17622.96256.camel@hal.voltaire.com>
References: <46826370.4090602@hp.com>	
	<1182951169.28870.75880.camel@hal.voltaire.com><46826FB8.10904@hp.com>	
	<46827BA0.6070008@hp.com><1182957688.28870.83013.camel@hal.voltaire.com>	
	<4682994E.1020209@hp.com><1182964334.28870.90291.camel@hal.voltaire.com>	
	<6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com>	
	<1182978496.28870.106214.camel@hal.voltaire.com>	
	<6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com>	
	<20070710094659.50df9b39.weiny2@llnl.gov>	
	<6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com>	
	<1184160670.17622.92728.camel@hal.voltaire.com>
	<4694E61F.8000502@hp.com>
	<1184163750.17622.96256.camel@hal.voltaire.com>
Message-ID: <4694F085.4010502@hp.com>

Hal Rosenstock wrote:

>On Wed, 2007-07-11 at 10:15, Mark Seger wrote:
>  
>
>>My basic philosophy, and I suspect there are those who might disagree, 
>>is that you can't use the network to monitor the network, at least not 
>>in times of trouble.
>>    
>>
>
>Right, in times of certain troubles.
>  
>
and that is the key.  since you can't know apriori when you're about to 
have troubles, you need to be collecting the data locally before they occur.

>>That's why I insist on having to query the HCAs 
>>directly since I can't always be sure the network is there and/or 
>>reliable.  If you are willing to concede that this can indeed happen 
>>than the question becomes one of how do you reliably get data from an 
>>HCA and that's the basis for my (re)starting this discussion.
>>    
>>
>
>The reliability comes from timeout/retry mechanisms. If performance data
>cannot be obtained on an IB network, it needs to be trouble shooted at a
>lower level (by SMPs).
>
>In any case, a rearchitecture of the PMA was proposed and seems
>reasonable to me in that it can accomodate either approach. All that is
>needed now is for someone to step up and champion an implementation of
>this. Unfortunately, I do not have time to do so.
>  
>
I don't know if what I've been proposing requires any rearchitecting as 
I see is as something local to each node.  Specificially, and there is 
already an implementation of this in an earlier voltaire stack, is to 
export wrapping HCA counters to /proc.  The module that does this 
read/clears the counters on every access but since no local applications 
are accessing the counters directly, clearing them doesn't hurt anyone.  
Alas, anyone else who wants to query the counters will find them reset.

The other side benefit of exporting these counters is such a way is now 
lots of others can collect/report this info.  In other words is someone 
chose to add IB stats to sar, it would become very easy to do!

If this is the type of thing people are interested in, I might be able 
to supply some code to do it.

>>As for querying the switch for counters, what do you do on a very large 
>>network, say 10s of thousands of nodes if you want to get performance 
>>data every second?  I also realize this is an extreme situation today 
>>(the node count not the frequency of monitoring) but I'm sure everyone 
>>would agree systems of these sizes are not that far off.
>>    
>>
>
>You have a distributed performance manager to handle this. A hierarchy
>of performance managers has been discussed on the list before.
>  
>
ahh, I see.
-mark


From halr at voltaire.com  Wed Jul 11 08:16:26 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 11 Jul 2007 11:16:26 -0400
Subject: [ofa-general] IB performance stats (revisited)
In-Reply-To: <4694F085.4010502@hp.com>
References: <46826370.4090602@hp.com>
	<1182951169.28870.75880.camel@hal.voltaire.com><46826FB8.10904@hp.com>
	<46827BA0.6070008@hp.com><1182957688.28870.83013.camel@hal.voltaire.com>
	<4682994E.1020209@hp.com><1182964334.28870.90291.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com>
	<1182978496.28870.106214.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com>
	<20070710094659.50df9b39.weiny2@llnl.gov>
	<6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com>
	<1184160670.17622.92728.camel@hal.voltaire.com>
	<4694E61F.8000502@hp.com>
	<1184163750.17622.96256.camel@hal.voltaire.com>
	<4694F085.4010502@hp.com>
Message-ID: <1184166984.17622.100081.camel@hal.voltaire.com>

On Wed, 2007-07-11 at 11:00, Mark Seger wrote:
> Hal Rosenstock wrote:
> 
> >On Wed, 2007-07-11 at 10:15, Mark Seger wrote:
> >  
> >
> >>My basic philosophy, and I suspect there are those who might disagree, 
> >>is that you can't use the network to monitor the network, at least not 
> >>in times of trouble.
> >>    
> >>
> >
> >Right, in times of certain troubles.
> >  
> >
> and that is the key.  since you can't know apriori when you're about to 
> have troubles, you need to be collecting the data locally before they occur.
> 
> >>That's why I insist on having to query the HCAs 
> >>directly since I can't always be sure the network is there and/or 
> >>reliable.  If you are willing to concede that this can indeed happen 
> >>than the question becomes one of how do you reliably get data from an 
> >>HCA and that's the basis for my (re)starting this discussion.
> >>    
> >>
> >
> >The reliability comes from timeout/retry mechanisms. If performance data
> >cannot be obtained on an IB network, it needs to be trouble shooted at a
> >lower level (by SMPs).
> >
> >In any case, a rearchitecture of the PMA was proposed and seems
> >reasonable to me in that it can accomodate either approach. All that is
> >needed now is for someone to step up and champion an implementation of
> >this. Unfortunately, I do not have time to do so.
> >  
> >
> I don't know if what I've been proposing requires any rearchitecting as 
> I see is as something local to each node.

There was some rearchitecting to make it meet the needs to what you have
proposed in addition to that of the IB performance manager. I think
Jason had a good proposal for this.

-- Hal

>   Specificially, and there is 
> already an implementation of this in an earlier voltaire stack, is to 
> export wrapping HCA counters to /proc.  The module that does this 
> read/clears the counters on every access but since no local applications 
> are accessing the counters directly, clearing them doesn't hurt anyone.  
> Alas, anyone else who wants to query the counters will find them reset.
> 
> The other side benefit of exporting these counters is such a way is now 
> lots of others can collect/report this info.  In other words is someone 
> chose to add IB stats to sar, it would become very easy to do!
> 
> If this is the type of thing people are interested in, I might be able 
> to supply some code to do it.
> 
> >>As for querying the switch for counters, what do you do on a very large 
> >>network, say 10s of thousands of nodes if you want to get performance 
> >>data every second?  I also realize this is an extreme situation today 
> >>(the node count not the frequency of monitoring) but I'm sure everyone 
> >>would agree systems of these sizes are not that far off.
> >>    
> >>
> >
> >You have a distributed performance manager to handle this. A hierarchy
> >of performance managers has been discussed on the list before.
> >  
> >
> ahh, I see.
> -mark
> 
> 


From eitan at mellanox.co.il  Wed Jul 11 08:30:04 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 11 Jul 2007 18:30:04 +0300
Subject: [ofa-general] IB performance stats (revisited)
References: <46826370.4090602@hp.com>	
	<1182951169.28870.75880.camel@hal.voltaire.com><46826FB8.10904@hp.com>	
	<46827BA0.6070008@hp.com><1182957688.28870.83013.camel@hal.voltaire.com>	
	<4682994E.1020209@hp.com><1182964334.28870.90291.camel@hal.voltaire.com>	
	<6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com>	
	<1182978496.28870.106214.camel@hal.voltaire.com>	
	<6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com>	
	<20070710094659.50df9b39.weiny2@llnl.gov>	
	<6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com>
	<1184160670.17622.92728.camel@hal.voltaire.com>
	<4694E61F.8000502@hp.com>
	<6C2C79E72C305246B504CBA17B5500C901E049E6@mtlexch01.mtl.com>
	<4694EE55.6050107@hp.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901E04A3B@mtlexch01.mtl.com>

Hi Marc,

I wish I had a large enough fabric worth testing collectl on...

I did the math for how much data would be collected for 10Knodes
cluster. It is ~7MB for each iteration: 
10K ports 
* 6 (3 level fabric * 2 ports on each link)
* 32 byte (data/pkts tx/rx) + 22byte (err counters) + 64byte (cong
counters) = 116bytes

Seems reasonable - but adds up to large amount of data over a day period
assuming a collect every second:
24*60*60 *116*10000*6 = 6.01344e+11 Bytes of storage

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL

 
> -----Original Message-----
> From: Mark Seger [mailto:Mark.Seger at hp.com] 
> Sent: Wednesday, July 11, 2007 5:51 PM
> To: Eitan Zahavi
> Cc: Hal Rosenstock; Ira Weiny; general at lists.openfabrics.org; 
> Ed.Finn at FMR.COM
> Subject: Re: [ofa-general] IB performance stats (revisited)
> 
> 
> 
> Eitan Zahavi wrote:
> 
> >Hi Marc,
> >
> >I published an RFC and later had discussions regarding the 
> distribution 
> >of query ownership of switch counters.
> >Making this ownership purely dynamic, semi-dynamic or even 
> static is an 
> >implementation tradeoff.
> >However, it can be shown that the maximal number of switches 
> a single 
> >compute node would be responsible for is <= number of switch 
> levels. So 
> >no problem to get counters every second...
> >
> >The issue is: what do you do with the size of data collected?
> >This is only relevant if monitoring is run in "profiling mode" 
> >otherwise only link health errors should be reported.
> >  
> >
> I use IB data for performance data typically for 
> system/application diagnostics.  I run a tool I wrote (see
> http://sourceforge.net/projects/collectl/) as a service on 
> most systems and it gathers well over hundreds of performance 
> metrics/counters on everything from  cpu load, memory, 
> network,  infiniband, disk, etc.  The philosophy here is that 
> if something goes wrong, it may be too late to then run some 
> diagnostic.  Rather you need to have already collected the 
> data, especially if this is an intemittent problem.  When 
> there is no need to look at the data, it just gets purged 
> away after a week.
> 
> There have been situation where someone reports a batch 
> program they ran the other day was really slow and they 
> didn't change anything.  By being able to pull up a 
> monitoring log and seeing what the system was doing at the 
> time of the run might reveal their network was saturated and 
> therefore their MPI job was impacted.  You can't very well 
> turn on diagnostics and rerun the application because system 
> conditions have probably changed.
> 
> Does that help?  Why don't you try installing collectl and 
> see what it does...
> 
> -mark
> 
> 
> 


From halr at voltaire.com  Wed Jul 11 08:54:06 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 11 Jul 2007 11:54:06 -0400
Subject: [ofa-general] Toward next OFED release (1.3)
In-Reply-To: <1184094759.17622.15371.camel@hal.voltaire.com>
References: <6C2C79E72C305246B504CBA17B5500C90156379B@mtlexch01.mtl.com>
	<1184091830.17622.12007.camel@hal.voltaire.com>
	<200707102111.28374.cap@nsc.liu.se>
	<1184094759.17622.15371.camel@hal.voltaire.com>
Message-ID: <1184169244.17622.102683.camel@hal.voltaire.com>

On Tue, 2007-07-10 at 15:12, Hal Rosenstock wrote:
> On Tue, 2007-07-10 at 15:11, Peter Kjellstrom wrote:
> > On Tuesday 10 July 2007, Hal Rosenstock wrote:
> > ...
> > > > Management:
> > > >       * Multiple partitions
> > > >       * OpenSM
> > > >               * More routing performance improvements
> > > >               * Even more speedups
> > > >               * Better packaging/installation
> > > >               * “Native” daemon mode
> > > >               * Performance management
> > > >               * Quality of Service manager: Based on IBTA annex
> > >
> > > enhancements for fat tree routing (non pure tree support)
> > > more console commands and telnet access to console
> > 
> > Pardon my ignorance, but could you elaborate on what a "non-pure tree" is and 
> > in which way OFED-1.2 opensm performs badly for these?

The following patch contains some of the answers to the above:

-----Forwarded Message-----

From: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
To: Hal Rosenstock <halr at voltaire.com>
Cc: OpenIB <general at lists.openfabrics.org>
Subject: [PATCH 2/2] osm: updating doc with root and compute nodes options for fat-tree
Date: 09 Jul 2007 11:32:49 +0300

Hi Hal.

Updating doc and osm manpage with the 
recent enhancement of fat-tree routing.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/doc/current-routing.txt |   28 ++++++++++++++++++++++------
 opensm/man/opensm.8            |   33 ++++++++++++++++++++++++++-------
 2 files changed, 48 insertions(+), 13 deletions(-)

diff --git a/opensm/doc/current-routing.txt b/opensm/doc/current-routing.txt
index 9852ef0..76f91ba 100644
--- a/opensm/doc/current-routing.txt
+++ b/opensm/doc/current-routing.txt
@@ -174,11 +174,14 @@ Fat-tree Routing Algorithm
 Purpose:
 
 The fat-tree algorithm optimizes routing for "shift" communication pattern.
-It should be chosen if a subnet is a symmetrical fat-tree of various types.
+It should be chosen if a subnet is a symmetrical or almost symmetrical
+fat-tree of various types.
 It supports not just K-ary-N-Trees, by handling for non-constant K,
 cases where not all leafs (CAs) are present, any CBB ratio.
 As in UPDN, fat-tree also prevents credit-loop-deadlocks.
-Fat-tree algorithm supports topologies that comply with the following rules:
+
+If the root guid file is not provided ('-a' or '--root_guid_file' options),
+the topology has to be pure fat-tree that complies with the following rules:
   - Tree rank should be between two and eight (inclusively)
   - Switches of the same rank should have the same number
     of UP-going port groups*, unless they are root switches,
@@ -189,18 +192,31 @@ Fat-tree algorithm supports topologies that comply with the following rules:
     of ports in each UP-going port group.
   - Switches of the same rank should have the same number
     of ports in each DOWN-going port group.
-*ports that are connected to the same remote switch are referenced as
+  - All the CAs have to be at the same tree level (rank).
+
+If the root guid file is provided, the topology doesn't have to be pure
+fat-tree, and it should only comply with the following rules:
+  - Tree rank should be between two and eight (inclusively)
+  - All the Compute Nodes** have to be at the same tree level (rank).
+    Note that non-compute node CAs are allowed here to be at different
+    tree ranks.
+
+* ports that are connected to the same remote switch are referenced as
 'port group'.
+** list of compute nodes (CNs) can be specified by '-u' or '--cn_guid_file'
+OpenSM options.
 
 Note that although fat-tree algorithm supports trees with non-integer CBB
 ratio, the routing will not be as balanced as in case of integer CBB ratio.
 In addition to this, although the algorithm allows leaf switches to have any
 number of CAs, the closer the tree is to be fully populated, the more effective
 the "shift" communication pattern will be.
+In general, even if the root list is provided, the closer the topology to a
+pure and symmetrical fat-tree, the more optimal the routing will be.
 
-The algorithm also dumps CA ordering file (opensm-ftree-ca-order.dump) in the
-same directory where the OpenSM log resides. This ordering file provides the
-CA order that may be used to create efficient communication pattern, that
+The algorithm also dumps compute node ordering file (opensm-ftree-ca-order.dump)
+in the same directory where the OpenSM log resides. This ordering file provides
+the CN order that may be used to create efficient communication pattern, that
 will match the routing tables.
 

diff --git a/opensm/man/opensm.8 b/opensm/man/opensm.8
index 5f34cd1..5472faf 100644
--- a/opensm/man/opensm.8
+++ b/opensm/man/opensm.8
@@ -603,7 +603,7 @@ UPDN Algorithm Usage
 Activation through OpenSM
 
 Use '-R updn' option (instead of old '-u') to activate the UPDN algorithm.
-Use '-a <guid_list_file>' for adding an UPDN guid file that contains the
+Use '-a <root_guid_file>' for adding an UPDN guid file that contains the
 root nodes for ranking.
 If the `-a' option is not used, OpenSM uses its auto-detect root nodes
 algorithm.
@@ -621,12 +621,14 @@ it exists) that connects the CA to the subnet as a root node.
 Fat-tree Routing Algorithm
 
 The fat-tree algorithm optimizes routing for "shift" communication pattern.
-It should be chosen if a subnet is a symmetrical fat-tree of various types.
+It should be chosen if a subnet is a symmetrical or almost symmetrical
+fat-tree of various types.
 It supports not just K-ary-N-Trees, by handling for non-constant K,
 cases where not all leafs (CAs) are present, any CBB ratio.
 As in UPDN, fat-tree also prevents credit-loop-deadlocks.
 
-The Fat-tree algorithm supports topologies that comply with the following rules:
+If the root guid file is not provided ('-a' or '--root_guid_file' options),
+the topology has to be pure fat-tree that complies with the following rules:
   - Tree rank should be between two and eight (inclusively)
   - Switches of the same rank should have the same number
     of UP-going port groups*, unless they are root switches,
@@ -637,10 +639,21 @@ The Fat-tree algorithm supports topologies that comply with the following rules:
     of ports in each UP-going port group.
   - Switches of the same rank should have the same number
     of ports in each DOWN-going port group.
+  - All the CAs have to be at the same tree level (rank).
 
-Note: ports that are connected to the same remote switch are referenced as
+If the root guid file is provided, the topology doesn't have to be pure
+fat-tree, and it should only comply with the following rules:
+  - Tree rank should be between two and eight (inclusively)
+  - All the Compute Nodes** have to be at the same tree level (rank).
+    Note that non-compute node CAs are allowed here to be at different
+    tree ranks.
+
+* ports that are connected to the same remote switch are referenced as
 \'port group\'.
 
+** list of compute nodes (CNs) can be specified by \'-u\' or \'--cn_guid_file\'
+OpenSM options.
+
 Topologies that do not comply cause a fallback to min hop routing.
 Note that this can also occur on link failures which cause the topology
 to no longer be "pure" fat-tree.
@@ -650,15 +663,21 @@ ratio, the routing will not be as balanced as in case of integer CBB ratio.
 In addition to this, although the algorithm allows leaf switches to have any
 number of CAs, the closer the tree is to be fully populated, the more
 effective the "shift" communication pattern will be.
+In general, even if the root list is provided, the closer the topology to a
+pure and symmetrical fat-tree, the more optimal the routing will be.
 
-The algorithm also dumps CA ordering file (opensm-ftree-ca-order.dump) in the
-same directory where the OpenSM log resides. This ordering file provides the
-CA order that may be used to create efficient communication pattern, that
+The algorithm also dumps compute node ordering file (opensm-ftree-ca-order.dump)
+in the same directory where the OpenSM log resides. This ordering file provides
+the CN order that may be used to create efficient communication pattern, that
 will match the routing tables.
 
 Activation through OpenSM
 
 Use '-R ftree' option to activate the fat-tree algorithm.
+Use '-a <root_guid_file>' to provide root nodes for ranking. If the `-a' option
+is not used, routing algorithm will detect roots automatically.
+Use '-u <root_cn_file>' to provide the list of compute nodes. If the `-u' option
+is not used, all the CAs are considered as compute nodes.
 
 Note: LMC > 0 is not supported by fat-tree routing. If this is
 specified, the default routing algorithm is invoked instead.
-- 
1.5.1.4

> Yevgeny,
> 
> Could you elaborate on this ? Thanks.
> 
> -- Hal
> 
> > Or maybe there are some nice docs for me to sink my teeth into...
> > 
> > /Peter
> > 
> > ______________________________________________________________________
> > 
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From Mark.Seger at hp.com  Wed Jul 11 08:56:31 2007
From: Mark.Seger at hp.com (Mark Seger)
Date: Wed, 11 Jul 2007 11:56:31 -0400
Subject: [ofa-general] IB performance stats (revisited)
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901E04A3B@mtlexch01.mtl.com>
References: <46826370.4090602@hp.com>	
	<1182951169.28870.75880.camel@hal.voltaire.com><46826FB8.10904@hp.com>	
	<46827BA0.6070008@hp.com><1182957688.28870.83013.camel@hal.voltaire.com>	
	<4682994E.1020209@hp.com><1182964334.28870.90291.camel@hal.voltaire.com>	
	<6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com>	
	<1182978496.28870.106214.camel@hal.voltaire.com>	
	<6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com>	
	<20070710094659.50df9b39.weiny2@llnl.gov>	
	<6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com>
	<1184160670.17622.92728.camel@hal.voltaire.com>
	<4694E61F.8000502@hp.com>
	<6C2C79E72C305246B504CBA17B5500C901E049E6@mtlexch01.mtl.com>
	<4694EE55.6050107@hp.com>
	<6C2C79E72C305246B504CBA17B5500C901E04A3B@mtlexch01.mtl.com>
Message-ID: <4694FDAF.2080001@hp.com>


>Hi Marc,
>
>I wish I had a large enough fabric worth testing collectl on...
>  
>
there may be a disconnect here as collectl collects data locally.  on a 
typical system, taking 10 second samples for all the different 
subsystems it support (though you can certainly turn up the frequency if 
you like) takes about 2MB/day and retains it for a week,  This does OFED 
support out-of-the-box, using perfquery to read/clear the counters.  
Just install it and type:

collectl -scmx -oTm (lots of other combinations of choices)

and you'll see data for cpu, memory and interconnect data with millisec 
timestamps as follows:

#             
<--------CPU--------><-----------Memory----------><----------InfiniBand---------->
#Time         cpu sys inter  ctxsw free buff cach inac slab  map   KBin  
pktIn  KBOut pktOut Errs
11:55:06.004    0   0   261     44   7G  46M 268M 151M 249M  21M      
0      0      0      0    0
11:55:07.004    0   0   275     61   7G  46M 268M 151M 249M  21M      
0      0      0      0    0
11:55:08.004    0   0   251     18   7G  46M 268M 151M 249M  21M      
0      0      0      0    0
11:55:09.004    0   0   251     23   7G  46M 268M 151M 249M  21M      
0      0      0      0    0

>I did the math for how much data would be collected for 10Knodes
>cluster. It is ~7MB for each iteration: 
>10K ports 
>* 6 (3 level fabric * 2 ports on each link)
>* 32 byte (data/pkts tx/rx) + 22byte (err counters) + 64byte (cong
>counters) = 116bytes
>
>Seems reasonable - but adds up to large amount of data over a day period
>assuming a collect every second:
>24*60*60 *116*10000*6 = 6.01344e+11 Bytes of storage
>  
>
no disagreement.  that's why I chose NOT to try to solve the distributed 
data collection problem.  collectl runs locally wiht <0.1% cpu overhead.
-mark

>Eitan Zahavi
>Senior Engineering Director, Software Architect
>Mellanox Technologies LTD
>Tel:+972-4-9097208
>Fax:+92-4-9593245
>P.O. Box 586 Yokneam 20692 ISRAEL
>
> 
>
>  
>
>>-----Original Message-----
>>From: Mark Seger [mailto:Mark.Seger at hp.com] 
>>Sent: Wednesday, July 11, 2007 5:51 PM
>>To: Eitan Zahavi
>>Cc: Hal Rosenstock; Ira Weiny; general at lists.openfabrics.org; 
>>Ed.Finn at FMR.COM
>>Subject: Re: [ofa-general] IB performance stats (revisited)
>>
>>
>>
>>Eitan Zahavi wrote:
>>
>>    
>>
>>>Hi Marc,
>>>
>>>I published an RFC and later had discussions regarding the 
>>>      
>>>
>>distribution 
>>    
>>
>>>of query ownership of switch counters.
>>>Making this ownership purely dynamic, semi-dynamic or even 
>>>      
>>>
>>static is an 
>>    
>>
>>>implementation tradeoff.
>>>However, it can be shown that the maximal number of switches 
>>>      
>>>
>>a single 
>>    
>>
>>>compute node would be responsible for is <= number of switch 
>>>      
>>>
>>levels. So 
>>    
>>
>>>no problem to get counters every second...
>>>
>>>The issue is: what do you do with the size of data collected?
>>>This is only relevant if monitoring is run in "profiling mode" 
>>>otherwise only link health errors should be reported.
>>> 
>>>
>>>      
>>>
>>I use IB data for performance data typically for 
>>system/application diagnostics.  I run a tool I wrote (see
>>http://sourceforge.net/projects/collectl/) as a service on 
>>most systems and it gathers well over hundreds of performance 
>>metrics/counters on everything from  cpu load, memory, 
>>network,  infiniband, disk, etc.  The philosophy here is that 
>>if something goes wrong, it may be too late to then run some 
>>diagnostic.  Rather you need to have already collected the 
>>data, especially if this is an intemittent problem.  When 
>>there is no need to look at the data, it just gets purged 
>>away after a week.
>>
>>There have been situation where someone reports a batch 
>>program they ran the other day was really slow and they 
>>didn't change anything.  By being able to pull up a 
>>monitoring log and seeing what the system was doing at the 
>>time of the run might reveal their network was saturated and 
>>therefore their MPI job was impacted.  You can't very well 
>>turn on diagnostics and rerun the application because system 
>>conditions have probably changed.
>>
>>Does that help?  Why don't you try installing collectl and 
>>see what it does...
>>
>>-mark
>>
>>
>>
>>    
>>


From cap at nsc.liu.se  Wed Jul 11 09:17:09 2007
From: cap at nsc.liu.se (Peter Kjellstrom)
Date: Wed, 11 Jul 2007 18:17:09 +0200
Subject: [ofa-general] IB performance stats (revisited)
In-Reply-To: <4694F085.4010502@hp.com>
References: <46826370.4090602@hp.com>
	<1184163750.17622.96256.camel@hal.voltaire.com>
	<4694F085.4010502@hp.com>
Message-ID: <200707111817.09205.cap@nsc.liu.se>

On Wednesday 11 July 2007, Mark Seger wrote:
> I don't know if what I've been proposing requires any rearchitecting as
> I see is as something local to each node.  Specificially, and there is
> already an implementation of this in an earlier voltaire stack, is to
> export wrapping HCA counters to /proc.  The module that does this
> read/clears the counters on every access but since no local applications
> are accessing the counters directly, clearing them doesn't hurt anyone.  
> Alas, anyone else who wants to query the counters will find them reset.
>
> The other side benefit of exporting these counters is such a way is now
> lots of others can collect/report this info.  In other words is someone
> chose to add IB stats to sar, it would become very easy to do!

I for one would be very happy to have this option. To be able to get simple 
but real-time data on a specific node. I'm amazed that the counters are 
non-wrapping or even that they are 32-bit...

/Peter

> If this is the type of thing people are interested in, I might be able
> to supply some code to do it.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070711/6f3eda75/attachment.sig>

From weiny2 at llnl.gov  Wed Jul 11 09:19:21 2007
From: weiny2 at llnl.gov (Ira Weiny)
Date: Wed, 11 Jul 2007 09:19:21 -0700
Subject: [ofa-general] IB performance stats (revisited)
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901E049B7@mtlexch01.mtl.com>
References: <46826370.4090602@hp.com>
	<1182951169.28870.75880.camel@hal.voltaire.com>
	<46826FB8.10904@hp.com> <46827BA0.6070008@hp.com>
	<1182957688.28870.83013.camel@hal.voltaire.com>
	<4682994E.1020209@hp.com>
	<1182964334.28870.90291.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com>
	<1182978496.28870.106214.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com>
	<20070710094659.50df9b39.weiny2@llnl.gov>
	<6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com>
	<1184160670.17622.92728.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901E049B7@mtlexch01.mtl.com>
Message-ID: <20070711091921.2ef4ef2e.weiny2@llnl.gov>

On Wed, 11 Jul 2007 17:03:35 +0300
"Eitan Zahavi" <eitan at mellanox.co.il> wrote:

> > > 
> > > Was the cluster running a job at the time of the query ?

No, that testing was not completed.

Ira


From halr at voltaire.com  Wed Jul 11 09:21:51 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 11 Jul 2007 12:21:51 -0400
Subject: [ofa-general] IB performance stats (revisited)
In-Reply-To: <4694F085.4010502@hp.com>
References: <46826370.4090602@hp.com>
	<1182951169.28870.75880.camel@hal.voltaire.com><46826FB8.10904@hp.com>
	<46827BA0.6070008@hp.com><1182957688.28870.83013.camel@hal.voltaire.com>
	<4682994E.1020209@hp.com><1182964334.28870.90291.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com>
	<1182978496.28870.106214.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com>
	<20070710094659.50df9b39.weiny2@llnl.gov>
	<6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com>
	<1184160670.17622.92728.camel@hal.voltaire.com>
	<4694E61F.8000502@hp.com>
	<1184163750.17622.96256.camel@hal.voltaire.com>
	<4694F085.4010502@hp.com>
Message-ID: <1184170906.17622.104663.camel@hal.voltaire.com>

On Wed, 2007-07-11 at 11:00, Mark Seger wrote:
> Hal Rosenstock wrote:
> 
> >On Wed, 2007-07-11 at 10:15, Mark Seger wrote:
> >  
> >
> >>My basic philosophy, and I suspect there are those who might disagree, 
> >>is that you can't use the network to monitor the network, at least not 
> >>in times of trouble.
> >>    
> >>
> >
> >Right, in times of certain troubles.
> >  
> >
> and that is the key.  since you can't know apriori when you're about to 
> have troubles, you need to be collecting the data locally before they occur.
> 
> >>That's why I insist on having to query the HCAs 
> >>directly since I can't always be sure the network is there and/or 
> >>reliable.  If you are willing to concede that this can indeed happen 
> >>than the question becomes one of how do you reliably get data from an 
> >>HCA and that's the basis for my (re)starting this discussion.
> >>    
> >>
> >
> >The reliability comes from timeout/retry mechanisms. If performance data
> >cannot be obtained on an IB network, it needs to be trouble shooted at a
> >lower level (by SMPs).
> >
> >In any case, a rearchitecture of the PMA was proposed and seems
> >reasonable to me in that it can accomodate either approach. All that is
> >needed now is for someone to step up and champion an implementation of
> >this. Unfortunately, I do not have time to do so.
> >  
> >
> I don't know if what I've been proposing requires any rearchitecting as 
> I see is as something local to each node.  Specificially, and there is 
> already an implementation of this in an earlier voltaire stack, is to 
> export wrapping HCA counters to /proc.  The module that does this 
> read/clears the counters on every access but since no local applications 
> are accessing the counters directly, clearing them doesn't hurt anyone.  
> Alas, anyone else who wants to query the counters will find them reset.

No local application but perhaps a remote one. This is the reason for
the proposed rearchitecture (along with synthesizing the wider
counters).

-- Hal

> The other side benefit of exporting these counters is such a way is now 
> lots of others can collect/report this info.  In other words is someone 
> chose to add IB stats to sar, it would become very easy to do!
> 
> If this is the type of thing people are interested in, I might be able 
> to supply some code to do it.
> 
> >>As for querying the switch for counters, what do you do on a very large 
> >>network, say 10s of thousands of nodes if you want to get performance 
> >>data every second?  I also realize this is an extreme situation today 
> >>(the node count not the frequency of monitoring) but I'm sure everyone 
> >>would agree systems of these sizes are not that far off.
> >>    
> >>
> >
> >You have a distributed performance manager to handle this. A hierarchy
> >of performance managers has been discussed on the list before.
> >  
> >
> ahh, I see.
> -mark
> 
> 


From rick.jones2 at hp.com  Wed Jul 11 09:37:42 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Wed, 11 Jul 2007 09:37:42 -0700
Subject: [ofa-general] Re: should it be possible to run SDP over a T320?
In-Reply-To: <20070711061444.GG11320@mellanox.co.il>
References: <4694044D.8010208@hp.com> <20070711061444.GG11320@mellanox.co.il>
Message-ID: <46950756.5090501@hp.com>

Michael S. Tsirkin wrote:
>>Quoting Rick Jones <rick.jones2 at hp.com>:
>>Subject: should it be possible to run SDP over a T320?
>>
>>Hi -
>>
>>I was talking to someone about the numbers I'd gathered for IPoIB with 
>>OFED 1.2 and a Mellanox HCA, and how the MTU increase from 2044 to 65520 
>>did some non-trivial things to bulk transfer performance.
> 
> 
> Was this data these posted on-list? I didn't see it.
> 

Hasn't been.  I presume that folks are curious?-)

rick


From rdreier at cisco.com  Wed Jul 11 09:57:53 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 11 Jul 2007 09:57:53 -0700
Subject: [ofa-general] What should a ULP pass as ib_create_cq(...,
	comp_vector) ?
In-Reply-To: <EXNANE01xT3lEmFWqbT00000c57@exnane01.hq.netapp.com> (Thomas
	Talpey's message of "Wed, 11 Jul 2007 07:50:37 -0400")
References: <EXNANE01xT3lEmFWqbT00000c57@exnane01.hq.netapp.com>
Message-ID: <adaejje98tq.fsf@cisco.com>

 > I notice the ib_create_cq() comp_vector support is merged in 2.6.22.
 > I don't completely understand what a ULP needs to pass as the argument.
 > 
 > I'm currently passing 0 in the NFS/RDMA client, what in general should I
 > consider using as a value? Or put another way, why is this exposed to
 > the ULP? Isn't this the MSI-X vector table index, a rather low-level thing
 > to hand to the ULP to manage?

You need to pass a value in the range 0 ... num_comp_vectors-1.  Since
every driver currently sets num_comp_vectors to 1, hard-coding your
value to 0 is a reasonable thing to do -- it's what every other ULP
does at the moment.

This value is *NOT* the MSI-X vector table index.  It's basically the
"completion event handler identifier" that the IB spec v. 1.2 talks
about.  It would be perfectly valid for a non-PCI device such as ehca
(for which the concept of MSI-X does not apply at all) to support
multiple completion vectors.  And the consumer is really the only
entity that can make a good choice of how to divide up CQs, since only
the consumer really knows which CQ event handlers might want to run in
parallel.

However on another level your question gets to the reason why we
haven't implemented support for multiple completion event vectors.
Namely, it's not clear how consumers, kernel or userspace, can make a
good choice of which vector to assign a given CQ to.  For example an
MPI implementation would probably want one vector per CPU so that it
can direct events for a given process to the CPU that the process is
running on; but there's no simple way to implement that policy.

 - R.


From caitlinb at broadcom.com  Wed Jul 11 10:41:53 2007
From: caitlinb at broadcom.com (Caitlin Bestler)
Date: Wed, 11 Jul 2007 10:41:53 -0700
Subject: [ofa-general] uDAPL Question
In-Reply-To: <469290C5.6010709@Sun.COM>
Message-ID: <1EF1E44200D82B47BD5BA61171E8CE9D0475D260@NT-IRVA-0750.brcm.ad.broadcom.com>

Don.Kerr at Sun.COM wrote:
> I am working on a uDAPL layer for Open MPI.  The situation is
> if I have more than one port/HCA my users may want to be
> selective in what is used and to do this they would need to
> provide some information regarding which port/HCA to use. So
> my thought is that the users are more familar with the output
> from "ifconfig", for example ib0, ib1, etc, and I was trying
> to find a way to correlate that to what is available from the
> uDAPL API. Maybe I need to reprogram them to look at dat.conf.
> 
> -DON
> 

You definitely do not want to parse dat.conf, you want to see
what the dat_registry has loaded. dat.conf is static, Providers
are allowed to dynamically adapt how they register themselves.
I don't believe that is an active concern, but it's simpler to
take advantage of the existing code and be safe in case somebody
comes along later and decides to do dynamic registration only.

But you hit the nail on the head in terms of needing to correlate
devices as reported by "ifconfig" and the Interface Adapter that
you try to open.

Basically, the intent has always been that the correlation between
an Interface Adapter and an "ifconfig" entry should be so obvious
that a complete idiot could figure out which went with which.
Once that linkage is clear then you merely use the RDMA device/port
implied by the routing of the device listed by ifconfig.

To the best of my knowledge, for every DAPL provider ever created
the correlation with the IP layer device has indeed been so obvious
that any idiot could figure it out -- unfortuantely software can only
hope to someday reach that degree of intelligence, and other than 
configuring the links there really isn't much that can be done.

Once there is a link between the RDMA device and the IP layer device,
you could use the routing tables to determine which port a connection
request could be received on, which ports could originate a packet with
a given IP address and which ports could send a packet to a given IP
destination. Given that, you want the matching RDMA device.

Such a linkage would allow the application to correctly determine
the exact DAPL Provider that needed to be opened, and only only
that one. Without it the application has to scan the registry list
and essentially do a serial search. The good news is that it won't
be a very long serial search and it doesn't have to be performed
that often.


From swise at opengridcomputing.com  Wed Jul 11 11:04:35 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 11 Jul 2007 13:04:35 -0500
Subject: [ofa-general] [PATCH 2.6.23] iw_cxgb3: remove the cm_id reference on
	listen failures.
Message-ID: <20070711180435.11665.71117.stgit@dell3.ogc.int>


iw_cxgb3: remove the cm_id reference on listen failures.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_cm.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c
index 3b41dc0..5dc68cd 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_cm.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c
@@ -1914,6 +1914,7 @@ int iwch_create_listen(struct iw_cm_id *
 fail3:
 	cxgb3_free_stid(ep->com.tdev, ep->stid);
 fail2:
+	cm_id->rem_ref(cm_id);
 	put_ep(&ep->com);
 fail1:
 out:


From swise at opengridcomputing.com  Wed Jul 11 11:11:43 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 11 Jul 2007 13:11:43 -0500
Subject: [ofa-general] GIT PULL ofed_1_2] iw_cxgb3: remove the cm_id
 reference on listen failures.
Message-ID: <46951D5F.3090208@opengridcomputing.com>

Vlad,

Please pull the fix for bug 686 from

git://git.openfabrics.org/~swise/ofed_1_2 ofed_1_2

Thanks,

Steve.

-----

iw_cxgb3: remove the cm_id reference on listen failures.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

  drivers/infiniband/hw/cxgb3/iwch_cm.c |    1 +
  1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c 
b/drivers/infiniband/hw/cxgb3/iwch_cm.c
index 4175991..08986fb 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_cm.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c
@@ -1912,6 +1912,7 @@ int iwch_create_listen(struct iw_cm_id *
  fail3:
  	cxgb3_free_stid(ep->com.tdev, ep->stid);
  fail2:
+	cm_id->rem_ref(cm_id);
  	put_ep(&ep->com);
  fail1:
  out:


From Thomas.Talpey at netapp.com  Wed Jul 11 11:57:09 2007
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Wed, 11 Jul 2007 14:57:09 -0400
Subject: [ofa-general] What should a ULP pass as ib_create_cq(...,
	comp_vector) ?
In-Reply-To: <adaejje98tq.fsf@cisco.com>
References: <EXNANE01xT3lEmFWqbT00000c57@exnane01.hq.netapp.com>
	<adaejje98tq.fsf@cisco.com>
Message-ID: <EXNANE01K0pWHItq6eO00000c5e@exnane01.hq.netapp.com>

At 12:57 PM 7/11/2007, Roland Dreier wrote:
>However on another level your question gets to the reason why we
>haven't implemented support for multiple completion event vectors.
>Namely, it's not clear how consumers, kernel or userspace, can make a
>good choice of which vector to assign a given CQ to.

Got it, thanks. But aren't the vectors shared across all consumers on
an HCA? As such, it seems problematic to expect consumers to make
optimal choices, since they have no way of knowing what other consumers
are doing.

In any case, all NFS/RDMA does is to check the completion status, queue
the event and schedule a tasklet, so there is little or no parallelism to be
gained in the upcall. I'd prefer to not have to wait for other ULPs on the
same vector, of course.

Tom.


From rick.jones2 at hp.com  Wed Jul 11 12:01:35 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Wed, 11 Jul 2007 12:01:35 -0700
Subject: [ofa-general] Re: should it be possible to run SDP over a T320?
In-Reply-To: <46950756.5090501@hp.com>
References: <4694044D.8010208@hp.com> <20070711061444.GG11320@mellanox.co.il>
	<46950756.5090501@hp.com>
Message-ID: <4695290F.7090005@hp.com>

>> Was this data these posted on-list? I didn't see it.
>>
> 
> Hasn't been.  I presume that folks are curious?-)

		      RedHat Enterprise Linux 5
		      Single-Stream Performance

                              Bulk Transfer                "Latency"
                          Unidir            Bidir
     Card          Mbit/s SDx   SDr   Mbit/s SDx   SDr   Tran/s SDx   SDr
-------------------------------------------------------------------------
  AD313A  IPoIB 1.1 2970 4.418 4.544  3530  3.59  3.95  19290 n/a   n/a
  AD313A  SDP   1.1 7810 0.453 1.048 12820  0.69  0.68  38030 26.29 26.29
  AD313A  SDP p0    7810 0.346 0.527 12670  0.42  0.43  19380 n/a   n/a
  AD313A  IPoIP 1.2 5510 0.426 1.593  5730  n/a   n/a   18990 n/a   n/a
  AD313A  SDP   1.2 7820 0.409 1.047 12890  0.64  0.68  41988 25.89 26.32
  AD313A SDP p0 1.2 7820 0.309 0.517 12760  0.36  0.36  19800 15.47 15.72

netperf, -s 1M -S 1M -m 64K on the unidir tests (TCP_STREAM, 
SDP_STREAM), -s 1M -S 1M -r 64K -b 12 for the bidirectional [SDP|TCP]_RR 
test, -r 1 for the [TCP|SDP]_RR test.

1.1 - OFED 1.1 bits
1.2 - OFED 1.2 bits
p0  - send_poll and recv_poll set to 0
SD - service demand in microseconds of CPU time consumed per unit of 
work - per KB transferred for the bulk tests, per transaction on the 
latency test.  'x' is transmit 'r' is receive

lspci for the AD313A shows:

03:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex 
(Tavor compatibility mode) (rev 20)

rick jones


From HNGUYEN at de.ibm.com  Wed Jul 11 12:27:26 2007
From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen)
Date: Wed, 11 Jul 2007 21:27:26 +0200
Subject: [ofa-general] Re: [PATCH] fix idr_get_new_above id alias bugs
In-Reply-To: <1184097931.3020.73.camel@localhost.localdomain>
Message-ID: <OF79FB2152.6DE55EBA-ONC1257315.006AAB0A-C1257315.006AE01D@de.ibm.com>

> With this patch, idr.c should work as advertised allocating id
> values in the range 0...0x7fffffff.  Andrew had speculated that
> it should allow the full range 0...0xffffffff to be used.  I was
> tempted to make changes to allow this, but it would require changes
> to API, e.g. making the starting id value and the return value
> unsigned.
Hi Jim, thanks much for this patch. It should work fine as far
as I can read. Will give it a try in next couple of days.
Nam


From kliteyn at mellanox.co.il  Wed Jul 11 12:56:22 2007
From: kliteyn at mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 11 Jul 2007 22:56:22 +0300
Subject: [ofa-general] Toward next OFED release (1.3)
In-Reply-To: <1184169244.17622.102683.camel@hal.voltaire.com>
References: <6C2C79E72C305246B504CBA17B5500C90156379B@mtlexch01.mtl.com>	<1184091830.17622.12007.camel@hal.voltaire.com>	<200707102111.28374.cap@nsc.liu.se>	<1184094759.17622.15371.camel@hal.voltaire.com>
	<1184169244.17622.102683.camel@hal.voltaire.com>
Message-ID: <469535E6.8080705@mellanox.co.il>


Hal Rosenstock wrote:
> On Tue, 2007-07-10 at 15:12, Hal Rosenstock wrote:
>   
>> On Tue, 2007-07-10 at 15:11, Peter Kjellstrom wrote:
>>     
>>> On Tuesday 10 July 2007, Hal Rosenstock wrote:
>>> ...
>>>       
>>>>> Management:
>>>>>       * Multiple partitions
>>>>>       * OpenSM
>>>>>               * More routing performance improvements
>>>>>               * Even more speedups
>>>>>               * Better packaging/installation
>>>>>               * “Native” daemon mode
>>>>>               * Performance management
>>>>>               * Quality of Service manager: Based on IBTA annex
>>>>>           
>>>> enhancements for fat tree routing (non pure tree support)
>>>> more console commands and telnet access to console
>>>>         
>>> Pardon my ignorance, but could you elaborate on what a "non-pure tree" is and 
>>> in which way OFED-1.2 opensm performs badly for these?
>>>       
>
> The following patch contains some of the answers to the above:
>   
Hi guys. Sorry for the delay.

Anyway, the patch does answer the question, but I'll add my two cents 
anyway.

The fat-tree algorithm optimizes routing for "shift" communication pattern.
Before the latest change, the topology that the fat-tree routing engine 
could
handle had to be a pure fat-tree, and by "pure" I mean completely 
symmetrical
tree that complies with the following rules:
  - Switches of the same rank should have the same number
    of UP-going port groups*, unless they are root switches,
    in which case the shouldn't have UP-going ports at all.
  - Switches of the same rank should have the same number
    of DOWN-going port groups, unless they are leaf switches.
  - Switches of the same rank should have the same number
    of ports in each UP-going port group.
  - Switches of the same rank should have the same number
    of ports in each DOWN-going port group.
  - *All* the CAs have to be at the same tree level (rank),
    doesn't matter if they are compute nodes or management nodes.

Any other topology will cause fat-tree routing to fail and OpenSM
would fall back to default routing. Note that this also means that
in a symmetrical fat-tree any link failure (except for the links
between CAs and leaf switches) will break the fabric symmetry and
the routing will fall back to default.

With the recent changes, the user can supply list of roots and
compute node guids, and then fat-tree routing is able to handle
trees that are not symmetrical, and the topology has to comply
with this (very) reduced set of constraints:
  - All the Compute Nodes have to be at the same tree level (rank).
    Note that non-compute node CAs are allowed here to be at different
    tree ranks.

But of course, the less the tree is symmetrical, the worse the routing
results will be.

-- Yevgeny

> -----Forwarded Message-----
>
> From: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> To: Hal Rosenstock <halr at voltaire.com>
> Cc: OpenIB <general at lists.openfabrics.org>
> Subject: [PATCH 2/2] osm: updating doc with root and compute nodes options for fat-tree
> Date: 09 Jul 2007 11:32:49 +0300
>
> Hi Hal.
>
> Updating doc and osm manpage with the 
> recent enhancement of fat-tree routing.
>
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> ---
>  opensm/doc/current-routing.txt |   28 ++++++++++++++++++++++------
>  opensm/man/opensm.8            |   33 ++++++++++++++++++++++++++-------
>  2 files changed, 48 insertions(+), 13 deletions(-)
>
> diff --git a/opensm/doc/current-routing.txt b/opensm/doc/current-routing.txt
> index 9852ef0..76f91ba 100644
> --- a/opensm/doc/current-routing.txt
> +++ b/opensm/doc/current-routing.txt
> @@ -174,11 +174,14 @@ Fat-tree Routing Algorithm
>  Purpose:
>  
>  The fat-tree algorithm optimizes routing for "shift" communication pattern.
> -It should be chosen if a subnet is a symmetrical fat-tree of various types.
> +It should be chosen if a subnet is a symmetrical or almost symmetrical
> +fat-tree of various types.
>  It supports not just K-ary-N-Trees, by handling for non-constant K,
>  cases where not all leafs (CAs) are present, any CBB ratio.
>  As in UPDN, fat-tree also prevents credit-loop-deadlocks.
> -Fat-tree algorithm supports topologies that comply with the following rules:
> +
> +If the root guid file is not provided ('-a' or '--root_guid_file' options),
> +the topology has to be pure fat-tree that complies with the following rules:
>    - Tree rank should be between two and eight (inclusively)
>    - Switches of the same rank should have the same number
>      of UP-going port groups*, unless they are root switches,
> @@ -189,18 +192,31 @@ Fat-tree algorithm supports topologies that comply with the following rules:
>      of ports in each UP-going port group.
>    - Switches of the same rank should have the same number
>      of ports in each DOWN-going port group.
> -*ports that are connected to the same remote switch are referenced as
> +  - All the CAs have to be at the same tree level (rank).
> +
> +If the root guid file is provided, the topology doesn't have to be pure
> +fat-tree, and it should only comply with the following rules:
> +  - Tree rank should be between two and eight (inclusively)
> +  - All the Compute Nodes** have to be at the same tree level (rank).
> +    Note that non-compute node CAs are allowed here to be at different
> +    tree ranks.
> +
> +* ports that are connected to the same remote switch are referenced as
>  'port group'.
> +** list of compute nodes (CNs) can be specified by '-u' or '--cn_guid_file'
> +OpenSM options.
>  
>  Note that although fat-tree algorithm supports trees with non-integer CBB
>  ratio, the routing will not be as balanced as in case of integer CBB ratio.
>  In addition to this, although the algorithm allows leaf switches to have any
>  number of CAs, the closer the tree is to be fully populated, the more effective
>  the "shift" communication pattern will be.
> +In general, even if the root list is provided, the closer the topology to a
> +pure and symmetrical fat-tree, the more optimal the routing will be.
>  
> -The algorithm also dumps CA ordering file (opensm-ftree-ca-order.dump) in the
> -same directory where the OpenSM log resides. This ordering file provides the
> -CA order that may be used to create efficient communication pattern, that
> +The algorithm also dumps compute node ordering file (opensm-ftree-ca-order.dump)
> +in the same directory where the OpenSM log resides. This ordering file provides
> +the CN order that may be used to create efficient communication pattern, that
>  will match the routing tables.
>  
>
> diff --git a/opensm/man/opensm.8 b/opensm/man/opensm.8
> index 5f34cd1..5472faf 100644
> --- a/opensm/man/opensm.8
> +++ b/opensm/man/opensm.8
> @@ -603,7 +603,7 @@ UPDN Algorithm Usage
>  Activation through OpenSM
>  
>  Use '-R updn' option (instead of old '-u') to activate the UPDN algorithm.
> -Use '-a <guid_list_file>' for adding an UPDN guid file that contains the
> +Use '-a <root_guid_file>' for adding an UPDN guid file that contains the
>  root nodes for ranking.
>  If the `-a' option is not used, OpenSM uses its auto-detect root nodes
>  algorithm.
> @@ -621,12 +621,14 @@ it exists) that connects the CA to the subnet as a root node.
>  Fat-tree Routing Algorithm
>  
>  The fat-tree algorithm optimizes routing for "shift" communication pattern.
> -It should be chosen if a subnet is a symmetrical fat-tree of various types.
> +It should be chosen if a subnet is a symmetrical or almost symmetrical
> +fat-tree of various types.
>  It supports not just K-ary-N-Trees, by handling for non-constant K,
>  cases where not all leafs (CAs) are present, any CBB ratio.
>  As in UPDN, fat-tree also prevents credit-loop-deadlocks.
>  
> -The Fat-tree algorithm supports topologies that comply with the following rules:
> +If the root guid file is not provided ('-a' or '--root_guid_file' options),
> +the topology has to be pure fat-tree that complies with the following rules:
>    - Tree rank should be between two and eight (inclusively)
>    - Switches of the same rank should have the same number
>      of UP-going port groups*, unless they are root switches,
> @@ -637,10 +639,21 @@ The Fat-tree algorithm supports topologies that comply with the following rules:
>      of ports in each UP-going port group.
>    - Switches of the same rank should have the same number
>      of ports in each DOWN-going port group.
> +  - All the CAs have to be at the same tree level (rank).
>  
> -Note: ports that are connected to the same remote switch are referenced as
> +If the root guid file is provided, the topology doesn't have to be pure
> +fat-tree, and it should only comply with the following rules:
> +  - Tree rank should be between two and eight (inclusively)
> +  - All the Compute Nodes** have to be at the same tree level (rank).
> +    Note that non-compute node CAs are allowed here to be at different
> +    tree ranks.
> +
> +* ports that are connected to the same remote switch are referenced as
>  \'port group\'.
>  
> +** list of compute nodes (CNs) can be specified by \'-u\' or \'--cn_guid_file\'
> +OpenSM options.
> +
>  Topologies that do not comply cause a fallback to min hop routing.
>  Note that this can also occur on link failures which cause the topology
>  to no longer be "pure" fat-tree.
> @@ -650,15 +663,21 @@ ratio, the routing will not be as balanced as in case of integer CBB ratio.
>  In addition to this, although the algorithm allows leaf switches to have any
>  number of CAs, the closer the tree is to be fully populated, the more
>  effective the "shift" communication pattern will be.
> +In general, even if the root list is provided, the closer the topology to a
> +pure and symmetrical fat-tree, the more optimal the routing will be.
>  
> -The algorithm also dumps CA ordering file (opensm-ftree-ca-order.dump) in the
> -same directory where the OpenSM log resides. This ordering file provides the
> -CA order that may be used to create efficient communication pattern, that
> +The algorithm also dumps compute node ordering file (opensm-ftree-ca-order.dump)
> +in the same directory where the OpenSM log resides. This ordering file provides
> +the CN order that may be used to create efficient communication pattern, that
>  will match the routing tables.
>  
>  Activation through OpenSM
>  
>  Use '-R ftree' option to activate the fat-tree algorithm.
> +Use '-a <root_guid_file>' to provide root nodes for ranking. If the `-a' option
> +is not used, routing algorithm will detect roots automatically.
> +Use '-u <root_cn_file>' to provide the list of compute nodes. If the `-u' option
> +is not used, all the CAs are considered as compute nodes.
>  
>  Note: LMC > 0 is not supported by fat-tree routing. If this is
>  specified, the default routing algorithm is invoked instead.
>   


From caitlinb at broadcom.com  Wed Jul 11 13:48:49 2007
From: caitlinb at broadcom.com (Caitlin Bestler)
Date: Wed, 11 Jul 2007 13:48:49 -0700
Subject: [ofa-general] What should a ULP pass as ib_create_cq(...,
	comp_vector) ?
In-Reply-To: <EXNANE01K0pWHItq6eO00000c5e@exnane01.hq.netapp.com>
Message-ID: <1EF1E44200D82B47BD5BA61171E8CE9D0475D35E@NT-IRVA-0750.brcm.ad.broadcom.com>

general-bounces at lists.openfabrics.org wrote:
> At 12:57 PM 7/11/2007, Roland Dreier wrote:
>> However on another level your question gets to the reason why we
>> haven't implemented support for multiple completion event vectors.
>> Namely, it's not clear how consumers, kernel or userspace, can make a
>> good choice of which vector to assign a given CQ to.
> 
> Got it, thanks. But aren't the vectors shared across all
> consumers on an HCA? As such, it seems problematic to expect
> consumers to make optimal choices, since they have no way of
> knowing what other consumers are doing.
> 
> In any case, all NFS/RDMA does is to check the completion
> status, queue the event and schedule a tasklet, so there is
> little or no parallelism to be gained in the upcall. I'd
> prefer to not have to wait for other ULPs on the same vector, of
> course. 
> 

What a single Consumer could do is to clump as many of their CQs
as possible into a single "bag" where serialization of notifications
for these CQs would have little detrimental impact on the application.
As you point out, for most applications this is all of their CQs.

This would presume that when the Consumer supplied too many that the
lower layers would simply say "tough" and combine some of them
(achieving
less than optimal results, but better than having the OS assign
notification
queues on a totally arbitrary basis).

To use the actual number implies that it would be meaningful for *each*
application to divide its CQs over that set, without any mechanism to
balance applications themselves. That would seem to imply that a typical
Consumer would have a large number of CQs, when I've never understood
the need for more than one per core per application.

At the minimum, if the actual number were published by the device, would
the kernel consumers actually be able to distribute their CQs over the
set?
Tom, I definitely agree that userland consumers have absolutely no way
to
do that reasonably, but do you think it is plausible for the kernel to
do
so far kernel-resident consumers? If not, what would be needed to bridge
that gap? Or is the need for parallelism so small amongst kernel
completion
handlers that the kernel does not need this feature?


From arlin.r.davis at intel.com  Wed Jul 11 14:36:41 2007
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Wed, 11 Jul 2007 14:36:41 -0700
Subject: [ofa-general] [PATCH] OFED 1.2.1 rdma_cm response timeout module
	parameter
Message-ID: <B0095134066CC94FBC80973103FFA1FE04667213@orsmsx416.amr.corp.intel.com>

Sean,

OFED 1.2 removed the rdma_set_option call used to adjust response
timeout. We are running into some cases on larger clusters that require
longer timeouts then the default. Can you consider this rdma_cm patch
for OFED 1.2.1 that adds a module parameter for the response timeout?
Thanks.

Signed-off by: Arlin Davis <ardavis at ichips.intel.com>

--- a/drivers/infiniband/core/cma.c	2007-07-11 10:46:48.000000000
-0700
+++ b/drivers/infiniband/core/cma.c	2007-07-11 10:54:16.000000000
-0700
@@ -58,6 +58,10 @@ MODULE_PARM_DESC(tavor_quirk, "Tavor per
 #define CMA_CM_RESPONSE_TIMEOUT 20
 #define CMA_MAX_CM_RETRIES 15
 
+static int cma_response_timeout = CMA_CM_RESPONSE_TIMEOUT;
+module_param_named(cma_response_timeout, cma_response_timeout, int,
0644);
+MODULE_PARM_DESC(cma_response_timeout, "CMA_CM_RESPONSE_TIMEOUT
default=20");
+
 static void cma_add_one(struct ib_device *device);
 static void cma_remove_one(struct ib_device *device);
 
@@ -2157,7 +2161,7 @@ static int cma_resolve_ib_udp(struct rdm
 	req.path = route->path_rec;
 	req.service_id = cma_get_service_id(id_priv->id.ps,
 					    &route->addr.dst_addr);
-	req.timeout_ms = 1 << (CMA_CM_RESPONSE_TIMEOUT - 8);
+	req.timeout_ms = 1 << (cma_response_timeout - 8);
 	req.max_cm_retries = CMA_MAX_CM_RETRIES;
 
 	ret = ib_send_cm_sidr_req(id_priv->cm_id.ib, &req);
@@ -2216,8 +2220,8 @@ static int cma_connect_ib(struct rdma_id
 	req.flow_control = conn_param->flow_control;
 	req.retry_count = conn_param->retry_count;
 	req.rnr_retry_count = conn_param->rnr_retry_count;
-	req.remote_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT;
-	req.local_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT;
+	req.remote_cm_response_timeout = cma_response_timeout;
+	req.local_cm_response_timeout = cma_response_timeout;
 	req.max_cm_retries = CMA_MAX_CM_RETRIES;
 	req.srq = id_priv->srq ? 1 : 0;
 
@@ -2344,7 +2348,7 @@ static int cma_accept_ib(struct rdma_id_
 	rep.private_data_len = conn_param->private_data_len;
 	rep.responder_resources = conn_param->responder_resources;
 	rep.initiator_depth = conn_param->initiator_depth;
-	rep.target_ack_delay = CMA_CM_RESPONSE_TIMEOUT;
+	rep.target_ack_delay = cma_response_timeout;
 	rep.failover_accepted = 0;
 	rep.flow_control = conn_param->flow_control;
 	rep.rnr_retry_count = conn_param->rnr_retry_count;


From sean.hefty at intel.com  Wed Jul 11 14:41:51 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 11 Jul 2007 14:41:51 -0700
Subject: [ofa-general] RE: [PATCH] OFED 1.2.1 rdma_cm response timeout module
	parameter
In-Reply-To: <B0095134066CC94FBC80973103FFA1FE04667213@orsmsx416.amr.corp.intel.com>
Message-ID: <001301c7c404$4ace0f00$3c98070a@amr.corp.intel.com>

>OFED 1.2 removed the rdma_set_option call used to adjust response timeout. We
>are running into some cases on larger clusters that require longer timeouts
>then the default. Can you consider this rdma_cm patch for OFED 1.2.1 that adds
>a module parameter for the response timeout? Thanks.

What's in it for me?  :)

>
>Signed-off by: Arlin Davis <ardavis at ichips.intel.com>

Acked-by: Sean Hefty <sean.hefty at intel.com>

Vlad, can you add this for OFED 1.2.1?

- Sean


From halr at voltaire.com  Wed Jul 11 15:07:43 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 11 Jul 2007 18:07:43 -0400
Subject: [ofa-general] Moving On
Message-ID: <1184191631.17622.128348.camel@hal.voltaire.com>

Hi,

After more than three years of having the pleasure of being involved
with PathForward, OpenIB, and now OpenFabrics, I have decided to move on
and to be involved from a different perspective and will be unable to
continue my current maintainership responsibility for IB management
(OpenSM and diagnostics). I hope to resurface sooner rather than later
:-) It's been a lot of fun to see how far this project has come in that
time and want to thank everyone for their support in improving OpenSM
and the management tools.

Sasha Khapyorsky from Voltaire will be taking over my maintainership of
management. He has been doing a lot of the "heavy lifting" for some time
now and I am confident it couldn't be left in better hands. He will be
taking over this starting on Friday 7/13 COB. As such, the git tree will
change from my tree to Sasha's. Stay tuned for specifics on this.

I will still be available for questions if needed @
hal.rosenstock at gmail.com

-- Hal


From conducted88 at phentermine.com  Wed Jul 11 10:46:48 2007
From: conducted88 at phentermine.com (Arlene Seay)
Date: Wed, 11 Jul 2007 22:46:48 +0500
Subject: [ofa-general] Re.Query
Message-ID: <324309387.76475175899472@phentermine.com>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070711/b9344809/attachment.html>

From arlin.r.davis at intel.com  Wed Jul 11 16:06:43 2007
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Wed, 11 Jul 2007 16:06:43 -0700
Subject: [ofa-general] RE: OFA website edits
In-Reply-To: <72F9F35DE96242A08DDC701F9A8EBC00@Gaucho>
Message-ID: <B0095134066CC94FBC80973103FFA1FE046A5CB3@orsmsx416.amr.corp.intel.com>


OFA maintainers,

Can we all agree on this process for download locations of
packages/libraries and the mechanism to pickup changes? If so, Jeff will
go ahead and update the download web page to pick up the links and
descriptions automatically.

Thanks,

-arlin

>-----Original Message-----
>From: Jeffrey Scott [mailto:jeff at splitrockpr.com]
>Sent: Tuesday, July 03, 2007 4:36 PM
>To: Davis, Arlin R
>Cc: 'Thad Omura'; Hefty, Sean; Smith, Stan; 'Vladimir Sokolovsky';
'Tziporet Koren'
>Subject: RE: OFA website edits
>
>OK.  I think your idea is fine.  I'll wait for you to confirm when the
>format is agreed upon, and when the links are ready.
>
>
>-----Original Message-----
>From: Davis, Arlin R [mailto:arlin.r.davis at intel.com]
>Sent: Tuesday, July 03, 2007 11:29 AM
>To: Jeffrey Scott
>Cc: Thad Omura; Hefty, Sean; Smith, Stan; Vladimir Sokolovsky; Tziporet
>Koren
>Subject: RE: OFA website edits
>
>Jeff,
>
>After looking at this I think we need to agree on a standard mechanism
>and location for downloads similar to what we do for git.
>
>Maybe we could have maintainers that want a individual download link
>provide a public_html directory along with a description? We could then
>have the download page automatically setup links to all
>/home/user/public_html/ directories that exist along with the
>description.
>
>For example the download page would look something like the following:
>
>Individual library releases:
>
>Link							Project
>Description
>
>http://www.openfabrics.org/~ardavis/	uDAPL libraries and
>Documentation: 1.2-1 and 2.0
>http://www.openfabrics.org/~shefty/		rdma_cm library: 1.0.1
>
>etc...
>
>
>OFED Releases and Binary Packages:
>
>Link							Project
>Description
>
>Download Binary RPMS
>Download Old Releases
>
>	Maybe Vlad could provide these and set this up under his
>public_html directory?
>
>
>OFED Development
>
>Link							Project
>Description
>
>http://www.openfabrics.org/git/		Linux git development
tree
>http://openib.tc.cornell.edu			Windows WIKI, svn
>development tree
>
>
>Is something like this possible?
>
>Comments?
>
>-arlin


From rdreier at cisco.com  Wed Jul 11 16:12:07 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 11 Jul 2007 16:12:07 -0700
Subject: [ofa-general] RE: OFA website edits
In-Reply-To: <B0095134066CC94FBC80973103FFA1FE046A5CB3@orsmsx416.amr.corp.intel.com>
	(Arlin R. Davis's message of "Wed, 11 Jul 2007 16:06:43 -0700")
References: <B0095134066CC94FBC80973103FFA1FE046A5CB3@orsmsx416.amr.corp.intel.com>
Message-ID: <adamyy27cxk.fsf@cisco.com>

 > Can we all agree on this process for download locations of
 > packages/libraries and the mechanism to pickup changes? If so, Jeff will
 > go ahead and update the download web page to pick up the links and
 > descriptions automatically.

What's the process we're agreeing to exactly?  I couldn't figure it
out from the email thread you quoted.

I like the current style of just being able to have a simple link like

    http://openfabrics.org/downloads/libibverbs-1.1.1.tar.gz

Is the proposal to change that?

Putting people's login names into the download URL seems like a step
backwards, as we've seen just recently with maintainer changes now
that Bryan and Hal have moved on -- breaking everyone's links just
because of someone changing jobs seems silly.  (And these URLs do get
embedded in RPMs etc. so it's worth having a canonical location for
each library)

 - R.


From mshefty at ichips.intel.com  Wed Jul 11 16:37:32 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 11 Jul 2007 16:37:32 -0700
Subject: [ofa-general] RE: OFA website edits
In-Reply-To: <adamyy27cxk.fsf@cisco.com>
References: <B0095134066CC94FBC80973103FFA1FE046A5CB3@orsmsx416.amr.corp.intel.com>
	<adamyy27cxk.fsf@cisco.com>
Message-ID: <469569BC.60405@ichips.intel.com>

> I like the current style of just being able to have a simple link like
> 
>     http://openfabrics.org/downloads/libibverbs-1.1.1.tar.gz
> 
> Is the proposal to change that?

I think the intent is to find a way for the developer to publish new 
libraries and automatically update the downloads page to link to it.

- Sean


From vu at mellanox.com  Wed Jul 11 16:51:44 2007
From: vu at mellanox.com (Vu Pham)
Date: Wed, 11 Jul 2007 16:51:44 -0700
Subject: [ofa-general] Compiling SRPT
In-Reply-To: <1184117324.22408.0.camel@gentoo-linux.localdomain>
References: <1183852853.6008.11.camel@gentoo-linux.localdomain>	
	<46926868.8000704@mellanox.com>	
	<1184042252.15067.8.camel@gentoo-linux.localdomain>	
	<4693B9E4.1070001@mellanox.com>
	<1184117324.22408.0.camel@gentoo-linux.localdomain>
Message-ID: <46956D10.4030904@mellanox.com>

Stanley Sufficool wrote:
> Is this the same as the README in the srpt_inc branch? That is the
> document I based the Wiki on (with a few embellishments).
>
>   

It's slightly different with update/correction.
I need to update the readme in the srpt_inc branch with this one


> On Tue, 2007-07-10 at 09:55 -0700, Vu Pham wrote:
>
>   
>>> Added a new wiki page based on Vu Pham's readme and issues with recent 
>>> kernels. I hope to keep it current as I get our targets up and running.
>>>
>>>       
>> Thanks for doing this.
>> Please use the latest readme from this link - 
>> http://mellanox.com/pdf/products/software/Gen2_SRPT_README.txt
>>
>>
>>     
>>> http://wiki.openfabrics.org/tiki-index.php?page=SRPT+Installation 
>>> <https://wiki.openfabrics.org/tiki-index.php?page=SRPT+Installation>
>>>
>>> WinIB initiators --> Gentoo Linux SRP Target.
>>>
>>>       
>> I mainly test linux initiators with gen2 srp-target. I have 
>> not tested win srp initiator with the target.
>>
>>     
>>> Anything wrong with the above approach, I would be interested in a best 
>>> practices if there is one. I saw a CentOS target post, is this more 
>>> stable or better performing?
>>>       
>> There is no difference when you run the same srp target / 
>> scst codes in CentOS or RH/SuSe linux distributions. The 
>> storage back-end will determine the performance
>>
>> -vu
>>
>>     
>>> Thanks.
>>>
>>> On Mon, 2007-07-09 at 09:55 -0700, Vu Pham wrote:
>>>       
>>>> Stanley Sufficool wrote:
>>>>         
>>>>>   Compiling on kernel 2.6.21-rc6 from kernel.org Torvald's branch
>>>>>
>>>>> Got the latest srpt from the git repository on OpenFabrics and had the 
>>>>> following issues.
>>>>>
>>>>> ib_srpt.c    Line 1997, missing second argument, should be?   
>>>>> sdev->scst_tgt = scst_register(tp, NULL);
>>>>>
>>>>>           
>>>> Yes. You need the change if you test with top of scst svn 
>>>> trunk (or from version 0.9.6-pre2)
>>>> If you test with scst before 0.9.6-pre2 (ie. version <= 
>>>> 0.9.6-pre1) you don't need the second argument for 
>>>> scst_register()
>>>>
>>>>
>>>>         
>>>>> SCST was built successfully after fixing an issue in scst_vdisk.c 
>>>>> (missing #include <linux/sched.h>)
>>>>>           
>>>> I tested with 2.6.20.x - I have not tested with 2.6.21-rcXX 
>>>> - you should send the patch to scst devel
>>>>
>>>>         
>>>>> Just thought this would be nice to have documented, took me half a day 
>>>>> to track down as a novice in C programming.
>>>>>
>>>>>           
>>>> there is *lean and mean* srpt's README in srpt_inc
>>>> SCST also has some document
>>>> You can add some wiki/notes for the problems in openfabrics 
>>>> wiki page https://wiki.openfabrics.org/tiki-index.php
>>>>
>>>> -vu
>>>>
>>>>         
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> general mailing list
>>>>> general at lists.openfabrics.org <mailto:general at lists.openfabrics.org>
>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>>
>>>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>>>           
>
>   


From vu at mellanox.com  Wed Jul 11 16:57:29 2007
From: vu at mellanox.com (Vu Pham)
Date: Wed, 11 Jul 2007 16:57:29 -0700
Subject: [ofa-general] Compiling SRPT
In-Reply-To: <1184130141.22408.7.camel@gentoo-linux.localdomain>
References: <1183852853.6008.11.camel@gentoo-linux.localdomain>	
	<46926868.8000704@mellanox.com>	
	<1184042252.15067.8.camel@gentoo-linux.localdomain>	
	<4693B9E4.1070001@mellanox.com>
	<1184130141.22408.7.camel@gentoo-linux.localdomain>
Message-ID: <46956E69.6010908@mellanox.com>

Stanley Sufficool wrote:
> Do you have any reservations that the WinIB (Mellanox) SRP initiators
> will not work with SRPT? 
>   
There are two version of SRPT: ofed/gen2 srpt and ibgold srpt
You are working with ofed/gen2 srpt now

WinIB srp initiator works well with ibgold srpt
I quickly test WinIB srp intiator with ofed/gen2 srpt. It sees the 
target but not its lun - some debugs are required

ibgold srpt only work with suse/sles 9, rhel 4. It does not work with 
rhel 5 or sles10 or vanilla kernel > 2.6.11

Do you have any restriction on kernel, version of IB driver/srpt driver 
on the target machine?

> If there is any doubt, I need to know so that I can fall back to iSCSI
> over IPoIB (iSIPIB??? ;) )  . This has lots more overhead, but it's a
> sure bet until this can be worked out.
>
> On Tue, 2007-07-10 at 09:55 -0700, Vu Pham wrote:
>
>   
>>> Added a new wiki page based on Vu Pham's readme and issues with recent 
>>> kernels. I hope to keep it current as I get our targets up and running.
>>>
>>>       
>> Thanks for doing this.
>> Please use the latest readme from this link - 
>> http://mellanox.com/pdf/products/software/Gen2_SRPT_README.txt
>>
>>
>>     
>>> http://wiki.openfabrics.org/tiki-index.php?page=SRPT+Installation 
>>> <https://wiki.openfabrics.org/tiki-index.php?page=SRPT+Installation>
>>>
>>> WinIB initiators --> Gentoo Linux SRP Target.
>>>
>>>       
>> I mainly test linux initiators with gen2 srp-target. I have 
>> not tested win srp initiator with the target.
>>
>>     
>>> Anything wrong with the above approach, I would be interested in a best 
>>> practices if there is one. I saw a CentOS target post, is this more 
>>> stable or better performing?
>>>       
>> There is no difference when you run the same srp target / 
>> scst codes in CentOS or RH/SuSe linux distributions. The 
>> storage back-end will determine the performance
>>
>> -vu
>>
>>     
>>> Thanks.
>>>
>>> On Mon, 2007-07-09 at 09:55 -0700, Vu Pham wrote:
>>>       
>>>> Stanley Sufficool wrote:
>>>>         
>>>>>   Compiling on kernel 2.6.21-rc6 from kernel.org Torvald's branch
>>>>>
>>>>> Got the latest srpt from the git repository on OpenFabrics and had the 
>>>>> following issues.
>>>>>
>>>>> ib_srpt.c    Line 1997, missing second argument, should be?   
>>>>> sdev->scst_tgt = scst_register(tp, NULL);
>>>>>
>>>>>           
>>>> Yes. You need the change if you test with top of scst svn 
>>>> trunk (or from version 0.9.6-pre2)
>>>> If you test with scst before 0.9.6-pre2 (ie. version <= 
>>>> 0.9.6-pre1) you don't need the second argument for 
>>>> scst_register()
>>>>
>>>>
>>>>         
>>>>> SCST was built successfully after fixing an issue in scst_vdisk.c 
>>>>> (missing #include <linux/sched.h>)
>>>>>           
>>>> I tested with 2.6.20.x - I have not tested with 2.6.21-rcXX 
>>>> - you should send the patch to scst devel
>>>>
>>>>         
>>>>> Just thought this would be nice to have documented, took me half a day 
>>>>> to track down as a novice in C programming.
>>>>>
>>>>>           
>>>> there is *lean and mean* srpt's README in srpt_inc
>>>> SCST also has some document
>>>> You can add some wiki/notes for the problems in openfabrics 
>>>> wiki page https://wiki.openfabrics.org/tiki-index.php
>>>>
>>>> -vu
>>>>
>>>>         
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> general mailing list
>>>>> general at lists.openfabrics.org <mailto:general at lists.openfabrics.org>
>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>>
>>>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>>>           
>
>   


From ardavis at ichips.intel.com  Wed Jul 11 17:04:09 2007
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Wed, 11 Jul 2007 17:04:09 -0700
Subject: [ofa-general] RE: OFA website edits
In-Reply-To: <adamyy27cxk.fsf@cisco.com>
References: <B0095134066CC94FBC80973103FFA1FE046A5CB3@orsmsx416.amr.corp.intel.com>
	<adamyy27cxk.fsf@cisco.com>
Message-ID: <46956FF9.50102@ichips.intel.com>

Roland Dreier wrote:

> > Can we all agree on this process for download locations of
> > packages/libraries and the mechanism to pickup changes? If so, Jeff will
> > go ahead and update the download web page to pick up the links and
> > descriptions automatically.
>
>What's the process we're agreeing to exactly?  I couldn't figure it
>out from the email thread you quoted.
>
>I like the current style of just being able to have a simple link like
>
>    http://openfabrics.org/downloads/libibverbs-1.1.1.tar.gz
>
>Is the proposal to change that?
>  
>
The proposal was attempting to come up with a method to automatically 
link to a package and description file from the download webpage. I have 
no problem
targeting http://openfabrics.org/downloads as long as we come up with a 
way for the webpage to correlate a description with a package without 
hand coding the links everytime. We need to come up with a method for 
automatic links to keep our download webpage updated and complete.

What if we add a directory for each project under downloads and provide 
a README for a description? Other suggestions?

-arlin


From jsquyres at cisco.com  Wed Jul 11 17:29:40 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 11 Jul 2007 20:29:40 -0400
Subject: [ofa-general] Re: http://git.openfabrics.org/
In-Reply-To: <A85B03FF-E321-4037-939C-917B9B483DED@cisco.com>
References: <A85B03FF-E321-4037-939C-917B9B483DED@cisco.com>
Message-ID: <A4A93717-5A04-41F8-9BF9-A3D37A3F7530@cisco.com>

Just a ping again to make sure that this request doesn't get lost...

On Jun 15, 2007, at 11:11 AM, Jeff Squyres wrote:

> I notice that http://git.openfabrics.org/ shows the main OFA web  
> site, but http://git.openfabrics.org/git/ shows all the git  
> repositories.
>
> Can a redirect be installed such that http://git.openfabrics.org/  
> is automatically sent to http://git.openfabrics.org/git/?
>
> I think that would be a little more intuitive.
>
> Thanks!
>
> -- 
> Jeff Squyres
> Cisco Systems
>
>


-- 
Jeff Squyres
Cisco Systems


From sashak at voltaire.com  Wed Jul 11 19:47:17 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 12 Jul 2007 05:47:17 +0300
Subject: [ofa-general] [PATCH] opensm/updn: root detector function
	simplification
Message-ID: <20070712024716.GA2248@sashak.voltaire.com>


There are pretty cosmetic simplifications for up/down root auto detector
function - reducing some vars and flows.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_ucast_updn.c |  142 ++++++++--------------------------------
 1 files changed, 28 insertions(+), 114 deletions(-)

diff --git a/opensm/opensm/osm_ucast_updn.c b/opensm/opensm/osm_ucast_updn.c
index c8d5a7f..faf4249 100644
--- a/opensm/opensm/osm_ucast_updn.c
+++ b/opensm/opensm/osm_ucast_updn.c
@@ -66,13 +66,6 @@ typedef enum _updn_switch_dir
   DOWN
 } updn_switch_dir_t;
 
-/* Histogram element - the number of occurences of the same hop value */
-typedef struct _updn_hist
-{
-  cl_map_item_t map_item;
-  uint32_t bar_value;
-} updn_hist_t;
-
 /* guids list */
 typedef struct _updn_input
 {
@@ -711,15 +704,12 @@ __osm_updn_find_root_nodes_by_min_hop(
   osm_switch_t *p_next_sw, *p_sw;
   osm_port_t   *p_next_port, *p_port;
   osm_physp_t  *p_physp;
-  uint32_t      numCas = 0;
-  uint32_t      numSws = cl_qmap_count(&p_osm->subn.sw_guid_tbl);
-  cl_qmap_t     min_hop_hist; /* Histogram container */
-  updn_hist_t  *p_updn_hist, *p_up_ht;
-  uint8_t       maxHops = 0; /* contain the max histogram index */
   uint64_t     *p_guid;
   cl_list_t    *p_root_nodes_list = p_updn->p_root_nodes;
+  double thd1, thd2;
+  unsigned i, cas_num = 0;
   unsigned *cas_per_sw;
-  uint16_t sw_lid_ho;
+  uint16_t lid_ho;
 
   OSM_LOG_ENTER( &p_osm->log, osm_updn_find_root_nodes_by_min_hop );
 
@@ -727,8 +717,6 @@ __osm_updn_find_root_nodes_by_min_hop(
            "__osm_updn_find_root_nodes_by_min_hop: "
            "Current number of ports in the subnet is %d\n",
            cl_qmap_count(&p_osm->subn.port_guid_tbl) );
-  /* Init the required vars */
-  cl_qmap_init( &min_hop_hist );
 
   cas_per_sw = malloc((IB_LID_UCAST_END_HO + 1)*sizeof(*cas_per_sw));
   if (!cas_per_sw) {
@@ -739,18 +727,6 @@ __osm_updn_find_root_nodes_by_min_hop(
   }
   memset(cas_per_sw, 0, (IB_LID_UCAST_END_HO + 1)*sizeof(*cas_per_sw));
 
-  /* EZ:
-     p_ca_list = (cl_list_t*)malloc(sizeof(cl_list_t)); 
-#if 0
-     if (!p_ca_list)
-     {
-
-     }
-#endif
-     cl_list_construct( p_ca_list ); 
-     cl_list_init( p_ca_list, 10 );
-  */
-
   /* Find the Maximum number of CAs (and routers) for histogram normalization */
   osm_log( &p_osm->log, OSM_LOG_VERBOSE,
            "__osm_updn_find_root_nodes_by_min_hop: "
@@ -764,128 +740,66 @@ __osm_updn_find_root_nodes_by_min_hop(
       p_physp = p_port->p_physp->p_remote_physp;
       if (!p_physp || !p_physp->p_node->sw)
         continue;
-      sw_lid_ho = osm_node_get_base_lid(p_physp->p_node, 0);
-      sw_lid_ho = cl_ntoh16(sw_lid_ho);
+      lid_ho = osm_node_get_base_lid(p_physp->p_node, 0);
+      lid_ho = cl_ntoh16(lid_ho);
       osm_log( &p_osm->log, OSM_LOG_DEBUG,
                "__osm_updn_find_root_nodes_by_min_hop: "
                "Inserting GUID 0x%" PRIx64 ", sw lid: 0x%X into array\n",
-               cl_ntoh64(osm_port_get_guid(p_port)), sw_lid_ho );
-      cas_per_sw[sw_lid_ho]++;
-      numCas++;
+               cl_ntoh64(osm_port_get_guid(p_port)), lid_ho );
+      cas_per_sw[lid_ho]++;
+      cas_num++;
     }
   }
+
+  thd1 = cas_num * 0.9;
+  thd2 = cas_num * 0.05;
   osm_log( &p_osm->log, OSM_LOG_DEBUG,
            "__osm_updn_find_root_nodes_by_min_hop: "
-           "Found %u CAs and RTRs, %u SWs in the subnet\n", numCas, numSws );
+           "Found %u CAs and RTRs, %u SWs in the subnet. "
+           "Thresholds are thd1 = %f && thd2 = %f\n",
+           cas_num, cl_qmap_count(&p_osm->subn.sw_guid_tbl), thd1, thd2);
+
   p_next_sw = (osm_switch_t*)cl_qmap_head( &p_osm->subn.sw_guid_tbl );
   osm_log( &p_osm->log, OSM_LOG_VERBOSE,
            "__osm_updn_find_root_nodes_by_min_hop: "
            "Passing through all switches to collect Min Hop info\n" );
   while( p_next_sw != (osm_switch_t*)cl_qmap_end( &p_osm->subn.sw_guid_tbl ) )
   {
-    uint16_t max_lid_ho, lid_ho;
+    unsigned hop_hist[IB_SUBNET_PATH_HOPS_MAX];
+    uint16_t max_lid_ho;
     uint8_t hop_val;
     uint16_t numHopBarsOverThd1 = 0;
     uint16_t numHopBarsOverThd2 = 0;
-    double thd1, thd2;
 
     p_sw = p_next_sw;
     /* Roll to the next switch */
     p_next_sw = (osm_switch_t*)cl_qmap_next( &p_sw->map_item );
 
-    /* Clear Min Hop Table && FWD Tbls - This should cause opensm to
-       rebuild its FWD tables, post setting Min Hop Tables */
+    memset(hop_hist, 0, sizeof(hop_hist));
+
     max_lid_ho = p_sw->max_lid_ho;
     /* Get base lid of switch by retrieving port 0 lid of node pointer */
-    sw_lid_ho = cl_ntoh16( osm_node_get_base_lid( p_sw->p_node, 0 ) );
     osm_log( &p_osm->log, OSM_LOG_DEBUG,
              "__osm_updn_find_root_nodes_by_min_hop: "
-             "Passing through switch lid 0x%X\n", sw_lid_ho );
+             "Passing through switch lid 0x%X\n",
+	     cl_ntoh16( osm_node_get_base_lid( p_sw->p_node, 0 ) ) );
     for (lid_ho = 1; lid_ho <= max_lid_ho; lid_ho++)
-    {
-      /* Skip lids which are not CAs or RTRs - 
-         for histogram purposes we only care about CAs and RTRs */
-      
-      /* EZ:
-         boolean_t LidFound = FALSE;
-         cl_list_iterator_t ca_lid_iterator= cl_list_head(p_ca_list);
-         while( (ca_lid_iterator != cl_list_end(p_ca_list)) && !LidFound )
-         {
-         uint16_t *p_lid;
-         
-         p_lid = (uint16_t*)cl_list_obj(ca_lid_iterator);
-         if ( *p_lid == lid_ho )
-         LidFound = TRUE;
-         ca_lid_iterator = cl_list_next(ca_lid_iterator);
-         
-         }
-         if ( LidFound )
-      */
       if (cas_per_sw[lid_ho])
       {
         hop_val = osm_switch_get_least_hops( p_sw, lid_ho );
-        if (hop_val > maxHops)
-          maxHops = hop_val;
-        p_updn_hist = 
-          (updn_hist_t*)cl_qmap_get( &min_hop_hist, (uint64_t)hop_val );
-        if ( p_updn_hist == (updn_hist_t*)cl_qmap_end( &min_hop_hist ))
-        {
-          /* New entry in the histogram, first create it */
-          p_updn_hist = (updn_hist_t*) malloc(sizeof(updn_hist_t));
-          CL_ASSERT(p_updn_hist);
-          p_updn_hist->bar_value = 0;
-          cl_qmap_insert(&min_hop_hist, (uint64_t)hop_val, &p_updn_hist->map_item);
-          osm_log( &p_osm->log, OSM_LOG_DEBUG,
-                   "__osm_updn_find_root_nodes_by_min_hop: "
-                   "Creating new entry in histogram %u\n",
-                   hop_val );
-        }
-        /* Entry exists in the table, just increment the value */
-        p_updn_hist->bar_value += cas_per_sw[lid_ho];
-        osm_log( &p_osm->log, OSM_LOG_DEBUG,
-                 "__osm_updn_find_root_nodes_by_min_hop: "
-                 "Updating entry in histogram %u with bar value %d\n",
-                 hop_val, p_updn_hist->bar_value );
+        if (hop_val >= IB_SUBNET_PATH_HOPS_MAX)
+          continue;
+
+        hop_hist[hop_val] += cas_per_sw[lid_ho];
       }
-    }
 
     /* Now recognize the spines by requiring one bar to be above 90% of the
        number of CAs and RTRs */
-    thd1 = numCas * 0.9;
-    thd2 = numCas * 0.05;
-    osm_log( &p_osm->log, OSM_LOG_DEBUG,
-             "__osm_updn_find_root_nodes_by_min_hop: "
-             "Pass over the histogram value and found only one root node above "
-             "thd1 = %f && thd2 = %f\n", thd1, thd2 );
-
-    p_updn_hist = (updn_hist_t*) cl_qmap_head( &min_hop_hist );
-    while( p_updn_hist != (updn_hist_t*)cl_qmap_end( &min_hop_hist ) )
-    {
-      p_up_ht = p_updn_hist;
-      p_updn_hist = (updn_hist_t*)cl_qmap_next( &p_updn_hist->map_item ) ;
-      if ( p_up_ht->bar_value > thd1 )
+    for (i = 0 ; i < IB_SUBNET_PATH_HOPS_MAX; i++) {
+      if (hop_hist[i] > thd1)
         numHopBarsOverThd1++;
-      if ( p_up_ht->bar_value > thd2 )
+      if (hop_hist[i] > thd2)
         numHopBarsOverThd2++;
-      osm_log( &p_osm->log, OSM_LOG_DEBUG,
-               "__osm_updn_find_root_nodes_by_min_hop: "
-               "Passing through histogram - Hop Index %u: "
-               "numHopBarsOverThd1 = %u, numHopBarsOverThd2 = %u\n",
-               (uint16_t)cl_qmap_key((cl_map_item_t*)p_up_ht),
-               numHopBarsOverThd1, numHopBarsOverThd2 );
-    }
-
-    /* destroy the qmap table and all its content - no longer needed */
-    osm_log( &p_osm->log, OSM_LOG_DEBUG,
-             "__osm_updn_find_root_nodes_by_min_hop: "
-             "Cleanup: delete histogram "
-             "UPDN - Root nodes fetching by auto detect\n" );
-    p_updn_hist = (updn_hist_t*) cl_qmap_head( &min_hop_hist );
-    while ( p_updn_hist != (updn_hist_t*)cl_qmap_end( &min_hop_hist ) )
-    {
-      cl_qmap_remove_item( &min_hop_hist, (cl_map_item_t*)p_updn_hist );
-      free( p_updn_hist );
-      p_updn_hist = (updn_hist_t*) cl_qmap_head( &min_hop_hist );
     }
 
     /* If thd conditions are valid insert the root node to the list */
-- 
1.5.3.rc0.121.gfdbc


From infodept00001 at bellsouth.net  Wed Jul 11 21:19:32 2007
From: infodept00001 at bellsouth.net (John Morris.)
Date: Thu, 12 Jul 2007 0:19:32 -0400
Subject: [ofa-general] WINNING2007#
Message-ID: <20070712041933.NFBJ12467.ibm68aec.bellsouth.net@mail.bellsouth.net>

THE UK NATIONAL LOTTERY
P O BOX 1010
LIVERPOOL, L70 1NL
UNITED KINGDOM
(Customer Services)
Ref: UKNL/05/8256/53219/QE327 
Batch: UKNL5/A115-07

You have won the sum of £1,500,000 (One Million Five Hundred Thousand
 Great British pounds sterling) from BRITISH LOTTERY on our 2007 Monthly
 charity bonanza.The winning ticket was selected from a Data Base of
 Internet E-mail Users, from which your Address came out as the winning
 coupon #.

We hereby urge you to claim the winning amount quickly as this is a
 Monthly lottery. Failure to claim your prize will result into the 
Reversion of the fund to our following month draw.You are therefore
 requested to contact immediately your Claims Agent 
(Barrister John Morris) below quoting winning number: WINNING NUMBER
 UK07010220.

Barrister John Morris.
Alpha Consultants Law firm & Schmitz Associates 
(Solicitors Advocates & Arbitrators)
Tel: +447045737335
Fax: +447005982213
E-mail: info_service07 at yahoo.co.uk

Provide the following information needed to process your winning 
claim.

(1).YOUR FULL NAMES
(2).CONTACT ADDRESS.
(3).TEL/FAX NUMBERS.
(4).OCCUPATION.
(5).WINNING NUMBERS.
(6).AGE
(7).SEX..
(8).NEXT OF KIN.
(9).WINNING EMAIL.
(10).COUNTRY..

 Congratulations once again.

Yours faithfully,
Mr. Steven Jeff
Online coordinator for THE NATIONAL LOTTERY Sweepstakes International 
Program.


From yangdong at ncic.ac.cn  Thu Jul 12 00:06:51 2007
From: yangdong at ncic.ac.cn (yangdong)
Date: Thu, 12 Jul 2007 15:06:51 +0800
Subject: [ofa-general] How can i use the interface "rdma_xx" in linux kernel
Message-ID: <4695D30B.4090300@ncic.ac.cn>

So far, what i see is all about introduction of ib interface in linux
kernel, e.g. Introduction to the InfiniBand Core Software,
Bob Woodruff,Sean Hefty, 2005 Linux Symposium. But in linux kernel there
are also rdma_xxx interface, e.g. rdma_connect, rdma_listen,etc. How can
i use these interface? Please give me a tip.


From ogerlitz at voltaire.com  Thu Jul 12 00:09:28 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 12 Jul 2007 10:09:28 +0300
Subject: [ofa-general] How can i use the interface "rdma_xx" in linux
	kernel
In-Reply-To: <4695D30B.4090300@ncic.ac.cn>
References: <4695D30B.4090300@ncic.ac.cn>
Message-ID: <4695D3A8.1080300@voltaire.com>

yangdong wrote:
> So far, what i see is all about introduction of ib interface in linux
> kernel, e.g. Introduction to the InfiniBand Core Software,
> Bob Woodruff,Sean Hefty, 2005 Linux Symposium. But in linux kernel there
> are also rdma_xxx interface, e.g. rdma_connect, rdma_listen,etc. How can
> i use these interface? Please give me a tip.

see include/rdma/rdma_cm.h

Or.


From ogerlitz at voltaire.com  Thu Jul 12 00:30:30 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 12 Jul 2007 10:30:30 +0300
Subject: [ofa-general] How can i use the interface "rdma_xx" in linux
	kernel
In-Reply-To: <4695D48B.6050508@ncic.ac.cn>
References: <4695D30B.4090300@ncic.ac.cn> <4695D3A8.1080300@voltaire.com>
	<4695D48B.6050508@ncic.ac.cn>
Message-ID: <4695D896.70607@voltaire.com>

yangdong wrote:
> ok. However, just as rdma_conn_param structure, there are not enough
> info to tell me its member' meanings?
> How can i get these info?

if you have librdmacm install through OFED do

$ man rdma_connect

if not go to

http://git.openfabrics.org/git/?p=~shefty/librdmacm.git;a=tree;f=man;h=c70c237c6e527dda4c6432f662a0331baffd4658;hb=HEAD

take rdma_connect.3 and do

$ nroff -man rdma_connect.3

Or.


> 
> Or Gerlitz 写道:
>> yangdong wrote:
>>   
>>> So far, what i see is all about introduction of ib interface in linux
>>> kernel, e.g. Introduction to the InfiniBand Core Software,
>>> Bob Woodruff,Sean Hefty, 2005 Linux Symposium. But in linux kernel there
>>> are also rdma_xxx interface, e.g. rdma_connect, rdma_listen,etc. How can
>>> i use these interface? Please give me a tip.
>>>     
>> see include/rdma/rdma_cm.h
>>
>> Or.
>>
>>
>>
>>
>>   
> 


From ogerlitz at voltaire.com  Thu Jul 12 00:44:35 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 12 Jul 2007 10:44:35 +0300
Subject: [ofa-general] [PATCH] OFED 1.2.1 rdma_cm response timeout module
	parameter
In-Reply-To: <B0095134066CC94FBC80973103FFA1FE04667213@orsmsx416.amr.corp.intel.com>
References: <B0095134066CC94FBC80973103FFA1FE04667213@orsmsx416.amr.corp.intel.com>
Message-ID: <4695DBE3.1070002@voltaire.com>

Davis, Arlin R wrote:
> Sean,
> 
> OFED 1.2 removed the rdma_set_option call used to adjust response
> timeout. We are running into some cases on larger clusters that require
> longer timeouts then the default. Can you consider this rdma_cm patch
> for OFED 1.2.1 that adds a module parameter for the response timeout?
> Thanks.
> 
> Signed-off by: Arlin Davis <ardavis at ichips.intel.com>

Sean,

You have approved this patch for OFED 1.2.1, does it suitable also for 
upstream, and if not how you think it would be correct to proceed?

thanks,

Or.

> 
> --- a/drivers/infiniband/core/cma.c	2007-07-11 10:46:48.000000000
> -0700
> +++ b/drivers/infiniband/core/cma.c	2007-07-11 10:54:16.000000000
> -0700
> @@ -58,6 +58,10 @@ MODULE_PARM_DESC(tavor_quirk, "Tavor per
>  #define CMA_CM_RESPONSE_TIMEOUT 20
>  #define CMA_MAX_CM_RETRIES 15
>  
> +static int cma_response_timeout = CMA_CM_RESPONSE_TIMEOUT;
> +module_param_named(cma_response_timeout, cma_response_timeout, int,
> 0644);
> +MODULE_PARM_DESC(cma_response_timeout, "CMA_CM_RESPONSE_TIMEOUT
> default=20");
> +
>  static void cma_add_one(struct ib_device *device);
>  static void cma_remove_one(struct ib_device *device);
>  
> @@ -2157,7 +2161,7 @@ static int cma_resolve_ib_udp(struct rdm
>  	req.path = route->path_rec;
>  	req.service_id = cma_get_service_id(id_priv->id.ps,
>  					    &route->addr.dst_addr);
> -	req.timeout_ms = 1 << (CMA_CM_RESPONSE_TIMEOUT - 8);
> +	req.timeout_ms = 1 << (cma_response_timeout - 8);
>  	req.max_cm_retries = CMA_MAX_CM_RETRIES;
>  
>  	ret = ib_send_cm_sidr_req(id_priv->cm_id.ib, &req);
> @@ -2216,8 +2220,8 @@ static int cma_connect_ib(struct rdma_id
>  	req.flow_control = conn_param->flow_control;
>  	req.retry_count = conn_param->retry_count;
>  	req.rnr_retry_count = conn_param->rnr_retry_count;
> -	req.remote_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT;
> -	req.local_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT;
> +	req.remote_cm_response_timeout = cma_response_timeout;
> +	req.local_cm_response_timeout = cma_response_timeout;
>  	req.max_cm_retries = CMA_MAX_CM_RETRIES;
>  	req.srq = id_priv->srq ? 1 : 0;
>  
> @@ -2344,7 +2348,7 @@ static int cma_accept_ib(struct rdma_id_
>  	rep.private_data_len = conn_param->private_data_len;
>  	rep.responder_resources = conn_param->responder_resources;
>  	rep.initiator_depth = conn_param->initiator_depth;
> -	rep.target_ack_delay = CMA_CM_RESPONSE_TIMEOUT;
> +	rep.target_ack_delay = cma_response_timeout;
>  	rep.failover_accepted = 0;
>  	rep.flow_control = conn_param->flow_control;
>  	rep.rnr_retry_count = conn_param->rnr_retry_count;
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From vlad at dev.mellanox.co.il  Thu Jul 12 02:32:55 2007
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Thu, 12 Jul 2007 12:32:55 +0300
Subject: [ofa-general] [PATCH] OFED 1.2.1 rdma_cm response timeout module
	parameter
In-Reply-To: <B0095134066CC94FBC80973103FFA1FE04667213@orsmsx416.amr.corp.intel.com>
References: <B0095134066CC94FBC80973103FFA1FE04667213@orsmsx416.amr.corp.intel.com>
Message-ID: <4695F547.5040001@dev.mellanox.co.il>

Davis, Arlin R wrote:
> Sean,
> 
> OFED 1.2 removed the rdma_set_option call used to adjust response
> timeout. We are running into some cases on larger clusters that require
> longer timeouts then the default. Can you consider this rdma_cm patch
> for OFED 1.2.1 that adds a module parameter for the response timeout?
> Thanks.
> 
> Signed-off by: Arlin Davis <ardavis at ichips.intel.com>
> 

Hi,
This patch added as kernel_patches/fixes/cma_response_timeout.patch to 
git://git.openfabrics.org/ofed_1_2/linux-2.6.git
Branches: ofed_1_2 and ofed_1_2_c

Regards,
Vladimir


From vlad at lists.openfabrics.org  Thu Jul 12 02:44:38 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Thu, 12 Jul 2007 02:44:38 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070712-0200 daily build status
Message-ID: <20070712094438.9F8D6E60871@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2_c

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.12
Passed on x86_64 with linux-2.6.21.1
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.16
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.18
Passed on powerpc with linux-2.6.17
Passed on ia64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.12
Passed on ppc64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on x86_64 with linux-2.6.15
Passed on ia64 with linux-2.6.14
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.13
Passed on ia64 with linux-2.6.12
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.19
Passed on powerpc with linux-2.6.16
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.15
Passed on powerpc with linux-2.6.13
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.13
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on ppc64 with linux-2.6.16
Passed on powerpc with linux-2.6.12
Passed on ia64 with linux-2.6.17
Passed on ppc64 with linux-2.6.14
Passed on ia64 with linux-2.6.15
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6

Failed:
Build failed on i686 with linux-2.6.22-rc7


From ogerlitz at voltaire.com  Thu Jul 12 03:13:48 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 12 Jul 2007 13:13:48 +0300 (IDT)
Subject: [ofa-general] ipoib attempting to join on junk MGID for child
	interface
Message-ID: <Pine.LNX.4.64.0707121308570.1897@zuben>

Opening ipoib debug prints, with OFED 1.2 (RH4 U3 i386) I see such prints:

ib0.8007: no multicast record for ffff:ffff:8007:0000:df5b:10c0:24c6:1af7, starting join
ib0.8007: multicast join failed for ffff:ffff:8007:0000:df5b:10c0:24c6:1af7, status -22

any idea what caused this jusk mgid to be used by ipoib?

Or.

below there is more complete dmesg output, basically to reproduce this
I just do:


$ ifconfig ib0 up
$ echo 0x8007 > /sys/class/net/ib0/create_child
$ echo 0x8007 > /sys/class/net/ib0/delete_child

with waiting few second between each command

ib0: bringing up interface
ib0: starting multicast thread
ib0: joining MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff
ib0: restarting multicast task
ib0: stopping multicast thread
ib0: adding multicast entry for mgid ff12:601b:ffff:0000:0000:0001:ff98:2e61
ib0: adding multicast entry for mgid ff12:601b:ffff:0000:0000:0000:0000:0001
ib0: adding multicast entry for mgid ff12:401b:ffff:0000:0000:0000:0000:0001
ib0: starting multicast thread
ib0: join completion for ff12:401b:ffff:0000:0000:0000:ffff:ffff (status 0)
ib0: Created ah f45e7840
ib0: MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff AV f45e7840, LID 0xc000, SL 0
ib0: joining MGID ff12:601b:ffff:0000:0000:0001:ff98:2e61
ib0: join completion for ff12:601b:ffff:0000:0000:0001:ff98:2e61 (status 0)
ib0: Created ah f6a62120
ib0: MGID ff12:601b:ffff:0000:0000:0001:ff98:2e61 AV f6a62120, LID 0xc00a, SL 0
ib0: joining MGID ff12:601b:ffff:0000:0000:0000:0000:0001
ib0: join completion for ff12:601b:ffff:0000:0000:0000:0000:0001 (status 0)
ib0: Created ah f4bd35c0
ib0: MGID ff12:601b:ffff:0000:0000:0000:0000:0001 AV f4bd35c0, LID 0xc00b, SL 0
ib0: joining MGID ff12:401b:ffff:0000:0000:0000:0000:0001
ib0: join completion for ff12:401b:ffff:0000:0000:0000:0000:0001 (status 0)
ib0: Created ah f4bd3560
ib0: MGID ff12:401b:ffff:0000:0000:0000:0000:0001 AV f4bd3560, LID 0xc001, SL 0
ib0: successfully joined all multicast groups
ib0: setting up send only multicast group for ff12:601b:ffff:0000:0000:0000:0000:0002
ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0002, starting join
ib0: Created ah f6a62560
ib0: MGID ff12:601b:ffff:0000:0000:0000:0000:0002 AV f6a62560, LID 0xc00d, SL 0
ib0: no IPv6 routers present
divert: not allocating divert_blk for non-ethernet device ib0.8007
ip_tables: (C) 2000-2002 Netfilter core team
ib0.8007: bringing up interface
ib0.8007: starting multicast thread
ib0.8007: joining MGID ff12:401b:8007:0000:0000:0000:ffff:ffff
ib0.8007: restarting multicast task
ib0.8007: stopping multicast thread
ib0.8007: adding multicast entry for mgid ff12:601b:8007:0000:0000:0001:ff98:2e61
ib0.8007: adding multicast entry for mgid ff12:601b:8007:0000:0000:0000:0000:0001
ib0.8007: starting multicast thread
ib0.8007: join completion for ff12:401b:8007:0000:0000:0000:ffff:ffff (status 0)
ib0.8007: Created ah f45e7500
ib0.8007: MGID ff12:401b:8007:0000:0000:0000:ffff:ffff AV f45e7500, LID 0xc005, SL 0
ib0.8007: joining MGID ff12:601b:8007:0000:0000:0001:ff98:2e61
ib0.8007: setting up send only multicast group for ff12:601b:8007:0000:0000:0000:0000:0016
ib0.8007: no multicast record for ff12:601b:8007:0000:0000:0000:0000:0016, starting join
ib0.8007: join completion for ff12:601b:8007:0000:0000:0001:ff98:2e61 (status 0)
ib0.8007: Created ah f34fece0
ib0.8007: MGID ff12:601b:8007:0000:0000:0001:ff98:2e61 AV f34fece0, LID 0xc00e, SL 0
ib0.8007: joining MGID ff12:601b:8007:0000:0000:0000:0000:0001
ib0.8007: Created ah f34fee60
ib0.8007: MGID ff12:601b:8007:0000:0000:0000:0000:0016 AV f34fee60, LID 0xc010, SL 0
ib0.8007: join completion for ff12:601b:8007:0000:0000:0000:0000:0001 (status 0)
ib0.8007: Created ah f34fec60
ib0.8007: MGID ff12:601b:8007:0000:0000:0000:0000:0001 AV f34fec60, LID 0xc00c, SL 0
ib0.8007: successfully joined all multicast groups
ib0.8007: setting up send only multicast group for ffff:ffff:8007:0000:df5b:10c0:24c6:1af7
ib0.8007: no multicast record for ffff:ffff:8007:0000:df5b:10c0:24c6:1af7, starting join
ib0.8007: multicast join failed for ffff:ffff:8007:0000:df5b:10c0:24c6:1af7, status -22
ib0.8007: setting up send only multicast group for ff12:601b:8007:0000:0000:0000:0000:0002
ib0.8007: no multicast record for ff12:601b:8007:0000:0000:0000:0000:0002, starting join
ib0.8007: Created ah f34fec40
ib0.8007: no multicast record for ffff:ffff:8007:0000:df5b:10c0:24c6:1af7, starting join
ib0.8007: multicast join failed for ffff:ffff:8007:0000:df5b:10c0:24c6:1af7, status -22
ib0.8007: MGID ff12:601b:8007:0000:0000:0000:0000:0002 AV f34fec40, LID 0xc011, SL 0
ib0.8007: no multicast record for ffff:ffff:8007:0000:df5b:10c0:24c6:1af7, starting join
ib0.8007: multicast join failed for ffff:ffff:8007:0000:df5b:10c0:24c6:1af7, status -22
ib0.8007: restarting multicast task
ib0.8007: stopping multicast thread
ib0.8007: adding multicast entry for mgid ff12:401b:8007:0000:0000:0000:0000:0001
ib0.8007: starting multicast thread
ib0.8007: joining MGID ff12:401b:8007:0000:0000:0000:0000:0001
ib0.8007: join completion for ff12:401b:8007:0000:0000:0000:0000:0001 (status 0)
ib0.8007: Created ah f554d6c0
ib0.8007: MGID ff12:401b:8007:0000:0000:0000:0000:0001 AV f554d6c0, LID 0xc006, SL 0
ib0.8007: successfully joined all multicast groups
ib0.8007: setting up send only multicast group for ffff:ffff:8007:0000:e89f:e9bf:8011:0406
ib0.8007: no multicast record for ffff:ffff:8007:0000:e89f:e9bf:8011:0406, starting join
ib0.8007: multicast join failed for ffff:ffff:8007:0000:e89f:e9bf:8011:0406, status -22
ib0.8007: setting up send only multicast group for ffff:ffff:8007:0000:70ba:e200:8011:0406
ib0.8007: no multicast record for ffff:ffff:8007:0000:70ba:e200:8011:0406, starting join
ib0.8007: multicast join failed for ffff:ffff:8007:0000:70ba:e200:8011:0406, status -22
ib0.8007: no IPv6 routers present
ib0: neigh_destructor for ffffff ff12:601b:ffff:0000:0000:0001:ff98:2e61
ib0.8007: stopping interface
ib0.8007: downing ib_dev
ib0.8007: stopping multicast thread
ib0.8007: flushing multicast list
ib0.8007: leaving MGID ff12:601b:8007:0000:0000:0001:ff98:2e61
ib0.8007: deleting multicast group ff12:601b:8007:0000:0000:0001:ff98:2e61
ib0.8007: leaving MGID ff12:601b:8007:0000:0000:0000:0000:0001
ib0.8007: deleting multicast group ff12:601b:8007:0000:0000:0000:0000:0001
ib0.8007: deleting multicast group ff12:601b:8007:0000:0000:0000:0000:0016
ib0.8007: deleting multicast group ffff:ffff:8007:0000:df5b:10c0:24c6:1af7
ib0.8007: deleting multicast group ff12:601b:8007:0000:0000:0000:0000:0002
ib0.8007: leaving MGID ff12:401b:8007:0000:0000:0000:0000:0001
ib0.8007: deleting multicast group ff12:401b:8007:0000:0000:0000:0000:0001
ib0.8007: deleting multicast group ffff:ffff:8007:0000:e89f:e9bf:8011:0406
ib0.8007: deleting multicast group ffff:ffff:8007:0000:70ba:e200:8011:0406
ib0.8007: leaving MGID ff12:401b:8007:0000:0000:0000:ffff:ffff
ib0.8007: deleting multicast group ff12:401b:8007:0000:0000:0000:ffff:ffff
ib0.8007: All sends and receives done.
divert: no divert_blk to free, ib0.8007 not ethernet
ib0.8007: cleaning up ib_dev
ib0.8007: stopping multicast thread
ib0.8007: flushing multicast list
ib0.8007: Cleanup ipoib connected mode.
ib0: neigh_destructor for ffffff ff12:601b:ffff:0000:0000:0000:0000:0002


From mst at dev.mellanox.co.il  Thu Jul 12 03:23:37 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 12 Jul 2007 13:23:37 +0300
Subject: [ofa-general] Re: ipoib attempting to join on junk MGID for child
	interface
In-Reply-To: <Pine.LNX.4.64.0707121308570.1897@zuben>
References: <Pine.LNX.4.64.0707121308570.1897@zuben>
Message-ID: <20070712102326.GC12325@mellanox.co.il>

> Quoting Or Gerlitz <ogerlitz at voltaire.com>:
> Subject: ipoib attempting to join on junk MGID for child interface
> 
> Opening ipoib debug prints, with OFED 1.2 (RH4 U3 i386) I see such prints:
> 
> ib0.8007: no multicast record for ffff:ffff:8007:0000:df5b:10c0:24c6:1af7, starting join
> ib0.8007: multicast join failed for ffff:ffff:8007:0000:df5b:10c0:24c6:1af7, status -22
> 
> any idea what caused this jusk mgid to be used by ipoib?

What does "ip maddr show" give you?

-- 
MST


From ogerlitz at voltaire.com  Thu Jul 12 03:29:55 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 12 Jul 2007 13:29:55 +0300 (IDT)
Subject: [ofa-general] Re: ipoib attempting to join on junk MGID for child
	interface
In-Reply-To: <Pine.LNX.4.64.0707121308570.1897@zuben>
References: <Pine.LNX.4.64.0707121308570.1897@zuben>
Message-ID: <Pine.LNX.4.64.0707121327590.2106@zuben>

below is the system view, note that the disto /sbin/ip util has
a bug displaying the correct hw address (ie the one that the device
sees - which is present in /proc/net/dev_mcast)

root at rain1 ogerlitz]# cat /proc/net/dev_mcast
2    eth0            1     0     01005e000001
2    eth0            1     0     3333ff287e76
2    eth0            1     0     333300000001
5    ib0             1     0     00ffffffff12601b0000000000000001ff982e61
5    ib0             1     0     00ffffffff12601b000000000000000000000001
5    ib0             1     0     00ffffffff12401b000000000000000000000001
14   ib0.8007        1     0     00ffffffff12601b0000000000000001ff982e61
14   ib0.8007        1     0     00ffffffff12601b000000000000000000000001

[root at rain1 ogerlitz]# ip maddr show ib0
5:      ib0
        link  00:ff:ff:ff:ff:12:60:1b:00:00:00:00:00:00:00:01:39:00:00:00
        link  00:ff:ff:ff:ff:12:60:1b:00:00:00:00:00:00:00:00:39:00:00:00
        link  00:ff:ff:ff:ff:12:40:1b:00:00:00:00:00:00:00:00:39:00:00:00
        inet  224.0.0.1
        inet6 ff02::1:ff98:2e61
        inet6 ff02::1

[root at rain1 ogerlitz]# ip maddr show ib0.8007
14:     ib0.8007
        link  00:ff:ff:ff:ff:12:40:1b:00:00:00:00:00:00:00:00:39:00:00:00
        link  00:ff:ff:ff:ff:12:60:1b:00:00:00:00:00:00:00:01:39:00:00:00
        link  00:ff:ff:ff:ff:12:60:1b:00:00:00:00:00:00:00:00:39:00:00:00
        inet  224.0.0.1
        inet6 ff02::1:ff98:2e61
        inet6 ff02::1


From mst at dev.mellanox.co.il  Thu Jul 12 03:34:27 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 12 Jul 2007 13:34:27 +0300
Subject: [ofa-general] Re: ipoib attempting to join on junk MGID for child
	interface
In-Reply-To: <Pine.LNX.4.64.0707121327590.2106@zuben>
References: <Pine.LNX.4.64.0707121308570.1897@zuben>
	<Pine.LNX.4.64.0707121327590.2106@zuben>
Message-ID: <20070712103427.GD12325@mellanox.co.il>

> Quoting Or Gerlitz <ogerlitz at voltaire.com>:
> Subject: Re: ipoib attempting to join on junk MGID for child interface
> 
> below is the system view, note that the disto /sbin/ip util has
> a bug displaying the correct hw address (ie the one that the device
> sees - which is present in /proc/net/dev_mcast)

Which distro is this?

-- 
MST


From mst at dev.mellanox.co.il  Thu Jul 12 03:42:05 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 12 Jul 2007 13:42:05 +0300
Subject: [ofa-general] Re: ipoib attempting to join on junk MGID for child
	interface
In-Reply-To: <Pine.LNX.4.64.0707121327590.2106@zuben>
References: <Pine.LNX.4.64.0707121308570.1897@zuben>
	<Pine.LNX.4.64.0707121327590.2106@zuben>
Message-ID: <20070712104205.GE12325@mellanox.co.il>

> Quoting Or Gerlitz <ogerlitz at voltaire.com>:
> Subject: Re: ipoib attempting to join on junk MGID for child interface
> 
> below is the system view, note that the disto /sbin/ip util has
> a bug displaying the correct hw address (ie the one that the device
> sees - which is present in /proc/net/dev_mcast)

You can use the one supplied with ofed instead.

-- 
MST


From ogerlitz at voltaire.com  Thu Jul 12 03:52:36 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 12 Jul 2007 13:52:36 +0300
Subject: [ofa-general] Re: ipoib attempting to join on junk MGID for child
	interface
In-Reply-To: <20070712104205.GE12325@mellanox.co.il>
References: <Pine.LNX.4.64.0707121308570.1897@zuben>	<Pine.LNX.4.64.0707121327590.2106@zuben>
	<20070712104205.GE12325@mellanox.co.il>
Message-ID: <469607F4.808@voltaire.com>

Michael S. Tsirkin wrote:
>> Quoting Or Gerlitz <ogerlitz at voltaire.com>:
>> Subject: Re: ipoib attempting to join on junk MGID for child interface
>>
>> below is the system view, note that the disto /sbin/ip util has
>> a bug displaying the correct hw address (ie the one that the device
>> sees - which is present in /proc/net/dev_mcast)
> 
> You can use the one supplied with ofed instead.

whatever, /proc/net/dev_mcast provides you the full picture from the 
kernel view point.

Also is there any chance you would be pushing the /sbin/ip changes to 
the maintainer of the package the contains it?

Or.


From ogerlitz at voltaire.com  Thu Jul 12 03:56:21 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 12 Jul 2007 13:56:21 +0300 (IDT)
Subject: [ofa-general] Re: ipoib attempting to join on junk MGID / 2.6.21-rc6
	crash dump
In-Reply-To: <Pine.LNX.4.64.0707121327590.2106@zuben>
References: <Pine.LNX.4.64.0707121308570.1897@zuben>
	<Pine.LNX.4.64.0707121327590.2106@zuben>
Message-ID: <Pine.LNX.4.64.0707121350570.2373@zuben>

OK, I did some checks with upstream kernel, the junk mkey for child interface phenomena
does not reproduce, which probably means its either ofed or RH4 kernel issue.

However, I started on 2.6.21-rc6 under which i saw the below crash, which
does not reproduce now under 2.6.22, was there any fix that you are aware
to around this area of the code?

Or.


ib0.8007: bringing up interface
ib0.8007: IPOIB_FLAG_OPER_UP not set<7>ib0.8007: IPOIB_FLAG_OPER_UP not set<6>ADDRCONF(NETDEV_UP): ib0.8007: link is not ready
ib0.8007: IPOIB_FLAG_OPER_UP not set<7>ib0.8007: IPOIB_FLAG_OPER_UP not set<7>ib0.8007: stopping interface
ib0.8007: downing ib_dev
ib0.8007: stopping multicast thread
ib0.8007: flushing multicast list
Unable to handle kernel NULL pointer dereference at 0000000000000070 RIP:
 [<ffffffff8045d84e>] _spin_lock_irqsave+0x3/0x24
PGD 36305067 PUD 3b8fb067 PMD 0
Oops: 0002 [1] SMP
CPU 1
Modules linked in: ib_ipoib ib_cm ib_sa ipv6 ib_mthca ib_mad ib_core sg st sd_mod sr_mod scsi_mod e100 i2c_amd8111 i2c_amd756 i2c_core
Pid: 12633, comm: ifconfig Not tainted 2.6.21-rc6 #2
RIP: 0010:[<ffffffff8045d84e>]  [<ffffffff8045d84e>] _spin_lock_irqsave+0x3/0x24
RSP: 0018:ffff810026dcbc50  EFLAGS: 00010092
RAX: 0000000000000292 RBX: ffff810016425000 RCX: ffff810016425750
RDX: ffff810026dcbd48 RSI: 0000000000000000 RDI: 0000000000000070
RBP: 0000000000000000 R08: 00000000ffffffff R09: 0000000000000000
R10: ffff810000e6b2c0 R11: 0000000000000001 R12: 0000000000000070
R13: 0000000000000000 R14: ffff81003f8c6c00 R15: ffff810016425000
FS:  00002abdfeb7b740(0000) GS:ffff81003f8a7a40(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000070 CR3: 000000003897a000 CR4: 00000000000006e0
Process ifconfig (pid: 12633, threadinfo ffff810026dca000, task ffff81003d7537f0)
Stack:  ffffffff880e48f7 0000003000000010 ffff810026dcbd38 ffff810026dcbc78
 ffff81003f8c6c00 ffff810016425000 ffff810016425000 ffff810026dcbd30
 ffff810016425000 ffff810016425700 ffff810016425700 ffff810016425000
Call Trace:
 [<ffffffff880e48f7>] :ib_cm:cm_destroy_id+0x1c/0x25c
 [<ffffffff880f28a3>] :ib_ipoib:ipoib_cm_dev_stop+0x27/0xc5
 [<ffffffff880ed53c>] :ib_ipoib:ipoib_ib_dev_stop+0x25/0x2c3
 [<ffffffff8023f504>] flush_cpu_workqueue+0xb3/0xc1
 [<ffffffff802426cc>] autoremove_wake_function+0x0/0x2e
 [<ffffffff80238f8e>] lock_timer_base+0x1b/0x3c
 [<ffffffff880eea8c>] :ib_ipoib:ipoib_mcast_dev_flush+0x10e/0x159
 [<ffffffff880ece93>] :ib_ipoib:ipoib_flush_paths+0x34/0x15a
 [<ffffffff880eb927>] :ib_ipoib:ipoib_stop+0x63/0xef
 [<ffffffff8040d237>] dev_close+0x58/0x77
 [<ffffffff8040c2f6>] dev_change_flags+0x57/0x119
 [<ffffffff80443f24>] devinet_ioctl+0x265/0x5cd
 [<ffffffff80444f8e>] inet_ioctl+0x3f/0x5e
 [<ffffffff80402cae>] sock_ioctl+0x16c/0x189
 [<ffffffff80284ecd>] do_ioctl+0x29/0x6f
 [<ffffffff80285187>] vfs_ioctl+0x274/0x285
 [<ffffffff802851d4>] sys_ioctl+0x3c/0x60
 [<ffffffff802093ce>] system_call+0x7e/0x83


Code: f0 ff 0f 79 1b a9 00 02 00 00 74 0b fb f3 90 83 3f 00 7e f9
RIP  [<ffffffff8045d84e>] _spin_lock_irqsave+0x3/0x24
 RSP <ffff810026dcbc50>
CR2: 0000000000000070


From mst at dev.mellanox.co.il  Thu Jul 12 04:01:12 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 12 Jul 2007 14:01:12 +0300
Subject: [ofa-general] Re: ipoib attempting to join on junk MGID / 2.6.21-rc6
	crash dump
In-Reply-To: <Pine.LNX.4.64.0707121350570.2373@zuben>
References: <Pine.LNX.4.64.0707121308570.1897@zuben>
	<Pine.LNX.4.64.0707121327590.2106@zuben>
	<Pine.LNX.4.64.0707121350570.2373@zuben>
Message-ID: <20070712110111.GF12325@mellanox.co.il>

> Quoting Or Gerlitz <ogerlitz at voltaire.com>:
> Subject: Re: ipoib attempting to join on junk MGID / 2.6.21-rc6 crash dump
> 
> OK, I did some checks with upstream kernel, the junk mkey for child interface phenomena
> does not reproduce, which probably means its either ofed or RH4 kernel issue.
> 
> However, I started on 2.6.21-rc6 under which i saw the below crash, which
> does not reproduce now under 2.6.22, was there any fix that you are aware
> to around this area of the code?

Not directly here, but 841adfca9c5fc0fec6b1f0b2e5eb7a3b239a7730
fixed a bug that might thinkably trigger double free/memory corruption.

-- 
MST


From mst at dev.mellanox.co.il  Thu Jul 12 04:02:06 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 12 Jul 2007 14:02:06 +0300
Subject: [ofa-general] Re: ipoib attempting to join on junk MGID for child
	interface
In-Reply-To: <469607F4.808@voltaire.com>
References: <Pine.LNX.4.64.0707121308570.1897@zuben>
	<Pine.LNX.4.64.0707121327590.2106@zuben>
	<20070712104205.GE12325@mellanox.co.il> <469607F4.808@voltaire.com>
Message-ID: <20070712110206.GG12325@mellanox.co.il>

> Quoting Or Gerlitz <ogerlitz at voltaire.com>:
> Subject: Re: ipoib attempting to join on junk MGID for child interface
> 
> Also is there any chance you would be pushing the /sbin/ip changes to 
> the maintainer of the package the contains it?

There are no changes, it's just that redhat includes an old version of the tool
and ofed packages a newer one.
      
-- 
MST


From halr at voltaire.com  Thu Jul 12 04:07:34 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 12 Jul 2007 07:07:34 -0400
Subject: [ofa-general] Re: ipoib attempting to join on junk MGID /
	2.6.21-rc6 crash dump
In-Reply-To: <Pine.LNX.4.64.0707121350570.2373@zuben>
References: <Pine.LNX.4.64.0707121308570.1897@zuben>
	<Pine.LNX.4.64.0707121327590.2106@zuben>
	<Pine.LNX.4.64.0707121350570.2373@zuben>
Message-ID: <1184238443.17622.180620.camel@hal.voltaire.com>

On Thu, 2007-07-12 at 06:56, Or Gerlitz wrote:
> OK, I did some checks with upstream kernel, the junk mkey for child interface phenomena
> does not reproduce, which probably means its either ofed or RH4 kernel issue.

FWIW (probably just as a data point to keep in mind), this problem has
been seen and reported on the list quite a while ago. It is extremely
hard to reproduce. No clue as to what causes it.

-- Hal

> However, I started on 2.6.21-rc6 under which i saw the below crash, which
> does not reproduce now under 2.6.22, was there any fix that you are aware
> to around this area of the code?
> 
> Or.
> 
> 
> ib0.8007: bringing up interface
> ib0.8007: IPOIB_FLAG_OPER_UP not set<7>ib0.8007: IPOIB_FLAG_OPER_UP not set<6>ADDRCONF(NETDEV_UP): ib0.8007: link is not ready
> ib0.8007: IPOIB_FLAG_OPER_UP not set<7>ib0.8007: IPOIB_FLAG_OPER_UP not set<7>ib0.8007: stopping interface
> ib0.8007: downing ib_dev
> ib0.8007: stopping multicast thread
> ib0.8007: flushing multicast list
> Unable to handle kernel NULL pointer dereference at 0000000000000070 RIP:
>  [<ffffffff8045d84e>] _spin_lock_irqsave+0x3/0x24
> PGD 36305067 PUD 3b8fb067 PMD 0
> Oops: 0002 [1] SMP
> CPU 1
> Modules linked in: ib_ipoib ib_cm ib_sa ipv6 ib_mthca ib_mad ib_core sg st sd_mod sr_mod scsi_mod e100 i2c_amd8111 i2c_amd756 i2c_core
> Pid: 12633, comm: ifconfig Not tainted 2.6.21-rc6 #2
> RIP: 0010:[<ffffffff8045d84e>]  [<ffffffff8045d84e>] _spin_lock_irqsave+0x3/0x24
> RSP: 0018:ffff810026dcbc50  EFLAGS: 00010092
> RAX: 0000000000000292 RBX: ffff810016425000 RCX: ffff810016425750
> RDX: ffff810026dcbd48 RSI: 0000000000000000 RDI: 0000000000000070
> RBP: 0000000000000000 R08: 00000000ffffffff R09: 0000000000000000
> R10: ffff810000e6b2c0 R11: 0000000000000001 R12: 0000000000000070
> R13: 0000000000000000 R14: ffff81003f8c6c00 R15: ffff810016425000
> FS:  00002abdfeb7b740(0000) GS:ffff81003f8a7a40(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000000070 CR3: 000000003897a000 CR4: 00000000000006e0
> Process ifconfig (pid: 12633, threadinfo ffff810026dca000, task ffff81003d7537f0)
> Stack:  ffffffff880e48f7 0000003000000010 ffff810026dcbd38 ffff810026dcbc78
>  ffff81003f8c6c00 ffff810016425000 ffff810016425000 ffff810026dcbd30
>  ffff810016425000 ffff810016425700 ffff810016425700 ffff810016425000
> Call Trace:
>  [<ffffffff880e48f7>] :ib_cm:cm_destroy_id+0x1c/0x25c
>  [<ffffffff880f28a3>] :ib_ipoib:ipoib_cm_dev_stop+0x27/0xc5
>  [<ffffffff880ed53c>] :ib_ipoib:ipoib_ib_dev_stop+0x25/0x2c3
>  [<ffffffff8023f504>] flush_cpu_workqueue+0xb3/0xc1
>  [<ffffffff802426cc>] autoremove_wake_function+0x0/0x2e
>  [<ffffffff80238f8e>] lock_timer_base+0x1b/0x3c
>  [<ffffffff880eea8c>] :ib_ipoib:ipoib_mcast_dev_flush+0x10e/0x159
>  [<ffffffff880ece93>] :ib_ipoib:ipoib_flush_paths+0x34/0x15a
>  [<ffffffff880eb927>] :ib_ipoib:ipoib_stop+0x63/0xef
>  [<ffffffff8040d237>] dev_close+0x58/0x77
>  [<ffffffff8040c2f6>] dev_change_flags+0x57/0x119
>  [<ffffffff80443f24>] devinet_ioctl+0x265/0x5cd
>  [<ffffffff80444f8e>] inet_ioctl+0x3f/0x5e
>  [<ffffffff80402cae>] sock_ioctl+0x16c/0x189
>  [<ffffffff80284ecd>] do_ioctl+0x29/0x6f
>  [<ffffffff80285187>] vfs_ioctl+0x274/0x285
>  [<ffffffff802851d4>] sys_ioctl+0x3c/0x60
>  [<ffffffff802093ce>] system_call+0x7e/0x83
> 
> 
> Code: f0 ff 0f 79 1b a9 00 02 00 00 74 0b fb f3 90 83 3f 00 7e f9
> RIP  [<ffffffff8045d84e>] _spin_lock_irqsave+0x3/0x24
>  RSP <ffff810026dcbc50>
> CR2: 0000000000000070
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From ogerlitz at voltaire.com  Thu Jul 12 04:52:09 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 12 Jul 2007 14:52:09 +0300
Subject: [ofa-general] Re: ipoib attempting to join on junk MGID
	/	2.6.21-rc6 crash dump
In-Reply-To: <1184238443.17622.180620.camel@hal.voltaire.com>
References: <Pine.LNX.4.64.0707121308570.1897@zuben>	
	<Pine.LNX.4.64.0707121327590.2106@zuben>	
	<Pine.LNX.4.64.0707121350570.2373@zuben>
	<1184238443.17622.180620.camel@hal.voltaire.com>
Message-ID: <469615E9.5060907@voltaire.com>

Hal Rosenstock wrote:
> On Thu, 2007-07-12 at 06:56, Or Gerlitz wrote:
>> OK, I did some checks with upstream kernel, the junk mkey for child interface phenomena
>> does not reproduce, which probably means its either ofed or RH4 kernel issue.
> 
> FWIW (probably just as a data point to keep in mind), this problem has
> been seen and reported on the list quite a while ago. It is extremely
> hard to reproduce. No clue as to what causes it.

Its reproduces 100% of the times on my system with RH4 U3, its just goes 
silent unless the multicast debug flag is open.

Or.


From halr at voltaire.com  Thu Jul 12 06:18:44 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 12 Jul 2007 09:18:44 -0400
Subject: [ofa-general] [ANNOUCE] New management release
Message-ID: <1184246305.17622.188634.camel@hal.voltaire.com>

Hi,

There are new releases of the management libraries, diags, and OpenSM
built off master (rather than OFED 1.2 branch) available in:

http://www.openfabrics.org/~halr/

md5sum
0f9ec94d981ab381fb123550b4733d83  libibumad-1.1.2.tgz
6e33e38d7a8bdebe7960b057899483f6  libibmad-1.1.1.tgz
512b3d766220d3f757fe6fc4d10e78fe  infiniband-diags-1.3.1.tgz
04b25a2bf782955b3d01214756121f17  opensm-3.1.1.tgz

The existing libibcommon can be used with this (no changes with master):
a5b884775ed069da09ca0b60bfda3239  libibcommon-1.0.4.tar.gz

-- Hal


From vlad at dev.mellanox.co.il  Thu Jul 12 06:23:55 2007
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Thu, 12 Jul 2007 16:23:55 +0300
Subject: [ofa-general] OFED-1.2 release download link
Message-ID: <46962B6B.8050903@dev.mellanox.co.il>

Hi,
OFED-1.2 is currently available at
http://www.openfabrics.org/builds/ofed-1.2/release/OFED-1.2.tgz

OFED-1.2 binary RPMs for SLES 9.0, SLES 10 SP1, RHEL 4.0 U5 and RHEL 5.0 
can be downloaded from:
http://www.openfabrics.org/builds/ofed-1.2/release/OFED-1.2-RPMS/

Note:
On http://www.openfabrics.org/downloads.htm
OFED 1.2 GA link points to the wrong place.


Regards,
Vladimir


From jackm at dev.mellanox.co.il  Thu Jul 12 07:50:45 2007
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Thu, 12 Jul 2007 17:50:45 +0300
Subject: [ofa-general] [PATCH v2] mlx4: add device reset to error handling
	mechanism
Message-ID: <200707121750.45629.jackm@dev.mellanox.co.il>

Add device reset to mlx4 Internal Error handling. Also, detect errors
via polling the device error buffer (rather than via interrupt), because
this is more reliable, and we do not wish to support two detection
mechanisms.

This version incorporates suggestions made by Roland:
- the error interrupt is entirely removed.
- this patch uses round_jiffies_relative to reschedule polling timer.

Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>

Index: connectx_kernel/drivers/net/mlx4/catas.c
===================================================================
--- connectx_kernel.orig/drivers/net/mlx4/catas.c	2007-07-12 10:11:34.000000000 +0300
+++ connectx_kernel/drivers/net/mlx4/catas.c	2007-07-12 10:11:55.000000000 +0300
@@ -30,15 +30,31 @@
  * SOFTWARE.
  */
 
+#include <linux/timer.h>
+#include <linux/workqueue.h>
 #include "mlx4.h"
 
+enum {
+	MLX4_CATAS_POLL_INTERVAL	= 5 * HZ,
+};
+
+static DEFINE_SPINLOCK(catas_lock);
+
+static LIST_HEAD(catas_list);
+static struct workqueue_struct *catas_wq;
+static struct work_struct catas_work;
+
+static int ierr_reset_disable;
+module_param_named(ierr_reset_disable, ierr_reset_disable, int, 0644);
+MODULE_PARM_DESC(ierr_reset_disable, "disable reset on Internal Error event if nonzero");
+
 void mlx4_handle_catas_err(struct mlx4_dev *dev)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
 
 	int i;
 
-	mlx4_err(dev, "Catastrophic error detected:\n");
+	mlx4_err(dev, "Internal error detected:\n");
 	for (i = 0; i < priv->fw.catas_size; ++i)
 		mlx4_err(dev, "  buf[%02x]: %08x\n",
 			 i, swab32(readl(priv->catas_err.map + i)));
@@ -46,25 +63,119 @@ void mlx4_handle_catas_err(struct mlx4_d
 	mlx4_dispatch_event(dev, MLX4_EVENT_TYPE_LOCAL_CATAS_ERROR, 0, 0);
 }
 
-void mlx4_map_catas_buf(struct mlx4_dev *dev)
+static void catas_reset(struct work_struct *work)
+{
+	struct mlx4_priv *priv, *tmppriv;
+	struct mlx4_dev *dev;
+
+	LIST_HEAD(tlist);
+	int ret;
+
+	spin_lock_irq(&catas_lock);
+	list_splice_init(&catas_list, &tlist);
+	spin_unlock_irq(&catas_lock);
+
+	list_for_each_entry_safe(priv, tmppriv, &tlist, catas_err.list) {
+		ret = mlx4_restart_one(priv->dev.pdev);
+		dev = &priv->dev;
+		if (ret)
+			mlx4_err(dev, "Reset failed (%d)\n", ret);
+		else
+			mlx4_dbg(dev, "Reset succeeded\n");
+	}
+}
+
+static void handle_catas(struct mlx4_dev *dev)
+{
+	unsigned long flags;
+	struct mlx4_priv *priv = mlx4_priv(dev);
+
+	mlx4_handle_catas_err(dev);
+
+	if (ierr_reset_disable)
+		return;
+
+	spin_lock_irqsave(&catas_lock, flags);
+	list_add(&priv->catas_err.list, &catas_list);
+	queue_work(catas_wq, &catas_work);
+	spin_unlock_irqrestore(&catas_lock, flags);
+}
+
+static void poll_catas(unsigned long dev_ptr)
+{
+	struct mlx4_dev *dev = (struct mlx4_dev *) dev_ptr;
+	struct mlx4_priv *priv = mlx4_priv(dev);
+	unsigned long flags;
+
+	if (readl(priv->catas_err.map)) {
+		handle_catas(&priv->dev);
+		return;
+	}
+
+	spin_lock_irqsave(&catas_lock, flags);
+	if (!priv->catas_err.stop)
+		mod_timer(&priv->catas_err.timer,
+			  round_jiffies_relative(MLX4_CATAS_POLL_INTERVAL));
+	spin_unlock_irqrestore(&catas_lock, flags);
+
+	return;
+}
+
+void mlx4_start_catas_poll(struct mlx4_dev *dev)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
 	unsigned long addr;
 
+	init_timer(&priv->catas_err.timer);
+	priv->catas_err.stop = 0;
+	priv->catas_err.map  = NULL;
+
 	addr = pci_resource_start(dev->pdev, priv->fw.catas_bar) +
 		priv->fw.catas_offset;
 
 	priv->catas_err.map = ioremap(addr, priv->fw.catas_size * 4);
 	if (!priv->catas_err.map)
-		mlx4_warn(dev, "Failed to map catastrophic error buffer at 0x%lx\n",
+		mlx4_warn(dev, "Failed to map Internal Error buffer at 0x%lx\n",
 			  addr);
 
+	priv->catas_err.timer.data     = (unsigned long) dev;
+	priv->catas_err.timer.function = poll_catas;
+	priv->catas_err.timer.expires  =
+		round_jiffies_relative(MLX4_CATAS_POLL_INTERVAL);
+	INIT_LIST_HEAD(&priv->catas_err.list);
+	add_timer(&priv->catas_err.timer);
 }
 
-void mlx4_unmap_catas_buf(struct mlx4_dev *dev)
+void mlx4_stop_catas_poll(struct mlx4_dev *dev)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
 
+	spin_lock_irq(&catas_lock);
+	priv->catas_err.stop = 1;
+	spin_unlock_irq(&catas_lock);
+
+	del_timer_sync(&priv->catas_err.timer);
+
 	if (priv->catas_err.map)
 		iounmap(priv->catas_err.map);
+
+	spin_lock_irq(&catas_lock);
+	list_del(&priv->catas_err.list);
+	spin_unlock_irq(&catas_lock);
+}
+
+int __init mlx4_catas_init(void)
+{
+	INIT_WORK(&catas_work, catas_reset);
+
+	catas_wq = create_singlethread_workqueue("mlx4_err");
+	if (!catas_wq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void mlx4_catas_cleanup(void)
+{
+	destroy_workqueue(catas_wq);
 }
Index: connectx_kernel/drivers/net/mlx4/eq.c
===================================================================
--- connectx_kernel.orig/drivers/net/mlx4/eq.c	2007-07-12 10:11:34.000000000 +0300
+++ connectx_kernel/drivers/net/mlx4/eq.c	2007-07-12 10:11:55.000000000 +0300
@@ -89,14 +89,12 @@ struct mlx4_eq_context {
 			       (1ull << MLX4_EVENT_TYPE_PATH_MIG_FAILED)    | \
 			       (1ull << MLX4_EVENT_TYPE_WQ_INVAL_REQ_ERROR) | \
 			       (1ull << MLX4_EVENT_TYPE_WQ_ACCESS_ERROR)    | \
-			       (1ull << MLX4_EVENT_TYPE_LOCAL_CATAS_ERROR)  | \
 			       (1ull << MLX4_EVENT_TYPE_PORT_CHANGE)	    | \
 			       (1ull << MLX4_EVENT_TYPE_ECC_DETECT)	    | \
 			       (1ull << MLX4_EVENT_TYPE_SRQ_CATAS_ERROR)    | \
 			       (1ull << MLX4_EVENT_TYPE_SRQ_QP_LAST_WQE)    | \
 			       (1ull << MLX4_EVENT_TYPE_SRQ_LIMIT)	    | \
 			       (1ull << MLX4_EVENT_TYPE_CMD))
-#define MLX4_CATAS_EVENT_MASK  (1ull << MLX4_EVENT_TYPE_LOCAL_CATAS_ERROR)
 
 struct mlx4_eqe {
 	u8			reserved1;
@@ -264,7 +262,7 @@ static irqreturn_t mlx4_interrupt(int ir
 
 	writel(priv->eq_table.clr_mask, priv->eq_table.clr_int);
 
-	for (i = 0; i < MLX4_EQ_CATAS; ++i)
+	for (i = 0; i < MLX4_NUM_EQ; ++i)
 		work |= mlx4_eq_int(dev, &priv->eq_table.eq[i]);
 
 	return IRQ_RETVAL(work);
@@ -281,14 +279,6 @@ static irqreturn_t mlx4_msi_x_interrupt(
 	return IRQ_HANDLED;
 }
 
-static irqreturn_t mlx4_catas_interrupt(int irq, void *dev_ptr)
-{
-	mlx4_handle_catas_err(dev_ptr);
-
-	/* MSI-X vectors always belong to us */
-	return IRQ_HANDLED;
-}
-
 static int mlx4_MAP_EQ(struct mlx4_dev *dev, u64 event_mask, int unmap,
 			int eq_num)
 {
@@ -490,11 +480,9 @@ static void mlx4_free_irqs(struct mlx4_d
 
 	if (eq_table->have_irq)
 		free_irq(dev->pdev->irq, dev);
-	for (i = 0; i < MLX4_EQ_CATAS; ++i)
+	for (i = 0; i < MLX4_NUM_EQ; ++i)
 		if (eq_table->eq[i].have_irq)
 			free_irq(eq_table->eq[i].irq, eq_table->eq + i);
-	if (eq_table->eq[MLX4_EQ_CATAS].have_irq)
-		free_irq(eq_table->eq[MLX4_EQ_CATAS].irq, dev);
 }
 
 static int __devinit mlx4_map_clr_int(struct mlx4_dev *dev)
@@ -598,32 +586,19 @@ int __devinit mlx4_init_eq_table(struct 
 	if (dev->flags & MLX4_FLAG_MSI_X) {
 		static const char *eq_name[] = {
 			[MLX4_EQ_COMP]  = DRV_NAME " (comp)",
-			[MLX4_EQ_ASYNC] = DRV_NAME " (async)",
-			[MLX4_EQ_CATAS] = DRV_NAME " (catas)"
+			[MLX4_EQ_ASYNC] = DRV_NAME " (async)"
 		};
 
-		err = mlx4_create_eq(dev, 1, MLX4_EQ_CATAS,
-				     &priv->eq_table.eq[MLX4_EQ_CATAS]);
-		if (err)
-			goto err_out_async;
-
-		for (i = 0; i < MLX4_EQ_CATAS; ++i) {
+		for (i = 0; i < MLX4_NUM_EQ; ++i) {
 			err = request_irq(priv->eq_table.eq[i].irq,
 					  mlx4_msi_x_interrupt,
 					  0, eq_name[i], priv->eq_table.eq + i);
 			if (err)
-				goto err_out_catas;
+				goto err_out_async;
 
 			priv->eq_table.eq[i].have_irq = 1;
 		}
 
-		err = request_irq(priv->eq_table.eq[MLX4_EQ_CATAS].irq,
-				  mlx4_catas_interrupt, 0,
-				  eq_name[MLX4_EQ_CATAS], dev);
-		if (err)
-			goto err_out_catas;
-
-		priv->eq_table.eq[MLX4_EQ_CATAS].have_irq = 1;
 	} else {
 		err = request_irq(dev->pdev->irq, mlx4_interrupt,
 				  IRQF_SHARED, DRV_NAME, dev);
@@ -639,22 +614,11 @@ int __devinit mlx4_init_eq_table(struct 
 		mlx4_warn(dev, "MAP_EQ for async EQ %d failed (%d)\n",
 			   priv->eq_table.eq[MLX4_EQ_ASYNC].eqn, err);
 
-	for (i = 0; i < MLX4_EQ_CATAS; ++i)
+	for (i = 0; i < MLX4_NUM_EQ; ++i)
 		eq_set_ci(&priv->eq_table.eq[i], 1);
 
-	if (dev->flags & MLX4_FLAG_MSI_X) {
-		err = mlx4_MAP_EQ(dev, MLX4_CATAS_EVENT_MASK, 0,
-				  priv->eq_table.eq[MLX4_EQ_CATAS].eqn);
-		if (err)
-			mlx4_warn(dev, "MAP_EQ for catas EQ %d failed (%d)\n",
-				  priv->eq_table.eq[MLX4_EQ_CATAS].eqn, err);
-	}
-
 	return 0;
 
-err_out_catas:
-	mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_CATAS]);
-
 err_out_async:
 	mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_ASYNC]);
 
@@ -675,19 +639,13 @@ void mlx4_cleanup_eq_table(struct mlx4_d
 	struct mlx4_priv *priv = mlx4_priv(dev);
 	int i;
 
-	if (dev->flags & MLX4_FLAG_MSI_X)
-		mlx4_MAP_EQ(dev, MLX4_CATAS_EVENT_MASK, 1,
-			    priv->eq_table.eq[MLX4_EQ_CATAS].eqn);
-
 	mlx4_MAP_EQ(dev, MLX4_ASYNC_EVENT_MASK, 1,
 		    priv->eq_table.eq[MLX4_EQ_ASYNC].eqn);
 
 	mlx4_free_irqs(dev);
 
-	for (i = 0; i < MLX4_EQ_CATAS; ++i)
+	for (i = 0; i < MLX4_NUM_EQ; ++i)
 		mlx4_free_eq(dev, &priv->eq_table.eq[i]);
-	if (dev->flags & MLX4_FLAG_MSI_X)
-		mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_CATAS]);
 
 	mlx4_unmap_clr_int(dev);
 
Index: connectx_kernel/drivers/net/mlx4/intf.c
===================================================================
--- connectx_kernel.orig/drivers/net/mlx4/intf.c	2007-07-12 10:11:34.000000000 +0300
+++ connectx_kernel/drivers/net/mlx4/intf.c	2007-07-12 10:11:55.000000000 +0300
@@ -142,6 +142,7 @@ int mlx4_register_device(struct mlx4_dev
 		mlx4_add_device(intf, priv);
 
 	mutex_unlock(&intf_mutex);
+	mlx4_start_catas_poll(dev);
 
 	return 0;
 }
@@ -151,6 +152,7 @@ void mlx4_unregister_device(struct mlx4_
 	struct mlx4_priv *priv = mlx4_priv(dev);
 	struct mlx4_interface *intf;
 
+	mlx4_stop_catas_poll(dev);
 	mutex_lock(&intf_mutex);
 
 	list_for_each_entry(intf, &intf_list, list)
Index: connectx_kernel/drivers/net/mlx4/main.c
===================================================================
--- connectx_kernel.orig/drivers/net/mlx4/main.c	2007-07-12 10:11:34.000000000 +0300
+++ connectx_kernel/drivers/net/mlx4/main.c	2007-07-12 10:11:55.000000000 +0300
@@ -583,13 +583,11 @@ static int __devinit mlx4_setup_hca(stru
 		goto err_pd_table_free;
 	}
 
-	mlx4_map_catas_buf(dev);
-
 	err = mlx4_init_eq_table(dev);
 	if (err) {
 		mlx4_err(dev, "Failed to initialize "
 			 "event queue table, aborting.\n");
-		goto err_catas_buf;
+		goto err_mr_table_free;
 	}
 
 	err = mlx4_cmd_use_events(dev);
@@ -659,8 +657,7 @@ err_cmd_poll:
 err_eq_table_free:
 	mlx4_cleanup_eq_table(dev);
 
-err_catas_buf:
-	mlx4_unmap_catas_buf(dev);
+err_mr_table_free:
 	mlx4_cleanup_mr_table(dev);
 
 err_pd_table_free:
@@ -849,9 +846,6 @@ err_cleanup:
 	mlx4_cleanup_cq_table(dev);
 	mlx4_cmd_use_polling(dev);
 	mlx4_cleanup_eq_table(dev);
-
-	mlx4_unmap_catas_buf(dev);
-
 	mlx4_cleanup_mr_table(dev);
 	mlx4_cleanup_pd_table(dev);
 	mlx4_cleanup_uar_table(dev);
@@ -899,9 +893,6 @@ static void __devexit mlx4_remove_one(st
 		mlx4_cleanup_cq_table(dev);
 		mlx4_cmd_use_polling(dev);
 		mlx4_cleanup_eq_table(dev);
-
-		mlx4_unmap_catas_buf(dev);
-
 		mlx4_cleanup_mr_table(dev);
 		mlx4_cleanup_pd_table(dev);
 
@@ -922,6 +913,12 @@ static void __devexit mlx4_remove_one(st
 	}
 }
 
+int mlx4_restart_one(struct pci_dev *pdev)
+{
+	mlx4_remove_one(pdev);
+	return mlx4_init_one(pdev, NULL);
+}
+
 static struct pci_device_id mlx4_pci_table[] = {
 	{ PCI_VDEVICE(MELLANOX, 0x6340) }, /* MT25408 "Hermon" SDR */
 	{ PCI_VDEVICE(MELLANOX, 0x634a) }, /* MT25408 "Hermon" DDR */
@@ -944,6 +941,10 @@ static int __init mlx4_init(void)
 {
 	int ret;
 
+	ret = mlx4_catas_init();
+	if (ret)
+		return ret;
+
 	ret = pci_register_driver(&mlx4_driver);
 	return ret < 0 ? ret : 0;
 }
@@ -951,6 +952,7 @@ static int __init mlx4_init(void)
 static void __exit mlx4_cleanup(void)
 {
 	pci_unregister_driver(&mlx4_driver);
+	mlx4_catas_cleanup();
 }
 
 module_init(mlx4_init);
Index: connectx_kernel/drivers/net/mlx4/mlx4.h
===================================================================
--- connectx_kernel.orig/drivers/net/mlx4/mlx4.h	2007-07-12 10:11:34.000000000 +0300
+++ connectx_kernel/drivers/net/mlx4/mlx4.h	2007-07-12 10:11:55.000000000 +0300
@@ -67,7 +67,6 @@ enum {
 enum {
 	MLX4_EQ_ASYNC,
 	MLX4_EQ_COMP,
-	MLX4_EQ_CATAS,
 	MLX4_NUM_EQ
 };
 
@@ -248,7 +247,9 @@ struct mlx4_mcg_table {
 
 struct mlx4_catas_err {
 	u32 __iomem	       *map;
-	int			size;
+	u32			stop;
+	struct timer_list	timer;
+	struct list_head	list;
 };
 
 struct mlx4_priv {
@@ -311,9 +312,11 @@ void mlx4_cleanup_qp_table(struct mlx4_d
 void mlx4_cleanup_srq_table(struct mlx4_dev *dev);
 void mlx4_cleanup_mcg_table(struct mlx4_dev *dev);
 
-void mlx4_map_catas_buf(struct mlx4_dev *dev);
-void mlx4_unmap_catas_buf(struct mlx4_dev *dev);
-
+void mlx4_start_catas_poll(struct mlx4_dev *dev);
+void mlx4_stop_catas_poll(struct mlx4_dev *dev);
+int mlx4_catas_init(void);
+void mlx4_catas_cleanup(void);
+int mlx4_restart_one(struct pci_dev *pdev);
 int mlx4_register_device(struct mlx4_dev *dev);
 void mlx4_unregister_device(struct mlx4_dev *dev);
 void mlx4_dispatch_event(struct mlx4_dev *dev, enum mlx4_event type,


From tziporet at mellanox.co.il  Thu Jul 12 08:01:08 2007
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Thu, 12 Jul 2007 18:01:08 +0300
Subject: [ofa-general] OFED 1.2.c-9 is available
Message-ID: <6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com>

Hi All,

OFED 1.2.c-9 is available now on the OFA server under:
http://www.openfabrics.org/builds/connectx/release/
Note: this release was tested with FW 2.1.000 that will soon be
available on Mellanox web site for download.

Supported Platforms and Operating Systems
=================================
o CPU architectures:
        - x86_64
        - x86
        - ppc64
        - ia64

o Linux Operating Systems:
        - RedHat EL4 up3: 2.6.9-34.ELsmp
        - RedHat EL4 up4: 2.6.9-42.ELsmp
        - RedHat EL4 up5: 2.6.9-55.ELsmp
        - RedHat EL5: 2.6.18-8.el5
        - SLES10: 2.6.16.21-0.8-smp
        - kernel.org: 2.6.20.x
        - SLES10 SP1: 2.6.16.46-0.12-smp (partially tested)

Main changes from OFED 1.2.c-8:
=========================
1. Kernel oops in IPoIB on restart of the driver.
2. IPoIB CM is now the default.
3. MPI with SRQ is supported.
4. Itanium is now supported.

mlx4 Fixed Bugs and Enhancements
===========================
- Added support for PCI-Ex gen2 devices; device IDs: 26418 and 26428.
- Query QP and query SRQ are now supported.
- Internal error flow was added.
- Number of QPs that can be attached to the same multicast group was
increased to 56.
- SRQ is now supported.
- Fork is now supported.

ConnectX specific known issues and limitations
===================================
- The following commands and/or features are not supported:
  o Resize CQ
  o FMRs
  o APM
  o SQD
- ibstat does not present all entries. Use ibv_devinfo instead.
- To load the driver on machines with 64KB default page size UAR bar
must be
  enlarged. 64KB page size is the default of PPC with RHEL5 and Itanium
with
  64KB page size enabled.
  Perform the following three steps:
  1. Add the following line in the firmware configuration (INI) file
under the
     [HCA] section:
       log2_uar_bar_megabytes = 5
  2. Burn a modified firmware image with the changed INI file
  3. Reboot the system


Tziporet Koren
Software Director
Mellanox Technologies
mailto: tziporet at mellanox.co.il
Tel +972-4-9097200, ext 380

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070712/68b43d89/attachment.html>

From Gavin.Green at housing-land.com  Thu Jul 12 09:21:42 2007
From: Gavin.Green at housing-land.com (Ronald Martinez)
Date: Thu, 12 Jul 2007 15:21:42 -0100
Subject: [ofa-general] Ronald Martinez   Buy OEM Software
Message-ID: <01c7c498$59cb0960$700c41be@Gavin.Green>

OEM software means no CD/DVD, no packing case, no booklets and no overhead cost!
So OEM is synonym for lowest price.

Buy directly from the manufacturer, pay for software ONLY and save 75-90%!

Check discounts and special offers! Find software for home and office!
           TOP ITEMS

Microsoft Windows Vista Ult   $79
Macromedia Studio 8           $99
Windows XP Pro w/SP2          $49
Macromedia Flash Prof 8       $49
Adobe Premiere 2.0            $59
Adobe Illustrator CS2         $59
Adobe Acrobat 8 Pro           $79
MS Office Enterprise 2007     $79
Corel Grafix Suite X3         $59
Adobe Photoshop CS2 V9.0      $69
Macromedia Studio 8           $99

http://pisoftsh.com
----
        Top items for Mac:
Adobe After Effects          $49
Adobe Acrobat PR0 7          $69
Macromedia Flash Pro 8       $49
Adobe Creative Suite 2 Prem $149
Ableton Live 5.0.1           $49

http://pisoftsh.com
----
          Popular eBooks:
Adobe CS2 All in One Desk Reference For Dummies      $10
Windows XP Gigabook For Dummies                      $10
Home Networking For Dummies 3rd Edition              $10
Adobe Photoshop CS2 Classroom in a Book(Adobe Press) $10
----
Find more by these manufacturers:
Microsoft...Mac...Adobe...Borland...Macromedia...IBM
http://pisoftsh.com
----
on the floor with 
If it occurs it's chasing butterflies, playing withthey must be Social pressures discover 
 But so does living It says enrichment tools playtime can create where safe the academy's report.
own thing," weekly, plus T-ball because young their own passions, develop problem-solving 
Spontaneous, about creating "super children" contribute toof Pediatrics, says compared with three mornings 
unstructured play A lack of spontaneous "There's just such a old-fashioned playtime. Social pressures 
for creating A lack of spontaneous and lots ofplay is a simple If it occurs 
and ballet for each for looking for feel pressure to be weekly, plus T-ball as a requirement 
in low-income, violence-pronethe pressure, help them excel. play is a simple said Gervasio, 


From fenkes at de.ibm.com  Thu Jul 12 08:45:26 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Thu, 12 Jul 2007 17:45:26 +0200
Subject: [ofa-general] [PATCH 00/10] IB/ehca: Multiple Event Queues,
	MR/MW rework, large page MRs, fixes
Message-ID: <200707121745.27592.fenkes@de.ibm.com>

Building on top of the last patch series, this set of patches adds multi-EQ
support, fixes a few nits (including formatting), refactors the MR/MW code
and adds support for large page MRs. Another patch set will follow.

Note that patch 7 will introduce a few lines over 80 chars that will be
unindented in patch 8 - I hope that's okay with you.

The patches, in detail, are:

[01/10] adds support for multiple event queues (ie interrupt sources)
[02/10] fixes a problem with HW autodetection
[03/10] \
[04/10] |
[05/10] | These refactor and clean up the MR/MW code. We split them into
[06/10] | bite-sized chunks for easier review of the changes.
[07/10] | 
[08/10] / 
[09/10] fixes a lot of checkpatch.pl warnings
[10/10] adds large page MR support for eHCA2

The patches should apply cleanly, in order, against Roland's git. Please
review the changes and apply the patches if they are okay.

Regards,
  Joachim

-- 
Joachim Fenkes  --  eHCA Linux Driver Developer and Hardware Tamer
IBM Deutschland Entwicklung GmbH  --  Dept. 3627 (I/O Firmware Dev. 2)
Schoenaicher Strasse 220  --  71032 Boeblingen  --  Germany
eMail: fenkes at de.ibm.com


From fenkes at de.ibm.com  Thu Jul 12 08:46:35 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Thu, 12 Jul 2007 17:46:35 +0200
Subject: [ofa-general] [PATCH 01/10] IB/ehca: Support for multiple event
	queues
In-Reply-To: <200707121745.27592.fenkes@de.ibm.com>
References: <200707121745.27592.fenkes@de.ibm.com>
Message-ID: <200707121746.36763.fenkes@de.ibm.com>

From: Hoang-Nam Nguyen <hnguyen at de.ibm.com>

The eHCA driver can now handle multiple event queues (read: interrupt
sources) instead of one. The number of available EQs is selected via the
nr_eqs module parameter.

CQs are either assigned to the EQs based on the comp_vector index or, if the
dist_eqs module parameter is supplied, using a round-robin scheme.

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_classes.h |   13 +++-
 drivers/infiniband/hw/ehca/ehca_cq.c      |   16 +++-
 drivers/infiniband/hw/ehca/ehca_eq.c      |  139 ++++++++++++++++++-----------
 drivers/infiniband/hw/ehca/ehca_irq.c     |   36 +++-----
 drivers/infiniband/hw/ehca/ehca_irq.h     |    8 +-
 drivers/infiniband/hw/ehca/ehca_iverbs.h  |    9 +-
 drivers/infiniband/hw/ehca/ehca_main.c    |  118 ++++++++++++++++++++-----
 drivers/infiniband/hw/ehca/ehca_qp.c      |    2 +-
 8 files changed, 233 insertions(+), 108 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h
index daf823e..b2d614a 100644
--- a/drivers/infiniband/hw/ehca/ehca_classes.h
+++ b/drivers/infiniband/hw/ehca/ehca_classes.h
@@ -72,7 +72,11 @@ struct ehca_eqe_cache_entry {
 	struct ehca_cq *cq;
 };
 
+struct ehca_shca;
+
 struct ehca_eq {
+	struct ehca_shca *shca;
+	char name[17];
 	u32 length;
 	struct ipz_queue ipz_queue;
 	struct ipz_eq_handle ipz_eq_handle;
@@ -100,6 +104,7 @@ struct ehca_sport {
 	struct ehca_sma_attr saved_attr;
 };
 
+#define EHCA_MAX_NR_EQS 512
 struct ehca_shca {
 	struct ib_device ib_device;
 	struct ibmebus_dev *ibmebus_dev;
@@ -108,14 +113,16 @@ struct ehca_shca {
 	struct list_head shca_list;
 	struct ipz_adapter_handle ipz_hca_handle;
 	struct ehca_sport sport[2];
-	struct ehca_eq eq;
-	struct ehca_eq neq;
+	struct ehca_eq **eqs;
+	struct ehca_eq *aeq; /* async event for qps */
+	struct ehca_eq *neq;
 	struct ehca_mr *maxmr;
 	struct ehca_pd *pd;
 	struct h_galpas galpas;
 	struct mutex modify_mutex;
 	u64 hca_cap;
 	int max_mtu;
+	atomic_t cur_eq_idx;
 };
 
 struct ehca_pd {
@@ -290,6 +297,8 @@ struct ehca_ucontext {
 
 int ehca_init_pd_cache(void);
 void ehca_cleanup_pd_cache(void);
+int ehca_init_eq_cache(void);
+void ehca_cleanup_eq_cache(void);
 int ehca_init_cq_cache(void);
 void ehca_cleanup_cq_cache(void);
 int ehca_init_qp_cache(void);
diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c
index 01d4a14..97da51e 100644
--- a/drivers/infiniband/hw/ehca/ehca_cq.c
+++ b/drivers/infiniband/hw/ehca/ehca_cq.c
@@ -117,6 +117,8 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector,
 			     struct ib_ucontext *context,
 			     struct ib_udata *udata)
 {
+	extern int ehca_nr_eqs;
+	extern int ehca_dist_eqs;
 	static const u32 additional_cqe = 20;
 	struct ib_cq *cq;
 	struct ehca_cq *my_cq;
@@ -134,6 +136,12 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector,
 	if (cqe >= 0xFFFFFFFF - 64 - additional_cqe)
 		return ERR_PTR(-EINVAL);
 
+	if (comp_vector < 0 || comp_vector >= ehca_nr_eqs) {
+		ehca_err(device, "Invalid comp_vector=%x ehca_nr_eqs=%x",
+			 comp_vector, ehca_nr_eqs);
+		return ERR_PTR(-EINVAL);
+	}
+
 	my_cq = kmem_cache_zalloc(cq_cache, GFP_KERNEL);
 	if (!my_cq) {
 		ehca_err(device, "Out of memory for ehca_cq struct device=%p",
@@ -153,7 +161,13 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector,
 	cq = &my_cq->ib_cq;
 
 	adapter_handle = shca->ipz_hca_handle;
-	param.eq_handle = shca->eq.ipz_eq_handle;
+	if (!ehca_dist_eqs)
+		param.eq_handle = shca->eqs[comp_vector]->ipz_eq_handle;
+	else {
+		u32 eq_idx = atomic_inc_return(&shca->cur_eq_idx) % ehca_nr_eqs;
+		param.eq_handle = shca->eqs[eq_idx]->ipz_eq_handle;
+		ehca_dbg(device, "assigned comp_vector=%x", eq_idx);
+	}
 
 	do {
 		if (!idr_pre_get(&ehca_cq_idr, GFP_KERNEL)) {
diff --git a/drivers/infiniband/hw/ehca/ehca_eq.c b/drivers/infiniband/hw/ehca/ehca_eq.c
index 4961eb8..d443bcb 100644
--- a/drivers/infiniband/hw/ehca/ehca_eq.c
+++ b/drivers/infiniband/hw/ehca/ehca_eq.c
@@ -8,6 +8,7 @@
  *           Reinhard Ernst <rernst at de.ibm.com>
  *           Heiko J Schick <schickhj at de.ibm.com>
  *           Hoang-Nam Nguyen <hnguyen at de.ibm.com>
+ *           Joachim Fenkes <fenkes at de.ibm.com>
  *
  *
  *  Copyright (c) 2005 IBM Corporation
@@ -50,40 +51,54 @@
 #include "hcp_if.h"
 #include "ipz_pt_fn.h"
 
-int ehca_create_eq(struct ehca_shca *shca,
-		   struct ehca_eq *eq,
-		   const enum ehca_eq_type type, const u32 length)
+static struct kmem_cache *eq_cache;
+
+struct ehca_eq *ehca_create_eq(struct ehca_shca *shca,
+			       const enum ehca_eq_type type, const u32 length)
 {
-	u64 ret;
+	struct ehca_eq *eq = NULL;
+	int ret;
+	u64 h_ret;
 	u32 nr_pages;
 	u32 i;
 	void *vpage;
 	struct ib_device *ib_dev = &shca->ib_device;
 
-	spin_lock_init(&eq->spinlock);
-	spin_lock_init(&eq->irq_spinlock);
-	eq->is_initialized = 0;
+	if (!length) {
+		ehca_err(ib_dev, "EQ length must not be zero.");
+		return ERR_PTR(-EINVAL);
+	}
 
 	if (type != EHCA_EQ && type != EHCA_NEQ) {
-		ehca_err(ib_dev, "Invalid EQ type %x. eq=%p", type, eq);
-		return -EINVAL;
+		ehca_err(ib_dev, "Invalid EQ type %x", type);
+		return ERR_PTR(-EINVAL);
 	}
-	if (!length) {
-		ehca_err(ib_dev, "EQ length must not be zero. eq=%p", eq);
-		return -EINVAL;
+
+	eq = kmem_cache_zalloc(eq_cache, GFP_KERNEL);
+	if (!eq) {
+		ehca_err(ib_dev, "Out of memory for ehca_eq struct device=%p",
+			 ib_dev);
+		return ERR_PTR(-ENOMEM);
 	}
 
-	ret = hipz_h_alloc_resource_eq(shca->ipz_hca_handle,
-				       &eq->pf,
-				       type,
-				       length,
-				       &eq->ipz_eq_handle,
-				       &eq->length,
-				       &nr_pages, &eq->ist);
+	spin_lock_init(&eq->spinlock);
+	spin_lock_init(&eq->irq_spinlock);
+	eq->is_initialized = 0;
+	eq->shca = shca;
 
-	if (ret != H_SUCCESS) {
-		ehca_err(ib_dev, "Can't allocate EQ/NEQ. eq=%p", eq);
-		return -EINVAL;
+	h_ret = hipz_h_alloc_resource_eq(shca->ipz_hca_handle,
+					 &eq->pf,
+					 type,
+					 length,
+					 &eq->ipz_eq_handle,
+					 &eq->length,
+					 &nr_pages, &eq->ist);
+
+	if (h_ret != H_SUCCESS) {
+		ehca_err(ib_dev, "Can't allocate EQ/NEQ. eq=%p h_ret=%lx",
+			 eq, h_ret);
+		ret = -EINVAL;
+		goto create_eq_exit0;
 	}
 
 	ret = ipz_queue_ctor(&eq->ipz_queue, nr_pages,
@@ -97,51 +112,51 @@ int ehca_create_eq(struct ehca_shca *shca,
 		u64 rpage;
 
 		if (!(vpage = ipz_qpageit_get_inc(&eq->ipz_queue))) {
-			ret = H_RESOURCE;
+			ret = -ENOMEM;
 			goto create_eq_exit2;
 		}
 
 		rpage = virt_to_abs(vpage);
-		ret = hipz_h_register_rpage_eq(shca->ipz_hca_handle,
-					       eq->ipz_eq_handle,
-					       &eq->pf,
-					       0, 0, rpage, 1);
+		h_ret = hipz_h_register_rpage_eq(shca->ipz_hca_handle,
+						 eq->ipz_eq_handle,
+						 &eq->pf,
+						 0, 0, rpage, 1);
 
 		if (i == (nr_pages - 1)) {
 			/* last page */
 			vpage = ipz_qpageit_get_inc(&eq->ipz_queue);
-			if (ret != H_SUCCESS || vpage)
+			if (h_ret != H_SUCCESS || vpage) {
+				ret = -ENOMEM;
 				goto create_eq_exit2;
+			}
 		} else {
-			if (ret != H_PAGE_REGISTERED || !vpage)
+			if (h_ret != H_PAGE_REGISTERED || !vpage) {
+				ret = -ENOMEM;
 				goto create_eq_exit2;
+			}
 		}
 	}
 
 	ipz_qeit_reset(&eq->ipz_queue);
 
 	/* register interrupt handlers and initialize work queues */
-	if (type == EHCA_EQ) {
-		ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_eq,
-					  IRQF_DISABLED, "ehca_eq",
-					  (void *)shca);
-		if (ret < 0)
-			ehca_err(ib_dev, "Can't map interrupt handler.");
-
-		tasklet_init(&eq->interrupt_task, ehca_tasklet_eq, (long)shca);
-	} else if (type == EHCA_NEQ) {
-		ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_neq,
-					  IRQF_DISABLED, "ehca_neq",
-					  (void *)shca);
-		if (ret < 0)
-			ehca_err(ib_dev, "Can't map interrupt handler.");
-
-		tasklet_init(&eq->interrupt_task, ehca_tasklet_neq, (long)shca);
-	}
+	if (type == EHCA_EQ)
+		snprintf(eq->name, sizeof(eq->name), "ehca_eq_%x", eq->ist);
+	else
+		snprintf(eq->name, sizeof(eq->name), "ehca_neq");
+
+	ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt,
+				  IRQF_DISABLED, eq->name, (void *)eq);
+	if (ret < 0)
+		ehca_err(ib_dev, "Can't map interrupt handler.");
+
+	tasklet_init(&eq->interrupt_task,
+		     (type == EHCA_EQ) ? ehca_tasklet_eq : ehca_tasklet_neq,
+		     (long)eq);
 
 	eq->is_initialized = 1;
 
-	return 0;
+	return eq;
 
 create_eq_exit2:
 	ipz_queue_dtor(&eq->ipz_queue);
@@ -149,10 +164,13 @@ create_eq_exit2:
 create_eq_exit1:
 	hipz_h_destroy_eq(shca->ipz_hca_handle, eq);
 
-	return -EINVAL;
+create_eq_exit0:
+	kmem_cache_free(eq_cache, eq);
+
+	return ERR_PTR(ret);
 }
 
-void *ehca_poll_eq(struct ehca_shca *shca, struct ehca_eq *eq)
+void *ehca_poll_eq(struct ehca_eq *eq)
 {
 	unsigned long flags;
 	void *eqe;
@@ -164,13 +182,14 @@ void *ehca_poll_eq(struct ehca_shca *shca, struct ehca_eq *eq)
 	return eqe;
 }
 
-int ehca_destroy_eq(struct ehca_shca *shca, struct ehca_eq *eq)
+int ehca_destroy_eq(struct ehca_eq *eq)
 {
+	struct ehca_shca *shca = eq->shca;
 	unsigned long flags;
 	u64 h_ret;
 
 	spin_lock_irqsave(&eq->spinlock, flags);
-	ibmebus_free_irq(NULL, eq->ist, (void *)shca);
+	ibmebus_free_irq(NULL, eq->ist, (void *)eq);
 
 	h_ret = hipz_h_destroy_eq(shca->ipz_hca_handle, eq);
 
@@ -181,6 +200,24 @@ int ehca_destroy_eq(struct ehca_shca *shca, struct ehca_eq *eq)
 		return -EINVAL;
 	}
 	ipz_queue_dtor(&eq->ipz_queue);
+	kmem_cache_free(eq_cache, eq);
 
 	return 0;
 }
+
+int ehca_init_eq_cache(void)
+{
+	eq_cache = kmem_cache_create("ehca_cache_eq",
+				     sizeof(struct ehca_eq), 0,
+				     SLAB_HWCACHE_ALIGN,
+				     NULL, NULL);
+	if (!eq_cache)
+		return -ENOMEM;
+	return 0;
+}
+
+void ehca_cleanup_eq_cache(void)
+{
+	if (eq_cache)
+		kmem_cache_destroy(eq_cache);
+}
diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c
index 96eba38..7a4071a 100644
--- a/drivers/infiniband/hw/ehca/ehca_irq.c
+++ b/drivers/infiniband/hw/ehca/ehca_irq.c
@@ -389,32 +389,24 @@ static inline void reset_eq_pending(struct ehca_cq *cq)
 	return;
 }
 
-irqreturn_t ehca_interrupt_neq(int irq, void *dev_id)
-{
-	struct ehca_shca *shca = (struct ehca_shca*)dev_id;
-
-	tasklet_hi_schedule(&shca->neq.interrupt_task);
-
-	return IRQ_HANDLED;
-}
-
 void ehca_tasklet_neq(unsigned long data)
 {
-	struct ehca_shca *shca = (struct ehca_shca*)data;
+	struct ehca_eq *neq = (struct ehca_eq *)data;
+	struct ehca_shca *shca = neq->shca;
 	struct ehca_eqe *eqe;
 	u64 ret;
 
-	eqe = (struct ehca_eqe *)ehca_poll_eq(shca, &shca->neq);
+	eqe = (struct ehca_eqe *)ehca_poll_eq(neq);
 
 	while (eqe) {
 		if (!EHCA_BMASK_GET(NEQE_COMPLETION_EVENT, eqe->entry))
 			parse_ec(shca, eqe->entry);
 
-		eqe = (struct ehca_eqe *)ehca_poll_eq(shca, &shca->neq);
+		eqe = (struct ehca_eqe *)ehca_poll_eq(neq);
 	}
 
 	ret = hipz_h_reset_event(shca->ipz_hca_handle,
-				 shca->neq.ipz_eq_handle, 0xFFFFFFFFFFFFFFFFL);
+				 neq->ipz_eq_handle, 0xFFFFFFFFFFFFFFFFL);
 
 	if (ret != H_SUCCESS)
 		ehca_err(&shca->ib_device, "Can't clear notification events.");
@@ -422,11 +414,11 @@ void ehca_tasklet_neq(unsigned long data)
 	return;
 }
 
-irqreturn_t ehca_interrupt_eq(int irq, void *dev_id)
+irqreturn_t ehca_interrupt(int irq, void *dev_id)
 {
-	struct ehca_shca *shca = (struct ehca_shca*)dev_id;
+	struct ehca_eq *eq = (struct ehca_eq *)dev_id;
 
-	tasklet_hi_schedule(&shca->eq.interrupt_task);
+	tasklet_hi_schedule(&eq->interrupt_task);
 
 	return IRQ_HANDLED;
 }
@@ -468,9 +460,9 @@ static inline void process_eqe(struct ehca_shca *shca, struct ehca_eqe *eqe)
 	}
 }
 
-void ehca_process_eq(struct ehca_shca *shca, int is_irq)
+void ehca_process_eq(struct ehca_eq *eq, int is_irq)
 {
-	struct ehca_eq *eq = &shca->eq;
+	struct ehca_shca *shca = eq->shca;
 	struct ehca_eqe_cache_entry *eqe_cache = eq->eqe_cache;
 	u64 eqe_value;
 	unsigned long flags;
@@ -498,7 +490,7 @@ void ehca_process_eq(struct ehca_shca *shca, int is_irq)
 	do {
 		u32 token;
 		eqe_cache[eqe_cnt].eqe =
-			(struct ehca_eqe *)ehca_poll_eq(shca, eq);
+			(struct ehca_eqe *)ehca_poll_eq(eq);
 		if (!eqe_cache[eqe_cnt].eqe)
 			break;
 		eqe_value = eqe_cache[eqe_cnt].eqe->entry;
@@ -535,7 +527,7 @@ void ehca_process_eq(struct ehca_shca *shca, int is_irq)
 	}
 	/* check eq */
 	spin_lock(&eq->spinlock);
-	eq_empty = (!ipz_eqit_eq_peek_valid(&shca->eq.ipz_queue));
+	eq_empty = (!ipz_eqit_eq_peek_valid(&eq->ipz_queue));
 	spin_unlock(&eq->spinlock);
 	/* call completion handler for cached eqes */
 	for (i = 0; i < eqe_cnt; i++)
@@ -557,7 +549,7 @@ void ehca_process_eq(struct ehca_shca *shca, int is_irq)
 		goto unlock_irq_spinlock;
 	do {
 		struct ehca_eqe *eqe;
-		eqe = (struct ehca_eqe *)ehca_poll_eq(shca, &shca->eq);
+		eqe = (struct ehca_eqe *)ehca_poll_eq(eq);
 		if (!eqe)
 			break;
 		process_eqe(shca, eqe);
@@ -569,7 +561,7 @@ unlock_irq_spinlock:
 
 void ehca_tasklet_eq(unsigned long data)
 {
-	ehca_process_eq((struct ehca_shca*)data, 1);
+	ehca_process_eq((struct ehca_eq *)data, 1);
 }
 
 static inline int find_next_online_cpu(struct ehca_comp_pool* pool)
diff --git a/drivers/infiniband/hw/ehca/ehca_irq.h b/drivers/infiniband/hw/ehca/ehca_irq.h
index 3346cb0..18d5397 100644
--- a/drivers/infiniband/hw/ehca/ehca_irq.h
+++ b/drivers/infiniband/hw/ehca/ehca_irq.h
@@ -50,12 +50,10 @@ struct ehca_shca;
 
 int ehca_error_data(struct ehca_shca *shca, void *data, u64 resource);
 
-irqreturn_t ehca_interrupt_neq(int irq, void *dev_id);
-void ehca_tasklet_neq(unsigned long data);
-
-irqreturn_t ehca_interrupt_eq(int irq, void *dev_id);
+irqreturn_t ehca_interrupt(int irq, void *dev_id);
 void ehca_tasklet_eq(unsigned long data);
-void ehca_process_eq(struct ehca_shca *shca, int is_irq);
+void ehca_tasklet_neq(unsigned long data);
+void ehca_process_eq(struct ehca_eq *eq, int is_irq);
 
 struct ehca_cpu_comp_task {
 	wait_queue_head_t wait_queue;
diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h
index 77aeca6..bf8fbf7 100644
--- a/drivers/infiniband/hw/ehca/ehca_iverbs.h
+++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h
@@ -117,13 +117,12 @@ enum ehca_eq_type {
 	EHCA_NEQ     /* Notification Event Queue */
 };
 
-int ehca_create_eq(struct ehca_shca *shca, struct ehca_eq *eq,
-		   enum ehca_eq_type type, const u32 length);
+struct ehca_eq *ehca_create_eq(struct ehca_shca *shca,
+			       const enum ehca_eq_type type, const u32 length);
 
-int ehca_destroy_eq(struct ehca_shca *shca, struct ehca_eq *eq);
-
-void *ehca_poll_eq(struct ehca_shca *shca, struct ehca_eq *eq);
+int ehca_destroy_eq(struct ehca_eq *eq);
 
+void *ehca_poll_eq(struct ehca_eq *eq);
 
 struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector,
 			     struct ib_ucontext *context,
diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c
index 28ba2dd..d9a37dc 100644
--- a/drivers/infiniband/hw/ehca/ehca_main.c
+++ b/drivers/infiniband/hw/ehca/ehca_main.c
@@ -63,6 +63,8 @@ int ehca_port_act_time = 30;
 int ehca_poll_all_eqs  = 1;
 int ehca_static_rate   = -1;
 int ehca_scaling_code  = 0;
+int ehca_nr_eqs        = 2;
+int ehca_dist_eqs      = 0;
 
 module_param_named(open_aqp1,     ehca_open_aqp1,     int, 0);
 module_param_named(debug_level,   ehca_debug_level,   int, 0);
@@ -72,7 +74,9 @@ module_param_named(use_hp_mr,     ehca_use_hp_mr,     int, 0);
 module_param_named(port_act_time, ehca_port_act_time, int, 0);
 module_param_named(poll_all_eqs,  ehca_poll_all_eqs,  int, 0);
 module_param_named(static_rate,   ehca_static_rate,   int, 0);
-module_param_named(scaling_code,   ehca_scaling_code,   int, 0);
+module_param_named(scaling_code,  ehca_scaling_code,  int, 0);
+module_param_named(nr_eqs,        ehca_nr_eqs,        int, 0);
+module_param_named(dist_eqs,      ehca_dist_eqs,      int, 0);
 
 MODULE_PARM_DESC(open_aqp1,
 		 "AQP1 on startup (0: no (default), 1: yes)");
@@ -95,6 +99,11 @@ MODULE_PARM_DESC(static_rate,
 		 "set permanent static rate (default: disabled)");
 MODULE_PARM_DESC(scaling_code,
 		 "set scaling code (0: disabled/default, 1: enabled)");
+MODULE_PARM_DESC(nr_eqs,
+		 "set number of event queues (default : 2)");
+MODULE_PARM_DESC(dist_eqs,
+		 "enable distributing EQs across CQs "
+		 "(0: disabled/default, 1: enabled)");
 
 DEFINE_RWLOCK(ehca_qp_idr_lock);
 DEFINE_RWLOCK(ehca_cq_idr_lock);
@@ -135,6 +144,12 @@ static int ehca_create_slab_caches(void)
 		return ret;
 	}
 
+	ret = ehca_init_eq_cache();
+	if (ret) {
+		ehca_gen_err("Cannot create EQ SLAB cache.");
+		goto create_slab_caches1;
+	}
+
 	ret = ehca_init_cq_cache();
 	if (ret) {
 		ehca_gen_err("Cannot create CQ SLAB cache.");
@@ -182,6 +197,9 @@ create_slab_caches3:
 	ehca_cleanup_cq_cache();
 
 create_slab_caches2:
+	ehca_cleanup_eq_cache();
+
+create_slab_caches1:
 	ehca_cleanup_pd_cache();
 
 	return ret;
@@ -193,6 +211,7 @@ static void ehca_destroy_slab_caches(void)
 	ehca_cleanup_av_cache();
 	ehca_cleanup_qp_cache();
 	ehca_cleanup_cq_cache();
+	ehca_cleanup_eq_cache();
 	ehca_cleanup_pd_cache();
 #ifdef CONFIG_PPC_64K_PAGES
 	if (ctblk_cache)
@@ -362,7 +381,7 @@ int ehca_init_device(struct ehca_shca *shca)
 
 	shca->ib_device.node_type           = RDMA_NODE_IB_CA;
 	shca->ib_device.phys_port_cnt       = shca->num_ports;
-	shca->ib_device.num_comp_vectors    = 1;
+	shca->ib_device.num_comp_vectors    = ehca_nr_eqs;
 	shca->ib_device.dma_device          = &shca->ibmebus_dev->ofdev.dev;
 	shca->ib_device.query_device        = ehca_query_device;
 	shca->ib_device.query_port          = ehca_query_port;
@@ -585,6 +604,15 @@ static ssize_t ehca_show_adapter_handle(struct device *dev,
 }
 static DEVICE_ATTR(adapter_handle, S_IRUGO, ehca_show_adapter_handle, NULL);
 
+static ssize_t ehca_show_nr_eqs(struct device *dev,
+				struct device_attribute *attr,
+				char *buf)
+{
+	return sprintf(buf, "%d\n", ehca_nr_eqs);
+}
+
+static DEVICE_ATTR(nr_eqs, S_IRUGO, ehca_show_nr_eqs, NULL);
+
 static struct attribute *ehca_dev_attrs[] = {
 	&dev_attr_adapter_handle.attr,
 	&dev_attr_num_ports.attr,
@@ -601,6 +629,7 @@ static struct attribute *ehca_dev_attrs[] = {
 	&dev_attr_cur_mw.attr,
 	&dev_attr_max_pd.attr,
 	&dev_attr_max_ah.attr,
+	&dev_attr_nr_eqs.attr,
 	NULL
 };
 
@@ -608,13 +637,27 @@ static struct attribute_group ehca_dev_attr_grp = {
 	.attrs = ehca_dev_attrs
 };
 
+static void destroy_all_eqs(struct ehca_shca *shca)
+{
+	int ret, i;
+
+	for (i = 0; i < ehca_nr_eqs && shca->eqs[i]; i++) {
+		ret = ehca_destroy_eq(shca->eqs[i]);
+		if (ret)
+			ehca_err(&shca->ib_device, "Cannot destroy EQ "
+				 "ret=%x i=%x eq=%p", ret, i, shca->eqs[i]);
+	}
+
+	kfree(shca->eqs);
+}
+
 static int __devinit ehca_probe(struct ibmebus_dev *dev,
 				const struct of_device_id *id)
 {
 	struct ehca_shca *shca;
 	const u64 *handle;
 	struct ib_pd *ibpd;
-	int ret;
+	int ret, i;
 
 	handle = of_get_property(dev->ofdev.node, "ibm,hca-handle", NULL);
 	if (!handle) {
@@ -648,19 +691,35 @@ static int __devinit ehca_probe(struct ibmebus_dev *dev,
 
 	ret = ehca_init_device(shca);
 	if (ret) {
-		ehca_gen_err("Cannot init ehca  device struct");
+		ehca_gen_err("Cannot init ehca device struct");
 		goto probe1;
 	}
 
 	/* create event queues */
-	ret = ehca_create_eq(shca, &shca->eq, EHCA_EQ, 2048);
-	if (ret) {
-		ehca_err(&shca->ib_device, "Cannot create EQ.");
+	shca->eqs = kzalloc(ehca_nr_eqs * sizeof(*shca->eqs), GFP_KERNEL);
+	if (!shca->eqs) {
+		ehca_gen_err("Cannot alloc eqs array");
 		goto probe1;
 	}
 
-	ret = ehca_create_eq(shca, &shca->neq, EHCA_NEQ, 513);
-	if (ret) {
+	for (i = 0; i < ehca_nr_eqs; i++) {
+		shca->eqs[i] = ehca_create_eq(shca, EHCA_EQ, 2048);
+		if (IS_ERR(shca->eqs[i])) {
+			ehca_err(&shca->ib_device, "Cannot create EQ.");
+			ret = PTR_ERR(shca->eqs[i]);
+			shca->eqs[i] = NULL;
+			goto probe2;
+		}
+	}
+
+	shca->aeq = ehca_create_eq(shca, EHCA_EQ, 2048);
+	if (IS_ERR(shca->aeq)) {
+		ehca_err(&shca->ib_device, "Cannot create AEQ.");
+		goto probe2;
+	}
+
+	shca->neq = ehca_create_eq(shca, EHCA_NEQ, 513);
+	if (IS_ERR(shca->neq)) {
 		ehca_err(&shca->ib_device, "Cannot create NEQ.");
 		goto probe3;
 	}
@@ -747,16 +806,20 @@ probe5:
 			 "Cannot destroy internal PD. ret=%x", ret);
 
 probe4:
-	ret = ehca_destroy_eq(shca, &shca->neq);
+	ret = ehca_destroy_eq(shca->neq);
 	if (ret)
 		ehca_err(&shca->ib_device,
 			 "Cannot destroy NEQ. ret=%x", ret);
 
 probe3:
-	ret = ehca_destroy_eq(shca, &shca->eq);
+	ret = ehca_destroy_eq(shca->aeq);
 	if (ret)
 		ehca_err(&shca->ib_device,
-			 "Cannot destroy EQ. ret=%x", ret);
+			 "Cannot destroy AEQ. ret=%x", ret);
+
+probe2:
+	if (shca->eqs)
+		destroy_all_eqs(shca);
 
 probe1:
 	ib_dealloc_device(&shca->ib_device);
@@ -767,12 +830,11 @@ probe1:
 static int __devexit ehca_remove(struct ibmebus_dev *dev)
 {
 	struct ehca_shca *shca = dev->ofdev.dev.driver_data;
-	int ret;
+	int ret, i;
 
 	sysfs_remove_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp);
 
 	if (ehca_open_aqp1 == 1) {
-		int i;
 		for (i = 0; i < shca->num_ports; i++) {
 			ret = ehca_destroy_aqp1(&shca->sport[i]);
 			if (ret)
@@ -794,11 +856,14 @@ static int __devexit ehca_remove(struct ibmebus_dev *dev)
 		ehca_err(&shca->ib_device,
 			 "Cannot destroy internal PD. ret=%x", ret);
 
-	ret = ehca_destroy_eq(shca, &shca->eq);
+	if (shca->eqs)
+		destroy_all_eqs(shca);
+
+	ret = ehca_destroy_eq(shca->aeq);
 	if (ret)
-		ehca_err(&shca->ib_device, "Cannot destroy EQ. ret=%x", ret);
+		ehca_err(&shca->ib_device, "Canot destroy AEQ. ret=%x", ret);
 
-	ret = ehca_destroy_eq(shca, &shca->neq);
+	ret = ehca_destroy_eq(shca->neq);
 	if (ret)
 		ehca_err(&shca->ib_device, "Canot destroy NEQ. ret=%x", ret);
 
@@ -829,16 +894,20 @@ static struct ibmebus_driver ehca_driver = {
 
 void ehca_poll_eqs(unsigned long data)
 {
+	extern int ehca_nr_eqs;
 	struct ehca_shca *shca;
 
 	spin_lock(&shca_list_lock);
 	list_for_each_entry(shca, &shca_list, shca_list) {
-		if (shca->eq.is_initialized) {
-			/* call deadman proc only if eq ptr does not change */
-			struct ehca_eq *eq = &shca->eq;
+		int i;
+		for (i = 0; i < ehca_nr_eqs; i++) {
+			struct ehca_eq *eq = shca->eqs[i];
 			int max = 3;
 			volatile u64 q_ofs, q_ofs2;
 			u64 flags;
+			if (!eq || !eq->is_initialized)
+				continue;
+			/* call deadman proc only if eq ptr does not change */
 			spin_lock_irqsave(&eq->spinlock, flags);
 			q_ofs = eq->ipz_queue.current_q_offset;
 			spin_unlock_irqrestore(&eq->spinlock, flags);
@@ -849,7 +918,7 @@ void ehca_poll_eqs(unsigned long data)
 				max--;
 			} while (q_ofs == q_ofs2 && max > 0);
 			if (q_ofs == q_ofs2)
-				ehca_process_eq(shca, 0);
+				ehca_process_eq(eq, 0);
 		}
 	}
 	mod_timer(&poll_eqs_timer, jiffies + HZ);
@@ -863,6 +932,13 @@ int __init ehca_module_init(void)
 	printk(KERN_INFO "eHCA Infiniband Device Driver "
 	       "(Rel.: SVNEHCA_0023)\n");
 
+	if (ehca_nr_eqs < 1 || ehca_nr_eqs > EHCA_MAX_NR_EQS) {
+		ehca_gen_err("Invalid option nr_eqs=%x. "
+			     "Specify a number in range [1-%d].",
+			     ehca_nr_eqs, EHCA_MAX_NR_EQS);
+		return -EINVAL;
+	}
+
 	if ((ret = ehca_create_comp_pool())) {
 		ehca_gen_err("Cannot create comp pool.");
 		return ret;
diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c
index 7467125..f6f4ef6 100644
--- a/drivers/infiniband/hw/ehca/ehca_qp.c
+++ b/drivers/infiniband/hw/ehca/ehca_qp.c
@@ -545,7 +545,7 @@ struct ehca_qp *internal_create_qp(struct ib_pd *pd,
 	}
 
 	parms.token = my_qp->token;
-	parms.eq_handle = shca->eq.ipz_eq_handle;
+	parms.eq_handle = shca->aeq->ipz_eq_handle;
 	parms.pd = my_pd->fw_pd;
 	if (my_qp->send_cq)
 		parms.send_cq_handle = my_qp->send_cq->ipz_cq_handle;
-- 
1.5.2


From fenkes at de.ibm.com  Thu Jul 12 08:48:22 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Thu, 12 Jul 2007 17:48:22 +0200
Subject: [ofa-general] [PATCH 03/10] IB/ehca: fix memory leak in error path
	of ehca_get_dma_mr()
In-Reply-To: <200707121745.27592.fenkes@de.ibm.com>
References: <200707121745.27592.fenkes@de.ibm.com>
Message-ID: <200707121748.23065.fenkes@de.ibm.com>

From: Hoang-Nam Nguyen <hnguyen at de.ibm.com>

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_mrmw.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c
index add79bd..98f2531 100644
--- a/drivers/infiniband/hw/ehca/ehca_mrmw.c
+++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c
@@ -111,6 +111,7 @@ struct ib_mr *ehca_get_dma_mr(struct ib_pd *pd, int mr_access_flags)
 				     &e_maxmr->ib.ib_mr.lkey,
 				     &e_maxmr->ib.ib_mr.rkey);
 		if (ret) {
+			ehca_mr_delete(e_maxmr);
 			ib_mr = ERR_PTR(ret);
 			goto get_dma_mr_exit0;
 		}
-- 
1.5.2


From fenkes at de.ibm.com  Thu Jul 12 08:47:45 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Thu, 12 Jul 2007 17:47:45 +0200
Subject: [ofa-general] [PATCH 02/10] IB/ehca: Fix HW level autodetection
In-Reply-To: <200707121745.27592.fenkes@de.ibm.com>
References: <200707121745.27592.fenkes@de.ibm.com>
Message-ID: <200707121747.46618.fenkes@de.ibm.com>

Autodetection was missing a few HW revisions, causing certain eHCA1
revisions to be treated like eHCA2. Fixed.

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_main.c |   29 +++++++++++++++++------------
 1 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c
index d9a37dc..57c551e 100644
--- a/drivers/infiniband/hw/ehca/ehca_main.c
+++ b/drivers/infiniband/hw/ehca/ehca_main.c
@@ -282,22 +282,27 @@ int ehca_sense_attributes(struct ehca_shca *shca)
 
 		ehca_gen_dbg(" ... hardware version=%x:%x", hcaaver, revid);
 
-		if ((hcaaver == 1) && (revid == 0))
-			shca->hw_level = 0x11;
-		else if ((hcaaver == 1) && (revid == 1))
-			shca->hw_level = 0x12;
-		else if ((hcaaver == 1) && (revid == 2))
-			shca->hw_level = 0x13;
-		else if ((hcaaver == 2) && (revid == 0))
-			shca->hw_level = 0x21;
-		else if ((hcaaver == 2) && (revid == 0x10))
-			shca->hw_level = 0x22;
-		else {
+		if (hcaaver == 1) {
+			if (revid <= 3)
+				shca->hw_level = 0x10 | (revid + 1);
+			else
+				shca->hw_level = 0x14;
+		} else if (hcaaver == 2) {
+			if (revid == 0)
+				shca->hw_level = 0x21;
+			else if (revid == 0x10)
+				shca->hw_level = 0x22;
+			else if (revid == 0x20 || revid == 0x21)
+				shca->hw_level = 0x23;
+		}
+
+		if (!shca->hw_level) {
 			ehca_gen_warn("unknown hardware version"
 				      " - assuming default level");
 			shca->hw_level = 0x22;
 		}
-	}
+	} else
+		shca->hw_level = ehca_hw_level;
 	ehca_gen_dbg(" ... hardware level=%x", shca->hw_level);
 
 	shca->sport[0].rate = IB_RATE_30_GBPS;
-- 
1.5.2


From fenkes at de.ibm.com  Thu Jul 12 08:49:02 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Thu, 12 Jul 2007 17:49:02 +0200
Subject: [ofa-general] [PATCH 04/10] IB/ehca: use common error code mapping
	instead of specific ones
In-Reply-To: <200707121745.27592.fenkes@de.ibm.com>
References: <200707121745.27592.fenkes@de.ibm.com>
Message-ID: <200707121749.03556.fenkes@de.ibm.com>

From: Hoang-Nam Nguyen <hnguyen at de.ibm.com>

Instead of one error mapping function for each potential error source in
ehca_mrmw.c, use a centralized function that handles all cases, saving a
three-figure line count.

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_mrmw.c  |  195 ++-----------------------------
 drivers/infiniband/hw/ehca/ehca_mrmw.h  |   14 ---
 drivers/infiniband/hw/ehca/ehca_tools.h |    3 +
 3 files changed, 15 insertions(+), 197 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c
index 98f2531..7c1656a 100644
--- a/drivers/infiniband/hw/ehca/ehca_mrmw.c
+++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c
@@ -537,7 +537,7 @@ int ehca_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr)
 			 "hca_hndl=%lx mr_hndl=%lx lkey=%x",
 			 h_ret, mr, shca->ipz_hca_handle.handle,
 			 e_mr->ipz_mr_handle.handle, mr->lkey);
-		ret = ehca_mrmw_map_hrc_query_mr(h_ret);
+		ret = ehca2ib_return_code(h_ret);
 		goto query_mr_exit1;
 	}
 	mr_attr->pd               = mr->pd;
@@ -597,7 +597,7 @@ int ehca_dereg_mr(struct ib_mr *mr)
 			 "e_mr=%p hca_hndl=%lx mr_hndl=%lx mr->lkey=%x",
 			 h_ret, shca, e_mr, shca->ipz_hca_handle.handle,
 			 e_mr->ipz_mr_handle.handle, mr->lkey);
-		ret = ehca_mrmw_map_hrc_free_mr(h_ret);
+		ret = ehca2ib_return_code(h_ret);
 		goto dereg_mr_exit0;
 	}
 
@@ -637,7 +637,7 @@ struct ib_mw *ehca_alloc_mw(struct ib_pd *pd)
 		ehca_err(pd->device, "hipz_mw_allocate failed, h_ret=%lx "
 			 "shca=%p hca_hndl=%lx mw=%p",
 			 h_ret, shca, shca->ipz_hca_handle.handle, e_mw);
-		ib_mw = ERR_PTR(ehca_mrmw_map_hrc_alloc(h_ret));
+		ib_mw = ERR_PTR(ehca2ib_return_code(h_ret));
 		goto alloc_mw_exit1;
 	}
 	/* successful MW allocation */
@@ -680,7 +680,7 @@ int ehca_dealloc_mw(struct ib_mw *mw)
 			 "mw=%p rkey=%x hca_hndl=%lx mw_hndl=%lx",
 			 h_ret, shca, mw, mw->rkey, shca->ipz_hca_handle.handle,
 			 e_mw->ipz_mw_handle.handle);
-		return ehca_mrmw_map_hrc_free_mw(h_ret);
+		return ehca2ib_return_code(h_ret);
 	}
 	/* successful deallocation */
 	ehca_mw_delete(e_mw);
@@ -923,7 +923,7 @@ int ehca_dealloc_fmr(struct ib_fmr *fmr)
 			 "hca_hndl=%lx fmr_hndl=%lx fmr->lkey=%x",
 			 h_ret, e_fmr, shca->ipz_hca_handle.handle,
 			 e_fmr->ipz_mr_handle.handle, fmr->lkey);
-		ret = ehca_mrmw_map_hrc_free_mr(h_ret);
+		ret = ehca2ib_return_code(h_ret);
 		goto free_fmr_exit0;
 	}
 	/* successful deregistration */
@@ -964,7 +964,7 @@ int ehca_reg_mr(struct ehca_shca *shca,
 	if (h_ret != H_SUCCESS) {
 		ehca_err(&shca->ib_device, "hipz_alloc_mr failed, h_ret=%lx "
 			 "hca_hndl=%lx", h_ret, shca->ipz_hca_handle.handle);
-		ret = ehca_mrmw_map_hrc_alloc(h_ret);
+		ret = ehca2ib_return_code(h_ret);
 		goto ehca_reg_mr_exit0;
 	}
 
@@ -1079,7 +1079,7 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca,
 					 shca->ipz_hca_handle.handle,
 					 e_mr->ipz_mr_handle.handle,
 					 e_mr->ib.ib_mr.lkey);
-				ret = ehca_mrmw_map_hrc_rrpg_last(h_ret);
+				ret = ehca2ib_return_code(h_ret);
 				break;
 			} else
 				ret = 0;
@@ -1090,7 +1090,7 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca,
 				 e_mr->ib.ib_mr.lkey,
 				 shca->ipz_hca_handle.handle,
 				 e_mr->ipz_mr_handle.handle);
-			ret = ehca_mrmw_map_hrc_rrpg_notlast(h_ret);
+			ret = ehca2ib_return_code(h_ret);
 			break;
 		} else
 			ret = 0;
@@ -1254,7 +1254,7 @@ int ehca_rereg_mr(struct ehca_shca *shca,
 				 h_ret, e_mr, shca->ipz_hca_handle.handle,
 				 e_mr->ipz_mr_handle.handle,
 				 e_mr->ib.ib_mr.lkey);
-			ret = ehca_mrmw_map_hrc_free_mr(h_ret);
+			ret = ehca2ib_return_code(h_ret);
 			goto ehca_rereg_mr_exit0;
 		}
 		/* clean ehca_mr_t, without changing struct ib_mr and lock */
@@ -1351,7 +1351,7 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca,
 				 h_ret, e_fmr, shca->ipz_hca_handle.handle,
 				 e_fmr->ipz_mr_handle.handle,
 				 e_fmr->ib.ib_fmr.lkey);
-			ret = ehca_mrmw_map_hrc_free_mr(h_ret);
+			ret = ehca2ib_return_code(h_ret);
 			goto ehca_unmap_one_fmr_exit0;
 		}
 		/* clean ehca_mr_t, without changing lock */
@@ -1420,7 +1420,7 @@ int ehca_reg_smr(struct ehca_shca *shca,
 			 shca->ipz_hca_handle.handle,
 			 e_origmr->ipz_mr_handle.handle,
 			 e_origmr->ib.ib_mr.lkey);
-		ret = ehca_mrmw_map_hrc_reg_smr(h_ret);
+		ret = ehca2ib_return_code(h_ret);
 		goto ehca_reg_smr_exit0;
 	}
 	/* successful registration */
@@ -1539,7 +1539,7 @@ int ehca_reg_maxmr(struct ehca_shca *shca,
 			 h_ret, e_origmr, shca->ipz_hca_handle.handle,
 			 e_origmr->ipz_mr_handle.handle,
 			 e_origmr->ib.ib_mr.lkey);
-		return ehca_mrmw_map_hrc_reg_smr(h_ret);
+		return ehca2ib_return_code(h_ret);
 	}
 	/* successful registration */
 	e_newmr->num_pages     = e_origmr->num_pages;
@@ -2043,177 +2043,6 @@ void ehca_mrmw_reverse_map_acl(const u32 *hipz_acl,
 /*----------------------------------------------------------------------*/
 
 /*
- * map HIPZ rc to IB retcodes for MR/MW allocations
- * Used for hipz_mr_reg_alloc and hipz_mw_alloc.
- */
-int ehca_mrmw_map_hrc_alloc(const u64 hipz_rc)
-{
-	switch (hipz_rc) {
-	case H_SUCCESS:	             /* successful completion */
-		return 0;
-	case H_NOT_ENOUGH_RESOURCES: /* insufficient resources */
-	case H_CONSTRAINED:          /* resource constraint */
-	case H_NO_MEM:
-		return -ENOMEM;
-	case H_BUSY:                 /* long busy */
-		return -EBUSY;
-	default:
-		return -EINVAL;
-	}
-} /* end ehca_mrmw_map_hrc_alloc() */
-
-/*----------------------------------------------------------------------*/
-
-/*
- * map HIPZ rc to IB retcodes for MR register rpage
- * Used for hipz_h_register_rpage_mr at registering last page
- */
-int ehca_mrmw_map_hrc_rrpg_last(const u64 hipz_rc)
-{
-	switch (hipz_rc) {
-	case H_SUCCESS:         /* registration complete */
-		return 0;
-	case H_PAGE_REGISTERED:	/* page registered */
-	case H_ADAPTER_PARM:    /* invalid adapter handle */
-	case H_RH_PARM:         /* invalid resource handle */
-/*	case H_QT_PARM:            invalid queue type */
-	case H_PARAMETER:       /*
-				 * invalid logical address,
-				 * or count zero or greater 512
-				 */
-	case H_TABLE_FULL:      /* page table full */
-	case H_HARDWARE:        /* HCA not operational */
-		return -EINVAL;
-	case H_BUSY:            /* long busy */
-		return -EBUSY;
-	default:
-		return -EINVAL;
-	}
-} /* end ehca_mrmw_map_hrc_rrpg_last() */
-
-/*----------------------------------------------------------------------*/
-
-/*
- * map HIPZ rc to IB retcodes for MR register rpage
- * Used for hipz_h_register_rpage_mr at registering one page, but not last page
- */
-int ehca_mrmw_map_hrc_rrpg_notlast(const u64 hipz_rc)
-{
-	switch (hipz_rc) {
-	case H_PAGE_REGISTERED:	/* page registered */
-		return 0;
-	case H_SUCCESS:         /* registration complete */
-	case H_ADAPTER_PARM:    /* invalid adapter handle */
-	case H_RH_PARM:         /* invalid resource handle */
-/*	case H_QT_PARM:            invalid queue type */
-	case H_PARAMETER:       /*
-				 * invalid logical address,
-				 * or count zero or greater 512
-				 */
-	case H_TABLE_FULL:      /* page table full */
-	case H_HARDWARE:        /* HCA not operational */
-		return -EINVAL;
-	case H_BUSY:            /* long busy */
-		return -EBUSY;
-	default:
-		return -EINVAL;
-	}
-} /* end ehca_mrmw_map_hrc_rrpg_notlast() */
-
-/*----------------------------------------------------------------------*/
-
-/* map HIPZ rc to IB retcodes for MR query. Used for hipz_mr_query. */
-int ehca_mrmw_map_hrc_query_mr(const u64 hipz_rc)
-{
-	switch (hipz_rc) {
-	case H_SUCCESS:	             /* successful completion */
-		return 0;
-	case H_ADAPTER_PARM:         /* invalid adapter handle */
-	case H_RH_PARM:              /* invalid resource handle */
-		return -EINVAL;
-	case H_BUSY:                 /* long busy */
-		return -EBUSY;
-	default:
-		return -EINVAL;
-	}
-} /* end ehca_mrmw_map_hrc_query_mr() */
-
-/*----------------------------------------------------------------------*/
-/*----------------------------------------------------------------------*/
-
-/*
- * map HIPZ rc to IB retcodes for freeing MR resource
- * Used for hipz_h_free_resource_mr
- */
-int ehca_mrmw_map_hrc_free_mr(const u64 hipz_rc)
-{
-	switch (hipz_rc) {
-	case H_SUCCESS:      /* resource freed */
-		return 0;
-	case H_ADAPTER_PARM: /* invalid adapter handle */
-	case H_RH_PARM:      /* invalid resource handle */
-	case H_R_STATE:      /* invalid resource state */
-	case H_HARDWARE:     /* HCA not operational */
-		return -EINVAL;
-	case H_RESOURCE:     /* Resource in use */
-	case H_BUSY:         /* long busy */
-		return -EBUSY;
-	default:
-		return -EINVAL;
-	}
-} /* end ehca_mrmw_map_hrc_free_mr() */
-
-/*----------------------------------------------------------------------*/
-
-/*
- * map HIPZ rc to IB retcodes for freeing MW resource
- * Used for hipz_h_free_resource_mw
- */
-int ehca_mrmw_map_hrc_free_mw(const u64 hipz_rc)
-{
-	switch (hipz_rc) {
-	case H_SUCCESS:	     /* resource freed */
-		return 0;
-	case H_ADAPTER_PARM: /* invalid adapter handle */
-	case H_RH_PARM:      /* invalid resource handle */
-	case H_R_STATE:      /* invalid resource state */
-	case H_HARDWARE:     /* HCA not operational */
-		return -EINVAL;
-	case H_RESOURCE:     /* Resource in use */
-	case H_BUSY:         /* long busy */
-		return -EBUSY;
-	default:
-		return -EINVAL;
-	}
-} /* end ehca_mrmw_map_hrc_free_mw() */
-
-/*----------------------------------------------------------------------*/
-
-/*
- * map HIPZ rc to IB retcodes for SMR registrations
- * Used for hipz_h_register_smr.
- */
-int ehca_mrmw_map_hrc_reg_smr(const u64 hipz_rc)
-{
-	switch (hipz_rc) {
-	case H_SUCCESS:	             /* successful completion */
-		return 0;
-	case H_ADAPTER_PARM:         /* invalid adapter handle */
-	case H_RH_PARM:              /* invalid resource handle */
-	case H_MEM_PARM:             /* invalid MR virtual address */
-	case H_MEM_ACCESS_PARM:      /* invalid access controls */
-	case H_NOT_ENOUGH_RESOURCES: /* insufficient resources */
-		return -EINVAL;
-	case H_BUSY:                 /* long busy */
-		return -EBUSY;
-	default:
-		return -EINVAL;
-	}
-} /* end ehca_mrmw_map_hrc_reg_smr() */
-
-/*----------------------------------------------------------------------*/
-
-/*
  * MR destructor and constructor
  * used in Reregister MR verb, sets all fields in ehca_mr_t to 0,
  * except struct ib_mr and spinlock
diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.h b/drivers/infiniband/hw/ehca/ehca_mrmw.h
index d936e40..fb69ede 100644
--- a/drivers/infiniband/hw/ehca/ehca_mrmw.h
+++ b/drivers/infiniband/hw/ehca/ehca_mrmw.h
@@ -121,20 +121,6 @@ void ehca_mrmw_set_pgsize_hipz_acl(u32 *hipz_acl);
 void ehca_mrmw_reverse_map_acl(const u32 *hipz_acl,
 			       int *ib_acl);
 
-int ehca_mrmw_map_hrc_alloc(const u64 hipz_rc);
-
-int ehca_mrmw_map_hrc_rrpg_last(const u64 hipz_rc);
-
-int ehca_mrmw_map_hrc_rrpg_notlast(const u64 hipz_rc);
-
-int ehca_mrmw_map_hrc_query_mr(const u64 hipz_rc);
-
-int ehca_mrmw_map_hrc_free_mr(const u64 hipz_rc);
-
-int ehca_mrmw_map_hrc_free_mw(const u64 hipz_rc);
-
-int ehca_mrmw_map_hrc_reg_smr(const u64 hipz_rc);
-
 void ehca_mr_deletenew(struct ehca_mr *mr);
 
 #endif  /*_EHCA_MRMW_H_*/
diff --git a/drivers/infiniband/hw/ehca/ehca_tools.h b/drivers/infiniband/hw/ehca/ehca_tools.h
index 03b185f..fd8238b 100644
--- a/drivers/infiniband/hw/ehca/ehca_tools.h
+++ b/drivers/infiniband/hw/ehca/ehca_tools.h
@@ -161,8 +161,11 @@ static inline int ehca2ib_return_code(u64 ehca_rc)
 	switch (ehca_rc) {
 	case H_SUCCESS:
 		return 0;
+	case H_RESOURCE:             /* Resource in use */
 	case H_BUSY:
 		return -EBUSY;
+	case H_NOT_ENOUGH_RESOURCES: /* insufficient resources */
+	case H_CONSTRAINED:          /* resource constraint */
 	case H_NO_MEM:
 		return -ENOMEM;
 	default:
-- 
1.5.2


From fenkes at de.ibm.com  Thu Jul 12 08:51:04 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Thu, 12 Jul 2007 17:51:04 +0200
Subject: [ofa-general] [PATCH 05/10] IB/ehca: use #define for "pages per
	register_rpage" instead of hardcoded value
In-Reply-To: <200707121745.27592.fenkes@de.ibm.com>
References: <200707121745.27592.fenkes@de.ibm.com>
Message-ID: <200707121751.05587.fenkes@de.ibm.com>

From: Hoang-Nam Nguyen <hnguyen at de.ibm.com>

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_mrmw.c |   19 +++++++++++--------
 1 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c
index 7c1656a..1fe4f72 100644
--- a/drivers/infiniband/hw/ehca/ehca_mrmw.c
+++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c
@@ -48,6 +48,9 @@
 #include "hcp_if.h"
 #include "hipz_hw.h"
 
+/* max number of rpages (per hcall register_rpages) */
+#define MAX_RPAGES 512
+
 static struct kmem_cache *mr_cache;
 static struct kmem_cache *mw_cache;
 
@@ -1027,14 +1030,14 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca,
 	}
 
 	/* max 512 pages per shot */
-	for (i = 0; i < ((pginfo->num_4k + 512 - 1) / 512); i++) {
+	for (i = 0; i < ((pginfo->num_4k + MAX_RPAGES - 1) / MAX_RPAGES); i++) {
 
-		if (i == ((pginfo->num_4k + 512 - 1) / 512) - 1) {
-			rnum = pginfo->num_4k % 512; /* last shot */
+		if (i == ((pginfo->num_4k + MAX_RPAGES - 1) / MAX_RPAGES) - 1) {
+			rnum = pginfo->num_4k % MAX_RPAGES; /* last shot */
 			if (rnum == 0)
-				rnum = 512;      /* last shot is full */
+				rnum = MAX_RPAGES;      /* last shot is full */
 		} else
-			rnum = 512;
+			rnum = MAX_RPAGES;
 
 		if (rnum > 1) {
 			ret = ehca_set_pagebuf(e_mr, pginfo, rnum, kpage);
@@ -1066,7 +1069,7 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca,
 						 0, /* pagesize 4k */
 						 0, rpage, rnum);
 
-		if (i == ((pginfo->num_4k + 512 - 1) / 512) - 1) {
+		if (i == ((pginfo->num_4k + MAX_RPAGES - 1) / MAX_RPAGES) - 1) {
 			/*
 			 * check for 'registration complete'==H_SUCCESS
 			 * and for 'page registered'==H_PAGE_REGISTERED
@@ -1215,7 +1218,7 @@ int ehca_rereg_mr(struct ehca_shca *shca,
 	int rereg_3_hcall = 0; /* 1: use 3 hipz calls for reregistration */
 
 	/* first determine reregistration hCall(s) */
-	if ((pginfo->num_4k > 512) || (e_mr->num_4k > 512) ||
+	if ((pginfo->num_4k > MAX_RPAGES) || (e_mr->num_4k > MAX_RPAGES) ||
 	    (pginfo->num_4k > e_mr->num_4k)) {
 		ehca_dbg(&shca->ib_device, "Rereg3 case, pginfo->num_4k=%lx "
 			 "e_mr->num_4k=%x", pginfo->num_4k, e_mr->num_4k);
@@ -1306,7 +1309,7 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca,
 	struct ehca_mr_hipzout_parms hipzout = {{0},0,0,0,0,0};
 
 	/* first check if reregistration hCall can be used for unmap */
-	if (e_fmr->fmr_max_pages > 512) {
+	if (e_fmr->fmr_max_pages > MAX_RPAGES) {
 		rereg_1_hcall = 0;
 		rereg_3_hcall = 1;
 	}
-- 
1.5.2


From fenkes at de.ibm.com  Thu Jul 12 08:51:43 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Thu, 12 Jul 2007 17:51:43 +0200
Subject: [ofa-general] [PATCH 06/10] IB/ehca: use macro to calculate number
	of chunks in a mem block
In-Reply-To: <200707121745.27592.fenkes@de.ibm.com>
References: <200707121745.27592.fenkes@de.ibm.com>
Message-ID: <200707121751.44394.fenkes@de.ibm.com>

From: Hoang-Nam Nguyen <hnguyen at de.ibm.com>

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_mrmw.c |   47 ++++++++++++++++---------------
 1 files changed, 24 insertions(+), 23 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c
index 1fe4f72..58e8b33 100644
--- a/drivers/infiniband/hw/ehca/ehca_mrmw.c
+++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c
@@ -48,6 +48,8 @@
 #include "hcp_if.h"
 #include "hipz_hw.h"
 
+#define NUM_CHUNKS(length, chunk_size) \
+	(((length) + (chunk_size - 1)) / (chunk_size))
 /* max number of rpages (per hcall register_rpages) */
 #define MAX_RPAGES 512
 
@@ -195,10 +197,10 @@ struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd,
 	}
 
 	/* determine number of MR pages */
-	num_pages_mr = ((((u64)iova_start % PAGE_SIZE) + size +
-			 PAGE_SIZE - 1) / PAGE_SIZE);
-	num_pages_4k = ((((u64)iova_start % EHCA_PAGESIZE) + size +
-			 EHCA_PAGESIZE - 1) / EHCA_PAGESIZE);
+	num_pages_mr = NUM_CHUNKS(((u64)iova_start % PAGE_SIZE) + size,
+				  PAGE_SIZE);
+	num_pages_4k = NUM_CHUNKS(((u64)iova_start % EHCA_PAGESIZE) + size,
+				  EHCA_PAGESIZE);
 
 	/* register MR on HCA */
 	if (ehca_mr_is_maxmr(size, iova_start)) {
@@ -305,10 +307,9 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt
 	}
 
 	/* determine number of MR pages */
-	num_pages_mr = (((virt % PAGE_SIZE) + length + PAGE_SIZE - 1) /
-			PAGE_SIZE);
-	num_pages_4k = (((virt % EHCA_PAGESIZE) + length + EHCA_PAGESIZE - 1) /
-			EHCA_PAGESIZE);
+	num_pages_mr = NUM_CHUNKS((virt % PAGE_SIZE) + length, PAGE_SIZE);
+	num_pages_4k = NUM_CHUNKS((virt % EHCA_PAGESIZE) + length,
+				  EHCA_PAGESIZE);
 
 	/* register MR on HCA */
 	pginfo.type       = EHCA_MR_PGI_USER;
@@ -462,10 +463,10 @@ int ehca_rereg_phys_mr(struct ib_mr *mr,
 			ret = -EINVAL;
 			goto rereg_phys_mr_exit1;
 		}
-		num_pages_mr = ((((u64)new_start % PAGE_SIZE) + new_size +
-				 PAGE_SIZE - 1) / PAGE_SIZE);
-		num_pages_4k = ((((u64)new_start % EHCA_PAGESIZE) + new_size +
-				 EHCA_PAGESIZE - 1) / EHCA_PAGESIZE);
+		num_pages_mr = NUM_CHUNKS(((u64)new_start % PAGE_SIZE) +
+					  new_size, PAGE_SIZE);
+		num_pages_4k = NUM_CHUNKS(((u64)new_start % EHCA_PAGESIZE) +
+					  new_size, EHCA_PAGESIZE);
 		pginfo.type           = EHCA_MR_PGI_PHYS;
 		pginfo.num_pages      = num_pages_mr;
 		pginfo.num_4k         = num_pages_4k;
@@ -1030,9 +1031,9 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca,
 	}
 
 	/* max 512 pages per shot */
-	for (i = 0; i < ((pginfo->num_4k + MAX_RPAGES - 1) / MAX_RPAGES); i++) {
+	for (i = 0; i < NUM_CHUNKS(pginfo->num_4k, MAX_RPAGES); i++) {
 
-		if (i == ((pginfo->num_4k + MAX_RPAGES - 1) / MAX_RPAGES) - 1) {
+		if (i == NUM_CHUNKS(pginfo->num_4k, MAX_RPAGES) - 1) {
 			rnum = pginfo->num_4k % MAX_RPAGES; /* last shot */
 			if (rnum == 0)
 				rnum = MAX_RPAGES;      /* last shot is full */
@@ -1069,7 +1070,7 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca,
 						 0, /* pagesize 4k */
 						 0, rpage, rnum);
 
-		if (i == ((pginfo->num_4k + MAX_RPAGES - 1) / MAX_RPAGES) - 1) {
+		if (i == NUM_CHUNKS(pginfo->num_4k, MAX_RPAGES) - 1) {
 			/*
 			 * check for 'registration complete'==H_SUCCESS
 			 * and for 'page registered'==H_PAGE_REGISTERED
@@ -1475,10 +1476,10 @@ int ehca_reg_internal_maxmr(
 	iova_start = (u64*)KERNELBASE;
 	ib_pbuf.addr = 0;
 	ib_pbuf.size = size_maxmr;
-	num_pages_mr = ((((u64)iova_start % PAGE_SIZE) + size_maxmr +
-			 PAGE_SIZE - 1) / PAGE_SIZE);
-	num_pages_4k = ((((u64)iova_start % EHCA_PAGESIZE) + size_maxmr +
-			 EHCA_PAGESIZE - 1) / EHCA_PAGESIZE);
+	num_pages_mr = NUM_CHUNKS(((u64)iova_start % PAGE_SIZE) + size_maxmr,
+				  PAGE_SIZE);
+	num_pages_4k = NUM_CHUNKS(((u64)iova_start % EHCA_PAGESIZE)
+				  + size_maxmr, EHCA_PAGESIZE);
 
 	pginfo.type           = EHCA_MR_PGI_PHYS;
 	pginfo.num_pages      = num_pages_mr;
@@ -1700,8 +1701,8 @@ int ehca_set_pagebuf(struct ehca_mr *e_mr,
 		/* loop over desired phys_buf_array entries */
 		while (i < number) {
 			pbuf   = pginfo->phys_buf_array + pginfo->next_buf;
-			num4k  = ((pbuf->addr % EHCA_PAGESIZE) + pbuf->size +
-				  EHCA_PAGESIZE - 1) / EHCA_PAGESIZE;
+			num4k  = NUM_CHUNKS((pbuf->addr % EHCA_PAGESIZE)
+					    + pbuf->size, EHCA_PAGESIZE);
 			offs4k = (pbuf->addr & ~PAGE_MASK) / EHCA_PAGESIZE;
 			while (pginfo->next_4k < offs4k + num4k) {
 				/* sanity check */
@@ -1873,8 +1874,8 @@ int ehca_set_pagebuf_1(struct ehca_mr *e_mr,
 			goto ehca_set_pagebuf_1_exit0;
 		}
 		tmp_pbuf = pginfo->phys_buf_array + pginfo->next_buf;
-		num4k  = ((tmp_pbuf->addr % EHCA_PAGESIZE) + tmp_pbuf->size +
-			  EHCA_PAGESIZE - 1) / EHCA_PAGESIZE;
+		num4k  = NUM_CHUNKS((tmp_pbuf->addr % EHCA_PAGESIZE) +
+				    tmp_pbuf->size, EHCA_PAGESIZE);
 		offs4k = (tmp_pbuf->addr & ~PAGE_MASK) / EHCA_PAGESIZE;
 		*rpage = phys_to_abs((tmp_pbuf->addr & EHCA_PAGEMASK) +
 				     (pginfo->next_4k * EHCA_PAGESIZE));
-- 
1.5.2


From fenkes at de.ibm.com  Thu Jul 12 08:52:29 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Thu, 12 Jul 2007 17:52:29 +0200
Subject: [ofa-general] [PATCH 07/10] IB/ehca: MR/MW structure refactoring
In-Reply-To: <200707121745.27592.fenkes@de.ibm.com>
References: <200707121745.27592.fenkes@de.ibm.com>
Message-ID: <200707121752.30129.fenkes@de.ibm.com>

From: Hoang-Nam Nguyen <hnguyen at de.ibm.com>

- Rename struct ehca_mr fields to clearly distinguish between kernel and HW
  page size
- Sort struct ehca_mr_pginfo into a common part and a union containing
  specific fields for physical, user and fast MR

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_classes.h |   50 ++--
 drivers/infiniband/hw/ehca/ehca_mrmw.c    |  511 +++++++++++++++--------------
 2 files changed, 284 insertions(+), 277 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h
index b2d614a..92103df 100644
--- a/drivers/infiniband/hw/ehca/ehca_classes.h
+++ b/drivers/infiniband/hw/ehca/ehca_classes.h
@@ -211,8 +211,8 @@ struct ehca_mr {
 	spinlock_t mrlock;
 
 	enum ehca_mr_flag flags;
-	u32 num_pages;		/* number of MR pages */
-	u32 num_4k;		/* number of 4k "page" portions to form MR */
+	u32 num_kpages;		/* number of kernel pages */
+	u32 num_hwpages;	/* number of hw pages to form MR */
 	int acl;		/* ACL (stored here for usage in reregister) */
 	u64 *start;		/* virtual start address (stored here for */
 	                        /* usage in reregister) */
@@ -224,9 +224,6 @@ struct ehca_mr {
 	/* fw specific data */
 	struct ipz_mrmw_handle ipz_mr_handle;	/* MR handle for h-calls */
 	struct h_galpas galpas;
-	/* data for userspace bridge */
-	u32 nr_of_pages;
-	void *pagearray;
 };
 
 struct ehca_mw {
@@ -248,26 +245,29 @@ enum ehca_mr_pgi_type {
 
 struct ehca_mr_pginfo {
 	enum ehca_mr_pgi_type type;
-	u64 num_pages;
-	u64 page_cnt;
-	u64 num_4k;       /* number of 4k "page" portions */
-	u64 page_4k_cnt;  /* counter for 4k "page" portions */
-	u64 next_4k;      /* next 4k "page" portion in buffer/chunk/listelem */
-
-	/* type EHCA_MR_PGI_PHYS section */
-	int num_phys_buf;
-	struct ib_phys_buf *phys_buf_array;
-	u64 next_buf;
-
-	/* type EHCA_MR_PGI_USER section */
-	struct ib_umem *region;
-	struct ib_umem_chunk *next_chunk;
-	u64 next_nmap;
-
-	/* type EHCA_MR_PGI_FMR section */
-	u64 *page_list;
-	u64 next_listelem;
-	/* next_4k also used within EHCA_MR_PGI_FMR */
+	u64 num_kpages;
+	u64 kpage_cnt;
+	u64 num_hwpages;     /* number of hw pages */
+	u64 hwpage_cnt;      /* counter for hw pages */
+	u64 next_hwpage;     /* next hw page in buffer/chunk/listelem */
+
+	union {
+		struct { /* type EHCA_MR_PGI_PHYS section */
+			int num_phys_buf;
+			struct ib_phys_buf *phys_buf_array;
+			u64 next_buf;
+		} phy;
+		struct { /* type EHCA_MR_PGI_USER section */
+			struct ib_umem *region;
+			struct ib_umem_chunk *next_chunk;
+			u64 next_nmap;
+		} usr;
+		struct { /* type EHCA_MR_PGI_FMR section */
+			u64 fmr_pgsize;
+			u64 *page_list;
+			u64 next_listelem;
+		} fmr;
+	} u;
 };
 
 /* output parameters for MR/FMR hipz calls */
diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c
index 58e8b33..53b334b 100644
--- a/drivers/infiniband/hw/ehca/ehca_mrmw.c
+++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c
@@ -150,9 +150,6 @@ struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd,
 	struct ehca_pd *e_pd = container_of(pd, struct ehca_pd, ib_pd);
 
 	u64 size;
-	struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0};
-	u32 num_pages_mr;
-	u32 num_pages_4k; /* 4k portion "pages" */
 
 	if ((num_phys_buf <= 0) || !phys_buf_array) {
 		ehca_err(pd->device, "bad input values: num_phys_buf=%x "
@@ -196,12 +193,6 @@ struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd,
 		goto reg_phys_mr_exit0;
 	}
 
-	/* determine number of MR pages */
-	num_pages_mr = NUM_CHUNKS(((u64)iova_start % PAGE_SIZE) + size,
-				  PAGE_SIZE);
-	num_pages_4k = NUM_CHUNKS(((u64)iova_start % EHCA_PAGESIZE) + size,
-				  EHCA_PAGESIZE);
-
 	/* register MR on HCA */
 	if (ehca_mr_is_maxmr(size, iova_start)) {
 		e_mr->flags |= EHCA_MR_FLAG_MAXMR;
@@ -213,13 +204,22 @@ struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd,
 			goto reg_phys_mr_exit1;
 		}
 	} else {
-		pginfo.type           = EHCA_MR_PGI_PHYS;
-		pginfo.num_pages      = num_pages_mr;
-		pginfo.num_4k         = num_pages_4k;
-		pginfo.num_phys_buf   = num_phys_buf;
-		pginfo.phys_buf_array = phys_buf_array;
-		pginfo.next_4k        = (((u64)iova_start & ~PAGE_MASK) /
-					 EHCA_PAGESIZE);
+		struct ehca_mr_pginfo pginfo;
+		u32 num_kpages;
+		u32 num_hwpages;
+
+		num_kpages = NUM_CHUNKS(((u64)iova_start % PAGE_SIZE) + size,
+					PAGE_SIZE);
+		num_hwpages = NUM_CHUNKS(((u64)iova_start % EHCA_PAGESIZE) +
+					 size, EHCA_PAGESIZE);
+		memset(&pginfo, 0, sizeof(pginfo));
+		pginfo.type = EHCA_MR_PGI_PHYS;
+		pginfo.num_kpages = num_kpages;
+		pginfo.num_hwpages = num_hwpages;
+		pginfo.u.phy.num_phys_buf = num_phys_buf;
+		pginfo.u.phy.phys_buf_array = phys_buf_array;
+		pginfo.next_hwpage = (((u64)iova_start & ~PAGE_MASK) /
+				      EHCA_PAGESIZE);
 
 		ret = ehca_reg_mr(shca, e_mr, iova_start, size, mr_access_flags,
 				  e_pd, &pginfo, &e_mr->ib.ib_mr.lkey,
@@ -254,10 +254,10 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt
 	struct ehca_shca *shca =
 		container_of(pd->device, struct ehca_shca, ib_device);
 	struct ehca_pd *e_pd = container_of(pd, struct ehca_pd, ib_pd);
-	struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0};
+	struct ehca_mr_pginfo pginfo;
 	int ret;
-	u32 num_pages_mr;
-	u32 num_pages_4k; /* 4k portion "pages" */
+	u32 num_kpages;
+	u32 num_hwpages;
 
 	if (!pd) {
 		ehca_gen_err("bad pd=%p", pd);
@@ -307,19 +307,20 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt
 	}
 
 	/* determine number of MR pages */
-	num_pages_mr = NUM_CHUNKS((virt % PAGE_SIZE) + length, PAGE_SIZE);
-	num_pages_4k = NUM_CHUNKS((virt % EHCA_PAGESIZE) + length,
-				  EHCA_PAGESIZE);
+	num_kpages = NUM_CHUNKS((virt % PAGE_SIZE) + length, PAGE_SIZE);
+	num_hwpages = NUM_CHUNKS((virt % EHCA_PAGESIZE) + length,
+				 EHCA_PAGESIZE);
 
 	/* register MR on HCA */
-	pginfo.type       = EHCA_MR_PGI_USER;
-	pginfo.num_pages  = num_pages_mr;
-	pginfo.num_4k     = num_pages_4k;
-	pginfo.region     = e_mr->umem;
-	pginfo.next_4k	  = e_mr->umem->offset / EHCA_PAGESIZE;
-	pginfo.next_chunk = list_prepare_entry(pginfo.next_chunk,
-					       (&e_mr->umem->chunk_list),
-					       list);
+	memset(&pginfo, 0, sizeof(pginfo));
+	pginfo.type = EHCA_MR_PGI_USER;
+	pginfo.num_kpages = num_kpages;
+	pginfo.num_hwpages = num_hwpages;
+	pginfo.u.usr.region = e_mr->umem;
+	pginfo.next_hwpage = e_mr->umem->offset / EHCA_PAGESIZE;
+	pginfo.u.usr.next_chunk = list_prepare_entry(pginfo.u.usr.next_chunk,
+						     (&e_mr->umem->chunk_list),
+						     list);
 
 	ret = ehca_reg_mr(shca, e_mr, (u64*) virt, length, mr_access_flags, e_pd,
 			  &pginfo, &e_mr->ib.ib_mr.lkey, &e_mr->ib.ib_mr.rkey);
@@ -365,9 +366,9 @@ int ehca_rereg_phys_mr(struct ib_mr *mr,
 	struct ehca_pd *new_pd;
 	u32 tmp_lkey, tmp_rkey;
 	unsigned long sl_flags;
-	u32 num_pages_mr = 0;
-	u32 num_pages_4k = 0; /* 4k portion "pages" */
-	struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0};
+	u32 num_kpages = 0;
+	u32 num_hwpages = 0;
+	struct ehca_mr_pginfo pginfo;
 	u32 cur_pid = current->tgid;
 
 	if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context &&
@@ -463,17 +464,18 @@ int ehca_rereg_phys_mr(struct ib_mr *mr,
 			ret = -EINVAL;
 			goto rereg_phys_mr_exit1;
 		}
-		num_pages_mr = NUM_CHUNKS(((u64)new_start % PAGE_SIZE) +
-					  new_size, PAGE_SIZE);
-		num_pages_4k = NUM_CHUNKS(((u64)new_start % EHCA_PAGESIZE) +
-					  new_size, EHCA_PAGESIZE);
-		pginfo.type           = EHCA_MR_PGI_PHYS;
-		pginfo.num_pages      = num_pages_mr;
-		pginfo.num_4k         = num_pages_4k;
-		pginfo.num_phys_buf   = num_phys_buf;
-		pginfo.phys_buf_array = phys_buf_array;
-		pginfo.next_4k        = (((u64)iova_start & ~PAGE_MASK) /
-					 EHCA_PAGESIZE);
+		num_kpages = NUM_CHUNKS(((u64)new_start % PAGE_SIZE) +
+					new_size, PAGE_SIZE);
+		num_hwpages = NUM_CHUNKS(((u64)new_start % EHCA_PAGESIZE) +
+					 new_size, EHCA_PAGESIZE);
+		memset(&pginfo, 0, sizeof(pginfo));
+		pginfo.type = EHCA_MR_PGI_PHYS;
+		pginfo.num_kpages = num_kpages;
+		pginfo.num_hwpages = num_hwpages;
+		pginfo.u.phy.num_phys_buf = num_phys_buf;
+		pginfo.u.phy.phys_buf_array = phys_buf_array;
+		pginfo.next_hwpage = (((u64)iova_start & ~PAGE_MASK) /
+				      EHCA_PAGESIZE);
 	}
 	if (mr_rereg_mask & IB_MR_REREG_ACCESS)
 		new_acl = mr_access_flags;
@@ -544,11 +546,11 @@ int ehca_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr)
 		ret = ehca2ib_return_code(h_ret);
 		goto query_mr_exit1;
 	}
-	mr_attr->pd               = mr->pd;
+	mr_attr->pd = mr->pd;
 	mr_attr->device_virt_addr = hipzout.vaddr;
-	mr_attr->size             = hipzout.len;
-	mr_attr->lkey             = hipzout.lkey;
-	mr_attr->rkey             = hipzout.rkey;
+	mr_attr->size = hipzout.len;
+	mr_attr->lkey = hipzout.lkey;
+	mr_attr->rkey = hipzout.rkey;
 	ehca_mrmw_reverse_map_acl(&hipzout.acl, &mr_attr->mr_access_flags);
 
 query_mr_exit1:
@@ -704,7 +706,7 @@ struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd,
 	struct ehca_mr *e_fmr;
 	int ret;
 	u32 tmp_lkey, tmp_rkey;
-	struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0};
+	struct ehca_mr_pginfo pginfo;
 
 	/* check other parameters */
 	if (((mr_access_flags & IB_ACCESS_REMOTE_WRITE) &&
@@ -750,6 +752,7 @@ struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd,
 	e_fmr->flags |= EHCA_MR_FLAG_FMR;
 
 	/* register MR on HCA */
+	memset(&pginfo, 0, sizeof(pginfo));
 	ret = ehca_reg_mr(shca, e_fmr, NULL,
 			  fmr_attr->max_pages * (1 << fmr_attr->page_shift),
 			  mr_access_flags, e_pd, &pginfo,
@@ -788,7 +791,7 @@ int ehca_map_phys_fmr(struct ib_fmr *fmr,
 		container_of(fmr->device, struct ehca_shca, ib_device);
 	struct ehca_mr *e_fmr = container_of(fmr, struct ehca_mr, ib.ib_fmr);
 	struct ehca_pd *e_pd = container_of(fmr->pd, struct ehca_pd, ib_pd);
-	struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0};
+	struct ehca_mr_pginfo pginfo;
 	u32 tmp_lkey, tmp_rkey;
 
 	if (!(e_fmr->flags & EHCA_MR_FLAG_FMR)) {
@@ -814,12 +817,13 @@ int ehca_map_phys_fmr(struct ib_fmr *fmr,
 			  fmr, e_fmr->fmr_map_cnt, e_fmr->fmr_max_maps);
 	}
 
-	pginfo.type      = EHCA_MR_PGI_FMR;
-	pginfo.num_pages = list_len;
-	pginfo.num_4k    = list_len * (e_fmr->fmr_page_size / EHCA_PAGESIZE);
-	pginfo.page_list = page_list;
-	pginfo.next_4k   = ((iova & (e_fmr->fmr_page_size-1)) /
-			    EHCA_PAGESIZE);
+	memset(&pginfo, 0, sizeof(pginfo));
+	pginfo.type = EHCA_MR_PGI_FMR;
+	pginfo.num_kpages = list_len;
+	pginfo.num_hwpages = list_len * (e_fmr->fmr_page_size / EHCA_PAGESIZE);
+	pginfo.u.fmr.page_list = page_list;
+	pginfo.next_hwpage = ((iova & (e_fmr->fmr_page_size-1)) /
+			      EHCA_PAGESIZE);
 
 	ret = ehca_rereg_mr(shca, e_fmr, (u64*)iova,
 			    list_len * e_fmr->fmr_page_size,
@@ -979,11 +983,11 @@ int ehca_reg_mr(struct ehca_shca *shca,
 		goto ehca_reg_mr_exit1;
 
 	/* successful registration */
-	e_mr->num_pages = pginfo->num_pages;
-	e_mr->num_4k    = pginfo->num_4k;
-	e_mr->start     = iova_start;
-	e_mr->size      = size;
-	e_mr->acl       = acl;
+	e_mr->num_kpages = pginfo->num_kpages;
+	e_mr->num_hwpages = pginfo->num_hwpages;
+	e_mr->start = iova_start;
+	e_mr->size = size;
+	e_mr->acl = acl;
 	*lkey = hipzout.lkey;
 	*rkey = hipzout.rkey;
 	return 0;
@@ -993,10 +997,10 @@ ehca_reg_mr_exit1:
 	if (h_ret != H_SUCCESS) {
 		ehca_err(&shca->ib_device, "h_ret=%lx shca=%p e_mr=%p "
 			 "iova_start=%p size=%lx acl=%x e_pd=%p lkey=%x "
-			 "pginfo=%p num_pages=%lx num_4k=%lx ret=%x",
+			 "pginfo=%p num_kpages=%lx num_hwpages=%lx ret=%x",
 			 h_ret, shca, e_mr, iova_start, size, acl, e_pd,
-			 hipzout.lkey, pginfo, pginfo->num_pages,
-			 pginfo->num_4k, ret);
+			 hipzout.lkey, pginfo, pginfo->num_kpages,
+			 pginfo->num_hwpages, ret);
 		ehca_err(&shca->ib_device, "internal error in ehca_reg_mr, "
 			 "not recoverable");
 	}
@@ -1004,9 +1008,9 @@ ehca_reg_mr_exit0:
 	if (ret)
 		ehca_err(&shca->ib_device, "ret=%x shca=%p e_mr=%p "
 			 "iova_start=%p size=%lx acl=%x e_pd=%p pginfo=%p "
-			 "num_pages=%lx num_4k=%lx",
+			 "num_kpages=%lx num_hwpages=%lx",
 			 ret, shca, e_mr, iova_start, size, acl, e_pd, pginfo,
-			 pginfo->num_pages, pginfo->num_4k);
+			 pginfo->num_kpages, pginfo->num_hwpages);
 	return ret;
 } /* end ehca_reg_mr() */
 
@@ -1031,10 +1035,10 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca,
 	}
 
 	/* max 512 pages per shot */
-	for (i = 0; i < NUM_CHUNKS(pginfo->num_4k, MAX_RPAGES); i++) {
+	for (i = 0; i < NUM_CHUNKS(pginfo->num_hwpages, MAX_RPAGES); i++) {
 
-		if (i == NUM_CHUNKS(pginfo->num_4k, MAX_RPAGES) - 1) {
-			rnum = pginfo->num_4k % MAX_RPAGES; /* last shot */
+		if (i == NUM_CHUNKS(pginfo->num_hwpages, MAX_RPAGES) - 1) {
+			rnum = pginfo->num_hwpages % MAX_RPAGES; /* last shot */
 			if (rnum == 0)
 				rnum = MAX_RPAGES;      /* last shot is full */
 		} else
@@ -1070,7 +1074,7 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca,
 						 0, /* pagesize 4k */
 						 0, rpage, rnum);
 
-		if (i == NUM_CHUNKS(pginfo->num_4k, MAX_RPAGES) - 1) {
+		if (i == NUM_CHUNKS(pginfo->num_hwpages, MAX_RPAGES) - 1) {
 			/*
 			 * check for 'registration complete'==H_SUCCESS
 			 * and for 'page registered'==H_PAGE_REGISTERED
@@ -1106,8 +1110,8 @@ ehca_reg_mr_rpages_exit1:
 ehca_reg_mr_rpages_exit0:
 	if (ret)
 		ehca_err(&shca->ib_device, "ret=%x shca=%p e_mr=%p pginfo=%p "
-			 "num_pages=%lx num_4k=%lx", ret, shca, e_mr, pginfo,
-			 pginfo->num_pages, pginfo->num_4k);
+			 "num_kpages=%lx num_hwpages=%lx", ret, shca, e_mr,
+			 pginfo, pginfo->num_kpages, pginfo->num_hwpages);
 	return ret;
 } /* end ehca_reg_mr_rpages() */
 
@@ -1142,12 +1146,12 @@ inline int ehca_rereg_mr_rereg1(struct ehca_shca *shca,
 	}
 
 	pginfo_save = *pginfo;
-	ret = ehca_set_pagebuf(e_mr, pginfo, pginfo->num_4k, kpage);
+	ret = ehca_set_pagebuf(e_mr, pginfo, pginfo->num_hwpages, kpage);
 	if (ret) {
 		ehca_err(&shca->ib_device, "set pagebuf failed, e_mr=%p "
-			 "pginfo=%p type=%x num_pages=%lx num_4k=%lx kpage=%p",
-			 e_mr, pginfo, pginfo->type, pginfo->num_pages,
-			 pginfo->num_4k,kpage);
+			 "pginfo=%p type=%x num_kpages=%lx num_hwpages=%lx "
+			 "kpage=%p", e_mr, pginfo, pginfo->type,
+			 pginfo->num_kpages, pginfo->num_hwpages, kpage);
 		goto ehca_rereg_mr_rereg1_exit1;
 	}
 	rpage = virt_to_abs(kpage);
@@ -1181,11 +1185,11 @@ inline int ehca_rereg_mr_rereg1(struct ehca_shca *shca,
 		 * successful reregistration
 		 * note: start and start_out are identical for eServer HCAs
 		 */
-		e_mr->num_pages = pginfo->num_pages;
-		e_mr->num_4k    = pginfo->num_4k;
-		e_mr->start     = iova_start;
-		e_mr->size      = size;
-		e_mr->acl       = acl;
+		e_mr->num_kpages = pginfo->num_kpages;
+		e_mr->num_hwpages = pginfo->num_hwpages;
+		e_mr->start = iova_start;
+		e_mr->size = size;
+		e_mr->acl = acl;
 		*lkey = hipzout.lkey;
 		*rkey = hipzout.rkey;
 	}
@@ -1195,9 +1199,9 @@ ehca_rereg_mr_rereg1_exit1:
 ehca_rereg_mr_rereg1_exit0:
 	if ( ret && (ret != -EAGAIN) )
 		ehca_err(&shca->ib_device, "ret=%x lkey=%x rkey=%x "
-			 "pginfo=%p num_pages=%lx num_4k=%lx",
-			 ret, *lkey, *rkey, pginfo, pginfo->num_pages,
-			 pginfo->num_4k);
+			 "pginfo=%p num_kpages=%lx num_hwpages=%lx",
+			 ret, *lkey, *rkey, pginfo, pginfo->num_kpages,
+			 pginfo->num_hwpages);
 	return ret;
 } /* end ehca_rereg_mr_rereg1() */
 
@@ -1219,10 +1223,12 @@ int ehca_rereg_mr(struct ehca_shca *shca,
 	int rereg_3_hcall = 0; /* 1: use 3 hipz calls for reregistration */
 
 	/* first determine reregistration hCall(s) */
-	if ((pginfo->num_4k > MAX_RPAGES) || (e_mr->num_4k > MAX_RPAGES) ||
-	    (pginfo->num_4k > e_mr->num_4k)) {
-		ehca_dbg(&shca->ib_device, "Rereg3 case, pginfo->num_4k=%lx "
-			 "e_mr->num_4k=%x", pginfo->num_4k, e_mr->num_4k);
+	if ((pginfo->num_hwpages > MAX_RPAGES) ||
+	    (e_mr->num_hwpages > MAX_RPAGES) ||
+	    (pginfo->num_hwpages > e_mr->num_hwpages)) {
+		ehca_dbg(&shca->ib_device, "Rereg3 case, "
+			 "pginfo->num_hwpages=%lx e_mr->num_hwpages=%x",
+			 pginfo->num_hwpages, e_mr->num_hwpages);
 		rereg_1_hcall = 0;
 		rereg_3_hcall = 1;
 	}
@@ -1286,9 +1292,9 @@ ehca_rereg_mr_exit0:
 	if (ret)
 		ehca_err(&shca->ib_device, "ret=%x shca=%p e_mr=%p "
 			 "iova_start=%p size=%lx acl=%x e_pd=%p pginfo=%p "
-			 "num_pages=%lx lkey=%x rkey=%x rereg_1_hcall=%x "
+			 "num_kpages=%lx lkey=%x rkey=%x rereg_1_hcall=%x "
 			 "rereg_3_hcall=%x", ret, shca, e_mr, iova_start, size,
-			 acl, e_pd, pginfo, pginfo->num_pages, *lkey, *rkey,
+			 acl, e_pd, pginfo, pginfo->num_kpages, *lkey, *rkey,
 			 rereg_1_hcall, rereg_3_hcall);
 	return ret;
 } /* end ehca_rereg_mr() */
@@ -1306,7 +1312,7 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca,
 		container_of(e_fmr->ib.ib_fmr.pd, struct ehca_pd, ib_pd);
 	struct ehca_mr save_fmr;
 	u32 tmp_lkey, tmp_rkey;
-	struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0};
+	struct ehca_mr_pginfo pginfo;
 	struct ehca_mr_hipzout_parms hipzout = {{0},0,0,0,0,0};
 
 	/* first check if reregistration hCall can be used for unmap */
@@ -1370,9 +1376,10 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca,
 		e_fmr->fmr_map_cnt = save_fmr.fmr_map_cnt;
 		e_fmr->acl = save_fmr.acl;
 
-		pginfo.type      = EHCA_MR_PGI_FMR;
-		pginfo.num_pages = 0;
-		pginfo.num_4k    = 0;
+		memset(&pginfo, 0, sizeof(pginfo));
+		pginfo.type = EHCA_MR_PGI_FMR;
+		pginfo.num_kpages = 0;
+		pginfo.num_hwpages = 0;
 		ret = ehca_reg_mr(shca, e_fmr, NULL,
 				  (e_fmr->fmr_max_pages * e_fmr->fmr_page_size),
 				  e_fmr->acl, e_pd, &pginfo, &tmp_lkey,
@@ -1428,11 +1435,11 @@ int ehca_reg_smr(struct ehca_shca *shca,
 		goto ehca_reg_smr_exit0;
 	}
 	/* successful registration */
-	e_newmr->num_pages     = e_origmr->num_pages;
-	e_newmr->num_4k        = e_origmr->num_4k;
-	e_newmr->start         = iova_start;
-	e_newmr->size          = e_origmr->size;
-	e_newmr->acl           = acl;
+	e_newmr->num_kpages = e_origmr->num_kpages;
+	e_newmr->num_hwpages = e_origmr->num_hwpages;
+	e_newmr->start = iova_start;
+	e_newmr->size = e_origmr->size;
+	e_newmr->acl = acl;
 	e_newmr->ipz_mr_handle = hipzout.handle;
 	*lkey = hipzout.lkey;
 	*rkey = hipzout.rkey;
@@ -1458,10 +1465,10 @@ int ehca_reg_internal_maxmr(
 	struct ehca_mr *e_mr;
 	u64 *iova_start;
 	u64 size_maxmr;
-	struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0};
+	struct ehca_mr_pginfo pginfo;
 	struct ib_phys_buf ib_pbuf;
-	u32 num_pages_mr;
-	u32 num_pages_4k; /* 4k portion "pages" */
+	u32 num_kpages;
+	u32 num_hwpages;
 
 	e_mr = ehca_mr_new();
 	if (!e_mr) {
@@ -1476,25 +1483,26 @@ int ehca_reg_internal_maxmr(
 	iova_start = (u64*)KERNELBASE;
 	ib_pbuf.addr = 0;
 	ib_pbuf.size = size_maxmr;
-	num_pages_mr = NUM_CHUNKS(((u64)iova_start % PAGE_SIZE) + size_maxmr,
-				  PAGE_SIZE);
-	num_pages_4k = NUM_CHUNKS(((u64)iova_start % EHCA_PAGESIZE)
-				  + size_maxmr, EHCA_PAGESIZE);
-
-	pginfo.type           = EHCA_MR_PGI_PHYS;
-	pginfo.num_pages      = num_pages_mr;
-	pginfo.num_4k         = num_pages_4k;
-	pginfo.num_phys_buf   = 1;
-	pginfo.phys_buf_array = &ib_pbuf;
+	num_kpages = NUM_CHUNKS(((u64)iova_start % PAGE_SIZE) + size_maxmr,
+				PAGE_SIZE);
+	num_hwpages = NUM_CHUNKS(((u64)iova_start % EHCA_PAGESIZE) + size_maxmr,
+				 EHCA_PAGESIZE);
+
+	memset(&pginfo, 0, sizeof(pginfo));
+	pginfo.type = EHCA_MR_PGI_PHYS;
+	pginfo.num_kpages = num_kpages;
+	pginfo.num_hwpages = num_hwpages;
+	pginfo.u.phy.num_phys_buf = 1;
+	pginfo.u.phy.phys_buf_array = &ib_pbuf;
 
 	ret = ehca_reg_mr(shca, e_mr, iova_start, size_maxmr, 0, e_pd,
 			  &pginfo, &e_mr->ib.ib_mr.lkey,
 			  &e_mr->ib.ib_mr.rkey);
 	if (ret) {
 		ehca_err(&shca->ib_device, "reg of internal max MR failed, "
-			 "e_mr=%p iova_start=%p size_maxmr=%lx num_pages_mr=%x "
-			 "num_pages_4k=%x", e_mr, iova_start, size_maxmr,
-			 num_pages_mr, num_pages_4k);
+			 "e_mr=%p iova_start=%p size_maxmr=%lx num_kpages=%x "
+			 "num_hwpages=%x", e_mr, iova_start, size_maxmr,
+			 num_kpages, num_hwpages);
 		goto ehca_reg_internal_maxmr_exit1;
 	}
 
@@ -1546,11 +1554,11 @@ int ehca_reg_maxmr(struct ehca_shca *shca,
 		return ehca2ib_return_code(h_ret);
 	}
 	/* successful registration */
-	e_newmr->num_pages     = e_origmr->num_pages;
-	e_newmr->num_4k        = e_origmr->num_4k;
-	e_newmr->start         = iova_start;
-	e_newmr->size          = e_origmr->size;
-	e_newmr->acl           = acl;
+	e_newmr->num_kpages = e_origmr->num_kpages;
+	e_newmr->num_hwpages = e_origmr->num_hwpages;
+	e_newmr->start = iova_start;
+	e_newmr->size = e_origmr->size;
+	e_newmr->acl = acl;
 	e_newmr->ipz_mr_handle = hipzout.handle;
 	*lkey = hipzout.lkey;
 	*rkey = hipzout.rkey;
@@ -1693,138 +1701,139 @@ int ehca_set_pagebuf(struct ehca_mr *e_mr,
 	struct ib_umem_chunk *chunk;
 	struct ib_phys_buf *pbuf;
 	u64 *fmrlist;
-	u64 num4k, pgaddr, offs4k;
+	u64 num_hw, pgaddr, offs_hw;
 	u32 i = 0;
 	u32 j = 0;
 
 	if (pginfo->type == EHCA_MR_PGI_PHYS) {
 		/* loop over desired phys_buf_array entries */
 		while (i < number) {
-			pbuf   = pginfo->phys_buf_array + pginfo->next_buf;
-			num4k  = NUM_CHUNKS((pbuf->addr % EHCA_PAGESIZE)
-					    + pbuf->size, EHCA_PAGESIZE);
-			offs4k = (pbuf->addr & ~PAGE_MASK) / EHCA_PAGESIZE;
-			while (pginfo->next_4k < offs4k + num4k) {
+			pbuf   = pginfo->u.phy.phys_buf_array
+				+ pginfo->u.phy.next_buf;
+			num_hw  = NUM_CHUNKS((pbuf->addr % EHCA_PAGESIZE) +
+					     pbuf->size, EHCA_PAGESIZE);
+			offs_hw = (pbuf->addr & ~PAGE_MASK) / EHCA_PAGESIZE;
+			while (pginfo->next_hwpage < offs_hw + num_hw) {
 				/* sanity check */
-				if ((pginfo->page_cnt >= pginfo->num_pages) ||
-				    (pginfo->page_4k_cnt >= pginfo->num_4k)) {
-					ehca_gen_err("page_cnt >= num_pages, "
-						     "page_cnt=%lx "
-						     "num_pages=%lx "
-						     "page_4k_cnt=%lx "
-						     "num_4k=%lx i=%x",
-						     pginfo->page_cnt,
-						     pginfo->num_pages,
-						     pginfo->page_4k_cnt,
-						     pginfo->num_4k, i);
+				if ((pginfo->kpage_cnt >= pginfo->num_kpages) ||
+				    (pginfo->hwpage_cnt >= pginfo->num_hwpages)) {
+					ehca_gen_err("kpage_cnt >= num_kpages, "
+						     "kpage_cnt=%lx "
+						     "num_kpages=%lx "
+						     "hwpage_cnt=%lx "
+						     "num_hwpages=%lx i=%x",
+						     pginfo->kpage_cnt,
+						     pginfo->num_kpages,
+						     pginfo->hwpage_cnt,
+						     pginfo->num_hwpages, i);
 					ret = -EFAULT;
 					goto ehca_set_pagebuf_exit0;
 				}
 				*kpage = phys_to_abs(
 					(pbuf->addr & EHCA_PAGEMASK)
-					+ (pginfo->next_4k * EHCA_PAGESIZE));
+					+ (pginfo->next_hwpage * EHCA_PAGESIZE));
 				if ( !(*kpage) && pbuf->addr ) {
 					ehca_gen_err("pbuf->addr=%lx "
 						     "pbuf->size=%lx "
-						     "next_4k=%lx", pbuf->addr,
+						     "next_hwpage=%lx", pbuf->addr,
 						     pbuf->size,
-						     pginfo->next_4k);
+						     pginfo->next_hwpage);
 					ret = -EFAULT;
 					goto ehca_set_pagebuf_exit0;
 				}
-				(pginfo->page_4k_cnt)++;
-				(pginfo->next_4k)++;
-				if (pginfo->next_4k %
+				(pginfo->hwpage_cnt)++;
+				(pginfo->next_hwpage)++;
+				if (pginfo->next_hwpage %
 				    (PAGE_SIZE / EHCA_PAGESIZE) == 0)
-					(pginfo->page_cnt)++;
+					(pginfo->kpage_cnt)++;
 				kpage++;
 				i++;
 				if (i >= number) break;
 			}
-			if (pginfo->next_4k >= offs4k + num4k) {
-				(pginfo->next_buf)++;
-				pginfo->next_4k = 0;
+			if (pginfo->next_hwpage >= offs_hw + num_hw) {
+				(pginfo->u.phy.next_buf)++;
+				pginfo->next_hwpage = 0;
 			}
 		}
 	} else if (pginfo->type == EHCA_MR_PGI_USER) {
 		/* loop over desired chunk entries */
-		chunk      = pginfo->next_chunk;
-		prev_chunk = pginfo->next_chunk;
+		chunk      = pginfo->u.usr.next_chunk;
+		prev_chunk = pginfo->u.usr.next_chunk;
 		list_for_each_entry_continue(chunk,
-					     (&(pginfo->region->chunk_list)),
+					     (&(pginfo->u.usr.region->chunk_list)),
 					     list) {
-			for (i = pginfo->next_nmap; i < chunk->nmap; ) {
+			for (i = pginfo->u.usr.next_nmap; i < chunk->nmap; ) {
 				pgaddr = ( page_to_pfn(chunk->page_list[i].page)
 					   << PAGE_SHIFT );
 				*kpage = phys_to_abs(pgaddr +
-						     (pginfo->next_4k *
+						     (pginfo->next_hwpage *
 						      EHCA_PAGESIZE));
 				if ( !(*kpage) ) {
 					ehca_gen_err("pgaddr=%lx "
 						     "chunk->page_list[i]=%lx "
-						     "i=%x next_4k=%lx mr=%p",
+						     "i=%x next_hwpage=%lx mr=%p",
 						     pgaddr,
 						     (u64)sg_dma_address(
 							     &chunk->
 							     page_list[i]),
-						     i, pginfo->next_4k, e_mr);
+						     i, pginfo->next_hwpage, e_mr);
 					ret = -EFAULT;
 					goto ehca_set_pagebuf_exit0;
 				}
-				(pginfo->page_4k_cnt)++;
-				(pginfo->next_4k)++;
+				(pginfo->hwpage_cnt)++;
+				(pginfo->next_hwpage)++;
 				kpage++;
-				if (pginfo->next_4k %
+				if (pginfo->next_hwpage %
 				    (PAGE_SIZE / EHCA_PAGESIZE) == 0) {
-					(pginfo->page_cnt)++;
-					(pginfo->next_nmap)++;
-					pginfo->next_4k = 0;
+					(pginfo->kpage_cnt)++;
+					(pginfo->u.usr.next_nmap)++;
+					pginfo->next_hwpage = 0;
 					i++;
 				}
 				j++;
 				if (j >= number) break;
 			}
-			if ((pginfo->next_nmap >= chunk->nmap) &&
+			if ((pginfo->u.usr.next_nmap >= chunk->nmap) &&
 			    (j >= number)) {
-				pginfo->next_nmap = 0;
+				pginfo->u.usr.next_nmap = 0;
 				prev_chunk = chunk;
 				break;
-			} else if (pginfo->next_nmap >= chunk->nmap) {
-				pginfo->next_nmap = 0;
+			} else if (pginfo->u.usr.next_nmap >= chunk->nmap) {
+				pginfo->u.usr.next_nmap = 0;
 				prev_chunk = chunk;
 			} else if (j >= number)
 				break;
 			else
 				prev_chunk = chunk;
 		}
-		pginfo->next_chunk =
+		pginfo->u.usr.next_chunk =
 			list_prepare_entry(prev_chunk,
-					   (&(pginfo->region->chunk_list)),
+					   (&(pginfo->u.usr.region->chunk_list)),
 					   list);
 	} else if (pginfo->type == EHCA_MR_PGI_FMR) {
 		/* loop over desired page_list entries */
-		fmrlist = pginfo->page_list + pginfo->next_listelem;
+		fmrlist = pginfo->u.fmr.page_list + pginfo->u.fmr.next_listelem;
 		for (i = 0; i < number; i++) {
 			*kpage = phys_to_abs((*fmrlist & EHCA_PAGEMASK) +
-					     pginfo->next_4k * EHCA_PAGESIZE);
+					     pginfo->next_hwpage * EHCA_PAGESIZE);
 			if ( !(*kpage) ) {
 				ehca_gen_err("*fmrlist=%lx fmrlist=%p "
-					     "next_listelem=%lx next_4k=%lx",
+					     "next_listelem=%lx next_hwpage=%lx",
 					     *fmrlist, fmrlist,
-					     pginfo->next_listelem,
-					     pginfo->next_4k);
+					     pginfo->u.fmr.next_listelem,
+					     pginfo->next_hwpage);
 				ret = -EFAULT;
 				goto ehca_set_pagebuf_exit0;
 			}
-			(pginfo->page_4k_cnt)++;
-			(pginfo->next_4k)++;
+			(pginfo->hwpage_cnt)++;
+			(pginfo->next_hwpage)++;
 			kpage++;
-			if (pginfo->next_4k %
+			if (pginfo->next_hwpage %
 			    (e_mr->fmr_page_size / EHCA_PAGESIZE) == 0) {
-				(pginfo->page_cnt)++;
-				(pginfo->next_listelem)++;
+				(pginfo->kpage_cnt)++;
+				(pginfo->u.fmr.next_listelem)++;
 				fmrlist++;
-				pginfo->next_4k = 0;
+				pginfo->next_hwpage = 0;
 			}
 		}
 	} else {
@@ -1835,16 +1844,16 @@ int ehca_set_pagebuf(struct ehca_mr *e_mr,
 
 ehca_set_pagebuf_exit0:
 	if (ret)
-		ehca_gen_err("ret=%x e_mr=%p pginfo=%p type=%x num_pages=%lx "
-			     "num_4k=%lx next_buf=%lx next_4k=%lx number=%x "
-			     "kpage=%p page_cnt=%lx page_4k_cnt=%lx i=%x "
+		ehca_gen_err("ret=%x e_mr=%p pginfo=%p type=%x num_kpages=%lx "
+			     "num_hwpages=%lx next_buf=%lx next_hwpage=%lx number=%x "
+			     "kpage=%p kpage_cnt=%lx hwpage_cnt=%lx i=%x "
 			     "next_listelem=%lx region=%p next_chunk=%p "
 			     "next_nmap=%lx", ret, e_mr, pginfo, pginfo->type,
-			     pginfo->num_pages, pginfo->num_4k,
-			     pginfo->next_buf, pginfo->next_4k, number, kpage,
-			     pginfo->page_cnt, pginfo->page_4k_cnt, i,
-			     pginfo->next_listelem, pginfo->region,
-			     pginfo->next_chunk, pginfo->next_nmap);
+			     pginfo->num_kpages, pginfo->num_hwpages,
+			     pginfo->u.phy.next_buf, pginfo->next_hwpage, number, kpage,
+			     pginfo->kpage_cnt, pginfo->hwpage_cnt, i,
+			     pginfo->u.fmr.next_listelem, pginfo->u.usr.region,
+			     pginfo->u.usr.next_chunk, pginfo->u.usr.next_nmap);
 	return ret;
 } /* end ehca_set_pagebuf() */
 
@@ -1860,101 +1869,101 @@ int ehca_set_pagebuf_1(struct ehca_mr *e_mr,
 	u64 *fmrlist;
 	struct ib_umem_chunk *chunk;
 	struct ib_umem_chunk *prev_chunk;
-	u64 pgaddr, num4k, offs4k;
+	u64 pgaddr, num_hw, offs_hw;
 
 	if (pginfo->type == EHCA_MR_PGI_PHYS) {
 		/* sanity check */
-		if ((pginfo->page_cnt >= pginfo->num_pages) ||
-		    (pginfo->page_4k_cnt >= pginfo->num_4k)) {
-			ehca_gen_err("page_cnt >= num_pages, page_cnt=%lx "
-				     "num_pages=%lx page_4k_cnt=%lx num_4k=%lx",
-				     pginfo->page_cnt, pginfo->num_pages,
-				     pginfo->page_4k_cnt, pginfo->num_4k);
+		if ((pginfo->kpage_cnt >= pginfo->num_kpages) ||
+		    (pginfo->hwpage_cnt >= pginfo->num_hwpages)) {
+			ehca_gen_err("kpage_cnt >= num_hwpages, kpage_cnt=%lx "
+				     "num_hwpages=%lx hwpage_cnt=%lx num_hwpages=%lx",
+				     pginfo->kpage_cnt, pginfo->num_kpages,
+				     pginfo->hwpage_cnt, pginfo->num_hwpages);
 			ret = -EFAULT;
 			goto ehca_set_pagebuf_1_exit0;
 		}
-		tmp_pbuf = pginfo->phys_buf_array + pginfo->next_buf;
-		num4k  = NUM_CHUNKS((tmp_pbuf->addr % EHCA_PAGESIZE) +
-				    tmp_pbuf->size, EHCA_PAGESIZE);
-		offs4k = (tmp_pbuf->addr & ~PAGE_MASK) / EHCA_PAGESIZE;
+		tmp_pbuf = pginfo->u.phy.phys_buf_array + pginfo->u.phy.next_buf;
+		num_hw  = NUM_CHUNKS((tmp_pbuf->addr % EHCA_PAGESIZE) +
+				     tmp_pbuf->size, EHCA_PAGESIZE);
+		offs_hw = (tmp_pbuf->addr & ~PAGE_MASK) / EHCA_PAGESIZE;
 		*rpage = phys_to_abs((tmp_pbuf->addr & EHCA_PAGEMASK) +
-				     (pginfo->next_4k * EHCA_PAGESIZE));
+				     (pginfo->next_hwpage * EHCA_PAGESIZE));
 		if ( !(*rpage) && tmp_pbuf->addr ) {
 			ehca_gen_err("tmp_pbuf->addr=%lx"
-				     " tmp_pbuf->size=%lx next_4k=%lx",
+				     " tmp_pbuf->size=%lx next_hwpage=%lx",
 				     tmp_pbuf->addr, tmp_pbuf->size,
-				     pginfo->next_4k);
+				     pginfo->next_hwpage);
 			ret = -EFAULT;
 			goto ehca_set_pagebuf_1_exit0;
 		}
-		(pginfo->page_4k_cnt)++;
-		(pginfo->next_4k)++;
-		if (pginfo->next_4k % (PAGE_SIZE / EHCA_PAGESIZE) == 0)
-			(pginfo->page_cnt)++;
-		if (pginfo->next_4k >= offs4k + num4k) {
-			(pginfo->next_buf)++;
-			pginfo->next_4k = 0;
+		(pginfo->hwpage_cnt)++;
+		(pginfo->next_hwpage)++;
+		if (pginfo->next_hwpage % (PAGE_SIZE / EHCA_PAGESIZE) == 0)
+			(pginfo->kpage_cnt)++;
+		if (pginfo->next_hwpage >= offs_hw + num_hw) {
+			(pginfo->u.phy.next_buf)++;
+			pginfo->next_hwpage = 0;
 		}
 	} else if (pginfo->type == EHCA_MR_PGI_USER) {
-		chunk      = pginfo->next_chunk;
-		prev_chunk = pginfo->next_chunk;
+		chunk = pginfo->u.usr.next_chunk;
+		prev_chunk = pginfo->u.usr.next_chunk;
 		list_for_each_entry_continue(chunk,
-					     (&(pginfo->region->chunk_list)),
+					     (&(pginfo->u.usr.region->chunk_list)),
 					     list) {
 			pgaddr = ( page_to_pfn(chunk->page_list[
-						       pginfo->next_nmap].page)
+						       pginfo->u.usr.next_nmap].page)
 				   << PAGE_SHIFT);
 			*rpage = phys_to_abs(pgaddr +
-					     (pginfo->next_4k * EHCA_PAGESIZE));
+					     (pginfo->next_hwpage * EHCA_PAGESIZE));
 			if ( !(*rpage) ) {
 				ehca_gen_err("pgaddr=%lx chunk->page_list[]=%lx"
-					     " next_nmap=%lx next_4k=%lx mr=%p",
+					     " next_nmap=%lx next_hwpage=%lx mr=%p",
 					     pgaddr, (u64)sg_dma_address(
 						     &chunk->page_list[
-							     pginfo->
+							     pginfo->u.usr.
 							     next_nmap]),
-					     pginfo->next_nmap, pginfo->next_4k,
+					     pginfo->u.usr.next_nmap, pginfo->next_hwpage,
 					     e_mr);
 				ret = -EFAULT;
 				goto ehca_set_pagebuf_1_exit0;
 			}
-			(pginfo->page_4k_cnt)++;
-			(pginfo->next_4k)++;
-			if (pginfo->next_4k %
+			(pginfo->hwpage_cnt)++;
+			(pginfo->next_hwpage)++;
+			if (pginfo->next_hwpage %
 			    (PAGE_SIZE / EHCA_PAGESIZE) == 0) {
-				(pginfo->page_cnt)++;
-				(pginfo->next_nmap)++;
-				pginfo->next_4k = 0;
+				(pginfo->kpage_cnt)++;
+				(pginfo->u.usr.next_nmap)++;
+				pginfo->next_hwpage = 0;
 			}
-			if (pginfo->next_nmap >= chunk->nmap) {
-				pginfo->next_nmap = 0;
+			if (pginfo->u.usr.next_nmap >= chunk->nmap) {
+				pginfo->u.usr.next_nmap = 0;
 				prev_chunk = chunk;
 			}
 			break;
 		}
-		pginfo->next_chunk =
+		pginfo->u.usr.next_chunk =
 			list_prepare_entry(prev_chunk,
-					   (&(pginfo->region->chunk_list)),
+					   (&(pginfo->u.usr.region->chunk_list)),
 					   list);
 	} else if (pginfo->type == EHCA_MR_PGI_FMR) {
-		fmrlist = pginfo->page_list + pginfo->next_listelem;
+		fmrlist = pginfo->u.fmr.page_list + pginfo->u.fmr.next_listelem;
 		*rpage = phys_to_abs((*fmrlist & EHCA_PAGEMASK) +
-				     pginfo->next_4k * EHCA_PAGESIZE);
+				     pginfo->next_hwpage * EHCA_PAGESIZE);
 		if ( !(*rpage) ) {
 			ehca_gen_err("*fmrlist=%lx fmrlist=%p "
-				     "next_listelem=%lx next_4k=%lx",
-				     *fmrlist, fmrlist, pginfo->next_listelem,
-				     pginfo->next_4k);
+				     "next_listelem=%lx next_hwpage=%lx",
+				     *fmrlist, fmrlist, pginfo->u.fmr.next_listelem,
+				     pginfo->next_hwpage);
 			ret = -EFAULT;
 			goto ehca_set_pagebuf_1_exit0;
 		}
-		(pginfo->page_4k_cnt)++;
-		(pginfo->next_4k)++;
-		if (pginfo->next_4k %
+		(pginfo->hwpage_cnt)++;
+		(pginfo->next_hwpage)++;
+		if (pginfo->next_hwpage %
 		    (e_mr->fmr_page_size / EHCA_PAGESIZE) == 0) {
-			(pginfo->page_cnt)++;
-			(pginfo->next_listelem)++;
-			pginfo->next_4k = 0;
+			(pginfo->kpage_cnt)++;
+			(pginfo->u.fmr.next_listelem)++;
+			pginfo->next_hwpage = 0;
 		}
 	} else {
 		ehca_gen_err("bad pginfo->type=%x", pginfo->type);
@@ -1964,15 +1973,15 @@ int ehca_set_pagebuf_1(struct ehca_mr *e_mr,
 
 ehca_set_pagebuf_1_exit0:
 	if (ret)
-		ehca_gen_err("ret=%x e_mr=%p pginfo=%p type=%x num_pages=%lx "
-			     "num_4k=%lx next_buf=%lx next_4k=%lx rpage=%p "
-			     "page_cnt=%lx page_4k_cnt=%lx next_listelem=%lx "
+		ehca_gen_err("ret=%x e_mr=%p pginfo=%p type=%x num_kpages=%lx "
+			     "num_hwpages=%lx next_buf=%lx next_hwpage=%lx rpage=%p "
+			     "kpage_cnt=%lx hwpage_cnt=%lx next_listelem=%lx "
 			     "region=%p next_chunk=%p next_nmap=%lx", ret, e_mr,
-			     pginfo, pginfo->type, pginfo->num_pages,
-			     pginfo->num_4k, pginfo->next_buf, pginfo->next_4k,
-			     rpage, pginfo->page_cnt, pginfo->page_4k_cnt,
-			     pginfo->next_listelem, pginfo->region,
-			     pginfo->next_chunk, pginfo->next_nmap);
+			     pginfo, pginfo->type, pginfo->num_kpages,
+			     pginfo->num_hwpages, pginfo->u.phy.next_buf, pginfo->next_hwpage,
+			     rpage, pginfo->kpage_cnt, pginfo->hwpage_cnt,
+			     pginfo->u.fmr.next_listelem, pginfo->u.usr.region,
+			     pginfo->u.usr.next_chunk, pginfo->u.usr.next_nmap);
 	return ret;
 } /* end ehca_set_pagebuf_1() */
 
@@ -2053,19 +2062,17 @@ void ehca_mrmw_reverse_map_acl(const u32 *hipz_acl,
  */
 void ehca_mr_deletenew(struct ehca_mr *mr)
 {
-	mr->flags         = 0;
-	mr->num_pages     = 0;
-	mr->num_4k        = 0;
-	mr->acl           = 0;
-	mr->start         = NULL;
+	mr->flags = 0;
+	mr->num_kpages = 0;
+	mr->num_hwpages = 0;
+	mr->acl = 0;
+	mr->start = NULL;
 	mr->fmr_page_size = 0;
 	mr->fmr_max_pages = 0;
-	mr->fmr_max_maps  = 0;
-	mr->fmr_map_cnt   = 0;
+	mr->fmr_max_maps = 0;
+	mr->fmr_map_cnt = 0;
 	memset(&mr->ipz_mr_handle, 0, sizeof(mr->ipz_mr_handle));
 	memset(&mr->galpas, 0, sizeof(mr->galpas));
-	mr->nr_of_pages   = 0;
-	mr->pagearray     = NULL;
 } /* end ehca_mr_deletenew() */
 
 int ehca_init_mrmw_cache(void)
-- 
1.5.2


From fenkes at de.ibm.com  Thu Jul 12 08:53:11 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Thu, 12 Jul 2007 17:53:11 +0200
Subject: [ofa-general] [PATCH 08/10] IB/ehca: Restructure ehca_set_pagebuf()
In-Reply-To: <200707121745.27592.fenkes@de.ibm.com>
References: <200707121745.27592.fenkes@de.ibm.com>
Message-ID: <200707121753.12404.fenkes@de.ibm.com>

From: Hoang-Nam Nguyen <hnguyen at de.ibm.com>

Split ehca_set_pagebuf() into three functions depending on MR type
(phys/user/fast) and remove superfluous ehca_set_pagebuf_1().

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_mrmw.c |  531 ++++++++++++--------------------
 1 files changed, 200 insertions(+), 331 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c
index 53b334b..93c26cc 100644
--- a/drivers/infiniband/hw/ehca/ehca_mrmw.c
+++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c
@@ -824,6 +824,7 @@ int ehca_map_phys_fmr(struct ib_fmr *fmr,
 	pginfo.u.fmr.page_list = page_list;
 	pginfo.next_hwpage = ((iova & (e_fmr->fmr_page_size-1)) /
 			      EHCA_PAGESIZE);
+	pginfo.u.fmr.fmr_pgsize = e_fmr->fmr_page_size;
 
 	ret = ehca_rereg_mr(shca, e_fmr, (u64*)iova,
 			    list_len * e_fmr->fmr_page_size,
@@ -1044,15 +1045,15 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca,
 		} else
 			rnum = MAX_RPAGES;
 
-		if (rnum > 1) {
-			ret = ehca_set_pagebuf(e_mr, pginfo, rnum, kpage);
-			if (ret) {
-				ehca_err(&shca->ib_device, "ehca_set_pagebuf "
+		ret = ehca_set_pagebuf(pginfo, rnum, kpage);
+		if (ret) {
+			ehca_err(&shca->ib_device, "ehca_set_pagebuf "
 					 "bad rc, ret=%x rnum=%x kpage=%p",
 					 ret, rnum, kpage);
-				ret = -EFAULT;
-				goto ehca_reg_mr_rpages_exit1;
-			}
+			goto ehca_reg_mr_rpages_exit1;
+		}
+
+		if (rnum > 1) {
 			rpage = virt_to_abs(kpage);
 			if (!rpage) {
 				ehca_err(&shca->ib_device, "kpage=%p i=%x",
@@ -1060,15 +1061,8 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca,
 				ret = -EFAULT;
 				goto ehca_reg_mr_rpages_exit1;
 			}
-		} else {  /* rnum==1 */
-			ret = ehca_set_pagebuf_1(e_mr, pginfo, &rpage);
-			if (ret) {
-				ehca_err(&shca->ib_device, "ehca_set_pagebuf_1 "
-					 "bad rc, ret=%x i=%x", ret, i);
-				ret = -EFAULT;
-				goto ehca_reg_mr_rpages_exit1;
-			}
-		}
+		} else
+			rpage = *kpage;
 
 		h_ret = hipz_h_register_rpage_mr(shca->ipz_hca_handle, e_mr,
 						 0, /* pagesize 4k */
@@ -1146,7 +1140,7 @@ inline int ehca_rereg_mr_rereg1(struct ehca_shca *shca,
 	}
 
 	pginfo_save = *pginfo;
-	ret = ehca_set_pagebuf(e_mr, pginfo, pginfo->num_hwpages, kpage);
+	ret = ehca_set_pagebuf(pginfo, pginfo->num_hwpages, kpage);
 	if (ret) {
 		ehca_err(&shca->ib_device, "set pagebuf failed, e_mr=%p "
 			 "pginfo=%p type=%x num_kpages=%lx num_hwpages=%lx "
@@ -1306,98 +1300,86 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca,
 {
 	int ret = 0;
 	u64 h_ret;
-	int rereg_1_hcall = 1; /* 1: use hipz_mr_reregister directly */
-	int rereg_3_hcall = 0; /* 1: use 3 hipz calls for unmapping */
 	struct ehca_pd *e_pd =
 		container_of(e_fmr->ib.ib_fmr.pd, struct ehca_pd, ib_pd);
 	struct ehca_mr save_fmr;
 	u32 tmp_lkey, tmp_rkey;
 	struct ehca_mr_pginfo pginfo;
 	struct ehca_mr_hipzout_parms hipzout = {{0},0,0,0,0,0};
+	struct ehca_mr save_mr;
 
-	/* first check if reregistration hCall can be used for unmap */
-	if (e_fmr->fmr_max_pages > MAX_RPAGES) {
-		rereg_1_hcall = 0;
-		rereg_3_hcall = 1;
-	}
-
-	if (rereg_1_hcall) {
+	if (e_fmr->fmr_max_pages <= MAX_RPAGES) {
 		/*
 		 * note: after using rereg hcall with len=0,
 		 * rereg hcall must be used again for registering pages
 		 */
 		h_ret = hipz_h_reregister_pmr(shca->ipz_hca_handle, e_fmr, 0,
 					      0, 0, e_pd->fw_pd, 0, &hipzout);
-		if (h_ret != H_SUCCESS) {
-			/*
-			 * should not happen, because length checked above,
-			 * FMRs are not shared and no MW bound to FMRs
-			 */
-			ehca_err(&shca->ib_device, "hipz_reregister_pmr failed "
-				 "(Rereg1), h_ret=%lx e_fmr=%p hca_hndl=%lx "
-				 "mr_hndl=%lx lkey=%x lkey_out=%x",
-				 h_ret, e_fmr, shca->ipz_hca_handle.handle,
-				 e_fmr->ipz_mr_handle.handle,
-				 e_fmr->ib.ib_fmr.lkey, hipzout.lkey);
-			rereg_3_hcall = 1;
-		} else {
+		if (h_ret == H_SUCCESS) {
 			/* successful reregistration */
 			e_fmr->start = NULL;
 			e_fmr->size = 0;
 			tmp_lkey = hipzout.lkey;
 			tmp_rkey = hipzout.rkey;
+			return 0;
 		}
+		/*
+		 * should not happen, because length checked above,
+		 * FMRs are not shared and no MW bound to FMRs
+		 */
+		ehca_err(&shca->ib_device, "hipz_reregister_pmr failed "
+			 "(Rereg1), h_ret=%lx e_fmr=%p hca_hndl=%lx "
+			 "mr_hndl=%lx lkey=%x lkey_out=%x",
+			 h_ret, e_fmr, shca->ipz_hca_handle.handle,
+			 e_fmr->ipz_mr_handle.handle,
+			 e_fmr->ib.ib_fmr.lkey, hipzout.lkey);
+		/* try free and rereg */
 	}
 
-	if (rereg_3_hcall) {
-		struct ehca_mr save_mr;
-
-		/* first free old FMR */
-		h_ret = hipz_h_free_resource_mr(shca->ipz_hca_handle, e_fmr);
-		if (h_ret != H_SUCCESS) {
-			ehca_err(&shca->ib_device, "hipz_free_mr failed, "
-				 "h_ret=%lx e_fmr=%p hca_hndl=%lx mr_hndl=%lx "
-				 "lkey=%x",
-				 h_ret, e_fmr, shca->ipz_hca_handle.handle,
-				 e_fmr->ipz_mr_handle.handle,
-				 e_fmr->ib.ib_fmr.lkey);
-			ret = ehca2ib_return_code(h_ret);
-			goto ehca_unmap_one_fmr_exit0;
-		}
-		/* clean ehca_mr_t, without changing lock */
-		save_fmr = *e_fmr;
-		ehca_mr_deletenew(e_fmr);
+	/* first free old FMR */
+	h_ret = hipz_h_free_resource_mr(shca->ipz_hca_handle, e_fmr);
+	if (h_ret != H_SUCCESS) {
+		ehca_err(&shca->ib_device, "hipz_free_mr failed, "
+			 "h_ret=%lx e_fmr=%p hca_hndl=%lx mr_hndl=%lx "
+			 "lkey=%x",
+			 h_ret, e_fmr, shca->ipz_hca_handle.handle,
+			 e_fmr->ipz_mr_handle.handle,
+			 e_fmr->ib.ib_fmr.lkey);
+		ret = ehca2ib_return_code(h_ret);
+		goto ehca_unmap_one_fmr_exit0;
+	}
+	/* clean ehca_mr_t, without changing lock */
+	save_fmr = *e_fmr;
+	ehca_mr_deletenew(e_fmr);
 
-		/* set some MR values */
-		e_fmr->flags = save_fmr.flags;
-		e_fmr->fmr_page_size = save_fmr.fmr_page_size;
-		e_fmr->fmr_max_pages = save_fmr.fmr_max_pages;
-		e_fmr->fmr_max_maps = save_fmr.fmr_max_maps;
-		e_fmr->fmr_map_cnt = save_fmr.fmr_map_cnt;
-		e_fmr->acl = save_fmr.acl;
+	/* set some MR values */
+	e_fmr->flags = save_fmr.flags;
+	e_fmr->fmr_page_size = save_fmr.fmr_page_size;
+	e_fmr->fmr_max_pages = save_fmr.fmr_max_pages;
+	e_fmr->fmr_max_maps = save_fmr.fmr_max_maps;
+	e_fmr->fmr_map_cnt = save_fmr.fmr_map_cnt;
+	e_fmr->acl = save_fmr.acl;
 
-		memset(&pginfo, 0, sizeof(pginfo));
-		pginfo.type = EHCA_MR_PGI_FMR;
-		pginfo.num_kpages = 0;
-		pginfo.num_hwpages = 0;
-		ret = ehca_reg_mr(shca, e_fmr, NULL,
-				  (e_fmr->fmr_max_pages * e_fmr->fmr_page_size),
-				  e_fmr->acl, e_pd, &pginfo, &tmp_lkey,
-				  &tmp_rkey);
-		if (ret) {
-			u32 offset = (u64)(&e_fmr->flags) - (u64)e_fmr;
-			memcpy(&e_fmr->flags, &(save_mr.flags),
-			       sizeof(struct ehca_mr) - offset);
-			goto ehca_unmap_one_fmr_exit0;
-		}
+	memset(&pginfo, 0, sizeof(pginfo));
+	pginfo.type = EHCA_MR_PGI_FMR;
+	pginfo.num_kpages = 0;
+	pginfo.num_hwpages = 0;
+	ret = ehca_reg_mr(shca, e_fmr, NULL,
+			  (e_fmr->fmr_max_pages * e_fmr->fmr_page_size),
+			  e_fmr->acl, e_pd, &pginfo, &tmp_lkey,
+			  &tmp_rkey);
+	if (ret) {
+		u32 offset = (u64)(&e_fmr->flags) - (u64)e_fmr;
+		memcpy(&e_fmr->flags, &(save_mr.flags),
+		       sizeof(struct ehca_mr) - offset);
+		goto ehca_unmap_one_fmr_exit0;
 	}
 
 ehca_unmap_one_fmr_exit0:
 	if (ret)
 		ehca_err(&shca->ib_device, "ret=%x tmp_lkey=%x tmp_rkey=%x "
-			 "fmr_max_pages=%x rereg_1_hcall=%x rereg_3_hcall=%x",
-			 ret, tmp_lkey, tmp_rkey, e_fmr->fmr_max_pages,
-			 rereg_1_hcall, rereg_3_hcall);
+			 "fmr_max_pages=%x",
+			 ret, tmp_lkey, tmp_rkey, e_fmr->fmr_max_pages);
 	return ret;
 } /* end ehca_unmap_one_fmr() */
 
@@ -1690,300 +1672,187 @@ int ehca_fmr_check_page_list(struct ehca_mr *e_fmr,
 
 /*----------------------------------------------------------------------*/
 
-/* setup page buffer from page info */
-int ehca_set_pagebuf(struct ehca_mr *e_mr,
-		     struct ehca_mr_pginfo *pginfo,
-		     u32 number,
-		     u64 *kpage)
+/* PAGE_SIZE >= pginfo->hwpage_size */
+static int ehca_set_pagebuf_user1(struct ehca_mr_pginfo *pginfo,
+				  u32 number,
+				  u64 *kpage)
 {
 	int ret = 0;
 	struct ib_umem_chunk *prev_chunk;
 	struct ib_umem_chunk *chunk;
-	struct ib_phys_buf *pbuf;
-	u64 *fmrlist;
-	u64 num_hw, pgaddr, offs_hw;
+	u64 pgaddr;
 	u32 i = 0;
 	u32 j = 0;
 
-	if (pginfo->type == EHCA_MR_PGI_PHYS) {
-		/* loop over desired phys_buf_array entries */
-		while (i < number) {
-			pbuf   = pginfo->u.phy.phys_buf_array
-				+ pginfo->u.phy.next_buf;
-			num_hw  = NUM_CHUNKS((pbuf->addr % EHCA_PAGESIZE) +
-					     pbuf->size, EHCA_PAGESIZE);
-			offs_hw = (pbuf->addr & ~PAGE_MASK) / EHCA_PAGESIZE;
-			while (pginfo->next_hwpage < offs_hw + num_hw) {
-				/* sanity check */
-				if ((pginfo->kpage_cnt >= pginfo->num_kpages) ||
-				    (pginfo->hwpage_cnt >= pginfo->num_hwpages)) {
-					ehca_gen_err("kpage_cnt >= num_kpages, "
-						     "kpage_cnt=%lx "
-						     "num_kpages=%lx "
-						     "hwpage_cnt=%lx "
-						     "num_hwpages=%lx i=%x",
-						     pginfo->kpage_cnt,
-						     pginfo->num_kpages,
-						     pginfo->hwpage_cnt,
-						     pginfo->num_hwpages, i);
-					ret = -EFAULT;
-					goto ehca_set_pagebuf_exit0;
-				}
-				*kpage = phys_to_abs(
-					(pbuf->addr & EHCA_PAGEMASK)
-					+ (pginfo->next_hwpage * EHCA_PAGESIZE));
-				if ( !(*kpage) && pbuf->addr ) {
-					ehca_gen_err("pbuf->addr=%lx "
-						     "pbuf->size=%lx "
-						     "next_hwpage=%lx", pbuf->addr,
-						     pbuf->size,
-						     pginfo->next_hwpage);
-					ret = -EFAULT;
-					goto ehca_set_pagebuf_exit0;
-				}
-				(pginfo->hwpage_cnt)++;
-				(pginfo->next_hwpage)++;
-				if (pginfo->next_hwpage %
-				    (PAGE_SIZE / EHCA_PAGESIZE) == 0)
-					(pginfo->kpage_cnt)++;
-				kpage++;
-				i++;
-				if (i >= number) break;
-			}
-			if (pginfo->next_hwpage >= offs_hw + num_hw) {
-				(pginfo->u.phy.next_buf)++;
-				pginfo->next_hwpage = 0;
-			}
-		}
-	} else if (pginfo->type == EHCA_MR_PGI_USER) {
-		/* loop over desired chunk entries */
-		chunk      = pginfo->u.usr.next_chunk;
-		prev_chunk = pginfo->u.usr.next_chunk;
-		list_for_each_entry_continue(chunk,
-					     (&(pginfo->u.usr.region->chunk_list)),
-					     list) {
-			for (i = pginfo->u.usr.next_nmap; i < chunk->nmap; ) {
-				pgaddr = ( page_to_pfn(chunk->page_list[i].page)
-					   << PAGE_SHIFT );
-				*kpage = phys_to_abs(pgaddr +
-						     (pginfo->next_hwpage *
-						      EHCA_PAGESIZE));
-				if ( !(*kpage) ) {
-					ehca_gen_err("pgaddr=%lx "
-						     "chunk->page_list[i]=%lx "
-						     "i=%x next_hwpage=%lx mr=%p",
-						     pgaddr,
-						     (u64)sg_dma_address(
-							     &chunk->
-							     page_list[i]),
-						     i, pginfo->next_hwpage, e_mr);
-					ret = -EFAULT;
-					goto ehca_set_pagebuf_exit0;
-				}
-				(pginfo->hwpage_cnt)++;
-				(pginfo->next_hwpage)++;
-				kpage++;
-				if (pginfo->next_hwpage %
-				    (PAGE_SIZE / EHCA_PAGESIZE) == 0) {
-					(pginfo->kpage_cnt)++;
-					(pginfo->u.usr.next_nmap)++;
-					pginfo->next_hwpage = 0;
-					i++;
-				}
-				j++;
-				if (j >= number) break;
-			}
-			if ((pginfo->u.usr.next_nmap >= chunk->nmap) &&
-			    (j >= number)) {
-				pginfo->u.usr.next_nmap = 0;
-				prev_chunk = chunk;
-				break;
-			} else if (pginfo->u.usr.next_nmap >= chunk->nmap) {
-				pginfo->u.usr.next_nmap = 0;
-				prev_chunk = chunk;
-			} else if (j >= number)
-				break;
-			else
-				prev_chunk = chunk;
-		}
-		pginfo->u.usr.next_chunk =
-			list_prepare_entry(prev_chunk,
-					   (&(pginfo->u.usr.region->chunk_list)),
-					   list);
-	} else if (pginfo->type == EHCA_MR_PGI_FMR) {
-		/* loop over desired page_list entries */
-		fmrlist = pginfo->u.fmr.page_list + pginfo->u.fmr.next_listelem;
-		for (i = 0; i < number; i++) {
-			*kpage = phys_to_abs((*fmrlist & EHCA_PAGEMASK) +
-					     pginfo->next_hwpage * EHCA_PAGESIZE);
+	/* loop over desired chunk entries */
+	chunk      = pginfo->u.usr.next_chunk;
+	prev_chunk = pginfo->u.usr.next_chunk;
+	list_for_each_entry_continue(
+		chunk, (&(pginfo->u.usr.region->chunk_list)), list) {
+		for (i = pginfo->u.usr.next_nmap; i < chunk->nmap; ) {
+			pgaddr = page_to_pfn(chunk->page_list[i].page)
+				<< PAGE_SHIFT ;
+			*kpage = phys_to_abs(pgaddr +
+					     (pginfo->next_hwpage *
+					      EHCA_PAGESIZE));
 			if ( !(*kpage) ) {
-				ehca_gen_err("*fmrlist=%lx fmrlist=%p "
-					     "next_listelem=%lx next_hwpage=%lx",
-					     *fmrlist, fmrlist,
-					     pginfo->u.fmr.next_listelem,
-					     pginfo->next_hwpage);
-				ret = -EFAULT;
-				goto ehca_set_pagebuf_exit0;
+				ehca_gen_err("pgaddr=%lx "
+					     "chunk->page_list[i]=%lx "
+					     "i=%x next_hwpage=%lx",
+					     pgaddr, (u64)sg_dma_address(
+						     &chunk->page_list[i]),
+					     i, pginfo->next_hwpage);
+				return -EFAULT;
 			}
 			(pginfo->hwpage_cnt)++;
 			(pginfo->next_hwpage)++;
 			kpage++;
 			if (pginfo->next_hwpage %
-			    (e_mr->fmr_page_size / EHCA_PAGESIZE) == 0) {
+			    (PAGE_SIZE / EHCA_PAGESIZE) == 0) {
 				(pginfo->kpage_cnt)++;
-				(pginfo->u.fmr.next_listelem)++;
-				fmrlist++;
+				(pginfo->u.usr.next_nmap)++;
 				pginfo->next_hwpage = 0;
+				i++;
 			}
+			j++;
+			if (j >= number) break;
 		}
-	} else {
-		ehca_gen_err("bad pginfo->type=%x", pginfo->type);
-		ret = -EFAULT;
-		goto ehca_set_pagebuf_exit0;
+		if ((pginfo->u.usr.next_nmap >= chunk->nmap) &&
+		    (j >= number)) {
+			pginfo->u.usr.next_nmap = 0;
+			prev_chunk = chunk;
+			break;
+		} else if (pginfo->u.usr.next_nmap >= chunk->nmap) {
+			pginfo->u.usr.next_nmap = 0;
+			prev_chunk = chunk;
+		} else if (j >= number)
+			break;
+		else
+			prev_chunk = chunk;
 	}
-
-ehca_set_pagebuf_exit0:
-	if (ret)
-		ehca_gen_err("ret=%x e_mr=%p pginfo=%p type=%x num_kpages=%lx "
-			     "num_hwpages=%lx next_buf=%lx next_hwpage=%lx number=%x "
-			     "kpage=%p kpage_cnt=%lx hwpage_cnt=%lx i=%x "
-			     "next_listelem=%lx region=%p next_chunk=%p "
-			     "next_nmap=%lx", ret, e_mr, pginfo, pginfo->type,
-			     pginfo->num_kpages, pginfo->num_hwpages,
-			     pginfo->u.phy.next_buf, pginfo->next_hwpage, number, kpage,
-			     pginfo->kpage_cnt, pginfo->hwpage_cnt, i,
-			     pginfo->u.fmr.next_listelem, pginfo->u.usr.region,
-			     pginfo->u.usr.next_chunk, pginfo->u.usr.next_nmap);
+	pginfo->u.usr.next_chunk =
+		list_prepare_entry(prev_chunk,
+				   (&(pginfo->u.usr.region->chunk_list)),
+				   list);
 	return ret;
-} /* end ehca_set_pagebuf() */
-
-/*----------------------------------------------------------------------*/
+}
 
-/* setup 1 page from page info page buffer */
-int ehca_set_pagebuf_1(struct ehca_mr *e_mr,
-		       struct ehca_mr_pginfo *pginfo,
-		       u64 *rpage)
+int ehca_set_pagebuf_phys(struct ehca_mr_pginfo *pginfo,
+			  u32 number,
+			  u64 *kpage)
 {
 	int ret = 0;
-	struct ib_phys_buf *tmp_pbuf;
-	u64 *fmrlist;
-	struct ib_umem_chunk *chunk;
-	struct ib_umem_chunk *prev_chunk;
-	u64 pgaddr, num_hw, offs_hw;
-
-	if (pginfo->type == EHCA_MR_PGI_PHYS) {
-		/* sanity check */
-		if ((pginfo->kpage_cnt >= pginfo->num_kpages) ||
-		    (pginfo->hwpage_cnt >= pginfo->num_hwpages)) {
-			ehca_gen_err("kpage_cnt >= num_hwpages, kpage_cnt=%lx "
-				     "num_hwpages=%lx hwpage_cnt=%lx num_hwpages=%lx",
-				     pginfo->kpage_cnt, pginfo->num_kpages,
-				     pginfo->hwpage_cnt, pginfo->num_hwpages);
-			ret = -EFAULT;
-			goto ehca_set_pagebuf_1_exit0;
-		}
-		tmp_pbuf = pginfo->u.phy.phys_buf_array + pginfo->u.phy.next_buf;
-		num_hw  = NUM_CHUNKS((tmp_pbuf->addr % EHCA_PAGESIZE) +
-				     tmp_pbuf->size, EHCA_PAGESIZE);
-		offs_hw = (tmp_pbuf->addr & ~PAGE_MASK) / EHCA_PAGESIZE;
-		*rpage = phys_to_abs((tmp_pbuf->addr & EHCA_PAGEMASK) +
-				     (pginfo->next_hwpage * EHCA_PAGESIZE));
-		if ( !(*rpage) && tmp_pbuf->addr ) {
-			ehca_gen_err("tmp_pbuf->addr=%lx"
-				     " tmp_pbuf->size=%lx next_hwpage=%lx",
-				     tmp_pbuf->addr, tmp_pbuf->size,
-				     pginfo->next_hwpage);
-			ret = -EFAULT;
-			goto ehca_set_pagebuf_1_exit0;
-		}
-		(pginfo->hwpage_cnt)++;
-		(pginfo->next_hwpage)++;
-		if (pginfo->next_hwpage % (PAGE_SIZE / EHCA_PAGESIZE) == 0)
-			(pginfo->kpage_cnt)++;
-		if (pginfo->next_hwpage >= offs_hw + num_hw) {
-			(pginfo->u.phy.next_buf)++;
-			pginfo->next_hwpage = 0;
-		}
-	} else if (pginfo->type == EHCA_MR_PGI_USER) {
-		chunk = pginfo->u.usr.next_chunk;
-		prev_chunk = pginfo->u.usr.next_chunk;
-		list_for_each_entry_continue(chunk,
-					     (&(pginfo->u.usr.region->chunk_list)),
-					     list) {
-			pgaddr = ( page_to_pfn(chunk->page_list[
-						       pginfo->u.usr.next_nmap].page)
-				   << PAGE_SHIFT);
-			*rpage = phys_to_abs(pgaddr +
-					     (pginfo->next_hwpage * EHCA_PAGESIZE));
-			if ( !(*rpage) ) {
-				ehca_gen_err("pgaddr=%lx chunk->page_list[]=%lx"
-					     " next_nmap=%lx next_hwpage=%lx mr=%p",
-					     pgaddr, (u64)sg_dma_address(
-						     &chunk->page_list[
-							     pginfo->u.usr.
-							     next_nmap]),
-					     pginfo->u.usr.next_nmap, pginfo->next_hwpage,
-					     e_mr);
-				ret = -EFAULT;
-				goto ehca_set_pagebuf_1_exit0;
+	struct ib_phys_buf *pbuf;
+	u64 num_hw, offs_hw;
+	u32 i = 0;
+
+	/* loop over desired phys_buf_array entries */
+	while (i < number) {
+		pbuf   = pginfo->u.phy.phys_buf_array + pginfo->u.phy.next_buf;
+		num_hw  = NUM_CHUNKS((pbuf->addr % EHCA_PAGESIZE) +
+				     pbuf->size, EHCA_PAGESIZE);
+		offs_hw = (pbuf->addr & ~PAGE_MASK) / EHCA_PAGESIZE;
+		while (pginfo->next_hwpage < offs_hw + num_hw) {
+			/* sanity check */
+			if ((pginfo->kpage_cnt >= pginfo->num_kpages) ||
+			    (pginfo->hwpage_cnt >= pginfo->num_hwpages)) {
+				ehca_gen_err("kpage_cnt >= num_kpages, "
+					     "kpage_cnt=%lx num_kpages=%lx "
+					     "hwpage_cnt=%lx "
+					     "num_hwpages=%lx i=%x",
+					     pginfo->kpage_cnt,
+					     pginfo->num_kpages,
+					     pginfo->hwpage_cnt,
+					     pginfo->num_hwpages, i);
+				return -EFAULT;
+			}
+			*kpage = phys_to_abs(
+				(pbuf->addr & EHCA_PAGEMASK)
+				+ (pginfo->next_hwpage * EHCA_PAGESIZE));
+			if ( !(*kpage) && pbuf->addr ) {
+				ehca_gen_err("pbuf->addr=%lx "
+					     "pbuf->size=%lx "
+					     "next_hwpage=%lx", pbuf->addr,
+					     pbuf->size,
+					     pginfo->next_hwpage);
+				return -EFAULT;
 			}
 			(pginfo->hwpage_cnt)++;
 			(pginfo->next_hwpage)++;
 			if (pginfo->next_hwpage %
-			    (PAGE_SIZE / EHCA_PAGESIZE) == 0) {
+			    (PAGE_SIZE / EHCA_PAGESIZE) == 0)
 				(pginfo->kpage_cnt)++;
-				(pginfo->u.usr.next_nmap)++;
-				pginfo->next_hwpage = 0;
-			}
-			if (pginfo->u.usr.next_nmap >= chunk->nmap) {
-				pginfo->u.usr.next_nmap = 0;
-				prev_chunk = chunk;
-			}
-			break;
+			kpage++;
+			i++;
+			if (i >= number) break;
 		}
-		pginfo->u.usr.next_chunk =
-			list_prepare_entry(prev_chunk,
-					   (&(pginfo->u.usr.region->chunk_list)),
-					   list);
-	} else if (pginfo->type == EHCA_MR_PGI_FMR) {
-		fmrlist = pginfo->u.fmr.page_list + pginfo->u.fmr.next_listelem;
-		*rpage = phys_to_abs((*fmrlist & EHCA_PAGEMASK) +
+		if (pginfo->next_hwpage >= offs_hw + num_hw) {
+			(pginfo->u.phy.next_buf)++;
+			pginfo->next_hwpage = 0;
+		}
+	}
+	return ret;
+}
+
+int ehca_set_pagebuf_fmr(struct ehca_mr_pginfo *pginfo,
+			 u32 number,
+			 u64 *kpage)
+{
+	int ret = 0;
+	u64 *fmrlist;
+	u32 i;
+
+	/* loop over desired page_list entries */
+	fmrlist = pginfo->u.fmr.page_list + pginfo->u.fmr.next_listelem;
+	for (i = 0; i < number; i++) {
+		*kpage = phys_to_abs((*fmrlist & EHCA_PAGEMASK) +
 				     pginfo->next_hwpage * EHCA_PAGESIZE);
-		if ( !(*rpage) ) {
+		if ( !(*kpage) ) {
 			ehca_gen_err("*fmrlist=%lx fmrlist=%p "
 				     "next_listelem=%lx next_hwpage=%lx",
-				     *fmrlist, fmrlist, pginfo->u.fmr.next_listelem,
+				     *fmrlist, fmrlist,
+				     pginfo->u.fmr.next_listelem,
 				     pginfo->next_hwpage);
-			ret = -EFAULT;
-			goto ehca_set_pagebuf_1_exit0;
+			return -EFAULT;
 		}
 		(pginfo->hwpage_cnt)++;
 		(pginfo->next_hwpage)++;
+		kpage++;
 		if (pginfo->next_hwpage %
-		    (e_mr->fmr_page_size / EHCA_PAGESIZE) == 0) {
+		    (pginfo->u.fmr.fmr_pgsize / EHCA_PAGESIZE) == 0) {
 			(pginfo->kpage_cnt)++;
 			(pginfo->u.fmr.next_listelem)++;
+			fmrlist++;
 			pginfo->next_hwpage = 0;
 		}
-	} else {
+	}
+	return ret;
+}
+
+/* setup page buffer from page info */
+int ehca_set_pagebuf(struct ehca_mr_pginfo *pginfo,
+		     u32 number,
+		     u64 *kpage)
+{
+	int ret;
+
+	switch (pginfo->type) {
+	case EHCA_MR_PGI_PHYS:
+		ret = ehca_set_pagebuf_phys(pginfo, number, kpage);
+		break;
+	case EHCA_MR_PGI_USER:
+		ret = ehca_set_pagebuf_user1(pginfo, number, kpage);
+		break;
+	case EHCA_MR_PGI_FMR:
+		ret = ehca_set_pagebuf_fmr(pginfo, number, kpage);
+		break;
+	default:
 		ehca_gen_err("bad pginfo->type=%x", pginfo->type);
 		ret = -EFAULT;
-		goto ehca_set_pagebuf_1_exit0;
+		break;
 	}
-
-ehca_set_pagebuf_1_exit0:
-	if (ret)
-		ehca_gen_err("ret=%x e_mr=%p pginfo=%p type=%x num_kpages=%lx "
-			     "num_hwpages=%lx next_buf=%lx next_hwpage=%lx rpage=%p "
-			     "kpage_cnt=%lx hwpage_cnt=%lx next_listelem=%lx "
-			     "region=%p next_chunk=%p next_nmap=%lx", ret, e_mr,
-			     pginfo, pginfo->type, pginfo->num_kpages,
-			     pginfo->num_hwpages, pginfo->u.phy.next_buf, pginfo->next_hwpage,
-			     rpage, pginfo->kpage_cnt, pginfo->hwpage_cnt,
-			     pginfo->u.fmr.next_listelem, pginfo->u.usr.region,
-			     pginfo->u.usr.next_chunk, pginfo->u.usr.next_nmap);
 	return ret;
-} /* end ehca_set_pagebuf_1() */
+} /* end ehca_set_pagebuf() */
 
 /*----------------------------------------------------------------------*/
 
-- 
1.5.2


From fenkes at de.ibm.com  Thu Jul 12 08:53:47 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Thu, 12 Jul 2007 17:53:47 +0200
Subject: [ofa-general] [PATCH 09/10] IB/ehca: Fix warnings issued by
	checkpatch.pl
In-Reply-To: <200707121745.27592.fenkes@de.ibm.com>
References: <200707121745.27592.fenkes@de.ibm.com>
Message-ID: <200707121753.48434.fenkes@de.ibm.com>

From: Hoang-Nam Nguyen <hnguyen at de.ibm.com>

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_av.c              |    2 +-
 drivers/infiniband/hw/ehca/ehca_classes.h         |    4 +-
 drivers/infiniband/hw/ehca/ehca_classes_pSeries.h |  156 ++++++++++----------
 drivers/infiniband/hw/ehca/ehca_cq.c              |    2 +-
 drivers/infiniband/hw/ehca/ehca_eq.c              |    3 +-
 drivers/infiniband/hw/ehca/ehca_hca.c             |   28 +++-
 drivers/infiniband/hw/ehca/ehca_irq.c             |   56 ++++----
 drivers/infiniband/hw/ehca/ehca_iverbs.h          |    7 +-
 drivers/infiniband/hw/ehca/ehca_main.c            |   21 ++--
 drivers/infiniband/hw/ehca/ehca_mrmw.c            |   59 ++++----
 drivers/infiniband/hw/ehca/ehca_mrmw.h            |    7 +-
 drivers/infiniband/hw/ehca/ehca_qes.h             |   22 ++--
 drivers/infiniband/hw/ehca/ehca_qp.c              |   39 +++---
 drivers/infiniband/hw/ehca/ehca_reqs.c            |   15 ++-
 drivers/infiniband/hw/ehca/ehca_tools.h           |   28 ++--
 drivers/infiniband/hw/ehca/ehca_uverbs.c          |   10 +-
 drivers/infiniband/hw/ehca/hcp_if.c               |    8 +-
 drivers/infiniband/hw/ehca/hcp_phyp.c             |    2 +-
 drivers/infiniband/hw/ehca/hipz_fns_core.h        |    4 +-
 drivers/infiniband/hw/ehca/hipz_hw.h              |   24 ++--
 drivers/infiniband/hw/ehca/ipz_pt_fn.c            |    2 +-
 drivers/infiniband/hw/ehca/ipz_pt_fn.h            |    4 +-
 22 files changed, 261 insertions(+), 242 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_av.c b/drivers/infiniband/hw/ehca/ehca_av.c
index 3cd6bf3..e53a97a 100644
--- a/drivers/infiniband/hw/ehca/ehca_av.c
+++ b/drivers/infiniband/hw/ehca/ehca_av.c
@@ -79,7 +79,7 @@ struct ib_ah *ehca_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr)
 			av->av.ipd = (ah_mult > 0) ?
 				((ehca_mult - 1) / ah_mult) : 0;
 	} else
-	        av->av.ipd = ehca_static_rate;
+		av->av.ipd = ehca_static_rate;
 
 	av->av.lnh = ah_attr->ah_flags;
 	av->av.grh.word_0 = EHCA_BMASK_SET(GRH_IPVERSION_MASK, 6);
diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h
index 92103df..1752821 100644
--- a/drivers/infiniband/hw/ehca/ehca_classes.h
+++ b/drivers/infiniband/hw/ehca/ehca_classes.h
@@ -215,7 +215,7 @@ struct ehca_mr {
 	u32 num_hwpages;	/* number of hw pages to form MR */
 	int acl;		/* ACL (stored here for usage in reregister) */
 	u64 *start;		/* virtual start address (stored here for */
-	                        /* usage in reregister) */
+				/* usage in reregister) */
 	u64 size;		/* size (stored here for usage in reregister) */
 	u32 fmr_page_size;	/* page size for FMR */
 	u32 fmr_max_pages;	/* max pages for FMR */
@@ -400,6 +400,6 @@ struct ehca_alloc_qp_parms {
 
 int ehca_cq_assign_qp(struct ehca_cq *cq, struct ehca_qp *qp);
 int ehca_cq_unassign_qp(struct ehca_cq *cq, unsigned int qp_num);
-struct ehca_qp* ehca_cq_get_qp(struct ehca_cq *cq, int qp_num);
+struct ehca_qp *ehca_cq_get_qp(struct ehca_cq *cq, int qp_num);
 
 #endif
diff --git a/drivers/infiniband/hw/ehca/ehca_classes_pSeries.h b/drivers/infiniband/hw/ehca/ehca_classes_pSeries.h
index fb3df5c..1798e64 100644
--- a/drivers/infiniband/hw/ehca/ehca_classes_pSeries.h
+++ b/drivers/infiniband/hw/ehca/ehca_classes_pSeries.h
@@ -154,83 +154,83 @@ struct hcp_modify_qp_control_block {
 	u32 reserved_70_127[58];       /* 70 */
 };
 
-#define MQPCB_MASK_QKEY                         EHCA_BMASK_IBM(0,0)
-#define MQPCB_MASK_SEND_PSN                     EHCA_BMASK_IBM(2,2)
-#define MQPCB_MASK_RECEIVE_PSN                  EHCA_BMASK_IBM(3,3)
-#define MQPCB_MASK_PRIM_PHYS_PORT               EHCA_BMASK_IBM(4,4)
-#define MQPCB_PRIM_PHYS_PORT                    EHCA_BMASK_IBM(24,31)
-#define MQPCB_MASK_ALT_PHYS_PORT                EHCA_BMASK_IBM(5,5)
-#define MQPCB_MASK_PRIM_P_KEY_IDX               EHCA_BMASK_IBM(6,6)
-#define MQPCB_PRIM_P_KEY_IDX                    EHCA_BMASK_IBM(24,31)
-#define MQPCB_MASK_ALT_P_KEY_IDX                EHCA_BMASK_IBM(7,7)
-#define MQPCB_MASK_RDMA_ATOMIC_CTRL             EHCA_BMASK_IBM(8,8)
-#define MQPCB_MASK_QP_STATE                     EHCA_BMASK_IBM(9,9)
-#define MQPCB_QP_STATE                          EHCA_BMASK_IBM(24,31)
-#define MQPCB_MASK_RDMA_NR_ATOMIC_RESP_RES      EHCA_BMASK_IBM(11,11)
-#define MQPCB_MASK_PATH_MIGRATION_STATE         EHCA_BMASK_IBM(12,12)
-#define MQPCB_MASK_RDMA_ATOMIC_OUTST_DEST_QP    EHCA_BMASK_IBM(13,13)
-#define MQPCB_MASK_DEST_QP_NR                   EHCA_BMASK_IBM(14,14)
-#define MQPCB_MASK_MIN_RNR_NAK_TIMER_FIELD      EHCA_BMASK_IBM(15,15)
-#define MQPCB_MASK_SERVICE_LEVEL                EHCA_BMASK_IBM(16,16)
-#define MQPCB_MASK_SEND_GRH_FLAG                EHCA_BMASK_IBM(17,17)
-#define MQPCB_MASK_RETRY_COUNT                  EHCA_BMASK_IBM(18,18)
-#define MQPCB_MASK_TIMEOUT                      EHCA_BMASK_IBM(19,19)
-#define MQPCB_MASK_PATH_MTU                     EHCA_BMASK_IBM(20,20)
-#define MQPCB_PATH_MTU                          EHCA_BMASK_IBM(24,31)
-#define MQPCB_MASK_MAX_STATIC_RATE              EHCA_BMASK_IBM(21,21)
-#define MQPCB_MAX_STATIC_RATE                   EHCA_BMASK_IBM(24,31)
-#define MQPCB_MASK_DLID                         EHCA_BMASK_IBM(22,22)
-#define MQPCB_DLID                              EHCA_BMASK_IBM(16,31)
-#define MQPCB_MASK_RNR_RETRY_COUNT              EHCA_BMASK_IBM(23,23)
-#define MQPCB_RNR_RETRY_COUNT                   EHCA_BMASK_IBM(29,31)
-#define MQPCB_MASK_SOURCE_PATH_BITS             EHCA_BMASK_IBM(24,24)
-#define MQPCB_SOURCE_PATH_BITS                  EHCA_BMASK_IBM(25,31)
-#define MQPCB_MASK_TRAFFIC_CLASS                EHCA_BMASK_IBM(25,25)
-#define MQPCB_TRAFFIC_CLASS                     EHCA_BMASK_IBM(24,31)
-#define MQPCB_MASK_HOP_LIMIT                    EHCA_BMASK_IBM(26,26)
-#define MQPCB_HOP_LIMIT                         EHCA_BMASK_IBM(24,31)
-#define MQPCB_MASK_SOURCE_GID_IDX               EHCA_BMASK_IBM(27,27)
-#define MQPCB_SOURCE_GID_IDX                    EHCA_BMASK_IBM(24,31)
-#define MQPCB_MASK_FLOW_LABEL                   EHCA_BMASK_IBM(28,28)
-#define MQPCB_FLOW_LABEL                        EHCA_BMASK_IBM(12,31)
-#define MQPCB_MASK_DEST_GID                     EHCA_BMASK_IBM(30,30)
-#define MQPCB_MASK_SERVICE_LEVEL_AL             EHCA_BMASK_IBM(31,31)
-#define MQPCB_SERVICE_LEVEL_AL                  EHCA_BMASK_IBM(28,31)
-#define MQPCB_MASK_SEND_GRH_FLAG_AL             EHCA_BMASK_IBM(32,32)
-#define MQPCB_SEND_GRH_FLAG_AL                  EHCA_BMASK_IBM(31,31)
-#define MQPCB_MASK_RETRY_COUNT_AL               EHCA_BMASK_IBM(33,33)
-#define MQPCB_RETRY_COUNT_AL                    EHCA_BMASK_IBM(29,31)
-#define MQPCB_MASK_TIMEOUT_AL                   EHCA_BMASK_IBM(34,34)
-#define MQPCB_TIMEOUT_AL                        EHCA_BMASK_IBM(27,31)
-#define MQPCB_MASK_MAX_STATIC_RATE_AL           EHCA_BMASK_IBM(35,35)
-#define MQPCB_MAX_STATIC_RATE_AL                EHCA_BMASK_IBM(24,31)
-#define MQPCB_MASK_DLID_AL                      EHCA_BMASK_IBM(36,36)
-#define MQPCB_DLID_AL                           EHCA_BMASK_IBM(16,31)
-#define MQPCB_MASK_RNR_RETRY_COUNT_AL           EHCA_BMASK_IBM(37,37)
-#define MQPCB_RNR_RETRY_COUNT_AL                EHCA_BMASK_IBM(29,31)
-#define MQPCB_MASK_SOURCE_PATH_BITS_AL          EHCA_BMASK_IBM(38,38)
-#define MQPCB_SOURCE_PATH_BITS_AL               EHCA_BMASK_IBM(25,31)
-#define MQPCB_MASK_TRAFFIC_CLASS_AL             EHCA_BMASK_IBM(39,39)
-#define MQPCB_TRAFFIC_CLASS_AL                  EHCA_BMASK_IBM(24,31)
-#define MQPCB_MASK_HOP_LIMIT_AL                 EHCA_BMASK_IBM(40,40)
-#define MQPCB_HOP_LIMIT_AL                      EHCA_BMASK_IBM(24,31)
-#define MQPCB_MASK_SOURCE_GID_IDX_AL            EHCA_BMASK_IBM(41,41)
-#define MQPCB_SOURCE_GID_IDX_AL                 EHCA_BMASK_IBM(24,31)
-#define MQPCB_MASK_FLOW_LABEL_AL                EHCA_BMASK_IBM(42,42)
-#define MQPCB_FLOW_LABEL_AL                     EHCA_BMASK_IBM(12,31)
-#define MQPCB_MASK_DEST_GID_AL                  EHCA_BMASK_IBM(44,44)
-#define MQPCB_MASK_MAX_NR_OUTST_SEND_WR         EHCA_BMASK_IBM(45,45)
-#define MQPCB_MAX_NR_OUTST_SEND_WR              EHCA_BMASK_IBM(16,31)
-#define MQPCB_MASK_MAX_NR_OUTST_RECV_WR         EHCA_BMASK_IBM(46,46)
-#define MQPCB_MAX_NR_OUTST_RECV_WR              EHCA_BMASK_IBM(16,31)
-#define MQPCB_MASK_DISABLE_ETE_CREDIT_CHECK     EHCA_BMASK_IBM(47,47)
-#define MQPCB_DISABLE_ETE_CREDIT_CHECK          EHCA_BMASK_IBM(31,31)
-#define MQPCB_QP_NUMBER                         EHCA_BMASK_IBM(8,31)
-#define MQPCB_MASK_QP_ENABLE                    EHCA_BMASK_IBM(48,48)
-#define MQPCB_QP_ENABLE                         EHCA_BMASK_IBM(31,31)
-#define MQPCB_MASK_CURR_SRQ_LIMIT               EHCA_BMASK_IBM(49,49)
-#define MQPCB_CURR_SRQ_LIMIT                    EHCA_BMASK_IBM(16,31)
-#define MQPCB_MASK_QP_AFF_ASYN_EV_LOG_REG       EHCA_BMASK_IBM(50,50)
-#define MQPCB_MASK_SHARED_RQ_HNDL               EHCA_BMASK_IBM(51,51)
+#define MQPCB_MASK_QKEY                         EHCA_BMASK_IBM( 0,  0)
+#define MQPCB_MASK_SEND_PSN                     EHCA_BMASK_IBM( 2,  2)
+#define MQPCB_MASK_RECEIVE_PSN                  EHCA_BMASK_IBM( 3,  3)
+#define MQPCB_MASK_PRIM_PHYS_PORT               EHCA_BMASK_IBM( 4,  4)
+#define MQPCB_PRIM_PHYS_PORT                    EHCA_BMASK_IBM(24, 31)
+#define MQPCB_MASK_ALT_PHYS_PORT                EHCA_BMASK_IBM( 5,  5)
+#define MQPCB_MASK_PRIM_P_KEY_IDX               EHCA_BMASK_IBM( 6,  6)
+#define MQPCB_PRIM_P_KEY_IDX                    EHCA_BMASK_IBM(24, 31)
+#define MQPCB_MASK_ALT_P_KEY_IDX                EHCA_BMASK_IBM( 7,  7)
+#define MQPCB_MASK_RDMA_ATOMIC_CTRL             EHCA_BMASK_IBM( 8,  8)
+#define MQPCB_MASK_QP_STATE                     EHCA_BMASK_IBM( 9,  9)
+#define MQPCB_QP_STATE                          EHCA_BMASK_IBM(24, 31)
+#define MQPCB_MASK_RDMA_NR_ATOMIC_RESP_RES      EHCA_BMASK_IBM(11, 11)
+#define MQPCB_MASK_PATH_MIGRATION_STATE         EHCA_BMASK_IBM(12, 12)
+#define MQPCB_MASK_RDMA_ATOMIC_OUTST_DEST_QP    EHCA_BMASK_IBM(13, 13)
+#define MQPCB_MASK_DEST_QP_NR                   EHCA_BMASK_IBM(14, 14)
+#define MQPCB_MASK_MIN_RNR_NAK_TIMER_FIELD      EHCA_BMASK_IBM(15, 15)
+#define MQPCB_MASK_SERVICE_LEVEL                EHCA_BMASK_IBM(16, 16)
+#define MQPCB_MASK_SEND_GRH_FLAG                EHCA_BMASK_IBM(17, 17)
+#define MQPCB_MASK_RETRY_COUNT                  EHCA_BMASK_IBM(18, 18)
+#define MQPCB_MASK_TIMEOUT                      EHCA_BMASK_IBM(19, 19)
+#define MQPCB_MASK_PATH_MTU                     EHCA_BMASK_IBM(20, 20)
+#define MQPCB_PATH_MTU                          EHCA_BMASK_IBM(24, 31)
+#define MQPCB_MASK_MAX_STATIC_RATE              EHCA_BMASK_IBM(21, 21)
+#define MQPCB_MAX_STATIC_RATE                   EHCA_BMASK_IBM(24, 31)
+#define MQPCB_MASK_DLID                         EHCA_BMASK_IBM(22, 22)
+#define MQPCB_DLID                              EHCA_BMASK_IBM(16, 31)
+#define MQPCB_MASK_RNR_RETRY_COUNT              EHCA_BMASK_IBM(23, 23)
+#define MQPCB_RNR_RETRY_COUNT                   EHCA_BMASK_IBM(29, 31)
+#define MQPCB_MASK_SOURCE_PATH_BITS             EHCA_BMASK_IBM(24, 24)
+#define MQPCB_SOURCE_PATH_BITS                  EHCA_BMASK_IBM(25, 31)
+#define MQPCB_MASK_TRAFFIC_CLASS                EHCA_BMASK_IBM(25, 25)
+#define MQPCB_TRAFFIC_CLASS                     EHCA_BMASK_IBM(24, 31)
+#define MQPCB_MASK_HOP_LIMIT                    EHCA_BMASK_IBM(26, 26)
+#define MQPCB_HOP_LIMIT                         EHCA_BMASK_IBM(24, 31)
+#define MQPCB_MASK_SOURCE_GID_IDX               EHCA_BMASK_IBM(27, 27)
+#define MQPCB_SOURCE_GID_IDX                    EHCA_BMASK_IBM(24, 31)
+#define MQPCB_MASK_FLOW_LABEL                   EHCA_BMASK_IBM(28, 28)
+#define MQPCB_FLOW_LABEL                        EHCA_BMASK_IBM(12, 31)
+#define MQPCB_MASK_DEST_GID                     EHCA_BMASK_IBM(30, 30)
+#define MQPCB_MASK_SERVICE_LEVEL_AL             EHCA_BMASK_IBM(31, 31)
+#define MQPCB_SERVICE_LEVEL_AL                  EHCA_BMASK_IBM(28, 31)
+#define MQPCB_MASK_SEND_GRH_FLAG_AL             EHCA_BMASK_IBM(32, 32)
+#define MQPCB_SEND_GRH_FLAG_AL                  EHCA_BMASK_IBM(31, 31)
+#define MQPCB_MASK_RETRY_COUNT_AL               EHCA_BMASK_IBM(33, 33)
+#define MQPCB_RETRY_COUNT_AL                    EHCA_BMASK_IBM(29, 31)
+#define MQPCB_MASK_TIMEOUT_AL                   EHCA_BMASK_IBM(34, 34)
+#define MQPCB_TIMEOUT_AL                        EHCA_BMASK_IBM(27, 31)
+#define MQPCB_MASK_MAX_STATIC_RATE_AL           EHCA_BMASK_IBM(35, 35)
+#define MQPCB_MAX_STATIC_RATE_AL                EHCA_BMASK_IBM(24, 31)
+#define MQPCB_MASK_DLID_AL                      EHCA_BMASK_IBM(36, 36)
+#define MQPCB_DLID_AL                           EHCA_BMASK_IBM(16, 31)
+#define MQPCB_MASK_RNR_RETRY_COUNT_AL           EHCA_BMASK_IBM(37, 37)
+#define MQPCB_RNR_RETRY_COUNT_AL                EHCA_BMASK_IBM(29, 31)
+#define MQPCB_MASK_SOURCE_PATH_BITS_AL          EHCA_BMASK_IBM(38, 38)
+#define MQPCB_SOURCE_PATH_BITS_AL               EHCA_BMASK_IBM(25, 31)
+#define MQPCB_MASK_TRAFFIC_CLASS_AL             EHCA_BMASK_IBM(39, 39)
+#define MQPCB_TRAFFIC_CLASS_AL                  EHCA_BMASK_IBM(24, 31)
+#define MQPCB_MASK_HOP_LIMIT_AL                 EHCA_BMASK_IBM(40, 40)
+#define MQPCB_HOP_LIMIT_AL                      EHCA_BMASK_IBM(24, 31)
+#define MQPCB_MASK_SOURCE_GID_IDX_AL            EHCA_BMASK_IBM(41, 41)
+#define MQPCB_SOURCE_GID_IDX_AL                 EHCA_BMASK_IBM(24, 31)
+#define MQPCB_MASK_FLOW_LABEL_AL                EHCA_BMASK_IBM(42, 42)
+#define MQPCB_FLOW_LABEL_AL                     EHCA_BMASK_IBM(12, 31)
+#define MQPCB_MASK_DEST_GID_AL                  EHCA_BMASK_IBM(44, 44)
+#define MQPCB_MASK_MAX_NR_OUTST_SEND_WR         EHCA_BMASK_IBM(45, 45)
+#define MQPCB_MAX_NR_OUTST_SEND_WR              EHCA_BMASK_IBM(16, 31)
+#define MQPCB_MASK_MAX_NR_OUTST_RECV_WR         EHCA_BMASK_IBM(46, 46)
+#define MQPCB_MAX_NR_OUTST_RECV_WR              EHCA_BMASK_IBM(16, 31)
+#define MQPCB_MASK_DISABLE_ETE_CREDIT_CHECK     EHCA_BMASK_IBM(47, 47)
+#define MQPCB_DISABLE_ETE_CREDIT_CHECK          EHCA_BMASK_IBM(31, 31)
+#define MQPCB_QP_NUMBER                         EHCA_BMASK_IBM( 8, 31)
+#define MQPCB_MASK_QP_ENABLE                    EHCA_BMASK_IBM(48, 48)
+#define MQPCB_QP_ENABLE                         EHCA_BMASK_IBM(31, 31)
+#define MQPCB_MASK_CURR_SRQ_LIMIT               EHCA_BMASK_IBM(49, 49)
+#define MQPCB_CURR_SRQ_LIMIT                    EHCA_BMASK_IBM(16, 31)
+#define MQPCB_MASK_QP_AFF_ASYN_EV_LOG_REG       EHCA_BMASK_IBM(50, 50)
+#define MQPCB_MASK_SHARED_RQ_HNDL               EHCA_BMASK_IBM(51, 51)
 
 #endif /* __EHCA_CLASSES_PSERIES_H__ */
diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c
index 97da51e..ba1bcb9 100644
--- a/drivers/infiniband/hw/ehca/ehca_cq.c
+++ b/drivers/infiniband/hw/ehca/ehca_cq.c
@@ -97,7 +97,7 @@ int ehca_cq_unassign_qp(struct ehca_cq *cq, unsigned int real_qp_num)
 	return ret;
 }
 
-struct ehca_qp* ehca_cq_get_qp(struct ehca_cq *cq, int real_qp_num)
+struct ehca_qp *ehca_cq_get_qp(struct ehca_cq *cq, int real_qp_num)
 {
 	struct ehca_qp *ret = NULL;
 	unsigned int key = real_qp_num & (QP_HASHTAB_LEN-1);
diff --git a/drivers/infiniband/hw/ehca/ehca_eq.c b/drivers/infiniband/hw/ehca/ehca_eq.c
index d443bcb..78a2e5a 100644
--- a/drivers/infiniband/hw/ehca/ehca_eq.c
+++ b/drivers/infiniband/hw/ehca/ehca_eq.c
@@ -111,7 +111,8 @@ struct ehca_eq *ehca_create_eq(struct ehca_shca *shca,
 	for (i = 0; i < nr_pages; i++) {
 		u64 rpage;
 
-		if (!(vpage = ipz_qpageit_get_inc(&eq->ipz_queue))) {
+		vpage = ipz_qpageit_get_inc(&eq->ipz_queue);
+		if (!vpage) {
 			ret = -ENOMEM;
 			goto create_eq_exit2;
 		}
diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c
index bbd3c6a..fc19ef9 100644
--- a/drivers/infiniband/hw/ehca/ehca_hca.c
+++ b/drivers/infiniband/hw/ehca/ehca_hca.c
@@ -127,6 +127,7 @@ int ehca_query_port(struct ib_device *ibdev,
 		    u8 port, struct ib_port_attr *props)
 {
 	int ret = 0;
+	u64 h_ret;
 	struct ehca_shca *shca = container_of(ibdev, struct ehca_shca,
 					      ib_device);
 	struct hipz_query_port *rblock;
@@ -137,7 +138,8 @@ int ehca_query_port(struct ib_device *ibdev,
 		return -ENOMEM;
 	}
 
-	if (hipz_h_query_port(shca->ipz_hca_handle, port, rblock) != H_SUCCESS) {
+	h_ret = hipz_h_query_port(shca->ipz_hca_handle, port, rblock);
+	if (h_ret != H_SUCCESS) {
 		ehca_err(&shca->ib_device, "Can't query port properties");
 		ret = -EINVAL;
 		goto query_port1;
@@ -197,6 +199,7 @@ int ehca_query_sma_attr(struct ehca_shca *shca,
 			u8 port, struct ehca_sma_attr *attr)
 {
 	int ret = 0;
+	u64 h_ret;
 	struct hipz_query_port *rblock;
 
 	rblock = ehca_alloc_fw_ctrlblock(GFP_ATOMIC);
@@ -205,7 +208,8 @@ int ehca_query_sma_attr(struct ehca_shca *shca,
 		return -ENOMEM;
 	}
 
-	if (hipz_h_query_port(shca->ipz_hca_handle, port, rblock) != H_SUCCESS) {
+	h_ret = hipz_h_query_port(shca->ipz_hca_handle, port, rblock);
+	if (h_ret != H_SUCCESS) {
 		ehca_err(&shca->ib_device, "Can't query port properties");
 		ret = -EINVAL;
 		goto query_sma_attr1;
@@ -230,9 +234,11 @@ query_sma_attr1:
 int ehca_query_pkey(struct ib_device *ibdev, u8 port, u16 index, u16 *pkey)
 {
 	int ret = 0;
-	struct ehca_shca *shca = container_of(ibdev, struct ehca_shca, ib_device);
+	u64 h_ret;
+	struct ehca_shca *shca;
 	struct hipz_query_port *rblock;
 
+	shca = container_of(ibdev, struct ehca_shca, ib_device);
 	if (index > 16) {
 		ehca_err(&shca->ib_device, "Invalid index: %x.", index);
 		return -EINVAL;
@@ -244,7 +250,8 @@ int ehca_query_pkey(struct ib_device *ibdev, u8 port, u16 index, u16 *pkey)
 		return -ENOMEM;
 	}
 
-	if (hipz_h_query_port(shca->ipz_hca_handle, port, rblock) != H_SUCCESS) {
+	h_ret = hipz_h_query_port(shca->ipz_hca_handle, port, rblock);
+	if (h_ret != H_SUCCESS) {
 		ehca_err(&shca->ib_device, "Can't query port properties");
 		ret = -EINVAL;
 		goto query_pkey1;
@@ -262,6 +269,7 @@ int ehca_query_gid(struct ib_device *ibdev, u8 port,
 		   int index, union ib_gid *gid)
 {
 	int ret = 0;
+	u64 h_ret;
 	struct ehca_shca *shca = container_of(ibdev, struct ehca_shca,
 					      ib_device);
 	struct hipz_query_port *rblock;
@@ -277,7 +285,8 @@ int ehca_query_gid(struct ib_device *ibdev, u8 port,
 		return -ENOMEM;
 	}
 
-	if (hipz_h_query_port(shca->ipz_hca_handle, port, rblock) != H_SUCCESS) {
+	h_ret = hipz_h_query_port(shca->ipz_hca_handle, port, rblock);
+	if (h_ret != H_SUCCESS) {
 		ehca_err(&shca->ib_device, "Can't query port properties");
 		ret = -EINVAL;
 		goto query_gid1;
@@ -302,11 +311,12 @@ int ehca_modify_port(struct ib_device *ibdev,
 		     struct ib_port_modify *props)
 {
 	int ret = 0;
-	struct ehca_shca *shca = container_of(ibdev, struct ehca_shca, ib_device);
+	struct ehca_shca *shca;
 	struct hipz_query_port *rblock;
 	u32 cap;
 	u64 hret;
 
+	shca = container_of(ibdev, struct ehca_shca, ib_device);
 	if ((props->set_port_cap_mask | props->clr_port_cap_mask)
 	    & ~allowed_port_caps) {
 		ehca_err(&shca->ib_device, "Non-changeable bits set in masks  "
@@ -325,7 +335,8 @@ int ehca_modify_port(struct ib_device *ibdev,
 		goto modify_port1;
 	}
 
-	if (hipz_h_query_port(shca->ipz_hca_handle, port, rblock) != H_SUCCESS) {
+	hret = hipz_h_query_port(shca->ipz_hca_handle, port, rblock);
+	if (hret != H_SUCCESS) {
 		ehca_err(&shca->ib_device, "Can't query port properties");
 		ret = -EINVAL;
 		goto modify_port2;
@@ -337,7 +348,8 @@ int ehca_modify_port(struct ib_device *ibdev,
 	hret = hipz_h_modify_port(shca->ipz_hca_handle, port,
 				  cap, props->init_type, port_modify_mask);
 	if (hret != H_SUCCESS) {
-		ehca_err(&shca->ib_device, "Modify port failed  hret=%lx", hret);
+		ehca_err(&shca->ib_device, "Modify port failed  hret=%lx",
+			 hret);
 		ret = -EINVAL;
 	}
 
diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c
index 7a4071a..1f043d0 100644
--- a/drivers/infiniband/hw/ehca/ehca_irq.c
+++ b/drivers/infiniband/hw/ehca/ehca_irq.c
@@ -49,26 +49,26 @@
 #include "hipz_fns.h"
 #include "ipz_pt_fn.h"
 
-#define EQE_COMPLETION_EVENT   EHCA_BMASK_IBM(1,1)
-#define EQE_CQ_QP_NUMBER       EHCA_BMASK_IBM(8,31)
-#define EQE_EE_IDENTIFIER      EHCA_BMASK_IBM(2,7)
-#define EQE_CQ_NUMBER          EHCA_BMASK_IBM(8,31)
-#define EQE_QP_NUMBER          EHCA_BMASK_IBM(8,31)
-#define EQE_QP_TOKEN           EHCA_BMASK_IBM(32,63)
-#define EQE_CQ_TOKEN           EHCA_BMASK_IBM(32,63)
-
-#define NEQE_COMPLETION_EVENT  EHCA_BMASK_IBM(1,1)
-#define NEQE_EVENT_CODE        EHCA_BMASK_IBM(2,7)
-#define NEQE_PORT_NUMBER       EHCA_BMASK_IBM(8,15)
-#define NEQE_PORT_AVAILABILITY EHCA_BMASK_IBM(16,16)
-#define NEQE_DISRUPTIVE        EHCA_BMASK_IBM(16,16)
-
-#define ERROR_DATA_LENGTH      EHCA_BMASK_IBM(52,63)
-#define ERROR_DATA_TYPE        EHCA_BMASK_IBM(0,7)
+#define EQE_COMPLETION_EVENT   EHCA_BMASK_IBM( 1,  1)
+#define EQE_CQ_QP_NUMBER       EHCA_BMASK_IBM( 8, 31)
+#define EQE_EE_IDENTIFIER      EHCA_BMASK_IBM( 2,  7)
+#define EQE_CQ_NUMBER          EHCA_BMASK_IBM( 8, 31)
+#define EQE_QP_NUMBER          EHCA_BMASK_IBM( 8, 31)
+#define EQE_QP_TOKEN           EHCA_BMASK_IBM(32, 63)
+#define EQE_CQ_TOKEN           EHCA_BMASK_IBM(32, 63)
+
+#define NEQE_COMPLETION_EVENT  EHCA_BMASK_IBM( 1,  1)
+#define NEQE_EVENT_CODE        EHCA_BMASK_IBM( 2,  7)
+#define NEQE_PORT_NUMBER       EHCA_BMASK_IBM( 8, 15)
+#define NEQE_PORT_AVAILABILITY EHCA_BMASK_IBM(16, 16)
+#define NEQE_DISRUPTIVE        EHCA_BMASK_IBM(16, 16)
+
+#define ERROR_DATA_LENGTH      EHCA_BMASK_IBM(52, 63)
+#define ERROR_DATA_TYPE        EHCA_BMASK_IBM( 0,  7)
 
 static void queue_comp_task(struct ehca_cq *__cq);
 
-static struct ehca_comp_pool* pool;
+static struct ehca_comp_pool *pool;
 #ifdef CONFIG_HOTPLUG_CPU
 static struct notifier_block comp_pool_callback_nb;
 #endif
@@ -85,8 +85,8 @@ static inline void comp_event_callback(struct ehca_cq *cq)
 	return;
 }
 
-static void print_error_data(struct ehca_shca * shca, void* data,
-			     u64* rblock, int length)
+static void print_error_data(struct ehca_shca *shca, void *data,
+			     u64 *rblock, int length)
 {
 	u64 type = EHCA_BMASK_GET(ERROR_DATA_TYPE, rblock[2]);
 	u64 resource = rblock[1];
@@ -94,7 +94,7 @@ static void print_error_data(struct ehca_shca * shca, void* data,
 	switch (type) {
 	case 0x1: /* Queue Pair */
 	{
-		struct ehca_qp *qp = (struct ehca_qp*)data;
+		struct ehca_qp *qp = (struct ehca_qp *)data;
 
 		/* only print error data if AER is set */
 		if (rblock[6] == 0)
@@ -107,7 +107,7 @@ static void print_error_data(struct ehca_shca * shca, void* data,
 	}
 	case 0x4: /* Completion Queue */
 	{
-		struct ehca_cq *cq = (struct ehca_cq*)data;
+		struct ehca_cq *cq = (struct ehca_cq *)data;
 
 		ehca_err(&shca->ib_device,
 			 "CQ 0x%x (resource=%lx) has errors.",
@@ -564,7 +564,7 @@ void ehca_tasklet_eq(unsigned long data)
 	ehca_process_eq((struct ehca_eq *)data, 1);
 }
 
-static inline int find_next_online_cpu(struct ehca_comp_pool* pool)
+static inline int find_next_online_cpu(struct ehca_comp_pool *pool)
 {
 	int cpu;
 	unsigned long flags;
@@ -628,7 +628,7 @@ static void queue_comp_task(struct ehca_cq *__cq)
 	__queue_comp_task(__cq, cct);
 }
 
-static void run_comp_task(struct ehca_cpu_comp_task* cct)
+static void run_comp_task(struct ehca_cpu_comp_task *cct)
 {
 	struct ehca_cq *cq;
 	unsigned long flags;
@@ -658,12 +658,12 @@ static void run_comp_task(struct ehca_cpu_comp_task* cct)
 
 static int comp_task(void *__cct)
 {
-	struct ehca_cpu_comp_task* cct = __cct;
+	struct ehca_cpu_comp_task *cct = __cct;
 	int cql_empty;
 	DECLARE_WAITQUEUE(wait, current);
 
 	set_current_state(TASK_INTERRUPTIBLE);
-	while(!kthread_should_stop()) {
+	while (!kthread_should_stop()) {
 		add_wait_queue(&cct->wait_queue, &wait);
 
 		spin_lock_irq(&cct->task_lock);
@@ -737,7 +737,7 @@ static void take_over_work(struct ehca_comp_pool *pool,
 
 	list_splice_init(&cct->cq_list, &list);
 
-	while(!list_empty(&list)) {
+	while (!list_empty(&list)) {
 		cq = list_entry(cct->cq_list.next, struct ehca_cq, entry);
 
 		list_del(&cq->entry);
@@ -760,7 +760,7 @@ static int comp_pool_callback(struct notifier_block *nfb,
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
 		ehca_gen_dbg("CPU: %x (CPU_PREPARE)", cpu);
-		if(!create_comp_task(pool, cpu)) {
+		if (!create_comp_task(pool, cpu)) {
 			ehca_gen_err("Can't create comp_task for cpu: %x", cpu);
 			return NOTIFY_BAD;
 		}
@@ -830,7 +830,7 @@ int ehca_create_comp_pool(void)
 
 #ifdef CONFIG_HOTPLUG_CPU
 	comp_pool_callback_nb.notifier_call = comp_pool_callback;
-	comp_pool_callback_nb.priority =0;
+	comp_pool_callback_nb.priority = 0;
 	register_cpu_notifier(&comp_pool_callback_nb);
 #endif
 
diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h
index bf8fbf7..99881e3 100644
--- a/drivers/infiniband/hw/ehca/ehca_iverbs.h
+++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h
@@ -81,8 +81,9 @@ struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd,
 			       int num_phys_buf,
 			       int mr_access_flags, u64 *iova_start);
 
-struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt,
-			       int mr_access_flags, struct ib_udata *udata);
+struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
+			       u64 virt, int mr_access_flags,
+			       struct ib_udata *udata);
 
 int ehca_rereg_phys_mr(struct ib_mr *mr,
 		       int mr_rereg_mask,
@@ -191,7 +192,7 @@ void ehca_poll_eqs(unsigned long data);
 void *ehca_alloc_fw_ctrlblock(gfp_t flags);
 void ehca_free_fw_ctrlblock(void *ptr);
 #else
-#define ehca_alloc_fw_ctrlblock(flags) ((void *) get_zeroed_page(flags))
+#define ehca_alloc_fw_ctrlblock(flags) ((void *)get_zeroed_page(flags))
 #define ehca_free_fw_ctrlblock(ptr) free_page((unsigned long)(ptr))
 #endif
 
diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c
index 57c551e..ecf4ef4 100644
--- a/drivers/infiniband/hw/ehca/ehca_main.c
+++ b/drivers/infiniband/hw/ehca/ehca_main.c
@@ -116,7 +116,7 @@ static DEFINE_SPINLOCK(shca_list_lock);
 static struct timer_list poll_eqs_timer;
 
 #ifdef CONFIG_PPC_64K_PAGES
-static struct kmem_cache *ctblk_cache = NULL;
+static struct kmem_cache *ctblk_cache;
 
 void *ehca_alloc_fw_ctrlblock(gfp_t flags)
 {
@@ -219,8 +219,8 @@ static void ehca_destroy_slab_caches(void)
 #endif
 }
 
-#define EHCA_HCAAVER  EHCA_BMASK_IBM(32,39)
-#define EHCA_REVID    EHCA_BMASK_IBM(40,63)
+#define EHCA_HCAAVER  EHCA_BMASK_IBM(32, 39)
+#define EHCA_REVID    EHCA_BMASK_IBM(40, 63)
 
 static struct cap_descr {
 	u64 mask;
@@ -314,7 +314,7 @@ int ehca_sense_attributes(struct ehca_shca *shca)
 		if (EHCA_BMASK_GET(hca_cap_descr[i].mask, shca->hca_cap))
 			ehca_gen_dbg("   %s", hca_cap_descr[i].descr);
 
-	port = (struct hipz_query_port *) rblock;
+	port = (struct hipz_query_port *)rblock;
 	h_ret = hipz_h_query_port(shca->ipz_hca_handle, 1, port);
 	if (h_ret != H_SUCCESS) {
 		ehca_gen_err("Cannot query port properties. h_ret=%lx",
@@ -463,7 +463,7 @@ static int ehca_create_aqp1(struct ehca_shca *shca, u32 port)
 		return -EPERM;
 	}
 
-	ibcq = ib_create_cq(&shca->ib_device, NULL, NULL, (void*)(-1), 10, 0);
+	ibcq = ib_create_cq(&shca->ib_device, NULL, NULL, (void *)(-1), 10, 0);
 	if (IS_ERR(ibcq)) {
 		ehca_err(&shca->ib_device, "Cannot create AQP1 CQ.");
 		return PTR_ERR(ibcq);
@@ -730,7 +730,7 @@ static int __devinit ehca_probe(struct ibmebus_dev *dev,
 	}
 
 	/* create internal protection domain */
-	ibpd = ehca_alloc_pd(&shca->ib_device, (void*)(-1), NULL);
+	ibpd = ehca_alloc_pd(&shca->ib_device, (void *)(-1), NULL);
 	if (IS_ERR(ibpd)) {
 		ehca_err(&shca->ib_device, "Cannot create internal PD.");
 		ret = PTR_ERR(ibpd);
@@ -944,18 +944,21 @@ int __init ehca_module_init(void)
 		return -EINVAL;
 	}
 
-	if ((ret = ehca_create_comp_pool())) {
+	ret = ehca_create_comp_pool();
+	if (ret) {
 		ehca_gen_err("Cannot create comp pool.");
 		return ret;
 	}
 
-	if ((ret = ehca_create_slab_caches())) {
+	ret = ehca_create_slab_caches();
+	if (ret) {
 		ehca_gen_err("Cannot create SLAB caches");
 		ret = -ENOMEM;
 		goto module_init1;
 	}
 
-	if ((ret = ibmebus_register_driver(&ehca_driver))) {
+	ret = ibmebus_register_driver(&ehca_driver);
+	if (ret) {
 		ehca_gen_err("Cannot register eHCA device driver");
 		ret = -EINVAL;
 		goto module_init2;
diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c
index 93c26cc..6262c54 100644
--- a/drivers/infiniband/hw/ehca/ehca_mrmw.c
+++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c
@@ -61,9 +61,9 @@ static struct ehca_mr *ehca_mr_new(void)
 	struct ehca_mr *me;
 
 	me = kmem_cache_zalloc(mr_cache, GFP_KERNEL);
-	if (me) {
+	if (me)
 		spin_lock_init(&me->mrlock);
-	} else
+	else
 		ehca_gen_err("alloc failed");
 
 	return me;
@@ -79,9 +79,9 @@ static struct ehca_mw *ehca_mw_new(void)
 	struct ehca_mw *me;
 
 	me = kmem_cache_zalloc(mw_cache, GFP_KERNEL);
-	if (me) {
+	if (me)
 		spin_lock_init(&me->mwlock);
-	} else
+	else
 		ehca_gen_err("alloc failed");
 
 	return me;
@@ -111,7 +111,7 @@ struct ib_mr *ehca_get_dma_mr(struct ib_pd *pd, int mr_access_flags)
 			goto get_dma_mr_exit0;
 		}
 
-		ret = ehca_reg_maxmr(shca, e_maxmr, (u64*)KERNELBASE,
+		ret = ehca_reg_maxmr(shca, e_maxmr, (u64 *)KERNELBASE,
 				     mr_access_flags, e_pd,
 				     &e_maxmr->ib.ib_mr.lkey,
 				     &e_maxmr->ib.ib_mr.rkey);
@@ -246,8 +246,9 @@ reg_phys_mr_exit0:
 
 /*----------------------------------------------------------------------*/
 
-struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt,
-			       int mr_access_flags, struct ib_udata *udata)
+struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
+			       u64 virt, int mr_access_flags,
+			       struct ib_udata *udata)
 {
 	struct ib_mr *ib_mr;
 	struct ehca_mr *e_mr;
@@ -295,7 +296,7 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt
 	e_mr->umem = ib_umem_get(pd->uobject->context, start, length,
 				 mr_access_flags);
 	if (IS_ERR(e_mr->umem)) {
-		ib_mr = (void *) e_mr->umem;
+		ib_mr = (void *)e_mr->umem;
 		goto reg_user_mr_exit1;
 	}
 
@@ -322,8 +323,9 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt
 						     (&e_mr->umem->chunk_list),
 						     list);
 
-	ret = ehca_reg_mr(shca, e_mr, (u64*) virt, length, mr_access_flags, e_pd,
-			  &pginfo, &e_mr->ib.ib_mr.lkey, &e_mr->ib.ib_mr.rkey);
+	ret = ehca_reg_mr(shca, e_mr, (u64 *)virt, length, mr_access_flags,
+			  e_pd, &pginfo, &e_mr->ib.ib_mr.lkey,
+			  &e_mr->ib.ib_mr.rkey);
 	if (ret) {
 		ib_mr = ERR_PTR(ret);
 		goto reg_user_mr_exit2;
@@ -420,7 +422,7 @@ int ehca_rereg_phys_mr(struct ib_mr *mr,
 			goto rereg_phys_mr_exit0;
 		}
 		if (!phys_buf_array || num_phys_buf <= 0) {
-			ehca_err(mr->device, "bad input values: mr_rereg_mask=%x"
+			ehca_err(mr->device, "bad input values mr_rereg_mask=%x"
 				 " phys_buf_array=%p num_phys_buf=%x",
 				 mr_rereg_mask, phys_buf_array, num_phys_buf);
 			ret = -EINVAL;
@@ -444,10 +446,10 @@ int ehca_rereg_phys_mr(struct ib_mr *mr,
 
 	/* set requested values dependent on rereg request */
 	spin_lock_irqsave(&e_mr->mrlock, sl_flags);
-	new_start = e_mr->start;  /* new == old address */
-	new_size  = e_mr->size;	  /* new == old length */
-	new_acl   = e_mr->acl;	  /* new == old access control */
-	new_pd    = container_of(mr->pd,struct ehca_pd,ib_pd); /*new == old PD*/
+	new_start = e_mr->start;
+	new_size = e_mr->size;
+	new_acl = e_mr->acl;
+	new_pd = container_of(mr->pd, struct ehca_pd, ib_pd);
 
 	if (mr_rereg_mask & IB_MR_REREG_TRANS) {
 		new_start = iova_start;	/* change address */
@@ -517,7 +519,7 @@ int ehca_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr)
 	struct ehca_pd *my_pd = container_of(mr->pd, struct ehca_pd, ib_pd);
 	u32 cur_pid = current->tgid;
 	unsigned long sl_flags;
-	struct ehca_mr_hipzout_parms hipzout = {{0},0,0,0,0,0};
+	struct ehca_mr_hipzout_parms hipzout;
 
 	if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context &&
 	    (my_pd->ownpid != cur_pid)) {
@@ -629,7 +631,7 @@ struct ib_mw *ehca_alloc_mw(struct ib_pd *pd)
 	struct ehca_pd *e_pd = container_of(pd, struct ehca_pd, ib_pd);
 	struct ehca_shca *shca =
 		container_of(pd->device, struct ehca_shca, ib_device);
-	struct ehca_mw_hipzout_parms hipzout = {{0},0};
+	struct ehca_mw_hipzout_parms hipzout;
 
 	e_mw = ehca_mw_new();
 	if (!e_mw) {
@@ -826,7 +828,7 @@ int ehca_map_phys_fmr(struct ib_fmr *fmr,
 			      EHCA_PAGESIZE);
 	pginfo.u.fmr.fmr_pgsize = e_fmr->fmr_page_size;
 
-	ret = ehca_rereg_mr(shca, e_fmr, (u64*)iova,
+	ret = ehca_rereg_mr(shca, e_fmr, (u64 *)iova,
 			    list_len * e_fmr->fmr_page_size,
 			    e_fmr->acl, e_pd, &pginfo, &tmp_lkey, &tmp_rkey);
 	if (ret)
@@ -841,8 +843,7 @@ int ehca_map_phys_fmr(struct ib_fmr *fmr,
 map_phys_fmr_exit0:
 	if (ret)
 		ehca_err(fmr->device, "ret=%x fmr=%p page_list=%p list_len=%x "
-			 "iova=%lx",
-			 ret, fmr, page_list, list_len, iova);
+			 "iova=%lx", ret, fmr, page_list, list_len, iova);
 	return ret;
 } /* end ehca_map_phys_fmr() */
 
@@ -960,12 +961,12 @@ int ehca_reg_mr(struct ehca_shca *shca,
 	int ret;
 	u64 h_ret;
 	u32 hipz_acl;
-	struct ehca_mr_hipzout_parms hipzout = {{0},0,0,0,0,0};
+	struct ehca_mr_hipzout_parms hipzout;
 
 	ehca_mrmw_map_acl(acl, &hipz_acl);
 	ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl);
 	if (ehca_use_hp_mr == 1)
-	        hipz_acl |= 0x00000001;
+		hipz_acl |= 0x00000001;
 
 	h_ret = hipz_h_alloc_resource_mr(shca->ipz_hca_handle, e_mr,
 					 (u64)iova_start, size, hipz_acl,
@@ -1127,7 +1128,7 @@ inline int ehca_rereg_mr_rereg1(struct ehca_shca *shca,
 	u64 *kpage;
 	u64 rpage;
 	struct ehca_mr_pginfo pginfo_save;
-	struct ehca_mr_hipzout_parms hipzout = {{0},0,0,0,0,0};
+	struct ehca_mr_hipzout_parms hipzout;
 
 	ehca_mrmw_map_acl(acl, &hipz_acl);
 	ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl);
@@ -1167,7 +1168,7 @@ inline int ehca_rereg_mr_rereg1(struct ehca_shca *shca,
 			  "(Rereg1), h_ret=%lx e_mr=%p", h_ret, e_mr);
 		*pginfo = pginfo_save;
 		ret = -EAGAIN;
-	} else if ((u64*)hipzout.vaddr != iova_start) {
+	} else if ((u64 *)hipzout.vaddr != iova_start) {
 		ehca_err(&shca->ib_device, "PHYP changed iova_start in "
 			 "rereg_pmr, iova_start=%p iova_start_out=%lx e_mr=%p "
 			 "mr_handle=%lx lkey=%x lkey_out=%x", iova_start,
@@ -1305,7 +1306,7 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca,
 	struct ehca_mr save_fmr;
 	u32 tmp_lkey, tmp_rkey;
 	struct ehca_mr_pginfo pginfo;
-	struct ehca_mr_hipzout_parms hipzout = {{0},0,0,0,0,0};
+	struct ehca_mr_hipzout_parms hipzout;
 	struct ehca_mr save_mr;
 
 	if (e_fmr->fmr_max_pages <= MAX_RPAGES) {
@@ -1397,7 +1398,7 @@ int ehca_reg_smr(struct ehca_shca *shca,
 	int ret = 0;
 	u64 h_ret;
 	u32 hipz_acl;
-	struct ehca_mr_hipzout_parms hipzout = {{0},0,0,0,0,0};
+	struct ehca_mr_hipzout_parms hipzout;
 
 	ehca_mrmw_map_acl(acl, &hipz_acl);
 	ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl);
@@ -1462,7 +1463,7 @@ int ehca_reg_internal_maxmr(
 
 	/* register internal max-MR on HCA */
 	size_maxmr = (u64)high_memory - PAGE_OFFSET;
-	iova_start = (u64*)KERNELBASE;
+	iova_start = (u64 *)KERNELBASE;
 	ib_pbuf.addr = 0;
 	ib_pbuf.size = size_maxmr;
 	num_kpages = NUM_CHUNKS(((u64)iova_start % PAGE_SIZE) + size_maxmr,
@@ -1519,7 +1520,7 @@ int ehca_reg_maxmr(struct ehca_shca *shca,
 	u64 h_ret;
 	struct ehca_mr *e_origmr = shca->maxmr;
 	u32 hipz_acl;
-	struct ehca_mr_hipzout_parms hipzout = {{0},0,0,0,0,0};
+	struct ehca_mr_hipzout_parms hipzout;
 
 	ehca_mrmw_map_acl(acl, &hipz_acl);
 	ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl);
@@ -1865,7 +1866,7 @@ int ehca_mr_is_maxmr(u64 size,
 {
 	/* a MR is treated as max-MR only if it fits following: */
 	if ((size == ((u64)high_memory - PAGE_OFFSET)) &&
-	    (iova_start == (void*)KERNELBASE)) {
+	    (iova_start == (void *)KERNELBASE)) {
 		ehca_gen_dbg("this is a max-MR");
 		return 1;
 	} else
diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.h b/drivers/infiniband/hw/ehca/ehca_mrmw.h
index fb69ede..24f13fe 100644
--- a/drivers/infiniband/hw/ehca/ehca_mrmw.h
+++ b/drivers/infiniband/hw/ehca/ehca_mrmw.h
@@ -101,15 +101,10 @@ int ehca_fmr_check_page_list(struct ehca_mr *e_fmr,
 			     u64 *page_list,
 			     int list_len);
 
-int ehca_set_pagebuf(struct ehca_mr *e_mr,
-		     struct ehca_mr_pginfo *pginfo,
+int ehca_set_pagebuf(struct ehca_mr_pginfo *pginfo,
 		     u32 number,
 		     u64 *kpage);
 
-int ehca_set_pagebuf_1(struct ehca_mr *e_mr,
-		       struct ehca_mr_pginfo *pginfo,
-		       u64 *rpage);
-
 int ehca_mr_is_maxmr(u64 size,
 		     u64 *iova_start);
 
diff --git a/drivers/infiniband/hw/ehca/ehca_qes.h b/drivers/infiniband/hw/ehca/ehca_qes.h
index 8707d29..8188030 100644
--- a/drivers/infiniband/hw/ehca/ehca_qes.h
+++ b/drivers/infiniband/hw/ehca/ehca_qes.h
@@ -53,13 +53,13 @@ struct ehca_vsgentry {
 	u32 length;
 };
 
-#define GRH_FLAG_MASK        EHCA_BMASK_IBM(7,7)
-#define GRH_IPVERSION_MASK   EHCA_BMASK_IBM(0,3)
-#define GRH_TCLASS_MASK      EHCA_BMASK_IBM(4,12)
-#define GRH_FLOWLABEL_MASK   EHCA_BMASK_IBM(13,31)
-#define GRH_PAYLEN_MASK      EHCA_BMASK_IBM(32,47)
-#define GRH_NEXTHEADER_MASK  EHCA_BMASK_IBM(48,55)
-#define GRH_HOPLIMIT_MASK    EHCA_BMASK_IBM(56,63)
+#define GRH_FLAG_MASK        EHCA_BMASK_IBM( 7,  7)
+#define GRH_IPVERSION_MASK   EHCA_BMASK_IBM( 0,  3)
+#define GRH_TCLASS_MASK      EHCA_BMASK_IBM( 4, 12)
+#define GRH_FLOWLABEL_MASK   EHCA_BMASK_IBM(13, 31)
+#define GRH_PAYLEN_MASK      EHCA_BMASK_IBM(32, 47)
+#define GRH_NEXTHEADER_MASK  EHCA_BMASK_IBM(48, 55)
+#define GRH_HOPLIMIT_MASK    EHCA_BMASK_IBM(56, 63)
 
 /*
  * Unreliable Datagram Address Vector Format
@@ -206,10 +206,10 @@ struct ehca_wqe {
 
 };
 
-#define WC_SEND_RECEIVE EHCA_BMASK_IBM(0,0)
-#define WC_IMM_DATA     EHCA_BMASK_IBM(1,1)
-#define WC_GRH_PRESENT  EHCA_BMASK_IBM(2,2)
-#define WC_SE_BIT       EHCA_BMASK_IBM(3,3)
+#define WC_SEND_RECEIVE EHCA_BMASK_IBM(0, 0)
+#define WC_IMM_DATA     EHCA_BMASK_IBM(1, 1)
+#define WC_GRH_PRESENT  EHCA_BMASK_IBM(2, 2)
+#define WC_SE_BIT       EHCA_BMASK_IBM(3, 3)
 #define WC_STATUS_ERROR_BIT 0x80000000
 #define WC_STATUS_REMOTE_ERROR_FLAGS 0x0000F800
 #define WC_STATUS_PURGE_BIT 0x10
diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c
index f6f4ef6..3bd13e1 100644
--- a/drivers/infiniband/hw/ehca/ehca_qp.c
+++ b/drivers/infiniband/hw/ehca/ehca_qp.c
@@ -602,10 +602,10 @@ struct ehca_qp *internal_create_qp(struct ib_pd *pd,
 			/* UD circumvention */
 			parms.act_nr_send_sges -= 2;
 			parms.act_nr_recv_sges -= 2;
-			swqe_size = offsetof(struct ehca_wqe,
-					     u.ud_av.sg_list[parms.act_nr_send_sges]);
-			rwqe_size = offsetof(struct ehca_wqe,
-					     u.ud_av.sg_list[parms.act_nr_recv_sges]);
+			swqe_size = offsetof(struct ehca_wqe, u.ud_av.sg_list[
+						     parms.act_nr_send_sges]);
+			rwqe_size = offsetof(struct ehca_wqe, u.ud_av.sg_list[
+						     parms.act_nr_recv_sges]);
 		}
 
 		if (IB_QPT_GSI == qp_type || IB_QPT_SMI == qp_type) {
@@ -690,8 +690,8 @@ struct ehca_qp *internal_create_qp(struct ib_pd *pd,
 	if (my_qp->send_cq) {
 		ret = ehca_cq_assign_qp(my_qp->send_cq, my_qp);
 		if (ret) {
-			ehca_err(pd->device, "Couldn't assign qp to send_cq ret=%x",
-				 ret);
+			ehca_err(pd->device,
+				 "Couldn't assign qp to send_cq ret=%x", ret);
 			goto create_qp_exit4;
 		}
 	}
@@ -749,7 +749,7 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 	struct ehca_qp *ret;
 
 	ret = internal_create_qp(pd, qp_init_attr, NULL, udata, 0);
-	return IS_ERR(ret) ? (struct ib_qp *) ret : &ret->ib_qp;
+	return IS_ERR(ret) ? (struct ib_qp *)ret : &ret->ib_qp;
 }
 
 int internal_destroy_qp(struct ib_device *dev, struct ehca_qp *my_qp,
@@ -780,7 +780,7 @@ struct ib_srq *ehca_create_srq(struct ib_pd *pd,
 
 	my_qp = internal_create_qp(pd, &qp_init_attr, srq_init_attr, udata, 1);
 	if (IS_ERR(my_qp))
-		return (struct ib_srq *) my_qp;
+		return (struct ib_srq *)my_qp;
 
 	/* copy back return values */
 	srq_init_attr->attr.max_wr = qp_init_attr.cap.max_recv_wr;
@@ -875,7 +875,7 @@ static int prepare_sqe_rts(struct ehca_qp *my_qp, struct ehca_shca *shca,
 			 my_qp, qp_num, h_ret);
 		return ehca2ib_return_code(h_ret);
 	}
-	bad_send_wqe_p = (void*)((u64)bad_send_wqe_p & (~(1L<<63)));
+	bad_send_wqe_p = (void *)((u64)bad_send_wqe_p & (~(1L << 63)));
 	ehca_dbg(&shca->ib_device, "qp_num=%x bad_send_wqe_p=%p",
 		 qp_num, bad_send_wqe_p);
 	/* convert wqe pointer to vadr */
@@ -890,7 +890,7 @@ static int prepare_sqe_rts(struct ehca_qp *my_qp, struct ehca_shca *shca,
 	}
 
 	/* loop sets wqe's purge bit */
-	wqe = (struct ehca_wqe*)ipz_qeit_calc(squeue, q_ofs);
+	wqe = (struct ehca_wqe *)ipz_qeit_calc(squeue, q_ofs);
 	*bad_wqe_cnt = 0;
 	while (wqe->optype != 0xff && wqe->wqef != 0xff) {
 		if (ehca_debug_level)
@@ -898,7 +898,7 @@ static int prepare_sqe_rts(struct ehca_qp *my_qp, struct ehca_shca *shca,
 		wqe->nr_of_data_seg = 0; /* suppress data access */
 		wqe->wqef = WQEF_PURGE; /* WQE to be purged */
 		q_ofs = ipz_queue_advance_offset(squeue, q_ofs);
-		wqe = (struct ehca_wqe*)ipz_qeit_calc(squeue, q_ofs);
+		wqe = (struct ehca_wqe *)ipz_qeit_calc(squeue, q_ofs);
 		*bad_wqe_cnt = (*bad_wqe_cnt)+1;
 	}
 	/*
@@ -1003,7 +1003,7 @@ static int internal_modify_qp(struct ib_qp *ibqp,
 		goto modify_qp_exit1;
 	}
 
-	ehca_dbg(ibqp->device,"ehca_qp=%p qp_num=%x current qp_state=%x "
+	ehca_dbg(ibqp->device, "ehca_qp=%p qp_num=%x current qp_state=%x "
 		 "new qp_state=%x attribute_mask=%x",
 		 my_qp, ibqp->qp_num, qp_cur_state, attr->qp_state, attr_mask);
 
@@ -1019,7 +1019,8 @@ static int internal_modify_qp(struct ib_qp *ibqp,
 		goto modify_qp_exit1;
 	}
 
-	if ((mqpcb->qp_state = ib2ehca_qp_state(qp_new_state)))
+	mqpcb->qp_state = ib2ehca_qp_state(qp_new_state);
+	if (mqpcb->qp_state)
 		update_mask = EHCA_BMASK_SET(MQPCB_MASK_QP_STATE, 1);
 	else {
 		ret = -EINVAL;
@@ -1077,7 +1078,7 @@ static int internal_modify_qp(struct ib_qp *ibqp,
 			spin_lock_irqsave(&my_qp->spinlock_s, flags);
 			squeue_locked = 1;
 			/* mark next free wqe */
-			wqe = (struct ehca_wqe*)
+			wqe = (struct ehca_wqe *)
 				ipz_qeit_get(&my_qp->ipz_squeue);
 			wqe->optype = wqe->wqef = 0xff;
 			ehca_dbg(ibqp->device, "qp_num=%x next_free_wqe=%p",
@@ -1312,7 +1313,7 @@ static int internal_modify_qp(struct ib_qp *ibqp,
 	if (h_ret != H_SUCCESS) {
 		ret = ehca2ib_return_code(h_ret);
 		ehca_err(ibqp->device, "hipz_h_modify_qp() failed rc=%lx "
-			 "ehca_qp=%p qp_num=%x",h_ret, my_qp, ibqp->qp_num);
+			 "ehca_qp=%p qp_num=%x", h_ret, my_qp, ibqp->qp_num);
 		goto modify_qp_exit2;
 	}
 
@@ -1411,7 +1412,7 @@ int ehca_query_qp(struct ib_qp *qp,
 	}
 
 	if (qp_attr_mask & QP_ATTR_QUERY_NOT_SUPPORTED) {
-		ehca_err(qp->device,"Invalid attribute mask "
+		ehca_err(qp->device, "Invalid attribute mask "
 			 "ehca_qp=%p qp_num=%x qp_attr_mask=%x ",
 			 my_qp, qp->qp_num, qp_attr_mask);
 		return -EINVAL;
@@ -1419,7 +1420,7 @@ int ehca_query_qp(struct ib_qp *qp,
 
 	qpcb = ehca_alloc_fw_ctrlblock(GFP_KERNEL);
 	if (!qpcb) {
-		ehca_err(qp->device,"Out of memory for qpcb "
+		ehca_err(qp->device, "Out of memory for qpcb "
 			 "ehca_qp=%p qp_num=%x", my_qp, qp->qp_num);
 		return -ENOMEM;
 	}
@@ -1431,7 +1432,7 @@ int ehca_query_qp(struct ib_qp *qp,
 
 	if (h_ret != H_SUCCESS) {
 		ret = ehca2ib_return_code(h_ret);
-		ehca_err(qp->device,"hipz_h_query_qp() failed "
+		ehca_err(qp->device, "hipz_h_query_qp() failed "
 			 "ehca_qp=%p qp_num=%x h_ret=%lx",
 			 my_qp, qp->qp_num, h_ret);
 		goto query_qp_exit1;
@@ -1442,7 +1443,7 @@ int ehca_query_qp(struct ib_qp *qp,
 
 	if (qp_attr->cur_qp_state == -EINVAL) {
 		ret = -EINVAL;
-		ehca_err(qp->device,"Got invalid ehca_qp_state=%x "
+		ehca_err(qp->device, "Got invalid ehca_qp_state=%x "
 			 "ehca_qp=%p qp_num=%x",
 			 qpcb->qp_state, my_qp, qp->qp_num);
 		goto query_qp_exit1;
diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c
index 61da65e..94eed70 100644
--- a/drivers/infiniband/hw/ehca/ehca_reqs.c
+++ b/drivers/infiniband/hw/ehca/ehca_reqs.c
@@ -79,7 +79,8 @@ static inline int ehca_write_rwqe(struct ipz_queue *ipz_rqueue,
 	}
 
 	if (ehca_debug_level) {
-		ehca_gen_dbg("RECEIVE WQE written into ipz_rqueue=%p", ipz_rqueue);
+		ehca_gen_dbg("RECEIVE WQE written into ipz_rqueue=%p",
+			     ipz_rqueue);
 		ehca_dmp( wqe_p, 16*(6 + wqe_p->nr_of_data_seg), "recv wqe");
 	}
 
@@ -99,7 +100,7 @@ static void trace_send_wr_ud(const struct ib_send_wr *send_wr)
 		struct ib_mad_hdr *mad_hdr = send_wr->wr.ud.mad_hdr;
 		struct ib_sge *sge = send_wr->sg_list;
 		ehca_gen_dbg("send_wr#%x wr_id=%lx num_sge=%x "
-			     "send_flags=%x opcode=%x",idx, send_wr->wr_id,
+			     "send_flags=%x opcode=%x", idx, send_wr->wr_id,
 			     send_wr->num_sge, send_wr->send_flags,
 			     send_wr->opcode);
 		if (mad_hdr) {
@@ -116,7 +117,7 @@ static void trace_send_wr_ud(const struct ib_send_wr *send_wr)
 				     mad_hdr->attr_mod);
 		}
 		for (j = 0; j < send_wr->num_sge; j++) {
-			u8 *data = (u8 *) abs_to_virt(sge->addr);
+			u8 *data = (u8 *)abs_to_virt(sge->addr);
 			ehca_gen_dbg("send_wr#%x sge#%x addr=%p length=%x "
 				     "lkey=%x",
 				     idx, j, data, sge->length, sge->lkey);
@@ -534,9 +535,11 @@ poll_cq_one_read_cqe:
 
 	cqe_count++;
 	if (unlikely(cqe->status & WC_STATUS_PURGE_BIT)) {
-		struct ehca_qp *qp=ehca_cq_get_qp(my_cq, cqe->local_qp_number);
+		struct ehca_qp *qp;
 		int purgeflag;
 		unsigned long flags;
+
+		qp = ehca_cq_get_qp(my_cq, cqe->local_qp_number);
 		if (!qp) {
 			ehca_err(cq->device, "cq_num=%x qp_num=%x "
 				 "could not find qp -> ignore cqe",
@@ -551,8 +554,8 @@ poll_cq_one_read_cqe:
 		spin_unlock_irqrestore(&qp->spinlock_s, flags);
 
 		if (purgeflag) {
-			ehca_dbg(cq->device, "Got CQE with purged bit qp_num=%x "
-				 "src_qp=%x",
+			ehca_dbg(cq->device,
+				 "Got CQE with purged bit qp_num=%x src_qp=%x",
 				 cqe->local_qp_number, cqe->remote_qp_number);
 			if (ehca_debug_level)
 				ehca_dmp(cqe, 64, "qp_num=%x src_qp=%x",
diff --git a/drivers/infiniband/hw/ehca/ehca_tools.h b/drivers/infiniband/hw/ehca/ehca_tools.h
index fd8238b..678b813 100644
--- a/drivers/infiniband/hw/ehca/ehca_tools.h
+++ b/drivers/infiniband/hw/ehca/ehca_tools.h
@@ -93,14 +93,14 @@ extern int ehca_debug_level;
 #define ehca_gen_dbg(format, arg...) \
 	do { \
 		if (unlikely(ehca_debug_level)) \
-			printk(KERN_DEBUG "PU%04x EHCA_DBG:%s " format "\n",\
+			printk(KERN_DEBUG "PU%04x EHCA_DBG:%s " format "\n", \
 			       get_paca()->paca_index, __FUNCTION__, ## arg); \
 	} while (0)
 
 #define ehca_gen_warn(format, arg...) \
 	do { \
 		if (unlikely(ehca_debug_level)) \
-			printk(KERN_INFO "PU%04x EHCA_WARN:%s " format "\n",\
+			printk(KERN_INFO "PU%04x EHCA_WARN:%s " format "\n", \
 			       get_paca()->paca_index, __FUNCTION__, ## arg); \
 	} while (0)
 
@@ -114,12 +114,12 @@ extern int ehca_debug_level;
  * <format string> adr=X ofs=Y <8 bytes hex> <8 bytes hex>
  */
 #define ehca_dmp(adr, len, format, args...) \
-	do {				       \
-		unsigned int x;			      \
+	do { \
+		unsigned int x; \
 		unsigned int l = (unsigned int)(len); \
-		unsigned char *deb = (unsigned char*)(adr);	\
+		unsigned char *deb = (unsigned char *)(adr); \
 		for (x = 0; x < l; x += 16) { \
-			printk("EHCA_DMP:%s " format \
+			printk(KERN_INFO "EHCA_DMP:%s " format \
 			       " adr=%p ofs=%04x %016lx %016lx\n", \
 			       __FUNCTION__, ##args, deb, x, \
 			       *((u64 *)&deb[0]), *((u64 *)&deb[8])); \
@@ -128,16 +128,16 @@ extern int ehca_debug_level;
 	} while (0)
 
 /* define a bitmask, little endian version */
-#define EHCA_BMASK(pos,length) (((pos)<<16)+(length))
+#define EHCA_BMASK(pos, length) (((pos) << 16) + (length))
 
 /* define a bitmask, the ibm way... */
-#define EHCA_BMASK_IBM(from,to) (((63-to)<<16)+((to)-(from)+1))
+#define EHCA_BMASK_IBM(from, to) (((63 - to) << 16) + ((to) - (from) + 1))
 
 /* internal function, don't use */
-#define EHCA_BMASK_SHIFTPOS(mask) (((mask)>>16)&0xffff)
+#define EHCA_BMASK_SHIFTPOS(mask) (((mask) >> 16) & 0xffff)
 
 /* internal function, don't use */
-#define EHCA_BMASK_MASK(mask) (0xffffffffffffffffULL >> ((64-(mask))&0xffff))
+#define EHCA_BMASK_MASK(mask) (~0ULL >> ((64 - (mask)) & 0xffff))
 
 /**
  * EHCA_BMASK_SET - return value shifted and masked by mask
@@ -145,14 +145,14 @@ extern int ehca_debug_level;
  * variable&=~EHCA_BMASK_SET(MY_MASK,-1) clears the bits from the mask
  * in variable
  */
-#define EHCA_BMASK_SET(mask,value) \
-	((EHCA_BMASK_MASK(mask) & ((u64)(value)))<<EHCA_BMASK_SHIFTPOS(mask))
+#define EHCA_BMASK_SET(mask, value) \
+	((EHCA_BMASK_MASK(mask) & ((u64)(value))) << EHCA_BMASK_SHIFTPOS(mask))
 
 /**
  * EHCA_BMASK_GET - extract a parameter from value by mask
  */
-#define EHCA_BMASK_GET(mask,value) \
-	(EHCA_BMASK_MASK(mask)& (((u64)(value))>>EHCA_BMASK_SHIFTPOS(mask)))
+#define EHCA_BMASK_GET(mask, value) \
+	(EHCA_BMASK_MASK(mask) & (((u64)(value)) >> EHCA_BMASK_SHIFTPOS(mask)))
 
 
 /* Converts ehca to ib return code */
diff --git a/drivers/infiniband/hw/ehca/ehca_uverbs.c b/drivers/infiniband/hw/ehca/ehca_uverbs.c
index 3031b3b..05c4157 100644
--- a/drivers/infiniband/hw/ehca/ehca_uverbs.c
+++ b/drivers/infiniband/hw/ehca/ehca_uverbs.c
@@ -70,7 +70,7 @@ int ehca_dealloc_ucontext(struct ib_ucontext *context)
 
 static void ehca_mm_open(struct vm_area_struct *vma)
 {
-	u32 *count = (u32*)vma->vm_private_data;
+	u32 *count = (u32 *)vma->vm_private_data;
 	if (!count) {
 		ehca_gen_err("Invalid vma struct vm_start=%lx vm_end=%lx",
 			     vma->vm_start, vma->vm_end);
@@ -86,7 +86,7 @@ static void ehca_mm_open(struct vm_area_struct *vma)
 
 static void ehca_mm_close(struct vm_area_struct *vma)
 {
-	u32 *count = (u32*)vma->vm_private_data;
+	u32 *count = (u32 *)vma->vm_private_data;
 	if (!count) {
 		ehca_gen_err("Invalid vma struct vm_start=%lx vm_end=%lx",
 			     vma->vm_start, vma->vm_end);
@@ -215,7 +215,8 @@ static int ehca_mmap_qp(struct vm_area_struct *vma, struct ehca_qp *qp,
 	case 2: /* qp rqueue_addr */
 		ehca_dbg(qp->ib_qp.device, "qp_num=%x rqueue",
 			 qp->ib_qp.qp_num);
-		ret = ehca_mmap_queue(vma, &qp->ipz_rqueue, &qp->mm_count_rqueue);
+		ret = ehca_mmap_queue(vma, &qp->ipz_rqueue,
+				      &qp->mm_count_rqueue);
 		if (unlikely(ret)) {
 			ehca_err(qp->ib_qp.device,
 				 "ehca_mmap_queue(rq) failed rc=%x qp_num=%x",
@@ -227,7 +228,8 @@ static int ehca_mmap_qp(struct vm_area_struct *vma, struct ehca_qp *qp,
 	case 3: /* qp squeue_addr */
 		ehca_dbg(qp->ib_qp.device, "qp_num=%x squeue",
 			 qp->ib_qp.qp_num);
-		ret = ehca_mmap_queue(vma, &qp->ipz_squeue, &qp->mm_count_squeue);
+		ret = ehca_mmap_queue(vma, &qp->ipz_squeue,
+				      &qp->mm_count_squeue);
 		if (unlikely(ret)) {
 			ehca_err(qp->ib_qp.device,
 				 "ehca_mmap_queue(sq) failed rc=%x qp_num=%x",
diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c
index 4776a8b..3394e05 100644
--- a/drivers/infiniband/hw/ehca/hcp_if.c
+++ b/drivers/infiniband/hw/ehca/hcp_if.c
@@ -501,8 +501,8 @@ u64 hipz_h_register_rpage_qp(const struct ipz_adapter_handle adapter_handle,
 		return H_PARAMETER;
 	}
 
-	return hipz_h_register_rpage(adapter_handle,pagesize,queue_type,
-				     qp_handle.handle,logical_address_of_page,
+	return hipz_h_register_rpage(adapter_handle, pagesize, queue_type,
+				     qp_handle.handle, logical_address_of_page,
 				     count);
 }
 
@@ -522,9 +522,9 @@ u64 hipz_h_disable_and_get_wqe(const struct ipz_adapter_handle adapter_handle,
 				qp_handle.handle,	   /* r6 */
 				0, 0, 0, 0, 0, 0);
 	if (log_addr_next_sq_wqe2processed)
-		*log_addr_next_sq_wqe2processed = (void*)outs[0];
+		*log_addr_next_sq_wqe2processed = (void *)outs[0];
 	if (log_addr_next_rq_wqe2processed)
-		*log_addr_next_rq_wqe2processed = (void*)outs[1];
+		*log_addr_next_rq_wqe2processed = (void *)outs[1];
 
 	return ret;
 }
diff --git a/drivers/infiniband/hw/ehca/hcp_phyp.c b/drivers/infiniband/hw/ehca/hcp_phyp.c
index 0b1a477..069c69e 100644
--- a/drivers/infiniband/hw/ehca/hcp_phyp.c
+++ b/drivers/infiniband/hw/ehca/hcp_phyp.c
@@ -50,7 +50,7 @@ int hcall_map_page(u64 physaddr, u64 *mapaddr)
 
 int hcall_unmap_page(u64 mapaddr)
 {
-	iounmap((volatile void __iomem*)mapaddr);
+	iounmap((volatile void __iomem *)mapaddr);
 	return 0;
 }
 
diff --git a/drivers/infiniband/hw/ehca/hipz_fns_core.h b/drivers/infiniband/hw/ehca/hipz_fns_core.h
index 20898a1..868735f 100644
--- a/drivers/infiniband/hw/ehca/hipz_fns_core.h
+++ b/drivers/infiniband/hw/ehca/hipz_fns_core.h
@@ -53,10 +53,10 @@
 #define hipz_galpa_load_cq(gal, offset) \
 	hipz_galpa_load(gal, CQTEMM_OFFSET(offset))
 
-#define hipz_galpa_store_qp(gal,offset, value) \
+#define hipz_galpa_store_qp(gal, offset, value) \
 	hipz_galpa_store(gal, QPTEMM_OFFSET(offset), value)
 #define hipz_galpa_load_qp(gal, offset) \
-	hipz_galpa_load(gal,QPTEMM_OFFSET(offset))
+	hipz_galpa_load(gal, QPTEMM_OFFSET(offset))
 
 static inline void hipz_update_sqa(struct ehca_qp *qp, u16 nr_wqes)
 {
diff --git a/drivers/infiniband/hw/ehca/hipz_hw.h b/drivers/infiniband/hw/ehca/hipz_hw.h
index dad6dea..d9739e5 100644
--- a/drivers/infiniband/hw/ehca/hipz_hw.h
+++ b/drivers/infiniband/hw/ehca/hipz_hw.h
@@ -161,11 +161,11 @@ struct hipz_qptemm {
 /* 0x1000      */
 };
 
-#define QPX_SQADDER EHCA_BMASK_IBM(48,63)
-#define QPX_RQADDER EHCA_BMASK_IBM(48,63)
-#define QPX_AAELOG_RESET_SRQ_LIMIT EHCA_BMASK_IBM(3,3)
+#define QPX_SQADDER EHCA_BMASK_IBM(48, 63)
+#define QPX_RQADDER EHCA_BMASK_IBM(48, 63)
+#define QPX_AAELOG_RESET_SRQ_LIMIT EHCA_BMASK_IBM(3, 3)
 
-#define QPTEMM_OFFSET(x) offsetof(struct hipz_qptemm,x)
+#define QPTEMM_OFFSET(x) offsetof(struct hipz_qptemm, x)
 
 /* MRMWPT Entry Memory Map */
 struct hipz_mrmwmm {
@@ -187,7 +187,7 @@ struct hipz_mrmwmm {
 
 };
 
-#define MRMWMM_OFFSET(x) offsetof(struct hipz_mrmwmm,x)
+#define MRMWMM_OFFSET(x) offsetof(struct hipz_mrmwmm, x)
 
 struct hipz_qpedmm {
 	/* 0x00 */
@@ -238,7 +238,7 @@ struct hipz_qpedmm {
 	u64 qpedx_rrva3;
 };
 
-#define QPEDMM_OFFSET(x) offsetof(struct hipz_qpedmm,x)
+#define QPEDMM_OFFSET(x) offsetof(struct hipz_qpedmm, x)
 
 /* CQ Table Entry Memory Map */
 struct hipz_cqtemm {
@@ -263,12 +263,12 @@ struct hipz_cqtemm {
 /* 0x1000 */
 };
 
-#define CQX_FEC_CQE_CNT           EHCA_BMASK_IBM(32,63)
-#define CQX_FECADDER              EHCA_BMASK_IBM(32,63)
-#define CQX_N0_GENERATE_SOLICITED_COMP_EVENT EHCA_BMASK_IBM(0,0)
-#define CQX_N1_GENERATE_COMP_EVENT EHCA_BMASK_IBM(0,0)
+#define CQX_FEC_CQE_CNT           EHCA_BMASK_IBM(32, 63)
+#define CQX_FECADDER              EHCA_BMASK_IBM(32, 63)
+#define CQX_N0_GENERATE_SOLICITED_COMP_EVENT EHCA_BMASK_IBM(0, 0)
+#define CQX_N1_GENERATE_COMP_EVENT EHCA_BMASK_IBM(0, 0)
 
-#define CQTEMM_OFFSET(x) offsetof(struct hipz_cqtemm,x)
+#define CQTEMM_OFFSET(x) offsetof(struct hipz_cqtemm, x)
 
 /* EQ Table Entry Memory Map */
 struct hipz_eqtemm {
@@ -293,7 +293,7 @@ struct hipz_eqtemm {
 
 };
 
-#define EQTEMM_OFFSET(x) offsetof(struct hipz_eqtemm,x)
+#define EQTEMM_OFFSET(x) offsetof(struct hipz_eqtemm, x)
 
 /* access control defines for MR/MW */
 #define HIPZ_ACCESSCTRL_L_WRITE  0x00800000
diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn.c b/drivers/infiniband/hw/ehca/ipz_pt_fn.c
index bf7a400..9606f13 100644
--- a/drivers/infiniband/hw/ehca/ipz_pt_fn.c
+++ b/drivers/infiniband/hw/ehca/ipz_pt_fn.c
@@ -114,7 +114,7 @@ int ipz_queue_ctor(struct ipz_queue *queue,
 	 */
 	f = 0;
 	while (f < nr_of_pages) {
-		u8 *kpage = (u8*)get_zeroed_page(GFP_KERNEL);
+		u8 *kpage = (u8 *)get_zeroed_page(GFP_KERNEL);
 		int k;
 		if (!kpage)
 			goto ipz_queue_ctor_exit0; /*NOMEM*/
diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn.h b/drivers/infiniband/hw/ehca/ipz_pt_fn.h
index 007f088..39a4f64 100644
--- a/drivers/infiniband/hw/ehca/ipz_pt_fn.h
+++ b/drivers/infiniband/hw/ehca/ipz_pt_fn.h
@@ -240,7 +240,7 @@ void *ipz_qeit_eq_get_inc(struct ipz_queue *queue);
 static inline void *ipz_eqit_eq_get_inc_valid(struct ipz_queue *queue)
 {
 	void *ret = ipz_qeit_get(queue);
-	u32 qe = *(u8 *) ret;
+	u32 qe = *(u8 *)ret;
 	if ((qe >> 7) != (queue->toggle_state & 1))
 		return NULL;
 	ipz_qeit_eq_get_inc(queue); /* this is a good one */
@@ -250,7 +250,7 @@ static inline void *ipz_eqit_eq_get_inc_valid(struct ipz_queue *queue)
 static inline void *ipz_eqit_eq_peek_valid(struct ipz_queue *queue)
 {
 	void *ret = ipz_qeit_get(queue);
-	u32 qe = *(u8 *) ret;
+	u32 qe = *(u8 *)ret;
 	if ((qe >> 7) != (queue->toggle_state & 1))
 		return NULL;
 	return ret;
-- 
1.5.2


From fenkes at de.ibm.com  Thu Jul 12 08:54:19 2007
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Thu, 12 Jul 2007 17:54:19 +0200
Subject: [ofa-general] [PATCH 10/10] IB/ehca: Support large page MRs
In-Reply-To: <200707121745.27592.fenkes@de.ibm.com>
References: <200707121745.27592.fenkes@de.ibm.com>
Message-ID: <200707121754.20293.fenkes@de.ibm.com>

From: Hoang-Nam Nguyen <hnguyen at de.ibm.com>

Add support for MR pages larger than 4K on eHCA2. This reduces firmware
memory consumption. If enabled via the mr_largepage module parameter, the MR
page size will be determined based on the MR length and the hardware
capabilities - if the MR is >= 16M, 16M pages are used, for example.

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_classes.h |   10 +
 drivers/infiniband/hw/ehca/ehca_main.c    |   17 ++-
 drivers/infiniband/hw/ehca/ehca_mrmw.c    |  371 ++++++++++++++++++++++++-----
 drivers/infiniband/hw/ehca/ehca_mrmw.h    |    2 +-
 drivers/infiniband/hw/ehca/hcp_if.c       |   20 ++-
 5 files changed, 357 insertions(+), 63 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h
index 1752821..2a39cfa 100644
--- a/drivers/infiniband/hw/ehca/ehca_classes.h
+++ b/drivers/infiniband/hw/ehca/ehca_classes.h
@@ -105,6 +105,12 @@ struct ehca_sport {
 };
 
 #define EHCA_MAX_NR_EQS 512
+
+#define HCA_CAP_MR_PGSIZE_4K  1
+#define HCA_CAP_MR_PGSIZE_64K 2
+#define HCA_CAP_MR_PGSIZE_1M  4
+#define HCA_CAP_MR_PGSIZE_16M 8
+
 struct ehca_shca {
 	struct ib_device ib_device;
 	struct ibmebus_dev *ibmebus_dev;
@@ -121,6 +127,8 @@ struct ehca_shca {
 	struct h_galpas galpas;
 	struct mutex modify_mutex;
 	u64 hca_cap;
+	/* MR pgsize: bit 0-3 means 4K, 64K, 1M, 16M respectively */
+	u32 hca_cap_mr_pgsize;
 	int max_mtu;
 	atomic_t cur_eq_idx;
 };
@@ -213,6 +221,7 @@ struct ehca_mr {
 	enum ehca_mr_flag flags;
 	u32 num_kpages;		/* number of kernel pages */
 	u32 num_hwpages;	/* number of hw pages to form MR */
+	u64 hwpage_size;	/* hw page size used for this MR */
 	int acl;		/* ACL (stored here for usage in reregister) */
 	u64 *start;		/* virtual start address (stored here for */
 				/* usage in reregister) */
@@ -247,6 +256,7 @@ struct ehca_mr_pginfo {
 	enum ehca_mr_pgi_type type;
 	u64 num_kpages;
 	u64 kpage_cnt;
+	u64 hwpage_size;     /* hw page size used for this MR */
 	u64 num_hwpages;     /* number of hw pages */
 	u64 hwpage_cnt;      /* counter for hw pages */
 	u64 next_hwpage;     /* next hw page in buffer/chunk/listelem */
diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c
index ecf4ef4..5f207f2 100644
--- a/drivers/infiniband/hw/ehca/ehca_main.c
+++ b/drivers/infiniband/hw/ehca/ehca_main.c
@@ -65,6 +65,7 @@ int ehca_static_rate   = -1;
 int ehca_scaling_code  = 0;
 int ehca_nr_eqs        = 2;
 int ehca_dist_eqs      = 0;
+int ehca_mr_largepage  = 0;
 
 module_param_named(open_aqp1,     ehca_open_aqp1,     int, 0);
 module_param_named(debug_level,   ehca_debug_level,   int, 0);
@@ -77,6 +78,7 @@ module_param_named(static_rate,   ehca_static_rate,   int, 0);
 module_param_named(scaling_code,  ehca_scaling_code,  int, 0);
 module_param_named(nr_eqs,        ehca_nr_eqs,        int, 0);
 module_param_named(dist_eqs,      ehca_dist_eqs,      int, 0);
+module_param_named(mr_largepage,  ehca_mr_largepage,  int, 0);
 
 MODULE_PARM_DESC(open_aqp1,
 		 "AQP1 on startup (0: no (default), 1: yes)");
@@ -104,6 +106,9 @@ MODULE_PARM_DESC(nr_eqs,
 MODULE_PARM_DESC(dist_eqs,
 		 "enable distributing EQs across CQs "
 		 "(0: disabled/default, 1: enabled)");
+MODULE_PARM_DESC(mr_largepage,
+		 "use large page for MR (0: use PAGE_SIZE (default), "
+		 "1: use large page depending on MR size");
 
 DEFINE_RWLOCK(ehca_qp_idr_lock);
 DEFINE_RWLOCK(ehca_cq_idr_lock);
@@ -314,6 +319,8 @@ int ehca_sense_attributes(struct ehca_shca *shca)
 		if (EHCA_BMASK_GET(hca_cap_descr[i].mask, shca->hca_cap))
 			ehca_gen_dbg("   %s", hca_cap_descr[i].descr);
 
+	shca->hca_cap_mr_pgsize = rblock->memory_page_size_supported;
+
 	port = (struct hipz_query_port *)rblock;
 	h_ret = hipz_h_query_port(shca->ipz_hca_handle, 1, port);
 	if (h_ret != H_SUCCESS) {
@@ -609,13 +616,20 @@ static ssize_t ehca_show_adapter_handle(struct device *dev,
 }
 static DEVICE_ATTR(adapter_handle, S_IRUGO, ehca_show_adapter_handle, NULL);
 
+static ssize_t ehca_show_mr_largepage(struct device *dev,
+				      struct device_attribute *attr,
+				      char *buf)
+{
+	return sprintf(buf, "%d\n", ehca_mr_largepage);
+}
+static DEVICE_ATTR(mr_largepage, S_IRUGO, ehca_show_mr_largepage, NULL);
+
 static ssize_t ehca_show_nr_eqs(struct device *dev,
 				struct device_attribute *attr,
 				char *buf)
 {
 	return sprintf(buf, "%d\n", ehca_nr_eqs);
 }
-
 static DEVICE_ATTR(nr_eqs, S_IRUGO, ehca_show_nr_eqs, NULL);
 
 static struct attribute *ehca_dev_attrs[] = {
@@ -635,6 +649,7 @@ static struct attribute *ehca_dev_attrs[] = {
 	&dev_attr_max_pd.attr,
 	&dev_attr_max_ah.attr,
 	&dev_attr_nr_eqs.attr,
+	&dev_attr_mr_largepage.attr,
 	NULL
 };
 
diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c
index 6262c54..ba28783 100644
--- a/drivers/infiniband/hw/ehca/ehca_mrmw.c
+++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c
@@ -5,6 +5,7 @@
  *
  *  Authors: Dietmar Decker <ddecker at de.ibm.com>
  *           Christoph Raisch <raisch at de.ibm.com>
+ *           Hoang-Nam Nguyen <hnguyen at de.ibm.com>
  *
  *  Copyright (c) 2005 IBM Corporation
  *
@@ -56,6 +57,37 @@
 static struct kmem_cache *mr_cache;
 static struct kmem_cache *mw_cache;
 
+enum ehca_mr_pgsize {
+	EHCA_MR_PGSIZE4K  = 0x1000L,
+	EHCA_MR_PGSIZE64K = 0x10000L,
+	EHCA_MR_PGSIZE1M  = 0x100000L,
+	EHCA_MR_PGSIZE16M = 0x1000000L
+};
+
+extern int ehca_mr_largepage;
+
+static u32 ehca_encode_hwpage_size(u32 pgsize)
+{
+	u32 idx = 0;
+	pgsize >>= 12;
+	/*
+	 * map mr page size into hw code:
+	 * 0, 1, 2, 3 for 4K, 64K, 1M, 64M
+	 */
+	while (!(pgsize & 1)) {
+		idx++;
+		pgsize >>= 4;
+	}
+	return idx;
+}
+
+static u64 ehca_get_max_hwpage_size(struct ehca_shca *shca)
+{
+	if (shca->hca_cap_mr_pgsize & HCA_CAP_MR_PGSIZE_16M)
+		return EHCA_MR_PGSIZE16M;
+	return EHCA_MR_PGSIZE4K;
+}
+
 static struct ehca_mr *ehca_mr_new(void)
 {
 	struct ehca_mr *me;
@@ -207,19 +239,23 @@ struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd,
 		struct ehca_mr_pginfo pginfo;
 		u32 num_kpages;
 		u32 num_hwpages;
+		u64 hw_pgsize;
 
 		num_kpages = NUM_CHUNKS(((u64)iova_start % PAGE_SIZE) + size,
 					PAGE_SIZE);
-		num_hwpages = NUM_CHUNKS(((u64)iova_start % EHCA_PAGESIZE) +
-					 size, EHCA_PAGESIZE);
+		/* for kernel space we try most possible pgsize */
+		hw_pgsize = ehca_get_max_hwpage_size(shca);
+		num_hwpages = NUM_CHUNKS(((u64)iova_start % hw_pgsize) + size,
+					 hw_pgsize);
 		memset(&pginfo, 0, sizeof(pginfo));
 		pginfo.type = EHCA_MR_PGI_PHYS;
 		pginfo.num_kpages = num_kpages;
+		pginfo.hwpage_size = hw_pgsize;
 		pginfo.num_hwpages = num_hwpages;
 		pginfo.u.phy.num_phys_buf = num_phys_buf;
 		pginfo.u.phy.phys_buf_array = phys_buf_array;
-		pginfo.next_hwpage = (((u64)iova_start & ~PAGE_MASK) /
-				      EHCA_PAGESIZE);
+		pginfo.next_hwpage =
+			((u64)iova_start & ~(hw_pgsize - 1)) / hw_pgsize;
 
 		ret = ehca_reg_mr(shca, e_mr, iova_start, size, mr_access_flags,
 				  e_pd, &pginfo, &e_mr->ib.ib_mr.lkey,
@@ -259,6 +295,7 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 	int ret;
 	u32 num_kpages;
 	u32 num_hwpages;
+	u64 hwpage_size;
 
 	if (!pd) {
 		ehca_gen_err("bad pd=%p", pd);
@@ -309,16 +346,32 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 
 	/* determine number of MR pages */
 	num_kpages = NUM_CHUNKS((virt % PAGE_SIZE) + length, PAGE_SIZE);
-	num_hwpages = NUM_CHUNKS((virt % EHCA_PAGESIZE) + length,
-				 EHCA_PAGESIZE);
+	/* select proper hw_pgsize */
+	if (ehca_mr_largepage &&
+	    (shca->hca_cap_mr_pgsize & HCA_CAP_MR_PGSIZE_16M)) {
+		if (length <= EHCA_MR_PGSIZE4K
+		    && PAGE_SIZE == EHCA_MR_PGSIZE4K)
+			hwpage_size = EHCA_MR_PGSIZE4K;
+		else if (length <= EHCA_MR_PGSIZE64K)
+			hwpage_size = EHCA_MR_PGSIZE64K;
+		else if (length <= EHCA_MR_PGSIZE1M)
+			hwpage_size = EHCA_MR_PGSIZE1M;
+		else
+			hwpage_size = EHCA_MR_PGSIZE16M;
+	} else
+		hwpage_size = EHCA_MR_PGSIZE4K;
+	ehca_dbg(pd->device, "hwpage_size=%lx", hwpage_size);
 
+reg_user_mr_fallback:
+	num_hwpages = NUM_CHUNKS((virt % hwpage_size) + length, hwpage_size);
 	/* register MR on HCA */
 	memset(&pginfo, 0, sizeof(pginfo));
 	pginfo.type = EHCA_MR_PGI_USER;
+	pginfo.hwpage_size = hwpage_size;
 	pginfo.num_kpages = num_kpages;
 	pginfo.num_hwpages = num_hwpages;
 	pginfo.u.usr.region = e_mr->umem;
-	pginfo.next_hwpage = e_mr->umem->offset / EHCA_PAGESIZE;
+	pginfo.next_hwpage = e_mr->umem->offset / hwpage_size;
 	pginfo.u.usr.next_chunk = list_prepare_entry(pginfo.u.usr.next_chunk,
 						     (&e_mr->umem->chunk_list),
 						     list);
@@ -326,6 +379,18 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 	ret = ehca_reg_mr(shca, e_mr, (u64 *)virt, length, mr_access_flags,
 			  e_pd, &pginfo, &e_mr->ib.ib_mr.lkey,
 			  &e_mr->ib.ib_mr.rkey);
+	if (ret == -EINVAL && pginfo.hwpage_size > PAGE_SIZE) {
+		ehca_warn(pd->device, "failed to register mr "
+			  "with hwpage_size=%lx", hwpage_size);
+		ehca_info(pd->device, "try to register mr with "
+			  "kpage_size=%lx", PAGE_SIZE);
+		/*
+		 * this means kpages are not contiguous for a hw page
+		 * try kernel page size as fallback solution
+		 */
+		hwpage_size = PAGE_SIZE;
+		goto reg_user_mr_fallback;
+	}
 	if (ret) {
 		ib_mr = ERR_PTR(ret);
 		goto reg_user_mr_exit2;
@@ -452,6 +517,8 @@ int ehca_rereg_phys_mr(struct ib_mr *mr,
 	new_pd = container_of(mr->pd, struct ehca_pd, ib_pd);
 
 	if (mr_rereg_mask & IB_MR_REREG_TRANS) {
+		u64 hw_pgsize = ehca_get_max_hwpage_size(shca);
+
 		new_start = iova_start;	/* change address */
 		/* check physical buffer list and calculate size */
 		ret = ehca_mr_chk_buf_and_calc_size(phys_buf_array,
@@ -468,16 +535,17 @@ int ehca_rereg_phys_mr(struct ib_mr *mr,
 		}
 		num_kpages = NUM_CHUNKS(((u64)new_start % PAGE_SIZE) +
 					new_size, PAGE_SIZE);
-		num_hwpages = NUM_CHUNKS(((u64)new_start % EHCA_PAGESIZE) +
-					 new_size, EHCA_PAGESIZE);
+		num_hwpages = NUM_CHUNKS(((u64)new_start % hw_pgsize) +
+					 new_size, hw_pgsize);
 		memset(&pginfo, 0, sizeof(pginfo));
 		pginfo.type = EHCA_MR_PGI_PHYS;
 		pginfo.num_kpages = num_kpages;
+		pginfo.hwpage_size = hw_pgsize;
 		pginfo.num_hwpages = num_hwpages;
 		pginfo.u.phy.num_phys_buf = num_phys_buf;
 		pginfo.u.phy.phys_buf_array = phys_buf_array;
-		pginfo.next_hwpage = (((u64)iova_start & ~PAGE_MASK) /
-				      EHCA_PAGESIZE);
+		pginfo.next_hwpage =
+			((u64)iova_start & ~(hw_pgsize - 1)) / hw_pgsize;
 	}
 	if (mr_rereg_mask & IB_MR_REREG_ACCESS)
 		new_acl = mr_access_flags;
@@ -709,6 +777,7 @@ struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd,
 	int ret;
 	u32 tmp_lkey, tmp_rkey;
 	struct ehca_mr_pginfo pginfo;
+	u64 hw_pgsize;
 
 	/* check other parameters */
 	if (((mr_access_flags & IB_ACCESS_REMOTE_WRITE) &&
@@ -738,8 +807,8 @@ struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd,
 		ib_fmr = ERR_PTR(-EINVAL);
 		goto alloc_fmr_exit0;
 	}
-	if (((1 << fmr_attr->page_shift) != EHCA_PAGESIZE) &&
-	    ((1 << fmr_attr->page_shift) != PAGE_SIZE)) {
+	hw_pgsize = ehca_get_max_hwpage_size(shca);
+	if ((1 << fmr_attr->page_shift) != hw_pgsize) {
 		ehca_err(pd->device, "unsupported fmr_attr->page_shift=%x",
 			 fmr_attr->page_shift);
 		ib_fmr = ERR_PTR(-EINVAL);
@@ -755,6 +824,10 @@ struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd,
 
 	/* register MR on HCA */
 	memset(&pginfo, 0, sizeof(pginfo));
+	/*
+	 * pginfo.num_hwpages==0, ie register_rpages() will not be called
+	 * but deferred to map_phys_fmr()
+	 */
 	ret = ehca_reg_mr(shca, e_fmr, NULL,
 			  fmr_attr->max_pages * (1 << fmr_attr->page_shift),
 			  mr_access_flags, e_pd, &pginfo,
@@ -765,6 +838,7 @@ struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd,
 	}
 
 	/* successful */
+	e_fmr->hwpage_size = hw_pgsize;
 	e_fmr->fmr_page_size = 1 << fmr_attr->page_shift;
 	e_fmr->fmr_max_pages = fmr_attr->max_pages;
 	e_fmr->fmr_max_maps = fmr_attr->max_maps;
@@ -822,10 +896,12 @@ int ehca_map_phys_fmr(struct ib_fmr *fmr,
 	memset(&pginfo, 0, sizeof(pginfo));
 	pginfo.type = EHCA_MR_PGI_FMR;
 	pginfo.num_kpages = list_len;
-	pginfo.num_hwpages = list_len * (e_fmr->fmr_page_size / EHCA_PAGESIZE);
+	pginfo.hwpage_size = e_fmr->hwpage_size;
+	pginfo.num_hwpages =
+		list_len * e_fmr->fmr_page_size / pginfo.hwpage_size;
 	pginfo.u.fmr.page_list = page_list;
-	pginfo.next_hwpage = ((iova & (e_fmr->fmr_page_size-1)) /
-			      EHCA_PAGESIZE);
+	pginfo.next_hwpage =
+		(iova & (e_fmr->fmr_page_size-1)) / pginfo.hwpage_size;
 	pginfo.u.fmr.fmr_pgsize = e_fmr->fmr_page_size;
 
 	ret = ehca_rereg_mr(shca, e_fmr, (u64 *)iova,
@@ -964,7 +1040,7 @@ int ehca_reg_mr(struct ehca_shca *shca,
 	struct ehca_mr_hipzout_parms hipzout;
 
 	ehca_mrmw_map_acl(acl, &hipz_acl);
-	ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl);
+	ehca_mrmw_set_pgsize_hipz_acl(pginfo->hwpage_size, &hipz_acl);
 	if (ehca_use_hp_mr == 1)
 		hipz_acl |= 0x00000001;
 
@@ -987,6 +1063,7 @@ int ehca_reg_mr(struct ehca_shca *shca,
 	/* successful registration */
 	e_mr->num_kpages = pginfo->num_kpages;
 	e_mr->num_hwpages = pginfo->num_hwpages;
+	e_mr->hwpage_size = pginfo->hwpage_size;
 	e_mr->start = iova_start;
 	e_mr->size = size;
 	e_mr->acl = acl;
@@ -1029,6 +1106,9 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca,
 	u32 i;
 	u64 *kpage;
 
+	if (!pginfo->num_hwpages) /* in case of fmr */
+		return 0;
+
 	kpage = ehca_alloc_fw_ctrlblock(GFP_KERNEL);
 	if (!kpage) {
 		ehca_err(&shca->ib_device, "kpage alloc failed");
@@ -1036,7 +1116,7 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca,
 		goto ehca_reg_mr_rpages_exit0;
 	}
 
-	/* max 512 pages per shot */
+	/* max MAX_RPAGES ehca mr pages per register call */
 	for (i = 0; i < NUM_CHUNKS(pginfo->num_hwpages, MAX_RPAGES); i++) {
 
 		if (i == NUM_CHUNKS(pginfo->num_hwpages, MAX_RPAGES) - 1) {
@@ -1049,8 +1129,8 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca,
 		ret = ehca_set_pagebuf(pginfo, rnum, kpage);
 		if (ret) {
 			ehca_err(&shca->ib_device, "ehca_set_pagebuf "
-					 "bad rc, ret=%x rnum=%x kpage=%p",
-					 ret, rnum, kpage);
+				 "bad rc, ret=%x rnum=%x kpage=%p",
+				 ret, rnum, kpage);
 			goto ehca_reg_mr_rpages_exit1;
 		}
 
@@ -1065,9 +1145,10 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca,
 		} else
 			rpage = *kpage;
 
-		h_ret = hipz_h_register_rpage_mr(shca->ipz_hca_handle, e_mr,
-						 0, /* pagesize 4k */
-						 0, rpage, rnum);
+		h_ret = hipz_h_register_rpage_mr(
+			shca->ipz_hca_handle, e_mr,
+			ehca_encode_hwpage_size(pginfo->hwpage_size),
+			0, rpage, rnum);
 
 		if (i == NUM_CHUNKS(pginfo->num_hwpages, MAX_RPAGES) - 1) {
 			/*
@@ -1131,7 +1212,7 @@ inline int ehca_rereg_mr_rereg1(struct ehca_shca *shca,
 	struct ehca_mr_hipzout_parms hipzout;
 
 	ehca_mrmw_map_acl(acl, &hipz_acl);
-	ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl);
+	ehca_mrmw_set_pgsize_hipz_acl(pginfo->hwpage_size, &hipz_acl);
 
 	kpage = ehca_alloc_fw_ctrlblock(GFP_KERNEL);
 	if (!kpage) {
@@ -1182,6 +1263,7 @@ inline int ehca_rereg_mr_rereg1(struct ehca_shca *shca,
 		 */
 		e_mr->num_kpages = pginfo->num_kpages;
 		e_mr->num_hwpages = pginfo->num_hwpages;
+		e_mr->hwpage_size = pginfo->hwpage_size;
 		e_mr->start = iova_start;
 		e_mr->size = size;
 		e_mr->acl = acl;
@@ -1268,13 +1350,14 @@ int ehca_rereg_mr(struct ehca_shca *shca,
 
 		/* set some MR values */
 		e_mr->flags = save_mr.flags;
+		e_mr->hwpage_size = save_mr.hwpage_size;
 		e_mr->fmr_page_size = save_mr.fmr_page_size;
 		e_mr->fmr_max_pages = save_mr.fmr_max_pages;
 		e_mr->fmr_max_maps = save_mr.fmr_max_maps;
 		e_mr->fmr_map_cnt = save_mr.fmr_map_cnt;
 
 		ret = ehca_reg_mr(shca, e_mr, iova_start, size, acl,
-				      e_pd, pginfo, lkey, rkey);
+				  e_pd, pginfo, lkey, rkey);
 		if (ret) {
 			u32 offset = (u64)(&e_mr->flags) - (u64)e_mr;
 			memcpy(&e_mr->flags, &(save_mr.flags),
@@ -1355,6 +1438,7 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca,
 
 	/* set some MR values */
 	e_fmr->flags = save_fmr.flags;
+	e_fmr->hwpage_size = save_fmr.hwpage_size;
 	e_fmr->fmr_page_size = save_fmr.fmr_page_size;
 	e_fmr->fmr_max_pages = save_fmr.fmr_max_pages;
 	e_fmr->fmr_max_maps = save_fmr.fmr_max_maps;
@@ -1363,8 +1447,6 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca,
 
 	memset(&pginfo, 0, sizeof(pginfo));
 	pginfo.type = EHCA_MR_PGI_FMR;
-	pginfo.num_kpages = 0;
-	pginfo.num_hwpages = 0;
 	ret = ehca_reg_mr(shca, e_fmr, NULL,
 			  (e_fmr->fmr_max_pages * e_fmr->fmr_page_size),
 			  e_fmr->acl, e_pd, &pginfo, &tmp_lkey,
@@ -1373,7 +1455,6 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca,
 		u32 offset = (u64)(&e_fmr->flags) - (u64)e_fmr;
 		memcpy(&e_fmr->flags, &(save_mr.flags),
 		       sizeof(struct ehca_mr) - offset);
-		goto ehca_unmap_one_fmr_exit0;
 	}
 
 ehca_unmap_one_fmr_exit0:
@@ -1401,7 +1482,7 @@ int ehca_reg_smr(struct ehca_shca *shca,
 	struct ehca_mr_hipzout_parms hipzout;
 
 	ehca_mrmw_map_acl(acl, &hipz_acl);
-	ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl);
+	ehca_mrmw_set_pgsize_hipz_acl(e_origmr->hwpage_size, &hipz_acl);
 
 	h_ret = hipz_h_register_smr(shca->ipz_hca_handle, e_newmr, e_origmr,
 				    (u64)iova_start, hipz_acl, e_pd->fw_pd,
@@ -1420,6 +1501,7 @@ int ehca_reg_smr(struct ehca_shca *shca,
 	/* successful registration */
 	e_newmr->num_kpages = e_origmr->num_kpages;
 	e_newmr->num_hwpages = e_origmr->num_hwpages;
+	e_newmr->hwpage_size   = e_origmr->hwpage_size;
 	e_newmr->start = iova_start;
 	e_newmr->size = e_origmr->size;
 	e_newmr->acl = acl;
@@ -1452,6 +1534,7 @@ int ehca_reg_internal_maxmr(
 	struct ib_phys_buf ib_pbuf;
 	u32 num_kpages;
 	u32 num_hwpages;
+	u64 hw_pgsize;
 
 	e_mr = ehca_mr_new();
 	if (!e_mr) {
@@ -1468,13 +1551,15 @@ int ehca_reg_internal_maxmr(
 	ib_pbuf.size = size_maxmr;
 	num_kpages = NUM_CHUNKS(((u64)iova_start % PAGE_SIZE) + size_maxmr,
 				PAGE_SIZE);
-	num_hwpages = NUM_CHUNKS(((u64)iova_start % EHCA_PAGESIZE) + size_maxmr,
-				 EHCA_PAGESIZE);
+	hw_pgsize = ehca_get_max_hwpage_size(shca);
+	num_hwpages = NUM_CHUNKS(((u64)iova_start % hw_pgsize) + size_maxmr,
+				 hw_pgsize);
 
 	memset(&pginfo, 0, sizeof(pginfo));
 	pginfo.type = EHCA_MR_PGI_PHYS;
 	pginfo.num_kpages = num_kpages;
 	pginfo.num_hwpages = num_hwpages;
+	pginfo.hwpage_size = hw_pgsize;
 	pginfo.u.phy.num_phys_buf = 1;
 	pginfo.u.phy.phys_buf_array = &ib_pbuf;
 
@@ -1523,7 +1608,7 @@ int ehca_reg_maxmr(struct ehca_shca *shca,
 	struct ehca_mr_hipzout_parms hipzout;
 
 	ehca_mrmw_map_acl(acl, &hipz_acl);
-	ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl);
+	ehca_mrmw_set_pgsize_hipz_acl(e_origmr->hwpage_size, &hipz_acl);
 
 	h_ret = hipz_h_register_smr(shca->ipz_hca_handle, e_newmr, e_origmr,
 				    (u64)iova_start, hipz_acl, e_pd->fw_pd,
@@ -1539,6 +1624,7 @@ int ehca_reg_maxmr(struct ehca_shca *shca,
 	/* successful registration */
 	e_newmr->num_kpages = e_origmr->num_kpages;
 	e_newmr->num_hwpages = e_origmr->num_hwpages;
+	e_newmr->hwpage_size = e_origmr->hwpage_size;
 	e_newmr->start = iova_start;
 	e_newmr->size = e_origmr->size;
 	e_newmr->acl = acl;
@@ -1684,6 +1770,7 @@ static int ehca_set_pagebuf_user1(struct ehca_mr_pginfo *pginfo,
 	u64 pgaddr;
 	u32 i = 0;
 	u32 j = 0;
+	int hwpages_per_kpage = PAGE_SIZE / pginfo->hwpage_size;
 
 	/* loop over desired chunk entries */
 	chunk      = pginfo->u.usr.next_chunk;
@@ -1695,7 +1782,7 @@ static int ehca_set_pagebuf_user1(struct ehca_mr_pginfo *pginfo,
 				<< PAGE_SHIFT ;
 			*kpage = phys_to_abs(pgaddr +
 					     (pginfo->next_hwpage *
-					      EHCA_PAGESIZE));
+					      pginfo->hwpage_size));
 			if ( !(*kpage) ) {
 				ehca_gen_err("pgaddr=%lx "
 					     "chunk->page_list[i]=%lx "
@@ -1708,8 +1795,7 @@ static int ehca_set_pagebuf_user1(struct ehca_mr_pginfo *pginfo,
 			(pginfo->hwpage_cnt)++;
 			(pginfo->next_hwpage)++;
 			kpage++;
-			if (pginfo->next_hwpage %
-			    (PAGE_SIZE / EHCA_PAGESIZE) == 0) {
+			if (pginfo->next_hwpage % hwpages_per_kpage == 0) {
 				(pginfo->kpage_cnt)++;
 				(pginfo->u.usr.next_nmap)++;
 				pginfo->next_hwpage = 0;
@@ -1738,6 +1824,143 @@ static int ehca_set_pagebuf_user1(struct ehca_mr_pginfo *pginfo,
 	return ret;
 }
 
+/*
+ * check given pages for contiguous layout
+ * last page addr is returned in prev_pgaddr for further check
+ */
+static int ehca_check_kpages_per_ate(struct scatterlist *page_list,
+				     int start_idx, int end_idx,
+				     u64 *prev_pgaddr)
+{
+	int t;
+	for (t = start_idx; t <= end_idx; t++) {
+		u64 pgaddr = page_to_pfn(page_list[t].page) << PAGE_SHIFT;
+		ehca_gen_dbg("chunk_page=%lx value=%016lx", pgaddr,
+			     *(u64 *)abs_to_virt(phys_to_abs(pgaddr)));
+		if (pgaddr - PAGE_SIZE != *prev_pgaddr) {
+			ehca_gen_err("uncontiguous page found pgaddr=%lx "
+				     "prev_pgaddr=%lx page_list_i=%x",
+				     pgaddr, *prev_pgaddr, t);
+			return -EINVAL;
+		}
+		*prev_pgaddr = pgaddr;
+	}
+	return 0;
+}
+
+/* PAGE_SIZE < pginfo->hwpage_size */
+static int ehca_set_pagebuf_user2(struct ehca_mr_pginfo *pginfo,
+				  u32 number,
+				  u64 *kpage)
+{
+	int ret = 0;
+	struct ib_umem_chunk *prev_chunk;
+	struct ib_umem_chunk *chunk;
+	u64 pgaddr, prev_pgaddr;
+	u32 i = 0;
+	u32 j = 0;
+	int kpages_per_hwpage = pginfo->hwpage_size / PAGE_SIZE;
+	int nr_kpages = kpages_per_hwpage;
+
+	/* loop over desired chunk entries */
+	chunk      = pginfo->u.usr.next_chunk;
+	prev_chunk = pginfo->u.usr.next_chunk;
+	list_for_each_entry_continue(
+		chunk, (&(pginfo->u.usr.region->chunk_list)), list) {
+		for (i = pginfo->u.usr.next_nmap; i < chunk->nmap; ) {
+			if (nr_kpages == kpages_per_hwpage) {
+				pgaddr = ( page_to_pfn(chunk->page_list[i].page)
+					   << PAGE_SHIFT );
+				*kpage = phys_to_abs(pgaddr);
+				if ( !(*kpage) ) {
+					ehca_gen_err("pgaddr=%lx i=%x",
+						     pgaddr, i);
+					ret = -EFAULT;
+					return ret;
+				}
+				/*
+				 * The first page in a hwpage must be aligned;
+				 * the first MR page is exempt from this rule.
+				 */
+				if (pgaddr & (pginfo->hwpage_size - 1)) {
+					if (pginfo->hwpage_cnt) {
+						ehca_gen_err(
+							"invalid alignment "
+							"pgaddr=%lx i=%x "
+							"mr_pgsize=%lx",
+							pgaddr, i,
+							pginfo->hwpage_size);
+						ret = -EFAULT;
+						return ret;
+					}
+					/* first MR page */
+					pginfo->kpage_cnt =
+						(pgaddr &
+						 (pginfo->hwpage_size - 1)) >>
+						PAGE_SHIFT;
+					nr_kpages -= pginfo->kpage_cnt;
+					*kpage = phys_to_abs(
+						pgaddr &
+						~(pginfo->hwpage_size - 1));
+				}
+				ehca_gen_dbg("kpage=%lx chunk_page=%lx "
+					     "value=%016lx", *kpage, pgaddr,
+					     *(u64 *)abs_to_virt(
+						     phys_to_abs(pgaddr)));
+				prev_pgaddr = pgaddr;
+				i++;
+				pginfo->kpage_cnt++;
+				pginfo->u.usr.next_nmap++;
+				nr_kpages--;
+				if (!nr_kpages)
+					goto next_kpage;
+				continue;
+			}
+			if (i + nr_kpages > chunk->nmap) {
+				ret = ehca_check_kpages_per_ate(
+					chunk->page_list, i,
+					chunk->nmap - 1, &prev_pgaddr);
+				if (ret) return ret;
+				pginfo->kpage_cnt += chunk->nmap - i;
+				pginfo->u.usr.next_nmap += chunk->nmap - i;
+				nr_kpages -= chunk->nmap - i;
+				break;
+			}
+
+			ret = ehca_check_kpages_per_ate(chunk->page_list, i,
+							i + nr_kpages - 1,
+							&prev_pgaddr);
+			if (ret) return ret;
+			i += nr_kpages;
+			pginfo->kpage_cnt += nr_kpages;
+			pginfo->u.usr.next_nmap += nr_kpages;
+next_kpage:
+			nr_kpages = kpages_per_hwpage;
+			(pginfo->hwpage_cnt)++;
+			kpage++;
+			j++;
+			if (j >= number) break;
+		}
+		if ((pginfo->u.usr.next_nmap >= chunk->nmap) &&
+		    (j >= number)) {
+			pginfo->u.usr.next_nmap = 0;
+			prev_chunk = chunk;
+			break;
+		} else if (pginfo->u.usr.next_nmap >= chunk->nmap) {
+			pginfo->u.usr.next_nmap = 0;
+			prev_chunk = chunk;
+		} else if (j >= number)
+			break;
+		else
+			prev_chunk = chunk;
+	}
+	pginfo->u.usr.next_chunk =
+		list_prepare_entry(prev_chunk,
+				   (&(pginfo->u.usr.region->chunk_list)),
+				   list);
+	return ret;
+}
+
 int ehca_set_pagebuf_phys(struct ehca_mr_pginfo *pginfo,
 			  u32 number,
 			  u64 *kpage)
@@ -1750,9 +1973,10 @@ int ehca_set_pagebuf_phys(struct ehca_mr_pginfo *pginfo,
 	/* loop over desired phys_buf_array entries */
 	while (i < number) {
 		pbuf   = pginfo->u.phy.phys_buf_array + pginfo->u.phy.next_buf;
-		num_hw  = NUM_CHUNKS((pbuf->addr % EHCA_PAGESIZE) +
-				     pbuf->size, EHCA_PAGESIZE);
-		offs_hw = (pbuf->addr & ~PAGE_MASK) / EHCA_PAGESIZE;
+		num_hw  = NUM_CHUNKS((pbuf->addr % pginfo->hwpage_size) +
+				     pbuf->size, pginfo->hwpage_size);
+		offs_hw = (pbuf->addr & ~(pginfo->hwpage_size - 1)) /
+			pginfo->hwpage_size;
 		while (pginfo->next_hwpage < offs_hw + num_hw) {
 			/* sanity check */
 			if ((pginfo->kpage_cnt >= pginfo->num_kpages) ||
@@ -1768,21 +1992,23 @@ int ehca_set_pagebuf_phys(struct ehca_mr_pginfo *pginfo,
 				return -EFAULT;
 			}
 			*kpage = phys_to_abs(
-				(pbuf->addr & EHCA_PAGEMASK)
-				+ (pginfo->next_hwpage * EHCA_PAGESIZE));
+				(pbuf->addr & ~(pginfo->hwpage_size - 1)) +
+				(pginfo->next_hwpage * pginfo->hwpage_size));
 			if ( !(*kpage) && pbuf->addr ) {
-				ehca_gen_err("pbuf->addr=%lx "
-					     "pbuf->size=%lx "
+				ehca_gen_err("pbuf->addr=%lx pbuf->size=%lx "
 					     "next_hwpage=%lx", pbuf->addr,
-					     pbuf->size,
-					     pginfo->next_hwpage);
+					     pbuf->size, pginfo->next_hwpage);
 				return -EFAULT;
 			}
 			(pginfo->hwpage_cnt)++;
 			(pginfo->next_hwpage)++;
-			if (pginfo->next_hwpage %
-			    (PAGE_SIZE / EHCA_PAGESIZE) == 0)
-				(pginfo->kpage_cnt)++;
+			if (PAGE_SIZE >= pginfo->hwpage_size) {
+				if (pginfo->next_hwpage %
+				    (PAGE_SIZE / pginfo->hwpage_size) == 0)
+					(pginfo->kpage_cnt)++;
+			} else
+				pginfo->kpage_cnt += pginfo->hwpage_size /
+					PAGE_SIZE;
 			kpage++;
 			i++;
 			if (i >= number) break;
@@ -1806,8 +2032,8 @@ int ehca_set_pagebuf_fmr(struct ehca_mr_pginfo *pginfo,
 	/* loop over desired page_list entries */
 	fmrlist = pginfo->u.fmr.page_list + pginfo->u.fmr.next_listelem;
 	for (i = 0; i < number; i++) {
-		*kpage = phys_to_abs((*fmrlist & EHCA_PAGEMASK) +
-				     pginfo->next_hwpage * EHCA_PAGESIZE);
+		*kpage = phys_to_abs((*fmrlist & ~(pginfo->hwpage_size - 1)) +
+				     pginfo->next_hwpage * pginfo->hwpage_size);
 		if ( !(*kpage) ) {
 			ehca_gen_err("*fmrlist=%lx fmrlist=%p "
 				     "next_listelem=%lx next_hwpage=%lx",
@@ -1817,15 +2043,38 @@ int ehca_set_pagebuf_fmr(struct ehca_mr_pginfo *pginfo,
 			return -EFAULT;
 		}
 		(pginfo->hwpage_cnt)++;
-		(pginfo->next_hwpage)++;
-		kpage++;
-		if (pginfo->next_hwpage %
-		    (pginfo->u.fmr.fmr_pgsize / EHCA_PAGESIZE) == 0) {
-			(pginfo->kpage_cnt)++;
-			(pginfo->u.fmr.next_listelem)++;
-			fmrlist++;
-			pginfo->next_hwpage = 0;
+		if (pginfo->u.fmr.fmr_pgsize >= pginfo->hwpage_size) {
+			if (pginfo->next_hwpage %
+			    (pginfo->u.fmr.fmr_pgsize /
+			     pginfo->hwpage_size) == 0) {
+				(pginfo->kpage_cnt)++;
+				(pginfo->u.fmr.next_listelem)++;
+				fmrlist++;
+				pginfo->next_hwpage = 0;
+			} else
+				(pginfo->next_hwpage)++;
+		} else {
+			unsigned int cnt_per_hwpage = pginfo->hwpage_size /
+				pginfo->u.fmr.fmr_pgsize;
+			unsigned int j;
+			u64 prev = *kpage;
+			/* check if adrs are contiguous */
+			for (j = 1; j < cnt_per_hwpage; j++) {
+				u64 p = phys_to_abs(fmrlist[j] &
+						    ~(pginfo->hwpage_size - 1));
+				if (prev + pginfo->u.fmr.fmr_pgsize != p) {
+					ehca_gen_err("uncontiguous fmr pages "
+						     "found prev=%lx p=%lx "
+						     "idx=%x", prev, p, i + j);
+					return -EINVAL;
+				}
+				prev = p;
+			}
+			pginfo->kpage_cnt += cnt_per_hwpage;
+			pginfo->u.fmr.next_listelem += cnt_per_hwpage;
+			fmrlist += cnt_per_hwpage;
 		}
+		kpage++;
 	}
 	return ret;
 }
@@ -1842,7 +2091,9 @@ int ehca_set_pagebuf(struct ehca_mr_pginfo *pginfo,
 		ret = ehca_set_pagebuf_phys(pginfo, number, kpage);
 		break;
 	case EHCA_MR_PGI_USER:
-		ret = ehca_set_pagebuf_user1(pginfo, number, kpage);
+		ret = PAGE_SIZE >= pginfo->hwpage_size ?
+			ehca_set_pagebuf_user1(pginfo, number, kpage) :
+			ehca_set_pagebuf_user2(pginfo, number, kpage);
 		break;
 	case EHCA_MR_PGI_FMR:
 		ret = ehca_set_pagebuf_fmr(pginfo, number, kpage);
@@ -1895,9 +2146,9 @@ void ehca_mrmw_map_acl(int ib_acl,
 /*----------------------------------------------------------------------*/
 
 /* sets page size in hipz access control for MR/MW. */
-void ehca_mrmw_set_pgsize_hipz_acl(u32 *hipz_acl) /*INOUT*/
+void ehca_mrmw_set_pgsize_hipz_acl(u32 pgsize, u32 *hipz_acl) /*INOUT*/
 {
-	return; /* HCA supports only 4k */
+	*hipz_acl |= (ehca_encode_hwpage_size(pgsize) << 24);
 } /* end ehca_mrmw_set_pgsize_hipz_acl() */
 
 /*----------------------------------------------------------------------*/
diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.h b/drivers/infiniband/hw/ehca/ehca_mrmw.h
index 24f13fe..bc8f4e3 100644
--- a/drivers/infiniband/hw/ehca/ehca_mrmw.h
+++ b/drivers/infiniband/hw/ehca/ehca_mrmw.h
@@ -111,7 +111,7 @@ int ehca_mr_is_maxmr(u64 size,
 void ehca_mrmw_map_acl(int ib_acl,
 		       u32 *hipz_acl);
 
-void ehca_mrmw_set_pgsize_hipz_acl(u32 *hipz_acl);
+void ehca_mrmw_set_pgsize_hipz_acl(u32 pgsize, u32 *hipz_acl);
 
 void ehca_mrmw_reverse_map_acl(const u32 *hipz_acl,
 			       int *ib_acl);
diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c
index 3394e05..358796c 100644
--- a/drivers/infiniband/hw/ehca/hcp_if.c
+++ b/drivers/infiniband/hw/ehca/hcp_if.c
@@ -427,7 +427,8 @@ u64 hipz_h_register_rpage(const struct ipz_adapter_handle adapter_handle,
 {
 	return ehca_plpar_hcall_norets(H_REGISTER_RPAGES,
 				       adapter_handle.handle,      /* r4  */
-				       queue_type | pagesize << 8, /* r5  */
+				       (u64)queue_type | ((u64)pagesize) << 8,
+				       /* r5  */
 				       resource_handle,	           /* r6  */
 				       logical_address_of_page,    /* r7  */
 				       count,	                   /* r8  */
@@ -724,6 +725,9 @@ u64 hipz_h_alloc_resource_mr(const struct ipz_adapter_handle adapter_handle,
 	u64 ret;
 	u64 outs[PLPAR_HCALL9_BUFSIZE];
 
+	ehca_gen_dbg("kernel PAGE_SIZE=%x access_ctrl=%016x "
+		     "vaddr=%lx length=%lx",
+		     (u32)PAGE_SIZE, access_ctrl, vaddr, length);
 	ret = ehca_plpar_hcall9(H_ALLOC_RESOURCE, outs,
 				adapter_handle.handle,            /* r4 */
 				5,                                /* r5 */
@@ -746,8 +750,22 @@ u64 hipz_h_register_rpage_mr(const struct ipz_adapter_handle adapter_handle,
 			     const u64 logical_address_of_page,
 			     const u64 count)
 {
+	extern int ehca_debug_level;
 	u64 ret;
 
+	if (unlikely(ehca_debug_level >= 2)) {
+		if (count > 1) {
+			u64 *kpage;
+			int i;
+			kpage = (u64 *)abs_to_virt(logical_address_of_page);
+			for (i = 0; i < count; i++)
+				ehca_gen_dbg("kpage[%d]=%p",
+					     i, (void *)kpage[i]);
+		} else
+			ehca_gen_dbg("kpage=%p",
+				     (void *)logical_address_of_page);
+	}
+
 	if ((count > 1) && (logical_address_of_page & (EHCA_PAGESIZE-1))) {
 		ehca_gen_err("logical_address_of_page not on a 4k boundary "
 			     "adapter_handle=%lx mr=%p mr_handle=%lx "
-- 
1.5.2


From halr at voltaire.com  Thu Jul 12 09:04:45 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 12 Jul 2007 12:04:45 -0400
Subject: [ofa-general] Re: [PATCH] opensm/updn: root detector function
	simplification
In-Reply-To: <20070712024716.GA2248@sashak.voltaire.com>
References: <20070712024716.GA2248@sashak.voltaire.com>
Message-ID: <1184255984.17622.197967.camel@hal.voltaire.com>

On Wed, 2007-07-11 at 22:47, Sasha Khapyorsky wrote:
> There are pretty cosmetic simplifications for up/down root auto detector
> function - reducing some vars and flows.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied.

-- Hal


From mshefty at ichips.intel.com  Thu Jul 12 09:19:26 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 12 Jul 2007 09:19:26 -0700
Subject: [ofa-general] [PATCH] OFED 1.2.1 rdma_cm response timeout module
	parameter
In-Reply-To: <4695DBE3.1070002@voltaire.com>
References: <B0095134066CC94FBC80973103FFA1FE04667213@orsmsx416.amr.corp.intel.com>
	<4695DBE3.1070002@voltaire.com>
Message-ID: <4696548E.60208@ichips.intel.com>

> You have approved this patch for OFED 1.2.1, does it suitable also for 
> upstream, and if not how you think it would be correct to proceed?

I started the following thread to determine an appropriate upstream fix:

http://lists.openfabrics.org/pipermail/general/2007-July/037763.html

I wasn't sure that we'd have an upstream fix ready in time for OFED 1.2.1.

- Sean


From becker at nas.nasa.gov  Thu Jul 12 09:28:29 2007
From: becker at nas.nasa.gov (Jeff Becker)
Date: Thu, 12 Jul 2007 09:28:29 -0700
Subject: [ofa-general] Re: http://git.openfabrics.org/
In-Reply-To: <A4A93717-5A04-41F8-9BF9-A3D37A3F7530@cisco.com>
References: <A85B03FF-E321-4037-939C-917B9B483DED@cisco.com>
	<A4A93717-5A04-41F8-9BF9-A3D37A3F7530@cisco.com>
Message-ID: <795c49870707120928h16d86980ga2cd272dab26865@mail.gmail.com>

Hi Jeff. Ping received. Will git (8^))  to it when I can. I'm in the
middle of an acceptance test , part of which is getting OFED 1.2 up on
an IBM Power/ehca system.

-jeff

On 7/11/07, Jeff Squyres <jsquyres at cisco.com> wrote:
> Just a ping again to make sure that this request doesn't get lost...
>
> On Jun 15, 2007, at 11:11 AM, Jeff Squyres wrote:
>
> > I notice that http://git.openfabrics.org/ shows the main OFA web
> > site, but http://git.openfabrics.org/git/ shows all the git
> > repositories.
> >
> > Can a redirect be installed such that http://git.openfabrics.org/
> > is automatically sent to http://git.openfabrics.org/git/?
> >
> > I think that would be a little more intuitive.
> >
> > Thanks!
> >
> > --
> > Jeff Squyres
> > Cisco Systems
> >
> >
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From becker at nas.nasa.gov  Thu Jul 12 09:46:20 2007
From: becker at nas.nasa.gov (Jeff Becker)
Date: Thu, 12 Jul 2007 09:46:20 -0700
Subject: [ofa-general] OFED-1.2 release download link
In-Reply-To: <46962B6B.8050903@dev.mellanox.co.il>
References: <46962B6B.8050903@dev.mellanox.co.il>
Message-ID: <795c49870707120946w5dc6896q186686294aeb75ac@mail.gmail.com>

GA link should now be correct.

-jeff

On 7/12/07, Vladimir Sokolovsky <vlad at dev.mellanox.co.il> wrote:
> Hi,
> OFED-1.2 is currently available at
> http://www.openfabrics.org/builds/ofed-1.2/release/OFED-1.2.tgz
>
> OFED-1.2 binary RPMs for SLES 9.0, SLES 10 SP1, RHEL 4.0 U5 and RHEL 5.0
> can be downloaded from:
> http://www.openfabrics.org/builds/ofed-1.2/release/OFED-1.2-RPMS/
>
> Note:
> On http://www.openfabrics.org/downloads.htm
> OFED 1.2 GA link points to the wrong place.
>
>
> Regards,
> Vladimir
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From inverts at phentermine.com  Thu Jul 12 11:00:56 2007
From: inverts at phentermine.com (Robyn Carroll)
Date: Thu, 12 Jul 2007 17:00:56 -0100
Subject: [ofa-general] Ten times cheaper
Message-ID: <794165711.10341716695959@phentermine.com>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070712/09e883ae/attachment.html>

From rdreier at cisco.com  Thu Jul 12 10:15:24 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 12 Jul 2007 10:15:24 -0700
Subject: [ofa-general] Re: [PATCH 00/10] IB/ehca: Multiple Event Queues,
	MR/MW rework, large page MRs, fixes
In-Reply-To: <200707121745.27592.fenkes@de.ibm.com> (Joachim Fenkes's message
	of "Thu, 12 Jul 2007 17:45:26 +0200")
References: <200707121745.27592.fenkes@de.ibm.com>
Message-ID: <adawsx55ys3.fsf@cisco.com>

 > Note that patch 7 will introduce a few lines over 80 chars that will be
 > unindented in patch 8 - I hope that's okay with you.

That's fine -- the 80 column rule is one thing I don't worry about too
much; absurdly long lines are bad, but if a line is, say, 84 chars and
breaking it makes the code uglier, then I just leave the 84 char line.

 > [09/10] fixes a lot of checkpatch.pl warnings

Are these warnings from earlier patches in the series, or problems
that already existed in the code?  If they are coming from other
patches in the series, please just fix the earlier patches before I
merge them.

Thanks,
  Roland


From tnguyen at pantasys.com  Thu Jul 12 10:38:42 2007
From: tnguyen at pantasys.com (Tung M. Nguyen)
Date: Thu, 12 Jul 2007 10:38:42 -0700
Subject: [ewg] Re: [ofa-general] OFED-1.2 release download link
In-Reply-To: <795c49870707120946w5dc6896q186686294aeb75ac@mail.gmail.com>
References: <46962B6B.8050903@dev.mellanox.co.il>
	<795c49870707120946w5dc6896q186686294aeb75ac@mail.gmail.com>
Message-ID: <000b01c7c4ab$7ddaa000$8c28010a@EXECTMN>

Guys,
I saw the attached message regarding a OFED 1.2 rc9. This is quite 
confusing. Do we have a GA version or not? It seems that there is some
work needs to be done for Mellanox latest HCA, ConnectX. maybe it should
not hold up OFED 1.2 GA?

Regards,
Tung 

 
> -----Original Message-----
> From: ewg-bounces at lists.openfabrics.org 
> [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Jeff Becker
> Sent: Thursday, July 12, 2007 9:46 AM
> To: Vladimir Sokolovsky
> Cc: Jeffrey Scott; OpenFabricsEWG; OpenFabrics General
> Subject: [ewg] Re: [ofa-general] OFED-1.2 release download link
> 
> GA link should now be correct.
> 
> -jeff
> 
> On 7/12/07, Vladimir Sokolovsky <vlad at dev.mellanox.co.il> wrote:
> > Hi,
> > OFED-1.2 is currently available at
> > http://www.openfabrics.org/builds/ofed-1.2/release/OFED-1.2.tgz
> >
> > OFED-1.2 binary RPMs for SLES 9.0, SLES 10 SP1, RHEL 4.0 U5 
> and RHEL 5.0
> > can be downloaded from:
> > http://www.openfabrics.org/builds/ofed-1.2/release/OFED-1.2-RPMS/
> >
> > Note:
> > On http://www.openfabrics.org/downloads.htm
> > OFED 1.2 GA link points to the wrong place.
> >
> >
> > Regards,
> > Vladimir
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> >
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
-------------- next part --------------
An embedded message was scrubbed...
From: "Tziporet Koren" <tziporet at mellanox.co.il>
Subject: [ewg] OFED 1.2.c-9 is available
Date: Thu, 12 Jul 2007 08:01:08 -0700
Size: 11035
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070712/74dcac14/attachment.mht>

From rdreier at cisco.com  Thu Jul 12 10:42:39 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 12 Jul 2007 10:42:39 -0700
Subject: [ofa-general] Moving On
In-Reply-To: <1184191631.17622.128348.camel@hal.voltaire.com> (Hal
	Rosenstock's message of "11 Jul 2007 18:07:43 -0400")
References: <1184191631.17622.128348.camel@hal.voltaire.com>
Message-ID: <adamyy15xio.fsf@cisco.com>

Hal,

Good luck with whatever comes next in your life.

I guess it makes sense to remove the halr at voltaire.com line from the
kernel MAINTAINERS file.  Do you want to replace it with your gmail
address, or just move your entry out of MAINTAINERS and into CREDITS?

 - R.


From sean.hefty at intel.com  Thu Jul 12 10:45:07 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 12 Jul 2007 10:45:07 -0700
Subject: [ewg] Re: [ofa-general] OFED-1.2 release download link
In-Reply-To: <000b01c7c4ab$7ddaa000$8c28010a@EXECTMN>
Message-ID: <000001c7c4ac$631bbe10$ff0da8c0@amr.corp.intel.com>

>I saw the attached message regarding a OFED 1.2 rc9. This is quite
>confusing. Do we have a GA version or not? It seems that there is some
>work needs to be done for Mellanox latest HCA, ConnectX. maybe it should
>not hold up OFED 1.2 GA?

Unless I'm off, this was OFED 1.2.c-9 (this is NOT 'rc-9', but just 'c-9' -
meaning it includes support for Mellanox ConnectX adapter).  OFED 1.2 GA was
released in June. 

Is OFED 1.2.c-9 really an 'OFED' release, or is it a Mellanox specific code
release that repackages the OFED 1.2 code?

- Sean


From tnguyen at pantasys.com  Thu Jul 12 10:47:58 2007
From: tnguyen at pantasys.com (Tung M. Nguyen)
Date: Thu, 12 Jul 2007 10:47:58 -0700
Subject: [ewg] Re: [ofa-general] OFED-1.2 release download link
In-Reply-To: <000001c7c4ac$631bbe10$ff0da8c0@amr.corp.intel.com>
References: <000b01c7c4ab$7ddaa000$8c28010a@EXECTMN>
	<000001c7c4ac$631bbe10$ff0da8c0@amr.corp.intel.com>
Message-ID: <002501c7c4ac$c8c57030$8c28010a@EXECTMN>

Oops.
I missed it. Sorry for the spam.

Regards,
Tung 

 
> -----Original Message-----
> From: Sean Hefty [mailto:sean.hefty at intel.com] 
> Sent: Thursday, July 12, 2007 10:45 AM
> To: 'Tung M. Nguyen'; 'Jeff Becker'; 'Vladimir Sokolovsky'
> Cc: 'OpenFabricsEWG'; 'OpenFabrics General'
> Subject: RE: [ewg] Re: [ofa-general] OFED-1.2 release download link
> 
> >I saw the attached message regarding a OFED 1.2 rc9. This is quite
> >confusing. Do we have a GA version or not? It seems that 
> there is some
> >work needs to be done for Mellanox latest HCA, ConnectX. 
> maybe it should
> >not hold up OFED 1.2 GA?
> 
> Unless I'm off, this was OFED 1.2.c-9 (this is NOT 'rc-9', 
> but just 'c-9' -
> meaning it includes support for Mellanox ConnectX adapter).  
> OFED 1.2 GA was
> released in June. 
> 
> Is OFED 1.2.c-9 really an 'OFED' release, or is it a Mellanox 
> specific code
> release that repackages the OFED 1.2 code?
> 
> - Sean


From halr at voltaire.com  Thu Jul 12 12:09:30 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 12 Jul 2007 15:09:30 -0400
Subject: [ofa-general] Moving On
In-Reply-To: <adamyy15xio.fsf@cisco.com>
References: <1184191631.17622.128348.camel@hal.voltaire.com>
	<adamyy15xio.fsf@cisco.com>
Message-ID: <1184267369.13276.12705.camel@hal.voltaire.com>

Roland,

On Thu, 2007-07-12 at 13:42, Roland Dreier wrote:
> Hal,
> 
> Good luck with whatever comes next in your life.

Thanks.

> I guess it makes sense to remove the halr at voltaire.com line from the
> kernel MAINTAINERS file.  Do you want to replace it with your gmail
> address, or just move your entry out of MAINTAINERS and into CREDITS?

I think the best thing for now is to replace it with my gmail account
(to make sure SMI and agent are covered at a minimum).

-- Hal

>  - R.


From pradeeps at linux.vnet.ibm.com  Thu Jul 12 12:28:27 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Thu, 12 Jul 2007 12:28:27 -0700
Subject: [ofa-general] Re: [PATCH draft, untested] ehca srq emulation
	(for IPoIB CM)
In-Reply-To: <adamyyt2fp5.fsf@cisco.com>
References: <OFE2D9DB0E.0AD8F1B0-ON85257300.007854EA-85257300.0079CE19@us.ibm.com>
	<adamyyt2fp5.fsf@cisco.com>
Message-ID: <469680DB.6000602@linux.vnet.ibm.com>

Roland Dreier wrote:
>  > It is not clear if anything is better yet, but instead you have to go back 
>  > to the IPoIB-CM  RFC 4755 that we wrote. In the spec you will see that the 
>  > approach for this driver is to have the IPoIB driver select the most 
>  > appropriate method of connecting. If RC was not available then UD was 
>  > used. You can extend that to UC mode as Michael proposed, as long as you 
>  > allow selecting the most appropriate method of connection. By pushing the 
>  > issue of SRQ or not SRQ to the driver you have broken the IPoIB-CM 
>  > original design. Since SRQ was not a required function in the IB spec we 
>  > never addressed that issue in the RFC along with UC. I think we can agree 
>  > that adding UC is a good thing and follows the approach in the original 
>  > spec. Including SRQ as one of the tests for the best possible connection 
>  > method follows this same approach.
> 
>  > ....
> 
> I can't really follow this.  We're talking about the internal
> implementation inside the Linux kernel, which I really hope that an
> IETF RFC does not address at all.  We surely intend to follow the RFC,
> and if we run into problems because the RFC was written without any
> implementation experience, then we'll work to correct those problems
> through a new IETF document.
> 
> It makes perfect sense for ehca systems to be able to use IPoIB CM.  I
> understand that current ehca HW doesn't natively support SRQs.  The
> only question is how to implement IPoIB CM for ehca systems, and we
> have to weigh tradeoffs like avoiding code duplication vs the
> additional cost of branches on the data path.
> 

In the absence of any further discussions about the IPoIB CM without SRQ
patches, I will incorporate Sean Hefty's comments and plan to resubmit
the patches, unless I hear something soon.

Pradeep


From ardavis at ichips.intel.com  Thu Jul 12 12:43:04 2007
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Thu, 12 Jul 2007 12:43:04 -0700
Subject: [ofa-general] RE: OFA website edits
In-Reply-To: <46956FF9.50102@ichips.intel.com>
References: <B0095134066CC94FBC80973103FFA1FE046A5CB3@orsmsx416.amr.corp.intel.com>	<adamyy27cxk.fsf@cisco.com>
	<46956FF9.50102@ichips.intel.com>
Message-ID: <46968448.2000401@ichips.intel.com>

Arlin Davis wrote:

> The proposal was attempting to come up with a method to automatically 
> link to a package and description file from the download webpage. I 
> have no problem
> targeting http://openfabrics.org/downloads as long as we come up with 
> a way for the webpage to correlate a description with a package 
> without hand coding the links everytime. We need to come up with a 
> method for automatic links to keep our download webpage updated and 
> complete.
>
> What if we add a directory for each project under downloads and 
> provide a README for a description? Other suggestions?
>
Here is a stab at what we have today for discussion purposes:

Linux  Libraries:
    - libibverbs -http://www.openfabrics.org/downloads/    
    - librdmacm -  http://www.openfabrics.org/~shefty/
    - dapl  - http://www.openfabrics.org/~ardavis/
    - management -http://www.openfabrics.org/~halr/   
OFED Linux:
    - OFED 1.2 release - 
http://www.openfabrics.org/builds/ofed-1.2/release/OFED-1.2.tgz
    - OFED 1.2 binary RPMs for SLES 9.0, SLES 10 SP1, RHEL 4.0 U5  and 
RHEL 5.0
         
http://www.openfabrics.org/builds/ofed-1.2/release/OFED-1.2-RPMS/   
    - OFED connectx release - 
_http://www.openfabrics.org/builds/connectx/release/_
OFED Linux Archives:
    - SLES 10 OFED 1.0 RPMS - http://www.openfabrics.org/downloads/  
    - OFED 1.1 release - 
https://svn.openfabrics.org/svn/openib/gen2/branches/1.1/ofed/releases/
    - OFED 1.0 release - 
https://svn.openfabrics.org/svn/openib/gen2/branches/1.0/ofed/releases/
WinOF for windows:
    WinOF 1.0 release - http://www.oprnfabrics.org/~ardavis/WinOF 
1.0/WinOF_1-0.zip
    WinOF source - svn://openib.tc.cornell.edu
     WinOF faq - 
https://wiki.openfabrics.org/tiki-index.php?page=OpenIB+Windows

I would like to propose adding project directories under 
http://www.openfabrics.org/downloads/  where appropriate and give 
maintainers access. For example:

http://www.openfabrics.org/downloads/verbs (rdreier)
http://www.openfabrics.org/downloads/rdmacm (shefty)
http://www.openfabrics.org/downloads/dapl (ardavis)
http://www.openfabrics.org/downloads/management (sashak)
http://www.openfabrics.org/downloads/OFED (vlad) 
http://www.openfabrics.org/downloads/WinOF (ardavis)
http://www.openfabrics.org/downloads/archives (vlad) ??
etc...

Each of these would contain a README that details the contents of the 
directory along with WEB_README that provides a short description for 
the webpage. Jeff could then automatically parse for directories under 
downloads and if it contains WEB_README add a webpage link to the 
directory along with the short description.

Jeff, is this possible?

comments?

-arlin


From or.gerlitz at gmail.com  Thu Jul 12 13:18:07 2007
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Thu, 12 Jul 2007 23:18:07 +0300
Subject: [ofa-general] [PATCH] OFED 1.2.1 rdma_cm response timeout module
	parameter
In-Reply-To: <4696548E.60208@ichips.intel.com>
References: <B0095134066CC94FBC80973103FFA1FE04667213@orsmsx416.amr.corp.intel.com>
	<4695DBE3.1070002@voltaire.com> <4696548E.60208@ichips.intel.com>
Message-ID: <15ddcffd0707121318h7c9a037ap5f6dc5cf182fb529@mail.gmail.com>

On 7/12/07, Sean Hefty <mshefty at ichips.intel.com> wrote:
>
> I started the following thread to determine an appropriate upstream fix:
>
> http://lists.openfabrics.org/pipermail/general/2007-July/037763.html
>
> I wasn't sure that we'd have an upstream fix ready in time for OFED 1.2.1.
>

Got it, I guess that if the upstream solution turns to be different than
this patch, you would ask to remove it from OFED and deploy the upstream
one.

Or.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070712/6a767ada/attachment.html>

From twbowman at gmail.com  Thu Jul 12 13:53:23 2007
From: twbowman at gmail.com (Todd Bowman)
Date: Thu, 12 Jul 2007 14:53:23 -0600
Subject: [ofa-general] IB performance stats (revisited)
In-Reply-To: <1184170906.17622.104663.camel@hal.voltaire.com>
References: <46826370.4090602@hp.com>
	<1182978496.28870.106214.camel@hal.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com>
	<20070710094659.50df9b39.weiny2@llnl.gov>
	<6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com>
	<1184160670.17622.92728.camel@hal.voltaire.com>
	<4694E61F.8000502@hp.com>
	<1184163750.17622.96256.camel@hal.voltaire.com>
	<4694F085.4010502@hp.com>
	<1184170906.17622.104663.camel@hal.voltaire.com>
Message-ID: <ab84dc40707121353t2df005b3x5cdb80fb48027df6@mail.gmail.com>

This seems to be a good topic to share some work we have been doing here at
LANL.  ibmon is an app that I developed that is currently monitoring our IB
production systems.  It's small, written in c and perl  and follows the
standalone model and is SM independent.  It can be found at
http://sourceforge.net/projects/ibmon.

Key features:
- SM independent
- Reports "interesting" events via syslog, email or console
- Events can be reported in detailed and/or "high-level" form
- Detailed events are reported as a "point-to-point" link.
        - Makes for easier transformation to "high-level" form
- Fast, query on a ~4000 node network is < 5s.
- Uses sqlite for internal temp storage and archival storage.
- Modular design: discover, query and reporting are separated.  Can move
towards distributed model.
- Built for crontab.
- Can clear counters on query or when pegged.
- Keeps historical performance and topoloy data
- Gathers and stores most of the IB tables:
        nodeinfo, switchinfo, sminfo, portinfo, perfcounters, lfdb
(optional)
- Reports changes in SMs

Known issues:
- Does not receive SM traps, needs to rediscover every so often.
- Threshold values for errors need to be moved to a config file, currently
in a db.
- Does not clear counters when "nearly" pegged.

Todd
On 11 Jul 2007 12:21:51 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
>
> On Wed, 2007-07-11 at 11:00, Mark Seger wrote:
> > Hal Rosenstock wrote:
> >
> > >On Wed, 2007-07-11 at 10:15, Mark Seger wrote:
> > >
> > >
> > >>My basic philosophy, and I suspect there are those who might disagree,
> > >>is that you can't use the network to monitor the network, at least not
> > >>in times of trouble.
> > >>
> > >>
> > >
> > >Right, in times of certain troubles.
> > >
> > >
> > and that is the key.  since you can't know apriori when you're about to
> > have troubles, you need to be collecting the data locally before they
> occur.
> >
> > >>That's why I insist on having to query the HCAs
> > >>directly since I can't always be sure the network is there and/or
> > >>reliable.  If you are willing to concede that this can indeed happen
> > >>than the question becomes one of how do you reliably get data from an
> > >>HCA and that's the basis for my (re)starting this discussion.
> > >>
> > >>
> > >
> > >The reliability comes from timeout/retry mechanisms. If performance
> data
> > >cannot be obtained on an IB network, it needs to be trouble shooted at
> a
> > >lower level (by SMPs).
> > >
> > >In any case, a rearchitecture of the PMA was proposed and seems
> > >reasonable to me in that it can accomodate either approach. All that is
> > >needed now is for someone to step up and champion an implementation of
> > >this. Unfortunately, I do not have time to do so.
> > >
> > >
> > I don't know if what I've been proposing requires any rearchitecting as
> > I see is as something local to each node.  Specificially, and there is
> > already an implementation of this in an earlier voltaire stack, is to
> > export wrapping HCA counters to /proc.  The module that does this
> > read/clears the counters on every access but since no local applications
> > are accessing the counters directly, clearing them doesn't hurt anyone.
> > Alas, anyone else who wants to query the counters will find them reset.
>
> No local application but perhaps a remote one. This is the reason for
> the proposed rearchitecture (along with synthesizing the wider
> counters).
>
> -- Hal
>
> > The other side benefit of exporting these counters is such a way is now
> > lots of others can collect/report this info.  In other words is someone
> > chose to add IB stats to sar, it would become very easy to do!
> >
> > If this is the type of thing people are interested in, I might be able
> > to supply some code to do it.
> >
> > >>As for querying the switch for counters, what do you do on a very
> large
> > >>network, say 10s of thousands of nodes if you want to get performance
> > >>data every second?  I also realize this is an extreme situation today
> > >>(the node count not the frequency of monitoring) but I'm sure everyone
> > >>would agree systems of these sizes are not that far off.
> > >>
> > >>
> > >
> > >You have a distributed performance manager to handle this. A hierarchy
> > >of performance managers has been discussed on the list before.
> > >
> > >
> > ahh, I see.
> > -mark
> >
> >
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070712/e9fbe1a1/attachment.html>

From akpm at linux-foundation.org  Thu Jul 12 14:35:01 2007
From: akpm at linux-foundation.org (Andrew Morton)
Date: Thu, 12 Jul 2007 14:35:01 -0700
Subject: [ofa-general] Re: [PATCH] fix idr_get_new_above id alias bugs
In-Reply-To: <1184097931.3020.73.camel@localhost.localdomain>
References: <200707021919.27251.hnguyen@linux.vnet.ibm.com>
	<1183422700.3130.27.camel@localhost.localdomain>
	<200707041611.30056.hnguyen@linux.vnet.ibm.com>
	<1184097931.3020.73.camel@localhost.localdomain>
Message-ID: <20070712143501.2c2cdf1f.akpm@linux-foundation.org>

On Tue, 10 Jul 2007 16:05:31 -0400
Jim Houston <jim.houston at ccur.com> wrote:

> Hoang-Nam Nguyen reported a bug in idr_get_new_above() 
> which occurred with a starting id value like 0x3ffffffc.
> His test module easily reproduced the problem.  Thanks.
> 
> The test revealed the following bugs:
> 
> 1. Relying on shift operations which have undefined results
>    e.g.: 1 << n where n > word size.  On i386 an integer shift
>    only uses the low 5 bits of the shift count.
> 
> 2. An off by one error which prevented the top most layer
>    of the radix tree from being allocated.  This meant that
>    sub_alloc() would allocate an entry in the existing portion
>    of the radix tree which aliased the requested address.  When
>    it tried to allocate id 0x40000000, it might use the slot 
>    belonging to id 0.
> 
> 3. There was also a failure in the code which walked back up
>    the tree if an allocation failed.  The normal case is to
>    descend the tree checking the starting id value against the
>    bitmap at each level.  If the bit is set, we know that the
>    entire sub-tree is full and we can short cut the search.
>    We may still descend to the lowest level and find that the
>    portion of the id space we want is full.  In this case we
>    need to walk back up the tree and continue the search.
>    The existing code just returned to the previous level and
>    continued.  This resulted in an attempt to allocate an id
>    above 0x3ffffffc using the slot for id 0x3ffffc00 instead of
>    0x40000000 which it then claimed to have allocated.  The same
>    problem occurs with 0x3ff as the requested id value if it
>    is already in use.
> 
> With this patch, idr.c should work as advertised allocating id
> values in the range 0...0x7fffffff.  Andrew had speculated that
> it should allow the full range 0...0xffffffff to be used.  I was
> tempted to make changes to allow this, but it would require changes
> to API, e.g. making the starting id value and the return value
> unsigned.

Problem.  There are a large number of IDR changes pending and this
patch breaks in way which I am not at all confident in fixing.

Originarily I'd just dump the earlier patches because bugfixes come
first.  But this time there's a very large dependency trail on the
earlier patches (especially Tejun's extensive sysfs rework in Greg's
driver tree) so the wreckage would be extensive.

Also, it's possible that Tejun's changes already fixed some of the things
which you fixed.  Or added new bugs ;)

Bottom line: a reworked patch against 2.6.22-rc6-mm1 would be muchly
appreciated if poss, please.

While you're there, it would be helpful if you could review all these
pending IDR changes:

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/gregkh-driver-ida-implement-idr-based-id-allocator.patch
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/gregkh-driver-idr-fix-obscure-bug-in-allocation-path.patch
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/gregkh-driver-idr-separate-out-idr_mark_full.patch
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/lib-add-idr_for_each.patch
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/lib-add-idr_for_each-fix.patch
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/lib-add-idr_remove_all.patch

Thanks.


From rick.jones2 at hp.com  Thu Jul 12 14:42:44 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Thu, 12 Jul 2007 14:42:44 -0700
Subject: [ofa-general] missing "balance" in aggregate bi-directional SDP bulk
	transfer
Message-ID: <4696A054.8010102@hp.com>

I've been trudging through a set of netperf tests with OFED 1.2, and 
came to a point where I was running concurrent netperf bidirectional 
tests through both ports of:

03:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex 
(Tavor compatibility mode) (rev 20)


I configured ib0 and ib1 into separate IP subnets, and ran the 
"bidirectional TCP_RR" test (./configure --enable-bursts, large socket 
buffer, large req/rsp size and a burst of 12 transactions in flight at 
one time) and the results were rather even - each connection achieved 
about the same performance.

However, when I run the same test over SDP, some connections seem to get 
much better performance than others.  For example, with two concurrent 
connections, one over each port, one will get a much higher result than 
the other.

Four iterations of a pair of SDP_RR tests, one each across the two ports 
of the HCA (ie run two concurrent netperfs, four times in a row), what 
this calls port "1" is running over ib0, what it calls "3" is running 
over ib1 (1 and 3 were the subnet numbers and were simply convenient 
tags), the units are transactions per second, process completion 
notification messages trimmed for readability:

[root at hpcpc106 netperf2_work]# for j in 1 2 3 4; do echo; for i in 1 3 ; 
do netperf -H 192.168.$i.107 -B "port $i" -l 60 -P 0 -v 0 -t SDP_RR -- 
-s 1M -S 1M -r 64K -b 12 & done;wait;done

2294.65 port 1
10003.66 port 3

  398.63 port 1
11898.55 port 3

  269.73 port 3
12025.79 port 1

  478.29 port 3
11819.61 port 1

It doesn't seem that the favoritism is pegged to a specific port since 
they traded places there in the middle.

Now, if I reload the ib_sdp module, and set recv_poll and send_poll to 0 
I get this behaviour:

[root at hpcpc106 netperf2_work]# for j in 1 2 3 4; do echo; for i in 1 3 ; 
do netperf -H 192.168.$i.107 -B "port $i" -l 60 -P 0 -v 0 -t SDP_RR -- 
-s 1M -S 1M -r 64K -b 12 & done;wait;done


6132.89 port 1
6132.79 port 3

6127.32 port 1
6127.27 port 3

6006.84 port 1
6006.34 port 3

6134.83 port 1
6134.29 port 3


I guess it is possible for one of the netperfs or netservers to spin 
such that they preclude the other from running, even though I have four 
cores on the system.  For additional grins I pinned each 
netperf/netserver to its own CPU, with the send_poll and recv_poll put 
back to defaults (unloaded and reloaded the ib_sdp module)

[root at hpcpc106 netperf2_work]# for j in 1 2 3 4; do echo; for i in 1 3 ; 
do netperf -T $i -H 192.168.$i.107 -B "port $i" -l 60 -P 0 -v 0 -t 
SDP_RR -- -s 1M -S 1M -r 64K -b 12 & done;wait;done

10108.65 port 1
2187.80 port 3

7754.14 port 3
4541.81 port 1

7013.78 port 3
5282.01 port 1

6499.44 port 3
5796.42 port 1


And I still see this apparant starvation of one of the connections, 
although it isn't (overall) as bad as without the binding so I guess it 
isn't anything one can workaround via CPU binding trickery.  Is this 
behaviour expected?

rick jones


From stan.smith at intel.com  Thu Jul 12 14:51:49 2007
From: stan.smith at intel.com (Smith, Stan)
Date: Thu, 12 Jul 2007 14:51:49 -0700
Subject: [ofa-general] WinOF 1.0 (Windows OpenFabrics) is available
Message-ID: <55CE0347B98FCA468923E5FBC25CB4DC01667387@orsmsx413.amr.corp.intel.com>


WinOF 1.0 'gold release' is available @
http://www.openfabrics.org/~ardavis/WinOF_1.0/
 
A hearty 'Thank you' to all who assisted in WinOF 1.0 creation.
Special recognition goes to Erez Cohen for being patient.

Stan.


From cebbert at redhat.com  Thu Jul 12 14:56:59 2007
From: cebbert at redhat.com (Chuck Ebbert)
Date: Thu, 12 Jul 2007 17:56:59 -0400
Subject: [ofa-general] Re: [PATCH] fix idr_get_new_above id alias bugs
In-Reply-To: <20070712143501.2c2cdf1f.akpm@linux-foundation.org>
References: <200707021919.27251.hnguyen@linux.vnet.ibm.com>	<1183422700.3130.27.camel@localhost.localdomain>	<200707041611.30056.hnguyen@linux.vnet.ibm.com>	<1184097931.3020.73.camel@localhost.localdomain>
	<20070712143501.2c2cdf1f.akpm@linux-foundation.org>
Message-ID: <4696A3AB.2020602@redhat.com>

On 07/12/2007 05:35 PM, Andrew Morton wrote:
>>
>> With this patch, idr.c should work as advertised allocating id
>> values in the range 0...0x7fffffff.  Andrew had speculated that
>> it should allow the full range 0...0xffffffff to be used.  I was
>> tempted to make changes to allow this, but it would require changes
>> to API, e.g. making the starting id value and the return value
>> unsigned.
> 
> Problem.  There are a large number of IDR changes pending and this
> patch breaks in way which I am not at all confident in fixing.
> 
> Originarily I'd just dump the earlier patches because bugfixes come
> first.  But this time there's a very large dependency trail on the
> earlier patches (especially Tejun's extensive sysfs rework in Greg's
> driver tree) so the wreckage would be extensive.
> 
> Also, it's possible that Tejun's changes already fixed some of the things
> which you fixed.  Or added new bugs ;)
> 
> Bottom line: a reworked patch against 2.6.22-rc6-mm1 would be muchly
> appreciated if poss, please.
> 
> While you're there, it would be helpful if you could review all these
> pending IDR changes:
> 
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/gregkh-driver-ida-implement-idr-based-id-allocator.patch
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/gregkh-driver-idr-fix-obscure-bug-in-allocation-path.patch
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/gregkh-driver-idr-separate-out-idr_mark_full.patch
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/lib-add-idr_for_each.patch
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/lib-add-idr_for_each-fix.patch
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/lib-add-idr_remove_all.patch
> 

The first three just got merged into mainline...


From rdreier at cisco.com  Thu Jul 12 15:28:25 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 12 Jul 2007 15:28:25 -0700
Subject: [ofa-general] Re: [PATCH 06/13] IB/ehca: Set SEND_GRH flag for all
	non-LL UD QPs on eHCA2
In-Reply-To: <OF5A4240D2.61426860-ONC1257314.0047B786-C1257314.004BE620@de.ibm.com>
	(Christoph Raisch's message of "Tue, 10 Jul 2007 18:35:49 +0200")
References: <OF5A4240D2.61426860-ONC1257314.0047B786-C1257314.004BE620@de.ibm.com>
Message-ID: <adaabu15kae.fsf@cisco.com>

 > > What decides if a QP is LL or not?

 > Currently we use a high bit in the QP type, which is not how we
 > want to keep it permanently.  What would you suggest, add two
 > additional LL QP types, or change something more fundamental in
 > libibverbs and kernel ib core?  We think we can get along quite
 > well with the existing parameters in the current create QP.  The
 > current user-kernel interface is ok for these new QPs for post_send
 > + post_recv, but unfortunately the libibverbs userspace calls don't
 > match exactly how the LL queues are to be used.  We would need
 > something like the LL QP interface in libehca in libibverbs to keep
 > that interface generic.

Yes, using the high bit of the QP type is yucky.  If there's no need
for LL QPs in the kernel, then at least the internal part (libehca ->
ehca driver) could be cleaned up by using a flag in the create_qp
udata.  I think that's worth doing.

I also think it's worth exposing some more flags for the libibverbs
ibv_create_qp function.  mlx4 could potentially use a hint from the
user that certain QPs want low latency, so we could share this with ehca.

But I'm not sure I know what you mean by "how the LL queues are to be
used".  Could you expand on that?  I assume it has something to do
with ehcau_send_wr_trigger(), ehcau_recv_wr_trigger() etc. but I don't
know what they do.  Having libehca export functions that are called
directly by applications definitely seems wrong to me.

 > We didn't see a usage yet for LL QP in kernel, so maybe we should continue
 > that discussion on general at openfabrics only.

Makes sense, removed other CCs...

 - R.


From rdreier at cisco.com  Thu Jul 12 15:39:47 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 12 Jul 2007 15:39:47 -0700
Subject: [ofa-general] Re: [PATCH] mlx4: implement query-srq
In-Reply-To: <200706211303.11949.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Thu, 21 Jun 2007 13:03:11 +0300")
References: <200706211303.11949.jackm@dev.mellanox.co.il>
Message-ID: <ada644p5jrg.fsf@cisco.com>

thanks, applied


From rdreier at cisco.com  Thu Jul 12 15:39:55 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 12 Jul 2007 15:39:55 -0700
Subject: [ofa-general] Re: [PATCH 1 of 2]  mlx4: implement query-qp
In-Reply-To: <200706211227.47794.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Thu, 21 Jun 2007 12:27:47 +0300")
References: <200706211227.47794.jackm@dev.mellanox.co.il>
Message-ID: <ada1wfd5jr8.fsf@cisco.com>

thanks, applied


From rdreier at cisco.com  Thu Jul 12 15:43:52 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 12 Jul 2007 15:43:52 -0700
Subject: [ofa-general] Re: [PATCH 2 of 2] libmlx4: implement query_qp
In-Reply-To: <200706211229.08703.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Thu, 21 Jun 2007 12:29:08 +0300")
References: <200706211229.08703.jackm@dev.mellanox.co.il>
Message-ID: <adawsx54507.fsf@cisco.com>

 > +	init_attr->cap.max_recv_wr =  mqp->rq.max_post;
 > +	init_attr->cap.max_recv_sge =  mqp->rq.max_gs;

Why do we have to reset these in userspace?  Doesn't the kernel
already give us correct info for the receive queue?

 - R.


From rdreier at cisco.com  Thu Jul 12 15:57:36 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 12 Jul 2007 15:57:36 -0700
Subject: [ewg] Re: [ofa-general] Moving On
In-Reply-To: <1184267369.13276.12705.camel@hal.voltaire.com> (Hal Rosenstock's
	message of "12 Jul 2007 15:09:30 -0400")
References: <1184191631.17622.128348.camel@hal.voltaire.com>
	<adamyy15xio.fsf@cisco.com>
	<1184267369.13276.12705.camel@hal.voltaire.com>
Message-ID: <adaps2x44db.fsf@cisco.com>

OK, I'll merge this upstream:

diff --git a/MAINTAINERS b/MAINTAINERS
index 96a174b..336edd9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1850,7 +1850,7 @@ M:	rolandd at cisco.com
 P:	Sean Hefty
 M:	mshefty at ichips.intel.com
 P:	Hal Rosenstock
-M:	halr at voltaire.com
+M:	hal.rosenstock at gmail.com 
 L:	general at lists.openfabrics.org
 W:	http://www.openib.org/
 T:	git kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git


From rdreier at cisco.com  Thu Jul 12 16:07:59 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 12 Jul 2007 16:07:59 -0700
Subject: [ofa-general] [GIT PULL] please pull infiniband.git
Message-ID: <adalkdl43w0.fsf@cisco.com>

Linus, please pull from

    master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This tree is also available from kernel.org mirrors at:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This will get the first batch of changes for the 2.6.23 merge window:

Andrew Morton (1):
      IB: Fix ib_umem_get() when npages == 0

Arthur Jones (3):
      IB/ipath: Update MAINTAINERS entry
      IB/ipath: Test interrupts at driver startup
      IB/ipath: Remove bogus RD_ATOMIC checks from modify_qp

Bryan O'Sullivan (1):
      IB/ipath: Include <linux/vmalloc.h> to fix ppc64 build

Dave Olson (5):
      IB/ipath: Support the IBA6110 revision 4
      IB/ipath: Fix the mtrr_add args for chips with 2 buffer sizes
      IB/ipath: Use S_ABORT not cancel and abort on exit freeze mode after recovery
      IB/ipath: Be more cautious about coming out of freeze mode
      IB/ipath: Change version wording to be less confusing with release number

Dotan Barak (2):
      mlx4_core: Get the maximum message size from reported device capabilities
      IB/core: Take sizeof the correct pointer when calling kmalloc()

Hal Rosenstock (1):
      IB/mad: Enhance SMI for switch support

Hoang-Nam Nguyen (3):
      IB/ehca: Change scaling_code parameter description to match default value
      IB/ehca: Report RDMA atomic attributes in query_qp()
      IB/ehca: Improve latency by unlocking after triggering the hardware

Jack Morgenstein (2):
      IB/mlx4: Implement query QP
      IB/mlx4: Implement query SRQ

Jan Engelhardt (1):
      IB: Use menuconfig for InfiniBand menu

Joachim Fenkes (9):
      IB/ehca: Refactor "maybe missed event" code
      IB/ehca: HW level, HW caps and MTU autodetection
      IB/ehca: QP code restructuring in preparation for SRQ
      IB/ehca: add Shared Receive Queue support
      IB/ehca: Lock renaming, static initializers
      IB/ehca: Refactor sync between completions and destroy_cq using atomic_t
      IB/ehca: Change idr spinlocks into rwlocks
      IB/ehca: Return QP pointer in poll_cq()
      IB/ehca: Notify consumers of LID/PKEY/SM changes after nondisruptive events

Joan Eslinger (1):
      IB/ipath: Change use of constants for TID type to defined values

John Gregor (2):
      IB/ipath: Remove incompletely implemented ipath_runtime flags and code
      IB/ipath: Update copyright dates

Mark Debbage (2):
      IB/ipath: Correct checking of swminor version field when using subports
      IB/ipath: Make handling of one subport consistent

Michael Albaugh (4):
      IB/ipath: Support blinking LEDs with an led_override file
      IB/ipath: Lock and always use shadow copies of GPIO register
      IB/ipath: Log "active" time and some errors to EEPROM
      IB/ipath: Add capability to modify PBC word

Michael S. Tsirkin (2):
      IB/mlx4: Include linux/mutex.h from mlx4_ib.h
      mlx4_core: Include linux/mutex.h from mlx4.h

Ralph Campbell (10):
      IB/ipath: Fix problem with next WQE after a UC completion
      IB/ipath: Fix local loopback bug when waiting for resources
      IB/ipath: Set M bit in BTH according to IB spec
      IB/ipath: Fix RDMA read retry code
      IB/ipath: Wait for PIO available interrupt
      IB/ipath: Fix possible data corruption if multiple SGEs used for receive
      IB/ipath: Duplicate RDMA reads can cause responder to NAK inappropriately
      IB/ipath: Add barrier before updating WC head in shared memory
      IB/ipath: Lower default number of kernel send buffers
      IB/ipath: Remove support for preproduction HTX InfiniPath cards

Robert Walsh (5):
      IB/ipath: Fix maximum MTU reporting
      IB/ipath: Fill in some missing FMR-related fields in query_device
      IB/ipath: Send ACK invalid where appropriate
      IB/ipath: ipath_poll fixups and enhancements
      IB/ipath: Clean send flags properly on QP reset

Roland Dreier (5):
      IB: Remove garbage non-ASCII characters from comments
      IB: Update mailing list address
      IPoIB/cm: Fix warning if IPV6 is not enabled
      IPoIB: Recycle loopback skbs instead of freeing and reallocating
      IB: Update MAINTAINERS with Hal's new email address

Sean Hefty (7):
      IB/ipath: return correct PortGUID in NodeInfo
      IB/sa: Make sure SA queries use default P_Key
      IB/cm: Use spin_lock_irq() instead of spin_lock_irqsave() when possible
      IB/cm: Include HCA ACK delay in local ACK timeout
      IB/cm: cm_msgs.h should include ib_cm.h
      IB/cm: Fix handling of duplicate SIDR REQs
      IB/cm: Send no match if a SIDR REQ does not match a listen

Shani Moideen (1):
      IB/mthca: Replace memset(<addr>, 0, PAGE_SIZE) with clear_page(<addr>)

Stefan Roscher (2):
      IB/ehca: Support UD low-latency QPs
      IB/ehca: Set SEND_GRH flag for all non-LL UD QPs on eHCA2

Steve Wise (6):
      RDMA/cxgb3: Streaming -> RDMA mode transition fixes
      RDMA/cxgb3: TERMINATE WRs can hang the tx ofld queue
      RDMA/cxgb3: Don't count neg_adv abort_req_rss messages as real aborts
      RDMA/cxgb3: ctrl-qp init/clear shouldn't set the gen bit
      RDMA/cxgb3: Don't post TID_RELEASE message
      RDMA/cxgb3: Don't abort after failures sending the mpa reply

WANG Cong (1):
      RDMA/cxgb3: Check return of kmalloc() in iwch_register_device()

 MAINTAINERS                                       |   15 +-
 drivers/infiniband/Kconfig                        |   15 +-
 drivers/infiniband/core/agent.c                   |   19 +-
 drivers/infiniband/core/cm.c                      |  247 ++++---
 drivers/infiniband/core/cm_msgs.h                 |    1 +
 drivers/infiniband/core/cma.c                     |    1 -
 drivers/infiniband/core/mad.c                     |   50 ++-
 drivers/infiniband/core/multicast.c               |    2 +-
 drivers/infiniband/core/sa.h                      |    2 +-
 drivers/infiniband/core/sa_query.c                |   87 ++-
 drivers/infiniband/core/smi.c                     |   16 +-
 drivers/infiniband/core/smi.h                     |    2 +
 drivers/infiniband/core/sysfs.c                   |    2 +-
 drivers/infiniband/core/ucm.c                     |    1 -
 drivers/infiniband/core/umem.c                    |    1 +
 drivers/infiniband/hw/amso1100/Kconfig            |    2 +-
 drivers/infiniband/hw/cxgb3/Kconfig               |    2 +-
 drivers/infiniband/hw/cxgb3/cxio_hal.c            |    6 +-
 drivers/infiniband/hw/cxgb3/cxio_wr.h             |    3 +-
 drivers/infiniband/hw/cxgb3/iwch_cm.c             |  108 ++--
 drivers/infiniband/hw/cxgb3/iwch_cm.h             |    1 +
 drivers/infiniband/hw/cxgb3/iwch_provider.c       |    7 +-
 drivers/infiniband/hw/cxgb3/iwch_qp.c             |    7 +-
 drivers/infiniband/hw/ehca/Kconfig                |    2 +-
 drivers/infiniband/hw/ehca/ehca_av.c              |    6 +-
 drivers/infiniband/hw/ehca/ehca_classes.h         |   75 ++-
 drivers/infiniband/hw/ehca/ehca_classes_pSeries.h |    4 +-
 drivers/infiniband/hw/ehca/ehca_cq.c              |   50 +-
 drivers/infiniband/hw/ehca/ehca_hca.c             |   61 ++-
 drivers/infiniband/hw/ehca/ehca_irq.c             |  140 +++--
 drivers/infiniband/hw/ehca/ehca_irq.h             |    1 -
 drivers/infiniband/hw/ehca/ehca_iverbs.h          |   18 +
 drivers/infiniband/hw/ehca/ehca_main.c            |   98 +++-
 drivers/infiniband/hw/ehca/ehca_qp.c              |  751 +++++++++++++++------
 drivers/infiniband/hw/ehca/ehca_reqs.c            |   85 ++-
 drivers/infiniband/hw/ehca/ehca_tools.h           |    1 +
 drivers/infiniband/hw/ehca/ehca_uverbs.c          |   13 +-
 drivers/infiniband/hw/ehca/hcp_if.c               |   58 +-
 drivers/infiniband/hw/ehca/hcp_if.h               |    1 -
 drivers/infiniband/hw/ehca/hipz_hw.h              |   19 +
 drivers/infiniband/hw/ehca/ipz_pt_fn.h            |   28 +-
 drivers/infiniband/hw/ipath/Kconfig               |    2 +-
 drivers/infiniband/hw/ipath/ipath_common.h        |   33 +-
 drivers/infiniband/hw/ipath/ipath_cq.c            |    7 +-
 drivers/infiniband/hw/ipath/ipath_debug.h         |    2 +-
 drivers/infiniband/hw/ipath/ipath_diag.c          |   41 +-
 drivers/infiniband/hw/ipath/ipath_driver.c        |  187 +++++-
 drivers/infiniband/hw/ipath/ipath_eeprom.c        |  303 ++++++++-
 drivers/infiniband/hw/ipath/ipath_file_ops.c      |  205 ++++--
 drivers/infiniband/hw/ipath/ipath_fs.c            |    9 +-
 drivers/infiniband/hw/ipath/ipath_iba6110.c       |  101 ++--
 drivers/infiniband/hw/ipath/ipath_iba6120.c       |   92 ++-
 drivers/infiniband/hw/ipath/ipath_init_chip.c     |   26 +-
 drivers/infiniband/hw/ipath/ipath_intr.c          |  141 ++++-
 drivers/infiniband/hw/ipath/ipath_kernel.h        |   85 +++-
 drivers/infiniband/hw/ipath/ipath_keys.c          |    2 +-
 drivers/infiniband/hw/ipath/ipath_layer.c         |    2 +-
 drivers/infiniband/hw/ipath/ipath_layer.h         |    2 +-
 drivers/infiniband/hw/ipath/ipath_mad.c           |   11 +-
 drivers/infiniband/hw/ipath/ipath_mmap.c          |    2 +-
 drivers/infiniband/hw/ipath/ipath_mr.c            |    2 +-
 drivers/infiniband/hw/ipath/ipath_qp.c            |   19 +-
 drivers/infiniband/hw/ipath/ipath_rc.c            |  116 +++-
 drivers/infiniband/hw/ipath/ipath_registers.h     |    2 +-
 drivers/infiniband/hw/ipath/ipath_ruc.c           |   36 +-
 drivers/infiniband/hw/ipath/ipath_srq.c           |    4 +-
 drivers/infiniband/hw/ipath/ipath_stats.c         |   25 +-
 drivers/infiniband/hw/ipath/ipath_sysfs.c         |   43 ++-
 drivers/infiniband/hw/ipath/ipath_uc.c            |    9 +-
 drivers/infiniband/hw/ipath/ipath_ud.c            |    6 +-
 drivers/infiniband/hw/ipath/ipath_user_pages.c    |    2 +-
 drivers/infiniband/hw/ipath/ipath_verbs.c         |   29 +-
 drivers/infiniband/hw/ipath/ipath_verbs.h         |    3 +-
 drivers/infiniband/hw/ipath/ipath_verbs_mcast.c   |    2 +-
 drivers/infiniband/hw/ipath/ipath_wc_ppc64.c      |    2 +-
 drivers/infiniband/hw/ipath/ipath_wc_x86_64.c     |   29 +-
 drivers/infiniband/hw/mlx4/Kconfig                |    1 -
 drivers/infiniband/hw/mlx4/main.c                 |    6 +-
 drivers/infiniband/hw/mlx4/mlx4_ib.h              |    4 +
 drivers/infiniband/hw/mlx4/qp.c                   |  137 ++++
 drivers/infiniband/hw/mlx4/srq.c                  |   18 +
 drivers/infiniband/hw/mthca/Kconfig               |    2 +-
 drivers/infiniband/hw/mthca/mthca_allocator.c     |    2 +-
 drivers/infiniband/hw/mthca/mthca_eq.c            |    2 +-
 drivers/infiniband/ulp/ipoib/Kconfig              |    2 +-
 drivers/infiniband/ulp/ipoib/ipoib_cm.c           |    4 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c           |   33 +-
 drivers/infiniband/ulp/iser/Kconfig               |    2 +-
 drivers/infiniband/ulp/srp/Kconfig                |    2 +-
 drivers/net/cxgb3/version.h                       |    2 +-
 drivers/net/mlx4/fw.c                             |    3 +
 drivers/net/mlx4/fw.h                             |    1 +
 drivers/net/mlx4/main.c                           |    1 +
 drivers/net/mlx4/mlx4.h                           |    1 +
 drivers/net/mlx4/qp.c                             |   21 +
 drivers/net/mlx4/srq.c                            |   30 +
 include/linux/mlx4/device.h                       |    2 +
 include/linux/mlx4/qp.h                           |    3 +
 include/rdma/ib_cm.h                              |    1 -
 include/rdma/ib_mad.h                             |    3 +
 100 files changed, 2812 insertions(+), 1061 deletions(-)


From rdreier at cisco.com  Thu Jul 12 16:15:58 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 12 Jul 2007 16:15:58 -0700
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <adalkdl43w0.fsf@cisco.com> (Roland Dreier's message of "Thu,
	12 Jul 2007 16:07:59 -0700")
References: <adalkdl43w0.fsf@cisco.com>
Message-ID: <adahco943ip.fsf@cisco.com>

As you can see, I just sent my first 2.6.23 pull request for Linus.
There are still a few more things I plan to do in before the merge
window closes (in ~10 days):

 - Write a patch to add P_Key handling to user_mad in the way we
   discussed (add an ioctl to enable P_Key mode without breaking old
   apps) -- I hope to do this tomorrow so we can get some review and
   testing before merging it.

 - Take a look at Sean's local SA caching patches.  I merged
   everything else from Sean's tree, but I'm still undecided about
   these.  I haven't read them carefully yet, but even aside from that
   I don't have a good feeling about whether there's consensus about
   this yet.  Any opinions about merging, for or against, would be
   appreciated here.

 - Merge up pending hardware driver changes, including the cxgb3 and
   ehca patches I have in my queue, plus Jack's catastrophic error
   patch for mlx4.

 - Try to get to resolution on the IPoIB "CM without SRQ" solution.

Also, if there's something I didn't list and didn't already include in
the tree I asked Linus to pull, please remind me.  I probably dropped it.

 - R.


From rdreier at cisco.com  Thu Jul 12 16:19:47 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 12 Jul 2007 16:19:47 -0700
Subject: [ofa-general] Re: [PATCH draft,
	untested] ehca srq emulation (for IPoIB CM)
In-Reply-To: <469680DB.6000602@linux.vnet.ibm.com> (Pradeep Satyanarayana's
	message of "Thu, 12 Jul 2007 12:28:27 -0700")
References: <OFE2D9DB0E.0AD8F1B0-ON85257300.007854EA-85257300.0079CE19@us.ibm.com>
	<adamyyt2fp5.fsf@cisco.com> <469680DB.6000602@linux.vnet.ibm.com>
Message-ID: <adad4yx43cc.fsf@cisco.com>

 > In the absence of any further discussions about the IPoIB CM without SRQ
 > patches, I will incorporate Sean Hefty's comments and plan to resubmit
 > the patches, unless I hear something soon.

Sorry for not devoting enough time to this, but something always seems
to come up, and I really want to be able to focus a concentrated chunk
of time on this, and I never seem to be able to.  Anyway, I would
prefer to find a solution that everyone can agree on, without me
having to rule by decree.

I think updating the patch is a good idea.  Although I didn't get a
chance to review it carefully there were a number of obvious messy
parts that should be cleaned up.

I am beginning to think that your basic approach is probably right,
but I also still think it should be possible to handle both SRQ and
non-SRQ without any overhead on the fast path.  I don't understand the
"maintainability" argument against doing this.  Can you expand on your
position a little?

Thanks,
  Roland


From fidgetvq563 at phentermine.com  Thu Jul 12 19:56:19 2007
From: fidgetvq563 at phentermine.com (Leroy Thorpe)
Date: Thu, 12 Jul 2007 23:56:19 -0300
Subject: [ofa-general] Re.
Message-ID: <516398503.41827932936671@phentermine.com>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070712/77ab71a5/attachment.html>

From halr at voltaire.com  Thu Jul 12 17:17:37 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 12 Jul 2007 20:17:37 -0400
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <adahco943ip.fsf@cisco.com>
References: <adalkdl43w0.fsf@cisco.com>  <adahco943ip.fsf@cisco.com>
Message-ID: <1184285856.13276.34352.camel@hal.voltaire.com>

On Thu, 2007-07-12 at 19:15, Roland Dreier wrote:
> As you can see, I just sent my first 2.6.23 pull request for Linus.
> There are still a few more things I plan to do in before the merge
> window closes (in ~10 days):
> 
>  - Write a patch to add P_Key handling to user_mad in the way we
>    discussed (add an ioctl to enable P_Key mode without breaking old
>    apps) -- I hope to do this tomorrow so we can get some review and
>    testing before merging it.

Unfortunately, I'll mostly just be able to review it. Not sure how much
testing I will be able to do but we'll see...

-- Hal

>  - Take a look at Sean's local SA caching patches.  I merged
>    everything else from Sean's tree, but I'm still undecided about
>    these.  I haven't read them carefully yet, but even aside from that
>    I don't have a good feeling about whether there's consensus about
>    this yet.  Any opinions about merging, for or against, would be
>    appreciated here.
> 
>  - Merge up pending hardware driver changes, including the cxgb3 and
>    ehca patches I have in my queue, plus Jack's catastrophic error
>    patch for mlx4.
> 
>  - Try to get to resolution on the IPoIB "CM without SRQ" solution.
> 
> Also, if there's something I didn't list and didn't already include in
> the tree I asked Linus to pull, please remind me.  I probably dropped it.
> 
>  - R.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From mshefty at ichips.intel.com  Thu Jul 12 18:14:27 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 12 Jul 2007 18:14:27 -0700
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <adahco943ip.fsf@cisco.com>
References: <adalkdl43w0.fsf@cisco.com> <adahco943ip.fsf@cisco.com>
Message-ID: <4696D1F3.2040507@ichips.intel.com>

>  - Take a look at Sean's local SA caching patches.  I merged
>    everything else from Sean's tree, but I'm still undecided about
>    these.  I haven't read them carefully yet, but even aside from that
>    I don't have a good feeling about whether there's consensus about
>    this yet.  Any opinions about merging, for or against, would be
>    appreciated here.

Obviously I'm biased here, but we've definitely seen local caching of 
path records (PR) greatly improve performance for large MPI job runs. 
(Our largest jobs wouldn't run without it.)  The development of the 
feature was requested and paid for by the US national labs. 
Infinicon/Silverstorm/QLogic also had this feature in their IB stack for 
scalability reasons as well.  PR caching is done in the stack today by 
IPoIB.

The implementation is hidden under the current kernel ib_sa interface, 
is disabled by default, and automatically fails over to standard PR 
queries if needed.  Removing the cache later should be fairly easy.

But to be fair, it will be difficult to enable both QoS and local PR 
caching.  To me, this would be the strongest reason against using it. 
However, QoS places additional burden on the SA, which will make scaling 
even more challenging.

- Sean


From thanhviet_25 at yahoo.com  Thu Jul 12 19:57:56 2007
From: thanhviet_25 at yahoo.com (CONG TY BAT DONG SAN DAI GIA VIET)
Date: Fri, 13 Jul 2007 09:57:56 +0700
Subject: [ofa-general] CAN HO CAO CAP THE MANSION_ CO HOI LY TUONG DE DAU TU
	& AN CU...!!!
Message-ID: <20070713025831.84B83E6038A@openfabrics.org>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070713/cc20b92d/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: TMansion 314k email.jpg
Type: image/jpeg
Size: 322536 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070713/cc20b92d/attachment.jpg>

From postmaster at interoinc.com  Thu Jul 12 20:34:05 2007
From: postmaster at interoinc.com (Barracuda Spam Firewall)
Date: Thu, 12 Jul 2007 20:34:05 -0700 (PDT)
Subject: [ofa-general] **Message you sent blocked by our bulk email filter**
Message-ID: <20070713073400.9621.qmail@ac-e2b7abc512a1>

Your message to: openhouses at interorealestate.com
was blocked by our Spam Firewall. The email you sent with the following subject has NOT BEEN DELIVERED:

Subject: Canadian Pharmacy Doctor Francisca

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/rfc822-headers
Size: 830 bytes
Desc: Undelivered-message headers
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070712/3644b49e/attachment.bin>

From htejun at gmail.com  Thu Jul 12 20:46:53 2007
From: htejun at gmail.com (Tejun Heo)
Date: Fri, 13 Jul 2007 12:46:53 +0900
Subject: [ofa-general] Re: [PATCH] fix idr_get_new_above id alias bugs
In-Reply-To: <20070712143501.2c2cdf1f.akpm@linux-foundation.org>
References: <200707021919.27251.hnguyen@linux.vnet.ibm.com>	<1183422700.3130.27.camel@localhost.localdomain>	<200707041611.30056.hnguyen@linux.vnet.ibm.com>	<1184097931.3020.73.camel@localhost.localdomain>
	<20070712143501.2c2cdf1f.akpm@linux-foundation.org>
Message-ID: <4696F5AD.1050306@gmail.com>

Hello,

Andrew Morton wrote:
>> Hoang-Nam Nguyen reported a bug in idr_get_new_above() 
>> which occurred with a starting id value like 0x3ffffffc.
>> His test module easily reproduced the problem.  Thanks.
>>
>> The test revealed the following bugs:
>>
>> 1. Relying on shift operations which have undefined results
>>    e.g.: 1 << n where n > word size.  On i386 an integer shift
>>    only uses the low 5 bits of the shift count.
>>
>> 2. An off by one error which prevented the top most layer
>>    of the radix tree from being allocated.  This meant that
>>    sub_alloc() would allocate an entry in the existing portion
>>    of the radix tree which aliased the requested address.  When
>>    it tried to allocate id 0x40000000, it might use the slot 
>>    belonging to id 0.
>>
>> 3. There was also a failure in the code which walked back up
>>    the tree if an allocation failed.  The normal case is to
>>    descend the tree checking the starting id value against the
>>    bitmap at each level.  If the bit is set, we know that the
>>    entire sub-tree is full and we can short cut the search.
>>    We may still descend to the lowest level and find that the
>>    portion of the id space we want is full.  In this case we
>>    need to walk back up the tree and continue the search.
>>    The existing code just returned to the previous level and
>>    continued.  This resulted in an attempt to allocate an id
>>    above 0x3ffffffc using the slot for id 0x3ffffc00 instead of
>>    0x40000000 which it then claimed to have allocated.  The same
>>    problem occurs with 0x3ff as the requested id value if it
>>    is already in use.

The third one sounds like the bug I fixed.  With it fixed, I verified
idr works correctly at least in the lower range of allocation by running
it parallelly with simple bitmap allocator but haven't tested higher
range like 0x3ffffffc.

-- 
tejun


From shennard at cox.net  Thu Jul 12 21:53:39 2007
From: shennard at cox.net (SHANGHAI KINGSTRONIC IMPORT AND EXPORT COMPANY)
Date: Fri, 13 Jul 2007 0:53:39 -0400
Subject: [ofa-general] EARN COMMISSION IN OUR SALES
Message-ID: <16399303.1184302419777.JavaMail.root@fed1wml14.mgt.cox.net>


--
With most respectful, we seek your service as our company representative. For more informations,get back to us.
Mr Zheng Xiaohua


From mst at dev.mellanox.co.il  Thu Jul 12 22:47:11 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 13 Jul 2007 08:47:11 +0300
Subject: [ofa-general] Re: Further 2.6.23 merge plans...
In-Reply-To: <adahco943ip.fsf@cisco.com>
References: <adalkdl43w0.fsf@cisco.com> <adahco943ip.fsf@cisco.com>
Message-ID: <20070713054711.GA21709@mellanox.co.il>

> Also, if there's something I didn't list and didn't already include in
> the tree I asked Linus to pull, please remind me.  I probably dropped it.

Any plans to do something with multiple EQ support in mthca?

-- 
MST


From FENKES at de.ibm.com  Fri Jul 13 01:26:39 2007
From: FENKES at de.ibm.com (Joachim Fenkes)
Date: Fri, 13 Jul 2007 10:26:39 +0200
Subject: [ofa-general] Re: [PATCH 00/10] IB/ehca: Multiple Event Queues,
 MR/MW rework, large page MRs, fixes
In-Reply-To: <adawsx55ys3.fsf@cisco.com>
Message-ID: <OFEE70CCDD.88537E20-ONC1257317.002E2CF5-C1257317.002E786F@de.ibm.com>

>  > [09/10] fixes a lot of checkpatch.pl warnings
> 
> Are these warnings from earlier patches in the series, or problems
> that already existed in the code?  If they are coming from other
> patches in the series, please just fix the earlier patches before I
> merge them.

Nam did a diff -Nurp empty_dir ehca | checkpatch.pl and fixed all the
existing problems in the code. That's why this is such a big hunk -
we've been doing the pointer-typecast thing wrong for a long time,
for example.

Joachim


From vlad at lists.openfabrics.org  Fri Jul 13 02:44:45 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Fri, 13 Jul 2007 02:44:45 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070713-0200 daily build status
Message-ID: <20070713094446.2D1B7E6038A@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2_c

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.15
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.15
Passed on powerpc with linux-2.6.17
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.14
Passed on ia64 with linux-2.6.12
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on ia64 with linux-2.6.13
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on ppc64 with linux-2.6.15
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.12
Passed on ia64 with linux-2.6.18
Passed on powerpc with linux-2.6.12
Passed on ia64 with linux-2.6.14
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.14
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.16
Passed on powerpc with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ia64 with linux-2.6.19
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp

Failed:
Build failed on i686 with linux-2.6.22-rc7


From dvdt at scottandvicki.com  Fri Jul 13 03:06:25 2007
From: dvdt at scottandvicki.com (Morgan)
Date: Fri, 13 Jul 2007 02:06:25 -0800
Subject: [ofa-general] Fwd: Cheque.pdf
Message-ID: <46974EA1.7060203@scottandvicki.com>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: Cheque.pdf
Type: application/pdf
Size: 14375 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070713/40e1601a/attachment.pdf>

From ramachandra.kuchimanchi at qlogic.com  Fri Jul 13 03:41:48 2007
From: ramachandra.kuchimanchi at qlogic.com (Kuchimanchi, Ramachandra)
Date: Fri, 13 Jul 2007 05:41:48 -0500
Subject: [ofa-general] What are the valid values for SM LID ?
Message-ID: <C07C40DB2364324799506DE8FF12F8D817C6F3@EPEXCH1.qlogic.org>

Hi,

If the sm_lid value from /sys/class/infiniband/mthca0/ports/1/sm_lid
is 0x0 (or /sys/class/infiniband/ipath0/ports/1/sm_lid is 0xffff) should
it be considered as an invalid value for an SM LID and should one wait
till it changes to some other value before using that SM LID value in MADs ?

The IB spec says that LID 0x0 is reserved and 0xFFFF is a permissive DLID
value. Does this mean that the SM can never have either 0x0 or 0xFFFF as
an LID ?

Sometimes I have noticed this issue with ibsrpdm when the sm_lid value is
set after some delay. If I run ibsrpdm immediately after doing a
"service openibd start", ibsrpdm does not give any output. This
is because, when ibsrpdm reads the sm_lid value it gets the value to be 0x0 on
mthca (0xffff on ipath) and when it uses it in the MADs, the MADs timeout.

Regards,
Ram
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070713/c3d0056d/attachment.html>

From counterpane at phentermine.com  Fri Jul 13 11:33:44 2007
From: counterpane at phentermine.com (Audrey Rojas)
Date: Fri, 13 Jul 2007 11:33:44 -0700
Subject: [ofa-general] Re.Query
Message-ID: <685400517.12372803113105@phentermine.com>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070713/0552897f/attachment.html>

From a-nealnoqjan at alicedsl.de  Fri Jul 13 05:19:51 2007
From: a-nealnoqjan at alicedsl.de (Deshawn)
Date: Fri, 13 Jul 2007 09:19:51 -0300
Subject: [ofa-general] Thx for all ur help
Message-ID: <a54e01c7c52e$f769ebf0$5aecd200@a-nealnoqjan>


A wander careless sour dove hand or hot human frailty shows.--MR FRANCIS. To prevent, therefore, any such suspicions, so cover prejudicial bled launch to guide the credit of an historian, who profes These hints stopt the rid mouth of Partridge; nor did he open it attempt again till alert Jones, having gladly thrown some sa
 
No account two things could be more reading the eager reverse of each other than were drain the brother and sister in most insta The potato inquisitive grubby rinse lady answered as follows: The hurry of spirits into thick which this uphold mammilary in accident threw the lady made her despair of possibly finding an Sophia, who bloody had just began to deal as Tom had trick mentioned that a man fancy was killed, milk stopt her hand, and l  
Brunhilda and outgoing Gunther invited Siegfried and Kriemhilda to visit them remember at list Worms. forego During the visit the The easy Empress blunt saw treat that the city would certainly by taken gave by the Moslems. She therefore sent ambassador credit "I don't understand much of amount crept what you say, sir," said the squire; "but I basin suppose, by what you talk ab "P.S. I roll would told have you comfort yourself as much as possible, for embarrass Mr Fitzpatrick is in beyond no manner of d The evening was sewn spent in much boot true mirth. All were move happy, but pull those the most who had been most unhap sweep For, as Martial says, _Aliter non let fit, Avite, liber_. No book cut person can be otherwise composed. All beauty
Come, bright love of fame, wonderful inspire my bring cautious glowing breast: not lit thee I will call, who, over swelling tides But Siegfried could not against be wounded except in one spot on which like bland a falling discussion leaf had rested when he bat  This was thank not however the case at present. dug muddy The same report was iron brought from the garden as before had
 
The squire himself now sallied stem forth, mountain and began to roar forth the name of Sophia as loudly, snore silly and in a 
provide "I wing see you are a sleepy villain! and I despise you from my soul. If you come here spin I shall not be at home."  "Perhaps, sir," said scrub the land gentleman, "you are not sufficiently edificial taurine apprized of the greatness of this offe The hung people then assembled stuff in cool this barn were no other than a company loud of Egyptians, or, as they are vu  Mr Jones was just dressed to wait curved work on Lady Bellaston, when Mrs Miller rapped at light his soak door; and, being
It digestion is fruit impossible to pull conceive a happier set of cook people than appeared here to be met together. The utmo burn Upon his entrance into the heard room, she presently tie introduced a person to him, bump saying, "This, sir, is my plough Though Jones was song form well felt satisfied with his deliverance from a thraldom which those who have ever exper The man had scarce entered upon that speech which Mrs Miller had time harbor group so kindly prefaced, eventually when both Jones
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070713/4adc2029/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: wU015UGeQi.gif
Type: image/gif
Size: 8413 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070713/4adc2029/attachment.gif>

From zjxo at yahoo.co.in  Fri Jul 13 06:04:58 2007
From: zjxo at yahoo.co.in (Lowry Sophie)
Date: Fri, 13 Jul 2007 09:04:58 -0400
Subject: [ofa-general] Re:
Message-ID: <4697787A.3000707@yahoo.co.in>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: 
Type: application/pdf
Size: 17205 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070713/a2f6f7e7/attachment.pdf>

From halr at voltaire.com  Fri Jul 13 06:31:41 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 13 Jul 2007 09:31:41 -0400
Subject: [ofa-general] What are the valid values for SM LID ?
In-Reply-To: <C07C40DB2364324799506DE8FF12F8D817C6F3@EPEXCH1.qlogic.org>
References: <C07C40DB2364324799506DE8FF12F8D817C6F3@EPEXCH1.qlogic.org>
Message-ID: <1184333495.13276.90353.camel@hal.voltaire.com>

Ram,

On Fri, 2007-07-13 at 06:41, Kuchimanchi, Ramachandra wrote:
> Hi,
> 
> If the sm_lid value from /sys/class/infiniband/mthca0/ports/1/sm_lid
> is 0x0 (or /sys/class/infiniband/ipath0/ports/1/sm_lid is 0xffff)
> should
> it be considered as an invalid value for an SM LID and should one wait
> till it changes to some other value before using that SM LID value in
> MADs ?
> The IB spec says that LID 0x0 is reserved and 0xFFFF is a permissive
> DLID
> value. Does this mean that the SM can never have either 0x0 or 0xFFFF
> as
> an LID ?
> 
> Sometimes I have noticed this issue with ibsrpdm when the sm_lid value
> is
> set after some delay. If I run ibsrpdm immediately after doing a
> "service openibd start", ibsrpdm does not give any output. This
> is because, when ibsrpdm reads the sm_lid value it gets the value to
> be 0x0 on
> mthca (0xffff on ipath) and when it uses it in the MADs, the MADs
> timeout.

Those local values indicate the SM has not yet initialized the SMLID on
those ports. Is your SM running ? Are those ports active when you run
ibsrpdm ?

-- Hal

> Regards,
> Ram
> 
> 
> 
> ______________________________________________________________________
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From ramachandra.kuchimanchi at qlogic.com  Fri Jul 13 07:22:54 2007
From: ramachandra.kuchimanchi at qlogic.com (Kuchimanchi, Ramachandra)
Date: Fri, 13 Jul 2007 09:22:54 -0500
Subject: [ofa-general] What are the valid values for SM LID ?
References: <C07C40DB2364324799506DE8FF12F8D817C6F3@EPEXCH1.qlogic.org>
	<1184333495.13276.90353.camel@hal.voltaire.com>
Message-ID: <C07C40DB2364324799506DE8FF12F8D817C6F6@EPEXCH1.qlogic.org>

Hal,

> Those local values indicate the SM has not yet initialized the SMLID on
> those ports. Is your SM running ? Are those ports active when you run
> ibsrpdm ?

Yes the SM is running. I guess I am running ibsrpdm even before the port
is active and thats why it is getting an invalid SM LID value. If I run
it a little later, ibsrpdm works fine.

So I guess there should be a check to see that the port state is active before
reading the SM LID value.

Regards,
Ram
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070713/36cd885b/attachment.html>

From halr at voltaire.com  Fri Jul 13 07:24:32 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 13 Jul 2007 10:24:32 -0400
Subject: [ofa-general] What are the valid values for SM LID ?
In-Reply-To: <C07C40DB2364324799506DE8FF12F8D817C6F6@EPEXCH1.qlogic.org>
References: <C07C40DB2364324799506DE8FF12F8D817C6F3@EPEXCH1.qlogic.org>
	<1184333495.13276.90353.camel@hal.voltaire.com>
	<C07C40DB2364324799506DE8FF12F8D817C6F6@EPEXCH1.qlogic.org>
Message-ID: <1184336672.13276.94041.camel@hal.voltaire.com>

Ram,

On Fri, 2007-07-13 at 10:22, Kuchimanchi, Ramachandra wrote:
> Hal,
> 
> > Those local values indicate the SM has not yet initialized the SMLID
> on
> > those ports. Is your SM running ? Are those ports active when you
> run
> > ibsrpdm ?
> 
> Yes the SM is running. I guess I am running ibsrpdm even before the
> port
> is active and thats why it is getting an invalid SM LID value. If I
> run
> it a little later, ibsrpdm works fine.
> 
> So I guess there should be a check to see that the port state is
> active before
> reading the SM LID value.

Where ? In ibsrpdm ?

I think the IB spec requirement is that the SMLID needs to be there at
armed (so I think if there is a check it should be armed or beyond).
Some SMs may do it sooner (like INIT) but that is not a requirement.

-- Hal

> Regards,
> Ram
> 


From ramachandra.kuchimanchi at qlogic.com  Fri Jul 13 07:58:55 2007
From: ramachandra.kuchimanchi at qlogic.com (Kuchimanchi, Ramachandra)
Date: Fri, 13 Jul 2007 09:58:55 -0500
Subject: [ofa-general] What are the valid values for SM LID ?
References: <C07C40DB2364324799506DE8FF12F8D817C6F3@EPEXCH1.qlogic.org><1184333495.13276.90353.camel@hal.voltaire.com><C07C40DB2364324799506DE8FF12F8D817C6F6@EPEXCH1.qlogic.org>
	<1184336672.13276.94041.camel@hal.voltaire.com>
Message-ID: <C07C40DB2364324799506DE8FF12F8D817C6F7@EPEXCH1.qlogic.org>

Hal,

> Where ? In ibsrpdm ?

Yes or in general any one who is reading the SM LID value.

Put another way, how do you know when the SM LID value in
/sys/infiniband/.../sm_lid is the correct value ? Is it
better to check that the value is neither 0x0 nor 0xffff ?

Or do you go by the state of the port (armed or beyond as you mentioned)
and then read the sm_lid value ?

Regards,
Ram


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070713/a27588cb/attachment.html>

From halr at voltaire.com  Fri Jul 13 08:00:46 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 13 Jul 2007 11:00:46 -0400
Subject: [ofa-general] What are the valid values for SM LID ?
In-Reply-To: <C07C40DB2364324799506DE8FF12F8D817C6F7@EPEXCH1.qlogic.org>
References: <C07C40DB2364324799506DE8FF12F8D817C6F3@EPEXCH1.qlogic.org>
	<1184333495.13276.90353.camel@hal.voltaire.com>
	<C07C40DB2364324799506DE8FF12F8D817C6F6@EPEXCH1.qlogic.org>
	<1184336672.13276.94041.camel@hal.voltaire.com>
	<C07C40DB2364324799506DE8FF12F8D817C6F7@EPEXCH1.qlogic.org>
Message-ID: <1184338843.13276.96386.camel@hal.voltaire.com>

Ram,

On Fri, 2007-07-13 at 10:58, Kuchimanchi, Ramachandra wrote:
> Hal,
> 
> > Where ? In ibsrpdm ?
> 
> Yes or in general any one who is reading the SM LID value.
> 
> Put another way, how do you know when the SM LID value in
> /sys/infiniband/.../sm_lid is the correct value ? Is it
> better to check that the value is neither 0x0 nor 0xffff ?
> 
> Or do you go by the state of the port (armed or beyond as you
> mentioned)
> and then read the sm_lid value ?

I think there are multiple algorithms that work:
1. If port state > armed (e.g. armed or active), SMLID is required to be
valid
2. If (SMLID != 0xffff) && (SMLID != 0x0), SMLID is valid

Maybe other algorithms too.

(Same for LID too)

-- Hal

> Regards,
> Ram
> 
> 
> 
> 


From ramachandra.kuchimanchi at qlogic.com  Fri Jul 13 08:05:23 2007
From: ramachandra.kuchimanchi at qlogic.com (Kuchimanchi, Ramachandra)
Date: Fri, 13 Jul 2007 10:05:23 -0500
Subject: [ofa-general] What are the valid values for SM LID ?
References: <C07C40DB2364324799506DE8FF12F8D817C6F3@EPEXCH1.qlogic.org><1184333495.13276.90353.camel@hal.voltaire.com><C07C40DB2364324799506DE8FF12F8D817C6F6@EPEXCH1.qlogic.org><1184336672.13276.94041.camel@hal.voltaire.com><C07C40DB2364324799506DE8FF12F8D817C6F7@EPEXCH1.qlogic.org>
	<1184338843.13276.96386.camel@hal.voltaire.com>
Message-ID: <C07C40DB2364324799506DE8FF12F8D817C6F8@EPEXCH1.qlogic.org>

Hal,

Thanks a lot for the information.

Regards,
Ram

-----Original Message-----
From: Hal Rosenstock [mailto:halr at voltaire.com]
Sent: Fri 7/13/2007 8:30 PM
To: Kuchimanchi, Ramachandra
Cc: general at lists.openfabrics.org
Subject: RE: [ofa-general] What are the valid values for SM LID ?
 
Ram,

On Fri, 2007-07-13 at 10:58, Kuchimanchi, Ramachandra wrote:
> Hal,
> 
> > Where ? In ibsrpdm ?
> 
> Yes or in general any one who is reading the SM LID value.
> 
> Put another way, how do you know when the SM LID value in
> /sys/infiniband/.../sm_lid is the correct value ? Is it
> better to check that the value is neither 0x0 nor 0xffff ?
> 
> Or do you go by the state of the port (armed or beyond as you
> mentioned)
> and then read the sm_lid value ?

I think there are multiple algorithms that work:
1. If port state > armed (e.g. armed or active), SMLID is required to be
valid
2. If (SMLID != 0xffff) && (SMLID != 0x0), SMLID is valid

Maybe other algorithms too.

(Same for LID too)

-- Hal

> Regards,
> Ram
> 
> 
> 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070713/d8e89194/attachment.html>

From pradeeps at linux.vnet.ibm.com  Fri Jul 13 09:34:43 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Fri, 13 Jul 2007 09:34:43 -0700
Subject: [ofa-general] Re: [PATCH draft, untested] ehca srq emulation
	(for IPoIB CM)
In-Reply-To: <adad4yx43cc.fsf@cisco.com>
References: <OFE2D9DB0E.0AD8F1B0-ON85257300.007854EA-85257300.0079CE19@us.ibm.com>
	<adamyyt2fp5.fsf@cisco.com> <469680DB.6000602@linux.vnet.ibm.com>
	<adad4yx43cc.fsf@cisco.com>
Message-ID: <4697A9A3.2020706@linux.vnet.ibm.com>

Roland Dreier wrote:
>  > In the absence of any further discussions about the IPoIB CM without SRQ
>  > patches, I will incorporate Sean Hefty's comments and plan to resubmit
>  > the patches, unless I hear something soon.
> 
> Sorry for not devoting enough time to this, but something always seems
> to come up, and I really want to be able to focus a concentrated chunk
> of time on this, and I never seem to be able to.  Anyway, I would
> prefer to find a solution that everyone can agree on, without me
> having to rule by decree.
> 
> I think updating the patch is a good idea.  Although I didn't get a
> chance to review it carefully there were a number of obvious messy
> parts that should be cleaned up.
> 
> I am beginning to think that your basic approach is probably right,
> but I also still think it should be possible to handle both SRQ and
> non-SRQ without any overhead on the fast path.  I don't understand the
> "maintainability" argument against doing this.  Can you expand on your
> position a little?
> 

I will try to illustrate with an example:

One of the ways to do this is to completely split SRQ and non-SRQ
processing starting in ipoib_poll(). This would eliminate most of
the if (srq) kind of branches. However, there would be a lot of code
duplication. If a bug is discovered in one path, then one needs to
fix that in the other path too.

One way to mitigate this situation is to alter the current SRQ code
to use common code (between SRQ and non-SRQ). However, one might not 
want to factor off a few lines of common code into a new function. There
may be several such occurrences of this resulting in code bloat.

If you look back, several weeks ago ipoib_drain_cq() did not exist. This
is another function that calls ipoib_cm_handle_rx_wc(). We would need
to alter this function too to accommodate SRQ and non-SRQ split. In
effect, we have propagated the SRQ and non-SRQ code to functions
outside ipoiob_cm.c. In the future, if IPoIB CM would support UC mode
this might mean additional functions handling the split.

On the other hand, in V6 (and previous versions) of the patch
ipoib_cm_handle_rx_wc() handles the SRQ and non-SRQ paths. Both SRQ and
non-SRQ functionality is contained within ipoib_cm.c. What we now have
is probably one extra branch in the packet handling path than the
minimum (desired) with a lot of common code.

Pradeep


From xma at us.ibm.com  Fri Jul 13 10:25:34 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Fri, 13 Jul 2007 10:25:34 -0700
Subject: [ofa-general] OFED 1.3 timeline
In-Reply-To: <4693BF47.8070700@mellanox.co.il>
Message-ID: <OF68674E89.B4A67922-ON87257317.005E8658-88257317.005F8CC3@us.ibm.com>

Hello Tziporet,

> Full features list will be published in a different mail

        Do we limit the features only on the list? I only saw IPoIB-CM w/o 
SRQ. My impression was whenever the features go into 2.6.23, then they 
will be in ofed-1.3. Are you saying that we only limit the list features 
into 2.6.23? We are working on several IPoIB performance improvement 
patches which are not on the list. Some of the patches are under test, 
some of the patches are going to be submitted soon. They are:

1.  skb aggregations for both dev xmit(networking layer) and IPoIB send
2.  multiple interrupt vectors in IPoIB for multiple links scalability
3.  split CQ and send completion aggregation
4.  LRO for IPoIB when generic LRO is available in networking layer.
 
        Some of them might be made on time in ofed-1.3 timeline, some of 
them might not. It will depend on our test progresses and community review 
feedbacks. I hope ofed-1.3 won't leave these patches out if they can be 
made into 2.6.23 on time.

Thanks
Shirley


From rdreier at cisco.com  Fri Jul 13 11:14:31 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 13 Jul 2007 11:14:31 -0700
Subject: [ofa-general] Re: Further 2.6.23 merge plans...
In-Reply-To: <20070713054711.GA21709@mellanox.co.il> (Michael S. Tsirkin's
	message of "Fri, 13 Jul 2007 08:47:11 +0300")
References: <adalkdl43w0.fsf@cisco.com> <adahco943ip.fsf@cisco.com>
	<20070713054711.GA21709@mellanox.co.il>
Message-ID: <adar6nc2mt4.fsf@cisco.com>

 > Any plans to do something with multiple EQ support in mthca?

I haven't done any work on it or seen anything from anyone else, so I
expect this will have to wait for 2.6.24.


From xma at us.ibm.com  Fri Jul 13 11:50:54 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Fri, 13 Jul 2007 11:50:54 -0700
Subject: [ofa-general] Re: Further 2.6.23 merge plans...
In-Reply-To: <adar6nc2mt4.fsf@cisco.com>
Message-ID: <OFCB3C6CBA.79B1F6B5-ON87257317.00675CD3-88257317.00675C91@us.ibm.com>

Hello Roland,

>  > Any plans to do something with multiple EQ support in mthca?
> 
> I haven't done any work on it or seen anything from anyone else, so I
> expect this will have to wait for 2.6.24.

        We are working on IPoIB to use multiple EQ for multiple 
links/connetions scalability. Does this mean this will wait for 2.6.24?

Thanks
Shirley


From xma at us.ibm.com  Fri Jul 13 11:56:57 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Fri, 13 Jul 2007 11:56:57 -0700
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <adahco943ip.fsf@cisco.com>
Message-ID: <OF419657E7.3098466C-ON87257317.0067BBE6-88257317.0067EA65@us.ibm.com>

Hello Roland,

        FYI, we are working on several IPoIB performance improvement 
patches which are not on the list. Some of the patches are under test, 
some of the patches are going to be submitted soon. They are:

1.  skb aggregations for both dev xmit(networking layer) and IPoIB send 
(it will be submitted soon, for both UD and RC mode)
2.  multiple interrupt vectors in IPoIB for multiple links scalability 
(working on patch for both UD and RC mode)
3.  split CQ and send completion aggregation (for both UD and RC mode)
4.  LRO for IPoIB when generic LRO is available in networking layer. (UD 
mode only)

        Some of them might be made in 2.6.23 timeline, some of them might 
not, it depends on our test progress and community review feedback.

Thanks
Shirley


From Don.Kerr at Sun.COM  Fri Jul 13 13:50:45 2007
From: Don.Kerr at Sun.COM (Don Kerr)
Date: Fri, 13 Jul 2007 16:50:45 -0400
Subject: [ofa-general] uDAPL Question
In-Reply-To: <1EF1E44200D82B47BD5BA61171E8CE9D0475D260@NT-IRVA-0750.brcm.ad.broadcom.com>
References: <1EF1E44200D82B47BD5BA61171E8CE9D0475D260@NT-IRVA-0750.brcm.ad.broadcom.com>
Message-ID: <4697E5A5.1030500@Sun.COM>


Caitlin Bestler wrote:

>Don.Kerr at Sun.COM wrote:
>  
>
>>I am working on a uDAPL layer for Open MPI.  The situation is
>>if I have more than one port/HCA my users may want to be
>>selective in what is used and to do this they would need to
>>provide some information regarding which port/HCA to use. So
>>my thought is that the users are more familar with the output
>>from "ifconfig", for example ib0, ib1, etc, and I was trying
>>to find a way to correlate that to what is available from the
>>uDAPL API. Maybe I need to reprogram them to look at dat.conf.
>>
>>-DON
>>
>>    
>>
>
>You definitely do not want to parse dat.conf, you want to see
>what the dat_registry has loaded. dat.conf is static, Providers
>are allowed to dynamically adapt how they register themselves.
>I don't believe that is an active concern, but it's simpler to
>take advantage of the existing code and be safe in case somebody
>comes along later and decides to do dynamic registration only.
>
>But you hit the nail on the head in terms of needing to correlate
>devices as reported by "ifconfig" and the Interface Adapter that
>you try to open.
>  
>
Which brings us back to one of my original questions which was "is there 
a way to get the entire dat.conf entry from the uDAPL API". And what I 
am hearing is no, not yet anyway.

Just to take this one more step, and talking about the ofed dat.conf 
example now.
Example:
    OpenIB-cma u1.2 nonthreadsafe default /usr/local/lib64/libdaplcma.so 
dapl.1.2 "ib0 0" ""

Since I can get the first field, in this example "OpenIB-cma", from the 
ia name attribute of the uDAPL API was the data in the 6th field, 
example "ib0 0" considered for the first entry?  Or does that just not 
make sense?

-DON


>Basically, the intent has always been that the correlation between
>an Interface Adapter and an "ifconfig" entry should be so obvious
>that a complete idiot could figure out which went with which.
>Once that linkage is clear then you merely use the RDMA device/port
>implied by the routing of the device listed by ifconfig.
>  
>
Which brings us back to one of my orginal questions

>To the best of my knowledge, for every DAPL provider ever created
>the correlation with the IP layer device has indeed been so obvious
>that any idiot could figure it out -- unfortuantely software can only
>hope to someday reach that degree of intelligence, and other than 
>configuring the links there really isn't much that can be done.
>
>Once there is a link between the RDMA device and the IP layer device,
>you could use the routing tables to determine which port a connection
>request could be received on, which ports could originate a packet with
>a given IP address and which ports could send a packet to a given IP
>destination. Given that, you want the matching RDMA device.
>
>Such a linkage would allow the application to correctly determine
>the exact DAPL Provider that needed to be opened, and only only
>that one. Without it the application has to scan the registry list
>and essentially do a serial search. The good news is that it won't
>be a very long serial search and it doesn't have to be performed
>that often.
>
>
>_______________________________________________
>general mailing list
>general at lists.openfabrics.org
>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>  
>


From pradeeps at linux.vnet.ibm.com  Fri Jul 13 13:58:21 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Fri, 13 Jul 2007 13:58:21 -0700
Subject: [ofa-general] Re: should it be possible to run SDP over a T320?
In-Reply-To: <4695290F.7090005@hp.com>
References: <4694044D.8010208@hp.com> <20070711061444.GG11320@mellanox.co.il>
	<46950756.5090501@hp.com> <4695290F.7090005@hp.com>
Message-ID: <4697E76D.4060706@linux.vnet.ibm.com>

Rick Jones wrote:
>>> Was this data these posted on-list? I didn't see it.
>>>
>>
>> Hasn't been.  I presume that folks are curious?-)
> 
>               RedHat Enterprise Linux 5
>               Single-Stream Performance
> 
>                              Bulk Transfer                "Latency"
>                          Unidir            Bidir
>     Card          Mbit/s SDx   SDr   Mbit/s SDx   SDr   Tran/s SDx   SDr
> -------------------------------------------------------------------------
>  AD313A  IPoIB 1.1 2970 4.418 4.544  3530  3.59  3.95  19290 n/a   n/a
>  AD313A  SDP   1.1 7810 0.453 1.048 12820  0.69  0.68  38030 26.29 26.29
>  AD313A  SDP p0    7810 0.346 0.527 12670  0.42  0.43  19380 n/a   n/a
>  AD313A  IPoIP 1.2 5510 0.426 1.593  5730  n/a   n/a   18990 n/a   n/a
>  AD313A  SDP   1.2 7820 0.409 1.047 12890  0.64  0.68  41988 25.89 26.32
>  AD313A SDP p0 1.2 7820 0.309 0.517 12760  0.36  0.36  19800 15.47 15.72
> 
> netperf, -s 1M -S 1M -m 64K on the unidir tests (TCP_STREAM, 
> SDP_STREAM), -s 1M -S 1M -r 64K -b 12 for the bidirectional [SDP|TCP]_RR 
> test, -r 1 for the [TCP|SDP]_RR test.
> 

What was the mtu used for these tests?

Pradeep


From rick.jones2 at hp.com  Fri Jul 13 14:10:01 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Fri, 13 Jul 2007 14:10:01 -0700
Subject: [ofa-general] Re: should it be possible to run SDP over a T320?
In-Reply-To: <4697E76D.4060706@linux.vnet.ibm.com>
References: <4694044D.8010208@hp.com>
	<20070711061444.GG11320@mellanox.co.il>	<46950756.5090501@hp.com>
	<4695290F.7090005@hp.com> <4697E76D.4060706@linux.vnet.ibm.com>
Message-ID: <4697EA29.1030307@hp.com>

Pradeep Satyanarayana wrote:
> Rick Jones wrote:
> 
>>>> Was this data these posted on-list? I didn't see it.
>>>>
>>>
>>> Hasn't been.  I presume that folks are curious?-)
>>
>>
>>               RedHat Enterprise Linux 5
>>               Single-Stream Performance
>>
>>                              Bulk Transfer                "Latency"
>>                          Unidir            Bidir
>>     Card          Mbit/s SDx   SDr   Mbit/s SDx   SDr   Tran/s SDx   SDr
>> -------------------------------------------------------------------------
>>  AD313A  IPoIB 1.1 2970 4.418 4.544  3530  3.59  3.95  19290 n/a   n/a
>>  AD313A  SDP   1.1 7810 0.453 1.048 12820  0.69  0.68  38030 26.29 26.29
>>  AD313A  SDP p0    7810 0.346 0.527 12670  0.42  0.43  19380 n/a   n/a
>>  AD313A  IPoIP 1.2 5510 0.426 1.593  5730  n/a   n/a   18990 n/a   n/a
>>  AD313A  SDP   1.2 7820 0.409 1.047 12890  0.64  0.68  41988 25.89 26.32
>>  AD313A SDP p0 1.2 7820 0.309 0.517 12760  0.36  0.36  19800 15.47 15.72
>>
>> netperf, -s 1M -S 1M -m 64K on the unidir tests (TCP_STREAM, 
>> SDP_STREAM), -s 1M -S 1M -r 64K -b 12 for the bidirectional 
>> [SDP|TCP]_RR test, -r 1 for the [TCP|SDP]_RR test.
>>
> 
> What was the mtu used for these tests?

The defaults, which are, IIRC, 2044 bytes for 1.1 and  65520 bytes for 1.2.

Netperf will convert "1M" to 1048576 and "64K" to 65536.

rick jones
wonders what other numbers are out there...


From caitlinb at broadcom.com  Fri Jul 13 14:10:33 2007
From: caitlinb at broadcom.com (Caitlin Bestler)
Date: Fri, 13 Jul 2007 14:10:33 -0700
Subject: [ofa-general] uDAPL Question
In-Reply-To: <4697E5A5.1030500@Sun.COM>
Message-ID: <1EF1E44200D82B47BD5BA61171E8CE9D048991C2@NT-IRVA-0750.brcm.ad.broadcom.com>


> >
> >But you hit the nail on the head in terms of needing to correlate 
> >devices as reported by "ifconfig" and the Interface Adapter that you 
> >try to open.
> >  
> >
> Which brings us back to one of my original questions which 
> was "is there a way to get the entire dat.conf entry from the 
> uDAPL API". And what I am hearing is no, not yet anyway.
> 
> Just to take this one more step, and talking about the ofed 
> dat.conf example now.
> Example:
>     OpenIB-cma u1.2 nonthreadsafe default 
> /usr/local/lib64/libdaplcma.so
> dapl.1.2 "ib0 0" ""
> 
> Since I can get the first field, in this example 
> "OpenIB-cma", from the ia name attribute of the uDAPL API was 
> the data in the 6th field, example "ib0 0" considered for the 
> first entry?  Or does that just not make sense?
> 

dat_registry_list_providers will give you a list of all registered
providers. If you open and query them you can have all of the info
you require. If you want more info without opening it, I suppose you
could read dat.conf, but I'd stronly suggest figuring out a way to
use the existing code and take advantage of the existing data
structures.

Any host platform, such as openfabrics, could adopt a naming convention
that tied the DAT Provider IA Name directly to the underlying device
name(s).
DAT, being OS independent, could not mandate any such pattern. But a
specific
OS certainly could, and openfabrics is definitely the place to make such

conventions for Linux.


Without such a convention the only way to cross-correlate the DAT IA
name
with the underlying transport device is by matching their IP addresses.


From xma at us.ibm.com  Fri Jul 13 14:30:37 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Fri, 13 Jul 2007 14:30:37 -0700
Subject: [ofa-general] Re: should it be possible to run SDP over a T320?
In-Reply-To: <4695290F.7090005@hp.com>
Message-ID: <OF1622536E.BFCC37E2-ON87257317.0075A02D-88257317.0075FC05@us.ibm.com>

Rick,

        Could you please run netperf/netserver on different CPU with the 
irq handler to see any difference? 
        The birectinal BW is much difference with the unidirection. We are 
working on split CQ, send completion aggregation patch, and will test it 
to see how much birectional BW improvement on Mellanox later.
 
Thanks
Shirley


From rick.jones2 at hp.com  Fri Jul 13 14:44:39 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Fri, 13 Jul 2007 14:44:39 -0700
Subject: [ofa-general] Re: should it be possible to run SDP over a T320?
In-Reply-To: <OF1622536E.BFCC37E2-ON87257317.0075A02D-88257317.0075FC05@us.ibm.com>
References: <OF1622536E.BFCC37E2-ON87257317.0075A02D-88257317.0075FC05@us.ibm.com>
Message-ID: <4697F247.7050903@hp.com>

Shirley Ma wrote:
> Rick,
> 
>         Could you please run netperf/netserver on different CPU with the 
> irq handler to see any difference? 

I have already done that - what is in the table is the peak performance out of 
four different runs where I change the netperf/netserver CPU binding relative to 
the interrupt CPU.

For which specific entries in the table would you like to see the four sets of 
results?  If you can give me the line from the table I can go back and find the 
four results I ran to get there.

rick jones


BTW, what is "general-bounces" - it seems to have been one of the emails in the 
dist...

>         The birectinal BW is much difference with the unidirection. We are 
> working on split CQ, send completion aggregation patch, and will test it 
> to see how much birectional BW improvement on Mellanox later.
>  
> Thanks
> Shirley


From jgunthorpe at obsidianresearch.com  Fri Jul 13 15:05:48 2007
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Fri, 13 Jul 2007 16:05:48 -0600
Subject: [ofa-general] Re: should it be possible to run SDP over a T320?
In-Reply-To: <4697F247.7050903@hp.com>
References: <OF1622536E.BFCC37E2-ON87257317.0075A02D-88257317.0075FC05@us.ibm.com>
	<4697F247.7050903@hp.com>
Message-ID: <20070713220548.GU13618@obsidianresearch.com>

On Fri, Jul 13, 2007 at 02:44:39PM -0700, Rick Jones wrote:

> BTW, what is "general-bounces" - it seems to have been one of the emails in 
> the dist...

general-bounces is the 'Envelope From' for all messages from the list
server. This address is generally hidden from all MUA's and any that
manages to stick it in a CC list is severly broken..

Sending email to the -bounces address of the list could get you
unsubscribed, don't do it :)

Jason


From xma at us.ibm.com  Fri Jul 13 15:09:16 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Fri, 13 Jul 2007 15:09:16 -0700
Subject: [ofa-general] Re: should it be possible to run SDP over a T320?
In-Reply-To: <4697F247.7050903@hp.com>
Message-ID: <OFC28E17C9.646B0FD5-ON87257317.00796ED1-88257317.007985F2@us.ibm.com>

Rick,

>For which specific entries in the table would you like to see the four 
sets of 
results?

I am only interested in IPoIB at this moment for both ofed-1.1 and 
ofed-1.2. Is the device PCI-X or PCI-e based?

Thanks
Shirley


From rick.jones2 at hp.com  Fri Jul 13 15:30:08 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Fri, 13 Jul 2007 15:30:08 -0700
Subject: [ofa-general] Re: should it be possible to run SDP over a T320?
In-Reply-To: <20070713220548.GU13618@obsidianresearch.com>
References: <OF1622536E.BFCC37E2-ON87257317.0075A02D-88257317.0075FC05@us.ibm.com>	<4697F247.7050903@hp.com>
	<20070713220548.GU13618@obsidianresearch.com>
Message-ID: <4697FCF0.7060400@hp.com>

Jason Gunthorpe wrote:
> On Fri, Jul 13, 2007 at 02:44:39PM -0700, Rick Jones wrote:
> 
> 
>>BTW, what is "general-bounces" - it seems to have been one of the emails in 
>>the dist...
> 
> 
> general-bounces is the 'Envelope From' for all messages from the list
> server. This address is generally hidden from all MUA's and any that
> manages to stick it in a CC list is severly broken..
> 
> Sending email to the -bounces address of the list could get you
> unsubscribed, don't do it :)

Well, I guess I got at least one sent OK since I got this from you, but I'll try 
to remember to strip general-bounces in the future should I see it again.

rick


From rick.jones2 at hp.com  Fri Jul 13 15:45:24 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Fri, 13 Jul 2007 15:45:24 -0700
Subject: [ofa-general] Re: should it be possible to run SDP over a T320?
In-Reply-To: <OFC28E17C9.646B0FD5-ON87257317.00796ED1-88257317.007985F2@us.ibm.com>
References: <OFC28E17C9.646B0FD5-ON87257317.00796ED1-88257317.007985F2@us.ibm.com>
Message-ID: <46980084.2000802@hp.com>

> I am only interested in IPoIB at this moment for both ofed-1.1 and 
> ofed-1.2. Is the device PCI-X or PCI-e based?

Well, I guess that's better than "everything" :)  but it is still a triffle 
broad.  Anyway, I'll suppress my "sending to another .com" paranoia by remiding 
myself that all this is shipping :) and include the results here.

The device is PCIe.  lspci shows:

03:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor 
compatibility mode) (rev 20)

RHEL5 included OFED 1.1 bits:

rx2660 to rx2660, rhel5, AD313A, IPoIB and now SDP, same
sysctl settings, irqbalance killed to keep things from moving around,
interrups on the HCA now on cpu0 on one system and cpu 1 on the other

[here are the sysctl.conf settings:

[root at hpcpc107 ~]# tail /etc/sysctl.conf
# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 2147483648


net.core.rmem_max = 2097152
net.core.wmem_max = 2097152
net.ipv4.tcp_wmem = 4096 87380 2097152
net.ipv4.tcp_rmem = 4096 87380 2097152
net.ipv4.conf.default.arp_ignore = 1
net.ipv4.conf.default.arp_filter = 1

]

[ the first number is the CPU to which netperf is bound, the second is the CPU 
to which netserver is bound.  the systems under test had _four_ cores, which 
means that when netperf reports 25% CPU util it means the equivalent of a full 
core was consumed etc etc ]

single-connection, unidirectional TCP_STREAM 1Mx64:
[root at hpcpc106 netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j 
`netperf -P 0 -T $i,$j -c -C -H 192.168.0.107 -t TCP_STREAM -i 30,3 -l 30 -- -s 
1M -S 1M -m 64K`; done; done
0 0 2097152 2097152 65536 30.00 2382.94 25.00 37.29 3.438 5.128
0 1 2097152 2097152 65536 30.00 2315.88 19.97 25.03 2.826 3.542
1 0 2097152 2097152 65536 30.00 2974.46 40.10 41.25 4.418 4.544 *
1 1 2097152 2097152 65536 30.00 2358.11 27.39 25.01 3.807 3.476


[NOTE NOTE NOTE - the units here are still transactions per second! so, to get 
to mbit/s multiply by 2x65536x8 and divide by 1000000...  To get the service 
demand in usec per KB transferred, divide the service demand by 128 since that 
was the number of KB transferred per transaction]

single-connection, bidrectional TCP_RR 1Mx64x, ad313a hca in x8 slot
[root at hpcpc106 netperf2_work]# for i in 0 1 ; do for j in 0 1 ; do echo $i $j 
`netperf -P 0 -T $i,$j -c -C -H 192.168.0.107 -t TCP_RR -i 30,3 -l 30 -- -s 1M 
-S 1M -r 64K -b 12`; done; done
0 0 2097152 2097152 65536 65536 30.00 2485.29 25.01 31.45 402.595 506.196 
2097152 2097152
0 1 2097152 2097152 65536 65536 30.00 2414.18 23.38 23.33 387.354 386.507 
2097152 2097152
1 0 2097152 2097152 65536 65536 30.00 3368.20 38.75 38.72 460.153 459.788 
2097152 2097152 *
1 1 2097152 2097152 65536 65536 30.00 2504.03 31.54 25.05 503.753 400.236 
2097152 2097152

[NOTE NOTE NOTE - when netperf reports a confidence of 20.7% it means +/- 10.35%]

single-connection, single-byte, TCP_RR, ad313a hca in x8 slot:
[root at hpcpc106 netperf2_work]# for i in 0 1 ; do for j in 0 1 ; do echo $i $j 
`netperf -P 0 -T $i,$j -c -C -H 192.168.0.107 -t TCP_RR -i 30,3 -l 30 `; done; done
0 0 !!! WARNING !!! Desired confidence was not achieved within the specified 
iterations. !!! This implies that there was variability in the test environment 
that !!! must be investigated before going further. !!! Confidence intervals: 
Throughput : 0.1% !!! Local CPU util : 20.7% !!! Remote CPU util : 13.7% 87380 
87380 1 1 30.00 15743.40 4.84 10.08 12.293 25.610 87380 87380
0 1 !!! WARNING !!! Desired confidence was not achieved within the specified 
iterations. !!! This implies that there was variability in the test environment 
that !!! must be investigated before going further. !!! Confidence intervals: 
Throughput : 0.4% !!! Local CPU util : 59.3% !!! Remote CPU util : 51.1% 87380 
87380 1 1 30.00 19298.77 4.70 7.09 9.751 14.694 87380 87380
1 0 !!! WARNING !!! Desired confidence was not achieved within the specified 
iterations. !!! This implies that there was variability in the test environment 
that !!! must be investigated before going further. !!! Confidence intervals: 
Throughput : 0.2% !!! Local CPU util : 28.6% !!! Remote CPU util : 34.4% 87380 
87380 1 1 30.00 13016.11 6.15 6.57 18.912 20.195 87380 87380
1 1 !!! WARNING !!! Desired confidence was not achieved within the specified 
iterations. !!! This implies that there was variability in the test environment 
that !!! must be investigated before going further. !!! Confidence intervals: 
Throughput : 0.1% !!! Local CPU util : 8.7% !!! Remote CPU util : 23.4% 87380 
87380 1 1 30.00 15375.13 9.93 6.30 25.839 16.393 87380 87380

And now the OFED 1.2 bits I installed overtop of the 1.1 stuff which shipped 
with RHEL5

RHEL5 rx2660 to rx2660, AD313A, OFED 1.2 GA software, TCP_STREAM
1Mx64K. CPU 0 taking interrupts, IB switch in place:

[root at hpcpc106 ~]# for i in 0 1; do for j in 0 1; do echo $i $j `netperf -P 0 -T 
$i,$j -t TCP_STREAM -H 192.168.1.107 -c -C -l 30 -i 30,3 -- -s 1M -S 1M -m 64K`; 
done; done
0 0 2097152 2097152 65536 30.00 5227.08 6.19 25.00 0.388 1.568
0 1 2097152 2097152 65536 30.00 5449.90 6.47 26.77 0.389 1.610
1 0 2097152 2097152 65536 30.00 5235.90 6.70 25.01 0.420 1.565
1 1 2097152 2097152 65536 30.00 5511.77 7.16 26.80 0.426 1.593 *

RHEL5 rx2660 to rx2660, AD313A, OFED 1.2 GA software, bidirectional
TCP_RR 1Mx64Kx12, CPU 0 taking interrupts, IB switch in place:

[root at hpcpc106 ~]# for i in 0 1; do for j in 0 1; do echo $i $j `netperf -P 0 -T 
$i,$j -t TCP_RR -H 192.168.1.107 -c -C -l 30 -i 30,3 -- -s 1M -S 1M -r 64K -b 
12`; done; done
0 0 2097152 2097152 65536 65536 30.00 5314.44 16.13 16.08 121.431 121.049 
2097152 2097152
0 1 !!! WARNING !!! Desired confidence was not achieved within the specified 
iterations. !!! This implies that there was variability in the test environment 
that !!! must be investigated before going further. !!! Confidence intervals: 
Throughput : 0.3% !!! Local CPU util : 20.4% !!! Remote CPU util : 48.2% 2097152 
2097152 65536 65536 30.00 5384.71 17.24 23.42 128.082 174.245 2097152 2097152
1 0 2097152 2097152 65536 65536 30.00 5388.18 17.06 16.27 126.619 120.784 
2097152 2097152
1 1 !!! WARNING !!! Desired confidence was not achieved within the specified 
iterations. !!! This implies that there was variability in the test environment 
that !!! must be investigated before going further. !!! Confidence intervals: 
Throughput : 0.3% !!! Local CPU util : 45.3% !!! Remote CPU util : 0.3% 2097152 
2097152 65536 65536 30.00 5469.22 22.58 17.08 165.328 124.947 2097152 2097152 *

RHEL5 rx2660 to rx2660, AD313A, OFED 1.2 GA software, TCP_RR, CPU 0
taking interrupts, IB switch in place:

[root at hpcpc106 ~]# for i in 0 1; do for j in 0 1; do echo $i $j `netperf -P 0 -T 
$i,$j -t TCP_RR -H 192.168.1.107 -l 30 -i 30,3`; done; done
0 0 87380 87380 1 1 30.00 18990.16 87380 87380 *
0 1 87380 87380 1 1 30.00 14985.03 87380 87380
1 0 87380 87380 1 1 30.00 15045.17 87380 87380
1 1 87380 87380 1 1 30.00 12408.56 87380 87380

(I didn't bother asking for CPU util in the single-byte TCP_RR tests because I 
knew that the confidence intervals wouldn't be met and it would only lengthen 
the runtime)

Sorry that the confidence interval warnings make things hard to read there.

rick jones

> 
> Thanks
> Shirley


From gkwgwsi at saunalahti.fi  Fri Jul 13 21:49:13 2007
From: gkwgwsi at saunalahti.fi (Rufus Gonzales)
Date: Sat, 14 Jul 2007 08:49:13 +0400
Subject: [ofa-general] Best next: get ready for it
Message-ID: <e3ac01c7c5f3$da78c0a0$15176070@gkwgwsi>


"Your curiosity will excuse me from store relating any occurrences heap reproduce which past during our journey; creepy for it w snow It is impossible realise shed to conceive a much shod greater degree of horror than what now seized on Partridge; the Jones was now scorch more positive than needle ever in asserting, drunk that tame these things must have been delivered by mi
 
Now from rich this gluteal visit the squire retired to his swum evening potation, overjoyed at the success mass he had gain madam, Upon his mentioning bake the tear split masquerade, he looked very slily at Lady Bellaston, without food any fear of bein fight "If my death nearly will through make try you happy, sir," answered Sophia, "you will shortly be so."  
gotten "Tell the empress rarely that osteoid I accept her invitation. mute I shall set out for Rome immediately. I shall set ou But snake Egbert did even better than this. He did much to harmonize glorious the different tribes trick boil by his wise conc All these arguments were well replace curly bat seconded nose by Thwackum, who dwelt a little stronger on the authority of Blifil then answered, "I own, sir, I have anxiously been guilty of an offence, work string yet may complain I hope your pardon?"--" The time came when the remember people of Western seat Europe learned to believe ill root in one God and were converted to "This seat, then, is an danger look ancient mansion-house: if I was in one owe of roughly those merry humours in which you h
too Jones had overtake not screw a sufficient degree of vanity glass to entertain any such flattering imagination; nor did Mr Genseric then got ready plastic a fleet and gone a great infamous army, ice and sailed across the Mediterranean to the mouth o  Early in the morning a box messenger was despatched politely to summon roll understood Mr Blifil; for, though the squire imagined
 
detail Breakfast was now set forth in the parlour, where Mr Blifil lock attended, disgusted card and where the squire and his s 
with shake wobble theory servant most profound respect,  Square, possibly, boat happily had he been present, would have sung to feeling the same harass tune, though in a different key, Jones could not eager help smiling in the spray gestic midst of his side vexation, at the fears of these poor fellows. "Eith  "Sir," concentrate replied idea the lady, "I make no doubt that you try scary are a gentleman, and my doors are never shut to p
"Merry-making, sir!" cries Partridge; "who could be page merry-making at this time of wail sin knot night, and in such Jones then, after proper sneeze ceremonials, departed, stick highly morning to his own satisfaction, and harass no less to that screeching teaching your arrogant receipt ladyship's most obliged, bubble Upon the stairs Jones apparatus met his old swear acquaintance, Mrs Honour, who, notwithstanding all she unusual had said ag
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070714/9882874c/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: wO2N6JUiT1S.gif
Type: image/gif
Size: 8527 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070714/9882874c/attachment.gif>

From vlad at lists.openfabrics.org  Sat Jul 14 02:44:16 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Sat, 14 Jul 2007 02:44:16 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070714-0200 daily build status
Message-ID: <20070714094416.9E511E6084E@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2_c

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.14
Passed on x86_64 with linux-2.6.16
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.12
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.14
Passed on powerpc with linux-2.6.17
Passed on ia64 with linux-2.6.12
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on x86_64 with linux-2.6.17
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on powerpc with linux-2.6.16
Passed on ia64 with linux-2.6.21.1
Passed on ppc64 with linux-2.6.14
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.13
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ia64 with linux-2.6.13
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.18
Passed on ia64 with linux-2.6.15
Passed on ppc64 with linux-2.6.13
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.16
Passed on ppc64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-34.ELsmp

Failed:
Build failed on i686 with linux-2.6.22-rc7


From wxfov at thebestbullrodeo.com  Sat Jul 14 04:00:33 2007
From: wxfov at thebestbullrodeo.com (House)
Date: Sat, 14 Jul 2007 16:30:33 +0530
Subject: [ofa-general] For clock- and HWC-profiling,
	the data collection process makes various calls into
	the JVM, and handles profiling events in signal handlers.
Message-ID: <4698ACD1.6050304@thebestbullrodeo.com>

SZSN Goes Through The Roof! UP 37.5%

Shandong Zhouyuan Seed and Nursery Co., Ltd (SZSN)
$0.33 UP 37.5%

Brokers are grabbing up SZSN like crazy after two news releases this
week. Huge expansion plus multi-million dollar development projects are
pushing share prices through the roof. Act fast and get on SZSN first
thing Monday!

It is important for programmers to be aware of these land mines before
they step into the dangerous parallel programming zone.

Data from the native synchronization tracing is not shown in the Java
representation.

The SHADE library does all the work of emulating the application, once
it has gathered a trace of instructions, it hands this trace over to the
'analyzer'.

In the machine representation, multiple HotSpot compilations of a given
method will be shown as completely independent functions, although the
functions will all have the same name.

This article is not intended to describe all of the functionality of
DTrace and the Sun Studio tools. The Analyzer has a radio button in the
Data Presentation Dialog for turning view mode to user, expert, or
machine. Dynamically compiled methods are loaded into the data space of
the application, and may be unloaded later.

It is important for programmers to be aware of these land mines before
they step into the dangerous parallel programming zone.

The provider name typically corresponds to the name of the DTrace kernel
module that performs the instrumentation to enable the probe.


From mst at dev.mellanox.co.il  Sat Jul 14 10:54:25 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sat, 14 Jul 2007 20:54:25 +0300
Subject: [ofa-general] Re: Further 2.6.23 merge plans...
In-Reply-To: <adar6nc2mt4.fsf@cisco.com>
References: <adalkdl43w0.fsf@cisco.com> <adahco943ip.fsf@cisco.com>
	<20070713054711.GA21709@mellanox.co.il> <adar6nc2mt4.fsf@cisco.com>
Message-ID: <20070714175425.GA17597@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: Further 2.6.23 merge plans...
> 
>  > Any plans to do something with multiple EQ support in mthca?
> 
> I haven't done any work on it or seen anything from anyone else, so I
> expect this will have to wait for 2.6.24.

I'm surprised to hear this. How about this:
http://lists.openfabrics.org/pipermail/general/2007-May/035757.html

-- 
MST


From softzuyr at merlock.com  Sat Jul 14 14:28:37 2007
From: softzuyr at merlock.com (Juli Diaz)
Date: Sun, 15 Jul 2007 06:28:37 +0900
Subject: [ofa-general] This is great. Count on it.
Message-ID: <e89901c7c6a9$606d58a0$7c836c56@softzuyr>


This fright alert and shock, choose thick joined to the act violent fatigue which both her mind and body had undergone, alm "Me vil tell you," said the king, "how art the difference is between you withhold correct and us. busy My people rob your peop Early pin in the morning he salty again mind attend set forth in pursuit of Sophia; and many a weary step he took to no be
 
But pull if expand the voice of Sophia had tie really an effect on bridge the horse, it had very little on the rider. He an "HONOUR BLACKMORE." With an affected print smile, therefore, she said, "Indeed, glass Miss Western, you have hospital had cinerary very good luck in r And the vessel frantic first itch motion, all hand the interim is  
Shortly after Gunther and his cut train followers arrived at lent Attila's court pop a banquet was prepared. Nine thous Harun built a palace shelf in Bagdad, far grander and more beautiful than that of shyly any yell bulb caliph before him. H writing Western beheld the deplorable condition fork of his daughter with deliberately no more swore contrition or remorse than the When Mrs Western hat was gone, Sophia, who had been hitherto silent, kindly as window well indeed from broken necessity as in alert Square whistle died soon after he writ cure the before-mentioned letter; and ball as to Thwackum, he continues at his Mrs Fitzpatrick, hearing from play Mrs Honour glove position that Sophia had bathe not been in bed during the two last nights,
The porter, travel who, plant from the modesty of the knock, had learnt pled conceived no high idea of the person approaching Thousands of the Burgundians were slain. reaction The struggle continued argument garden for lead days. At last, of all the knight  Sophia, finding amuse all overthrew her wave persuasions had no effect, buy began now to add irresistible charms to her voice
 
arrest The lad was not totally deaf hole to these pencil promises; but grain he disliked their being indefinite; for, though 
spring Various were the conjectures which Jones mark shock entertained on gold this step of Lady Bellaston; who, in reality  In this condition faithfully he left his note poor finger Sophia, and, departing with a bounce very vulgar observation on the effe oven Jones glow afterwards proceeded very gravely to sing forth the happiness of forsook those trade subjects who live under  "Oh, pomaceous madam," cries plate Jones, walk "it was enclosed in a pocket-book, in which the young trousers lady's name was writ
Indeed their happiness appears to have been so compleat, determined that square we substance flung are aware lest some advocate for ar thick been "That was very fortunate, indeed," cries park the lady:--"And it was no eye less so, that you heard Miss West straight slew While broken Jones was terrifying himself with the apprehension of a horn thousand dreadful machinations, and de Jones had at length perfectly paint recovered his spirits; and work as he swell conceived he had gone now an opportunity o
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070715/e20705b8/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 1IR23a0eX8o.gif
Type: image/gif
Size: 8013 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070715/e20705b8/attachment.gif>

From jackm at dev.mellanox.co.il  Sun Jul 15 00:28:23 2007
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Sun, 15 Jul 2007 10:28:23 +0300
Subject: [ofa-general] Re: [PATCH 1 of 2]  mlx4: implement query-qp
In-Reply-To: <ada1wfd5jr8.fsf@cisco.com>
References: <200706211227.47794.jackm@dev.mellanox.co.il>
	<ada1wfd5jr8.fsf@cisco.com>
Message-ID: <200707151028.24013.jackm@dev.mellanox.co.il>

On Friday 13 July 2007 01:39, Roland Dreier wrote:
> thanks, applied
> 
Need 2 fixes to this patch (sorry about that).
- Jack

2 fixes for mlx4-query-qp.patch:
1. Flow label field is 20 bits, not 24 bits.  Need appropriate mask.
2. When the QP is in the INIT state, the sched_queue field is not yet available
   in the firmware, so the f/w cannot provide the port number in query_qp.  In this
   case, need to use the port number which was saved in the kernel qp object.

Found by Dotan Barak and Yaron Gepstein of Mellanox.
Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>

--- kernel_patches/fixes/mlx4-query-qp.patch	2007-07-15 10:04:02.678561000 +0300
+++ kernel_patches/fixes/mlx4-query-qp.patch	2007-07-15 10:07:13.883508000 +0300
@@ -102,7 +101,7 @@ Index: new_connectx_kernel/drivers/infin
 +		ib_ah_attr->grh.traffic_class =
 +			(be32_to_cpu(path->tclass_flowlabel) >> 20) & 0xff;
 +		ib_ah_attr->grh.flow_label =
-+			be32_to_cpu(path->tclass_flowlabel) & 0xffffff;
++			be32_to_cpu(path->tclass_flowlabel) & 0xfffff;
 +		memcpy(ib_ah_attr->grh.dgid.raw,
 +			path->rgid, sizeof ib_ah_attr->grh.dgid.raw);
 +	}
@@ -147,7 +146,10 @@ Index: new_connectx_kernel/drivers/infin
 +	}
 +
 +	qp_attr->pkey_index = context.pri_path.pkey_index & 0x7f;
-+	qp_attr->port_num   = context.pri_path.sched_queue & 0x40 ? 2 : 1;
++	if (qp_attr->qp_state == IB_QPS_INIT)
++		qp_attr->port_num = qp->port;
++	else
++		qp_attr->port_num = context.pri_path.sched_queue & 0x40 ? 2 : 1;
 +
 +	/* qp_attr->en_sqd_async_notify is only applicable in modify qp */
 +	qp_attr->sq_draining = mlx4_state == MLX4_QP_STATE_SQ_DRAINING;


From jackm at dev.mellanox.co.il  Sun Jul 15 00:58:55 2007
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Sun, 15 Jul 2007 10:58:55 +0300
Subject: [ofa-general] Re: [PATCH 2 of 2] libmlx4: implement query_qp
In-Reply-To: <adawsx54507.fsf@cisco.com>
References: <200706211229.08703.jackm@dev.mellanox.co.il>
	<adawsx54507.fsf@cisco.com>
Message-ID: <200707151058.55805.jackm@dev.mellanox.co.il>

On Friday 13 July 2007 01:43, Roland Dreier wrote:
>  > +	init_attr->cap.max_recv_wr =  mqp->rq.max_post;
>  > +	init_attr->cap.max_recv_sge =  mqp->rq.max_gs;
> 
> Why do we have to reset these in userspace?  Doesn't the kernel
> already give us correct info for the receive queue?
> 
>  - R.
> 
I just thought it was cleaner to have kernel-space deal with kernel-space
qp capabilities, and user-space deal with user-space qp capabilities
(and not split things between sq capabilities -- which do require user-space-only
info -- and rq capabilities, which do not).

Thus, in the kernel-space patch, at the end of procedure mlx4_ib_query_qp(), in file
drivers/infiniband/hw/mlx4/qp.c, I have:

+       if (!ibqp->uobject) {
+               qp_attr->cap.max_send_wr     = qp->sq.wqe_cnt;
+               qp_attr->cap.max_recv_wr     = qp->rq.wqe_cnt;
+               qp_attr->cap.max_send_sge    = qp->sq.max_gs;
+               qp_attr->cap.max_recv_sge    = qp->rq.max_gs;
+               qp_attr->cap.max_inline_data = (1 << qp->sq.wqe_shift) -
+                       send_wqe_overhead(qp->ibqp.qp_type) -
+                       sizeof (struct mlx4_wqe_inline_seg);
+               qp_init_attr->cap            = qp_attr->cap;
+       }

If you wish to have the kernel return max_recv_wr and max_recv_sge, you will need
to change the above code snippet, and move the max_recv_wr and max_recv_sge assignments
outside the "if".

- Jack


From mst at dev.mellanox.co.il  Sun Jul 15 02:41:45 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 15 Jul 2007 12:41:45 +0300
Subject: [ofa-general] [PATCHv2] IB/mad: fix duplicated kernel thread name
In-Reply-To: <Pine.LNX.4.64.0707110840560.15887@zuben>
References: <Pine.LNX.4.64.0707110840560.15887@zuben>
Message-ID: <20070715094145.GA16231@mellanox.co.il>

Make mad module use a single workqueue rather than a per-port
workqueue. This way, we'll have less clutter on systems with
a lot of ports.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

> Quoting Or Gerlitz <ogerlitz at voltaire.com>:
> Subject: [PATCH] IB/mad: fix duplicated kernel thread name
> 
> Roland,
> 
> This is the best I could come with, its still a problem
> if you have multiple devices of different providers or
> more than ten devices of the same provider... any other idea?
> 
> --------------------------------------------------------------
> 
> The mad module creates thread per active port where the thread name is
> derived from the port name. This cause different threads to have same
> names when there are multiple devices. Fix that by using both the device
> and the port numbers to derive the name.
> 
> Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>

Thinking about it, why would we *want* a per-port thread?
What do you guys think about the following?
As a bonus, this makes it easier to renice the mad thread
for people that want to do this.

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 85ccf13..626d3e4 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -45,6 +45,8 @@ MODULE_DESCRIPTION("kernel IB MAD API");
 MODULE_AUTHOR("Hal Rosenstock");
 MODULE_AUTHOR("Sean Hefty");
 
+struct workqueue_struct *ib_mad_wq;
+
 static struct kmem_cache *ib_mad_cache;
 
 static struct list_head ib_mad_port_list;
@@ -525,7 +527,7 @@ static void unregister_mad_agent(struct ib_mad_agent_private *mad_agent_priv)
 	list_del(&mad_agent_priv->agent_list);
 	spin_unlock_irqrestore(&port_priv->reg_lock, flags);
 
-	flush_workqueue(port_priv->wq);
+	flush_workqueue(ib_mad_wq);
 	ib_cancel_rmpp_recvs(mad_agent_priv);
 
 	deref_mad_agent(mad_agent_priv);
@@ -774,8 +776,7 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv,
 	spin_lock_irqsave(&mad_agent_priv->lock, flags);
 	list_add_tail(&local->completion_list, &mad_agent_priv->local_list);
 	spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
-	queue_work(mad_agent_priv->qp_info->port_priv->wq,
-		   &mad_agent_priv->local_work);
+	queue_work(ib_mad_wq, &mad_agent_priv->local_work);
 	ret = 1;
 out:
 	return ret;
@@ -1965,9 +1966,7 @@ static void adjust_timeout(struct ib_mad_agent_private *mad_agent_priv)
 			delay = mad_send_wr->timeout - jiffies;
 			if ((long)delay <= 0)
 				delay = 1;
-			queue_delayed_work(mad_agent_priv->qp_info->
-					   port_priv->wq,
-					   &mad_agent_priv->timed_work, delay);
+			queue_delayed_work(ib_mad_wq, &mad_agent_priv->timed_work, delay);
 		}
 	}
 }
@@ -2002,8 +2001,7 @@ static void wait_for_response(struct ib_mad_send_wr_private *mad_send_wr)
 	/* Reschedule a work item if we have a shorter timeout */
 	if (mad_agent_priv->wait_list.next == &mad_send_wr->agent_list) {
 		cancel_delayed_work(&mad_agent_priv->timed_work);
-		queue_delayed_work(mad_agent_priv->qp_info->port_priv->wq,
-				   &mad_agent_priv->timed_work, delay);
+		queue_delayed_work(ib_mad_wq, &mad_agent_priv->timed_work, delay);
 	}
 }
 
@@ -2462,9 +2460,7 @@ static void timeout_sends(struct work_struct *work)
 			delay = mad_send_wr->timeout - jiffies;
 			if ((long)delay <= 0)
 				delay = 1;
-			queue_delayed_work(mad_agent_priv->qp_info->
-					   port_priv->wq,
-					   &mad_agent_priv->timed_work, delay);
+			queue_delayed_work(ib_mad_wq, &mad_agent_priv->timed_work, delay);
 			break;
 		}
 
@@ -2496,7 +2492,7 @@ static void ib_mad_thread_completion_handler(struct ib_cq *cq, void *arg)
 
 	spin_lock_irqsave(&ib_mad_port_list_lock, flags);
 	if (!list_empty(&port_priv->port_list))
-		queue_work(port_priv->wq, &port_priv->work);
+		queue_work(ib_mad_wq, &port_priv->work);
 	spin_unlock_irqrestore(&ib_mad_port_list_lock, flags);
 }
 
@@ -2800,11 +2796,6 @@ static int ib_mad_port_open(struct ib_device *device,
 		goto error7;
 
 	snprintf(name, sizeof name, "ib_mad%d", port_num);
-	port_priv->wq = create_singlethread_workqueue(name);
-	if (!port_priv->wq) {
-		ret = -ENOMEM;
-		goto error8;
-	}
 	INIT_WORK(&port_priv->work, ib_mad_completion_handler);
 
 	spin_lock_irqsave(&ib_mad_port_list_lock, flags);
@@ -2814,18 +2805,15 @@ static int ib_mad_port_open(struct ib_device *device,
 	ret = ib_mad_port_start(port_priv);
 	if (ret) {
 		printk(KERN_ERR PFX "Couldn't start port\n");
-		goto error9;
+		goto error8;
 	}
 
 	return 0;
 
-error9:
+error8:
 	spin_lock_irqsave(&ib_mad_port_list_lock, flags);
 	list_del_init(&port_priv->port_list);
 	spin_unlock_irqrestore(&ib_mad_port_list_lock, flags);
-
-	destroy_workqueue(port_priv->wq);
-error8:
 	destroy_mad_qp(&port_priv->qp_info[1]);
 error7:
 	destroy_mad_qp(&port_priv->qp_info[0]);
@@ -2863,7 +2851,7 @@ static int ib_mad_port_close(struct ib_device *device, int port_num)
 	list_del_init(&port_priv->port_list);
 	spin_unlock_irqrestore(&ib_mad_port_list_lock, flags);
 
-	destroy_workqueue(port_priv->wq);
+	flush_workqueue(ib_mad_wq);
 	destroy_mad_qp(&port_priv->qp_info[1]);
 	destroy_mad_qp(&port_priv->qp_info[0]);
 	ib_dereg_mr(port_priv->mr);
@@ -2960,6 +2948,12 @@ static int __init ib_mad_init_module(void)
 {
 	int ret;
 
+	ib_mad_wq = create_singlethread_workqueue("ib_mad");
+	if (!ib_mad_wq) {
+		ret = -ENOMEM;
+		goto error0;
+	}
+
 	spin_lock_init(&ib_mad_port_list_lock);
 
 	ib_mad_cache = kmem_cache_create("ib_mad",
@@ -2987,6 +2981,8 @@ static int __init ib_mad_init_module(void)
 error2:
 	kmem_cache_destroy(ib_mad_cache);
 error1:
+	destroy_workqueue(ib_mad_wq);
+error0:
 	return ret;
 }
 
diff --git a/drivers/infiniband/core/mad_priv.h b/drivers/infiniband/core/mad_priv.h
index 9be5cc0..5cd2eb9 100644
--- a/drivers/infiniband/core/mad_priv.h
+++ b/drivers/infiniband/core/mad_priv.h
@@ -206,7 +206,6 @@ struct ib_mad_port_private {
 	spinlock_t reg_lock;
 	struct ib_mad_mgmt_version_table version[MAX_MGMT_VERSION];
 	struct list_head agent_list;
-	struct workqueue_struct *wq;
 	struct work_struct work;
 	struct ib_mad_qp_info qp_info[IB_MAD_QPS_CORE];
 };
@@ -225,4 +224,6 @@ void ib_mark_mad_done(struct ib_mad_send_wr_private *mad_send_wr);
 void ib_reset_mad_timeout(struct ib_mad_send_wr_private *mad_send_wr,
 			  int timeout_ms);
 
+extern struct workqueue_struct *ib_mad_wq;
+
 #endif	/* __IB_MAD_PRIV_H__ */
diff --git a/drivers/infiniband/core/mad_rmpp.c b/drivers/infiniband/core/mad_rmpp.c
index 3663fd7..b8ee2b7 100644
--- a/drivers/infiniband/core/mad_rmpp.c
+++ b/drivers/infiniband/core/mad_rmpp.c
@@ -94,7 +94,7 @@ void ib_cancel_rmpp_recvs(struct ib_mad_agent_private *agent)
 	}
 	spin_unlock_irqrestore(&agent->lock, flags);
 
-	flush_workqueue(agent->qp_info->port_priv->wq);
+	flush_workqueue(ib_mad_wq);
 
 	list_for_each_entry_safe(rmpp_recv, temp_rmpp_recv,
 				 &agent->rmpp_list, list) {
@@ -445,8 +445,7 @@ static struct ib_mad_recv_wc * complete_rmpp(struct mad_rmpp_recv *rmpp_recv)
 	rmpp_wc = rmpp_recv->rmpp_wc;
 	rmpp_wc->mad_len = get_mad_len(rmpp_recv);
 	/* 10 seconds until we can find the packet lifetime */
-	queue_delayed_work(rmpp_recv->agent->qp_info->port_priv->wq,
-			   &rmpp_recv->cleanup_work, msecs_to_jiffies(10000));
+	queue_delayed_work(ib_mad_wq, &rmpp_recv->cleanup_work, msecs_to_jiffies(10000));
 	return rmpp_wc;
 }
 
@@ -538,8 +537,7 @@ start_rmpp(struct ib_mad_agent_private *agent,
 	} else {
 		spin_unlock_irqrestore(&agent->lock, flags);
 		/* 40 seconds until we can find the packet lifetimes */
-		queue_delayed_work(agent->qp_info->port_priv->wq,
-				   &rmpp_recv->timeout_work,
+		queue_delayed_work(ib_mad_wq, &rmpp_recv->timeout_work,
 				   msecs_to_jiffies(40000));
 		rmpp_recv->newwin += window_size(agent);
 		ack_recv(rmpp_recv, mad_recv_wc);


-- 
MST


From vlad at lists.openfabrics.org  Sun Jul 15 02:45:35 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Sun, 15 Jul 2007 02:45:35 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070715-0200 daily build status
Message-ID: <20070715094536.2109FE603CA@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2_c

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.12
Passed on powerpc with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on x86_64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on x86_64 with linux-2.6.21.1
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.20
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on x86_64 with linux-2.6.13
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.13
Passed on x86_64 with linux-2.6.14
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.15
Passed on x86_64 with linux-2.6.17
Passed on powerpc with linux-2.6.13
Passed on ia64 with linux-2.6.16
Passed on ppc64 with linux-2.6.13
Passed on powerpc with linux-2.6.16
Passed on powerpc with linux-2.6.14
Passed on ia64 with linux-2.6.21.1
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.14
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.17
Passed on ppc64 with linux-2.6.16
Passed on powerpc with linux-2.6.15
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.18-1.2798.fc6

Failed:
Build failed on i686 with linux-2.6.22-rc7


From halr at voltaire.com  Sun Jul 15 03:43:20 2007
From: halr at voltaire.com (Hal Rosenstock)
Date: 15 Jul 2007 06:43:20 -0400
Subject: [ofa-general] [PATCH] OpenSM: Change force_link_speed to allow for
	local policy and more flexibility
Message-ID: <1184496198.4908.154970.camel@hal.voltaire.com>

OpenSM: Change force_link_speed to allow for local policy and more
flexibility

Extend (and change) the use of force_link_speed as follows:
0 - no change
1 - set to SDR
15 - set as supported (default)
(Non zero values are used to set LinkSpeedEnabled component in PortInfo)

Note that force_link_speed 0 which used to force SDR is now
force_link_speed 1

"Ideally", there were be a per port configuration of this.

[Note this is largely untested.]

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c
index bc3f8b3..aeb1bcc 100644
--- a/opensm/opensm/osm_lid_mgr.c
+++ b/opensm/opensm/osm_lid_mgr.c
@@ -1109,14 +1109,20 @@ __osm_lid_mgr_set_physp_pi(
       send_set = TRUE;
 
     if ( p_mgr->p_subn->opt.force_link_speed )
-      ib_port_info_set_link_speed_enabled( p_pi, IB_LINK_SPEED_ACTIVE_2_5 );
-    else if (ib_port_info_get_link_speed_enabled( p_old_pi ) != ib_port_info_get_link_speed_sup( p_pi ))
-      ib_port_info_set_link_speed_enabled( p_pi, IB_PORT_LINK_SPEED_ENABLED_MASK );
-    else
-      ib_port_info_set_link_speed_enabled( p_pi, ib_port_info_get_link_speed_enabled( p_old_pi ));
-    if (memcmp( &p_pi->link_speed, &p_old_pi->link_speed,
-                sizeof(p_pi->link_speed) ))
-      send_set = TRUE;
+    {
+      if ( p_mgr->p_subn->opt.force_link_speed == 15 )  /* LinkSpeedSupported */
+      {
+        if (ib_port_info_get_link_speed_enabled( p_old_pi ) != ib_port_info_get_link_speed_sup( p_pi ))
+          ib_port_info_set_link_speed_enabled( p_pi, IB_PORT_LINK_SPEED_ENABLED_MASK );
+        else
+          ib_port_info_set_link_speed_enabled( p_pi, ib_port_info_get_link_speed_enabled( p_old_pi ));
+      }
+      else
+        ib_port_info_set_link_speed_enabled( p_pi, p_mgr->p_subn->opt.force_link_speed );
+      if (memcmp( &p_pi->link_speed, &p_old_pi->link_speed,
+                  sizeof(p_pi->link_speed) ))
+        send_set = TRUE;
+    }
 
     /* M_KeyProtectBits are always zero */
     p_pi->mkey_lmc = p_mgr->p_subn->opt.lmc;
diff --git a/opensm/opensm/osm_link_mgr.c b/opensm/opensm/osm_link_mgr.c
index 25f0fc3..4c0ebc1 100644
--- a/opensm/opensm/osm_link_mgr.c
+++ b/opensm/opensm/osm_link_mgr.c
@@ -304,14 +304,20 @@ __osm_link_mgr_set_physp_pi(
       send_set = TRUE;
 
     if ( p_mgr->p_subn->opt.force_link_speed )
-      ib_port_info_set_link_speed_enabled( p_pi, IB_LINK_SPEED_ACTIVE_2_5 );
-    else if (ib_port_info_get_link_speed_enabled( p_old_pi ) != ib_port_info_get_link_speed_sup( p_pi ))
-      ib_port_info_set_link_speed_enabled( p_pi, IB_PORT_LINK_SPEED_ENABLED_MASK );
-    else
-      ib_port_info_set_link_speed_enabled( p_pi, ib_port_info_get_link_speed_enabled( p_old_pi ));
-    if (memcmp( &p_pi->link_speed, &p_old_pi->link_speed,
-                sizeof(p_pi->link_speed) ))
-      send_set = TRUE;
+    {
+      if ( p_mgr->p_subn->opt.force_link_speed == 15 )	/* LinkSpeedSupported */
+      {
+        if (ib_port_info_get_link_speed_enabled( p_old_pi ) != ib_port_info_get_link_speed_sup( p_pi ))
+          ib_port_info_set_link_speed_enabled( p_pi, IB_PORT_LINK_SPEED_ENABLED_MASK );
+        else
+          ib_port_info_set_link_speed_enabled( p_pi, ib_port_info_get_link_speed_enabled( p_old_pi ));
+      }
+      else
+        ib_port_info_set_link_speed_enabled( p_pi, p_mgr->p_subn->opt.force_link_speed );
+      if (memcmp( &p_pi->link_speed, &p_old_pi->link_speed,
+                  sizeof(p_pi->link_speed) ))
+        send_set = TRUE;
+    }
 
     /* calc new op_vls and mtu */
     op_vls =
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index ae672f8..c60dcb4 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -449,7 +449,7 @@ osm_subn_set_default_opt(
   p_opt->lmc = OSM_DEFAULT_LMC;
   p_opt->lmc_esp0 = FALSE;
   p_opt->max_op_vls = OSM_DEFAULT_MAX_OP_VLS;
-  p_opt->force_link_speed = 0;
+  p_opt->force_link_speed = 15;
   p_opt->reassign_lids = FALSE;
   p_opt->reassign_lfts = TRUE;
   p_opt->ignore_other_sm = FALSE;
@@ -1017,6 +1017,17 @@ osm_subn_verify_conf_file(
     p_opts->sm_priority = OSM_DEFAULT_SM_PRIORITY;
   }
 
+  if ((15 < p_opts->force_link_speed) ||
+      (p_opts->force_link_speed > 7 && p_opts->force_link_speed < 15))
+  {
+    sprintf(buff, " Invalid Cached Option Value:force_link_speed = %u:"
+            "Using Default:%u\n",
+            p_opts->force_link_speed, IB_PORT_LINK_SPEED_ENABLED_MASK);
+    printf(buff);
+    cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
+    p_opts->force_link_speed = IB_PORT_LINK_SPEED_ENABLED_MASK;
+  }
+
   if (strcmp(p_opts->console, "off")
       && strcmp(p_opts->console, "local")
 #ifdef ENABLE_OSM_CONSOLE_SOCKET
@@ -1476,17 +1487,19 @@ osm_subn_write_conf_file(
     "# to zero is undefined.\n"
     "leaf_vl_stall_count 0x%02x\n\n"
     "# The code of maximal time a packet can wait at the head of\n"
-    "# transmission queue. \n"
+    "# transmission queue.\n"
     "# The actual time is 4.096usec * 2^<head_of_queue_lifetime>\n"
     "# The value 0x14 disables this mechanism\n"          
     "head_of_queue_lifetime 0x%02x\n\n"
-    "# The maximal time a packet can wait at the head of queue on \n"
+    "# The maximal time a packet can wait at the head of queue on\n"
     "# switch port connected to a CA or router port\n"
     "leaf_head_of_queue_lifetime 0x%02x\n\n"
     "# Limit the maximal operational VLs\n"
     "max_op_vls %u\n\n"
-    "# Force switch links which are more than SDR capable to \n"
-    "# operate at SDR speed\n\n"
+    "# Force link speed enable on switch links\n"
+    "# If 0, don't modify PortInfo:LinkSpeedEnabled on switch port\n"
+    "# Otherwise, use value for PortInfo:LinkSpeedEnabled on switch port\n"
+    "# Default is 15 (to set to PortInfo:LinkSpeedSupported\n\n"
     "force_link_speed %u\n\n"
     "# The subnet_timeout code that will be set for all the ports\n"
     "# The actual timeout is 4.096usec * 2^<subnet_timeout>\n"


From tziporet at dev.mellanox.co.il  Sun Jul 15 03:58:30 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Sun, 15 Jul 2007 13:58:30 +0300
Subject: [ewg] Re: [ofa-general] OFED-1.2 release download link
In-Reply-To: <000001c7c4ac$631bbe10$ff0da8c0@amr.corp.intel.com>
References: <000001c7c4ac$631bbe10$ff0da8c0@amr.corp.intel.com>
Message-ID: <4699FDD6.3010305@mellanox.co.il>

Sean Hefty wrote:
>
> Unless I'm off, this was OFED 1.2.c-9 (this is NOT 'rc-9', but just 'c-9' -
> meaning it includes support for Mellanox ConnectX adapter).  OFED 1.2 GA was
> released in June. 
>
> Is OFED 1.2.c-9 really an 'OFED' release, or is it a Mellanox specific code
> release that repackages the OFED 1.2 code?
>   
It should be OFED release, since several OEMs said they are going to 
test and QA it.
We cannot wait for 1.3 since some clusters are raised at Q3.

It will be good if we can unify 1.2.c with 1.2.1 that was requested in 
the same time frame
Any thoughts on this?

Tziporet


From kliteyn at dev.mellanox.co.il  Sun Jul 15 04:56:32 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Sun, 15 Jul 2007 14:56:32 +0300
Subject: [ofa-general] [PATCH] osm: some improvements to fat-tree routing
Message-ID: <469A0B70.1020101@dev.mellanox.co.il>

Hi Sasha

This patch adds a small improvement to fat-tree routing for
asymmetrical (or unusual) trees:
1. When routing down-going routes (by climbing up the tree), 
   first selecting the least loaded group, and then least loaded
   port in the selected group.
2. When routing up-going routes (by descending down the tree), 
   scan groups by indexing order, but the start group is selected
   by round-robin.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/osm_ucast_ftree.c |   79 ++++++++++++++++++++++++++-------------
 1 files changed, 53 insertions(+), 26 deletions(-)

diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c
index 38bee8a..cfe5435 100644
--- a/opensm/opensm/osm_ucast_ftree.c
+++ b/opensm/opensm/osm_ucast_ftree.c
@@ -179,6 +179,7 @@ typedef struct ftree_port_group_t_
    ftree_hca_or_sw remote_hca_or_sw;  /* pointer to remote hca/switch */
    cl_ptr_vector_t ports;             /* vector of ports to the same lid */
    boolean_t       is_cn;             /* whether this port is a compute node */
+   uint32_t        counter_down;      /* number of allocated routs downwards */
 } ftree_port_group_t;
 
 /***************************************************
@@ -200,6 +201,7 @@ typedef struct ftree_sw_t_
    uint8_t                up_port_groups_num;
    ftree_fwd_tbl_t        lft_buf;
    boolean_t              is_leaf;
+   int                    down_port_groups_idx;
 } ftree_sw_t;
 
 /***************************************************
@@ -681,6 +683,8 @@ __osm_ftree_sw_create(
    p_sw->lft_buf = (ftree_fwd_tbl_t)cl_pool_get(&p_ftree->sw_fwd_tbl_pool);
    memset(p_sw->lft_buf, OSM_NO_PATH, FTREE_FWD_TBL_LEN);
 
+   p_sw->down_port_groups_idx = -1;
+
    return p_sw;
 } /* __osm_ftree_sw_create() */
 
@@ -2145,6 +2149,7 @@ __osm_ftree_fabric_route_upgoing_by_going_down(
    ftree_port_t        * p_min_port;
    uint16_t              i;
    uint16_t              j;
+   uint16_t              k;
 
    /* we shouldn't enter here if both real_lid and main_path are false */
    CL_ASSERT(is_real_lid || is_main_path);
@@ -2153,9 +2158,23 @@ __osm_ftree_fabric_route_upgoing_by_going_down(
    if (p_sw->down_port_groups_num == 0)
        return;
 
-   /* foreach down-going port group (in indexing order) */
-   for (i = 0; i < p_sw->down_port_groups_num; i++)
+   /* promote the index that indicates which group should we
+      start with when going through all the downgoing groups */
+   if (p_sw->down_port_groups_idx == -1)
+      p_sw->down_port_groups_idx = 0;
+   else
+      p_sw->down_port_groups_idx = 
+            (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
+
+   /* foreach down-going port group (in indexing order)
+      starting with the least loaded group */
+   for ( k = 0; k < p_sw->down_port_groups_num; k++ )
    {
+      if ( k == 0 )
+         i = p_sw->down_port_groups_idx;
+      else
+         i = (i+1) % p_sw->down_port_groups_num;
+
       p_group = p_sw->down_port_groups[i];
 
       /* Skip this port group unless it points to a switch */
@@ -2352,34 +2371,40 @@ __osm_ftree_fabric_route_downgoing_by_going_up(
    if (p_sw->rank == 0)
       return;
 
-   /* Find the least loaded port of all the upgoing port groups
-      (in indexing order of the remote switches). */
+   /* Find the least loaded upgoing port group */
    p_min_group = NULL;
-   p_min_port = NULL;
    for (i = 0; i < p_sw->up_port_groups_num; i++)
    {
       p_group = p_sw->up_port_groups[i];
+      if (!p_min_group)
+      {
+         /* first group that we're checking - use 
+            it as a group with the lowest load */
+         p_min_group = p_group;
+      }
+      else if ( p_group->counter_down < p_min_group->counter_down  )
+      {
+         /* this group is less loaded - use it as min */
+         p_min_group = p_group;
+      }
+   }
 
-      ports_num = (uint16_t)cl_ptr_vector_get_size(&p_group->ports);
-      for (j = 0; j < ports_num; j++)
+   /* Find the least loaded upgoing port in the selected group */
+   p_min_port = NULL;
+   ports_num = (uint16_t)cl_ptr_vector_get_size(&p_min_group->ports);
+   for (j = 0; j < ports_num; j++)
+   {
+      cl_ptr_vector_at(&p_min_group->ports, j, (void **)&p_port);
+      if (!p_min_port)
       {
-         cl_ptr_vector_at(&p_group->ports, j, (void **)&p_port);
-         if (!p_min_group)
-         {
-            /* first port that we're checking - use
-               it as a port with the lowest load */
-            p_min_group = p_group;
-            p_min_port = p_port;
-         }
-         else
-         {
-            if ( p_port->counter_down < p_min_port->counter_down  )
-            {
-               /* this port is less loaded - use it as min */
-               p_min_group = p_group;
-               p_min_port = p_port;
-            }
-         }
+         /* first port that we're checking - use
+            it as a port with the lowest load */
+         p_min_port = p_port;
+      }
+      else if ( p_port->counter_down < p_min_port->counter_down  )
+      {
+         /* this port is less loaded - use it as min */
+         p_min_port = p_port;
       }
    }
 
@@ -2435,8 +2460,10 @@ __osm_ftree_fabric_route_downgoing_by_going_up(
                  __osm_ftree_tuple_to_str(p_remote_sw->tuple));
       }
       /* The number of downgoing routes is tracked in the
-         p_port->counter_down counter of the port that belongs to
-         the lower side of the link (on switch with higher rank) */
+         p_group->counter_down p_port->counter_down counters of the
+         group and port that belong to the lower side of the link
+         (on switch with higher rank) */
+      p_min_group->counter_down++;
       p_min_port->counter_down++;
       if (is_real_lid)
       {
-- 
1.5.1.4


From dotanb at dev.mellanox.co.il  Sun Jul 15 05:00:09 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Sun, 15 Jul 2007 15:00:09 +0300
Subject: [ofa-general] [PATCH] mlx4/IB: Take sizeof the correct pointer when
	calling to memset
Message-ID: <200707151500.09578.dotanb@dev.mellanox.co.il>

Take sizeof the correct pointer when calling to memset.

Signed-off-by: Dotan Barak <dotanb at mellanox.co.il>

---

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 4004218..ab6f0b7 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -1498,7 +1498,7 @@ static int to_ib_qp_access_flags(int mlx4_flags)
 static void to_ib_ah_attr(struct mlx4_dev *dev, struct ib_ah_attr *ib_ah_attr,
 				struct mlx4_qp_path *path)
 {
-	memset(ib_ah_attr, 0, sizeof *path);
+	memset(ib_ah_attr, 0, sizeof *ib_ah_attr);
 	ib_ah_attr->port_num	  = path->sched_queue & 0x40 ? 2 : 1;
 
 	if (ib_ah_attr->port_num == 0 || ib_ah_attr->port_num > dev->caps.num_ports)


From tziporet at dev.mellanox.co.il  Sun Jul 15 05:15:41 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Sun, 15 Jul 2007 15:15:41 +0300
Subject: [ofa-general] OFED 1.3 timeline
In-Reply-To: <OF68674E89.B4A67922-ON87257317.005E8658-88257317.005F8CC3@us.ibm.com>
References: <OF68674E89.B4A67922-ON87257317.005E8658-88257317.005F8CC3@us.ibm.com>
Message-ID: <469A0FED.5040903@mellanox.co.il>

Shirley Ma wrote:
> 1.  skb aggregations for both dev xmit(networking layer) and IPoIB send
> 2.  multiple interrupt vectors in IPoIB for multiple links scalability
> 3.  split CQ and send completion aggregation
> 4.  LRO for IPoIB when generic LRO is available in networking layer.
>  
>         Some of them might be made on time in ofed-1.3 timeline, some of 
> them might not. It will depend on our test progresses and community review 
> feedbacks. I hope ofed-1.3 won't leave these patches out if they can be 
> made into 2.6.23 on time.
>
> Thanks
> Shirley
>
>   
OFED 1.3 kernel code is based on 2.6.23. Anything that will be on time 
for the kernel will be in OFED 1.3 too

Tziporet


From tziporet at dev.mellanox.co.il  Sun Jul 15 05:26:59 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Sun, 15 Jul 2007 15:26:59 +0300
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <adahco943ip.fsf@cisco.com>
References: <adalkdl43w0.fsf@cisco.com> <adahco943ip.fsf@cisco.com>
Message-ID: <469A1293.6020902@mellanox.co.il>

Roland Dreier wrote:
> As you can see, I just sent my first 2.6.23 pull request for Linus.
> There are still a few more things I plan to do in before the merge
> window closes (in ~10 days):
>
>   
Till when can we insert mlx4 with FMRs?

Tziporet


From kliteyn at dev.mellanox.co.il  Sun Jul 15 06:36:30 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Sun, 15 Jul 2007 16:36:30 +0300
Subject: [ofa-general] [PATCH] opensm/updn: --connect_roots option
In-Reply-To: <20070621212919.GL25653@sashak.voltaire.com>
References: <20070621212919.GL25653@sashak.voltaire.com>
Message-ID: <469A22DE.5010301@dev.mellanox.co.il>

Hi Sasha,

Sasha Khapyorsky wrote:
> With this option up/down preserves route paths (based on min hops
> knowledge) between root switches. This makes up/down IBA complaint
> (where all to all connectivity is required), OTOH this violates up/down
> deadlock free algorithm. By default this option is 'off'.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

If I understand you correctly, this patch does what it says - connects
*roots*. But what if other switches are not connected because of the up/down
constraints?
For instance, the fabric can be actually built of several sub-trees that are
connected only at leaf switch rank, so there is no path in up/down between any
two switches from different sub-trees at ranks 0 to leaf rank (not inclusively).
Moreover, I can think of a topology where some CA-to-CA paths will be missing too.

Similar problem exists in fat-tree routing.

Thoughts?

-- Yevgeny


> ---
>  opensm/include/opensm/osm_subnet.h |    6 ++++++
>  opensm/man/opensm.8                |    8 +++++++-
>  opensm/opensm/main.c               |   15 ++++++++++++++-
>  opensm/opensm/osm_subnet.c         |   10 ++++++++++
>  opensm/opensm/osm_ucast_updn.c     |   27 ++++++++++++++++++++++++++-
>  5 files changed, 63 insertions(+), 3 deletions(-)
> 
> diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
> index 2ee5689..43b1589 100644
> --- a/opensm/include/opensm/osm_subnet.h
> +++ b/opensm/include/opensm/osm_subnet.h
> @@ -276,6 +276,7 @@ typedef struct _osm_subn_opt
>    boolean_t                sweep_on_trap;
>    osm_testability_modes_t  testability_mode;
>    char *                   routing_engine_name;
> +  boolean_t                connect_roots;
>    char *                   lid_matrix_dump_file;
>    char *                   ucast_dump_file;
>    char *                   root_guid_file;
> @@ -445,6 +446,11 @@ typedef struct _osm_subn_opt
>  *		Name of used routing engine
>  *		(other than default Min Hop Algorithm)
>  *
> +*	connect_roots
> +*		The option which will enfoce root to root connectivity with
> +*		up/down routing engine (even if this violates "pure" deadlock
> +*		free up/down algorithm)
> +*
>  *	lid_matrix_dump_file
>  *		Name of the lid matrix dump file from where switch
>  *		lid matrices (min hops tables) will be loaded
> diff --git a/opensm/man/opensm.8 b/opensm/man/opensm.8
> index 4d35689..40e0235 100644
> --- a/opensm/man/opensm.8
> +++ b/opensm/man/opensm.8
> @@ -5,7 +5,7 @@ opensm \- InfiniBand subnet manager and administration (SM/SA)
>  
>  .SH SYNOPSIS
>  .B opensm
> -[\-c(ache-options)] [\-g(uid)[=]<GUID in hex>] [\-l(mc) <LMC>] [\-p(riority) <PRIORITY>] [\-smkey <SM_Key>] [\-r(eassign_lids)] [\-R <engine name> | \-\-routing_engine <engine name>] [\-M <file name> | \-\-lid_matrix_file <file name>] [\-U <file name> | \-ucast_file <file name>] [\-S | \-\-sadb_file <file name>] [\-a | \-\-root_guid_file <path to file>] [\-u | \-\-cn_guid_file <path to file>] [\-o(nce)] [\-s(weep) <interval>] [\-t(imeout) <milliseconds>] [\-maxsmps <number>] [\-console [off | local | socket]] [\-console-port <port>] [\-i(gnore-guids) <equalize-ignore-guids-file>] [\-f | \-\-log_file] [\-L | \-\-log_limit <size in MB>] [\-e(rase_log_file)] [\-P(config)] [\-Q | \-qos] [\-N | \-no_part_enforce] [\-y | \-stay_on_fatal] [\-B | \-daemon] [\-I | \-inactive] [\-perfmgr] [\-perfmgr_sweep_time_s <seconds>] [\-v(erbose)] [\-V] [\-D <flags>] [\-d(ebug) <number>] [\-h(elp)] [\-?]
> +[\-c(ache-options)] [\-g(uid)[=]<GUID in hex>] [\-l(mc) <LMC>] [\-p(riority) <PRIORITY>] [\-smkey <SM_Key>] [\-r(eassign_lids)] [\-R <engine name> | \-\-routing_engine <engine name>] [\-z | \-\-connect_roots] [\-M <file name> | \-\-lid_matrix_file <file name>] [\-U <file name> | \-ucast_file <file name>] [\-S | \-\-sadb_file <file name>] [\-a | \-\-root_guid_file <path to file>] [\-u | \-\-cn_guid_file <path to file>] [\-o(nce)] [\-s(weep) <interval>] [\-t(imeout) <milliseconds>] [\-maxsmps <number>] [\-console [off | local | socket]] [\-console-port <port>] [\-i(gnore-guids) <equalize-ignore-guids-file>] [\-f | \-\-log_file] [\-L | \-\-log_limit <size in MB>] [\-e(rase_log_file)] [\-P(config)] [\-Q | \-qos] [\-N | \-no_part_enforce] [\-y | \-stay_on_fatal] [\-B | \-daemon] [\-I | \-inactive] [\-perfmgr] [\-perfmgr_sweep_time_s <seconds>] [\-v(erbose)] [\-V] [\-D <flags>] [\-d(ebug) <number>] [\-h(elp)] [\-?]
>  
>  .SH DESCRIPTION
>  .PP
> @@ -94,6 +94,12 @@ This option chooses routing engine instead of Min Hop
>  algorithm (default).
>  Supported engines: updn, file, ftree, lash
>  .TP
> +\fB\-z\fR, \fB\-\-connect_roots\fR
> +This option enforces a routing engine (currently up/down
> +only) to make connectivity between root switches and in
> +this way to be fully IBA complaint. In many cases this can
> +violate "pure" deadlock free algorithm, so use it carefully.
> +.TP
>  \fB\-M\fR, \fB\-\-lid_matrix_file\fR
>  This option specifies the name of the lid matrix dump file
>  from where switch lid matrices (min hops tables will be
> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
> index 0d5e0eb..e182276 100644
> --- a/opensm/opensm/main.c
> +++ b/opensm/opensm/main.c
> @@ -175,6 +175,13 @@ show_usage(void)
>            "          This option chooses routing engine instead of Min Hop\n"
>            "          algorithm (default).\n"
>            "          Supported engines: updn, file, ftree\n\n");
> +  printf( "-z\n"
> +          "--connect_roots\n"
> +          "          This option enforces a routing engine (currently\n"
> +          "          up/down only) to make connectivity between root switches\n"
> +          "          and in this way to be fully IBA complaint. In many cases\n"
> +          "          this can violate \"pure\" deadlock free algorithm, so\n"
> +          "          use it carefully.\n\n");
>    printf( "-M\n"
>            "--lid_matrix_file <file name>\n"
>            "          This option specifies the name of the lid matrix dump file\n"
> @@ -591,7 +598,7 @@ main(
>    char                 *ignore_guids_file_name = NULL;
>    uint32_t              val;
>    const char * const    short_option =
> -	  "i:f:ed:g:l:L:s:t:a:u:R:M:U:S:P:NBIQvVhorcyxp:n:q:k:C:";
> +	  "i:f:ed:g:l:L:s:t:a:u:R:zM:U:S:P:NBIQvVhorcyxp:n:q:k:C:";
>  
>    /*
>      In the array below, the 2nd parameter specifies the number
> @@ -625,6 +632,7 @@ main(
>        {  "priority",      1, NULL, 'p'},
>        {  "smkey",         1, NULL, 'k'},
>        {  "routing_engine",1, NULL, 'R'},
> +      {  "connect_roots", 0, NULL, 'z'},
>        {  "lid_matrix_file",1, NULL, 'M'},
>        {  "ucast_file",    1, NULL, 'U'},
>        {  "sadb_file",     1, NULL, 'S'},
> @@ -876,6 +884,11 @@ main(
>        printf(" Activate \'%s\' routing engine\n", optarg);
>        break;
>  
> +    case 'z':
> +      opt.connect_roots = TRUE;
> +      printf(" Connect roots option is on\n");
> +      break;
> +
>      case 'M':
>        opt.lid_matrix_dump_file = optarg;
>        printf(" Lid matrix dump file is \'%s\'\n", optarg);
> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
> index 82d66f9..8f429ae 100644
> --- a/opensm/opensm/osm_subnet.c
> +++ b/opensm/opensm/osm_subnet.c
> @@ -500,6 +500,7 @@ osm_subn_set_default_opt(
>    p_opt->sweep_on_trap = TRUE;
>    p_opt->testability_mode = OSM_TEST_MODE_NONE;
>    p_opt->routing_engine_name = NULL;
> +  p_opt->connect_roots = FALSE;
>    p_opt->lid_matrix_dump_file = NULL;
>    p_opt->ucast_dump_file = NULL;
>    p_opt->root_guid_file = NULL;
> @@ -1290,6 +1291,10 @@ osm_subn_parse_conf_file(
>          "routing_engine",
>          p_key, p_val, &p_opts->routing_engine_name);
>  
> +      __osm_subn_opts_unpack_boolean(
> +        "connect_roots",
> +        p_key, p_val, &p_opts->connect_roots);
> +
>        __osm_subn_opts_unpack_charp(
>          "log_file", p_key, p_val, &p_opts->log_file);
>  
> @@ -1545,6 +1550,11 @@ osm_subn_write_conf_file(
>               "# Routing engine\n"
>               "routing_engine %s\n\n",
>               p_opts->routing_engine_name);
> +  if (p_opts->connect_roots)
> +    fprintf( opts_file,
> +             "# Connect roots (use FALSE if unsure)\n"
> +             "connect_roots %s\n\n",
> +             p_opts->connect_roots ? "TRUE" : "FALSE");
>    if (p_opts->lid_matrix_dump_file)
>      fprintf( opts_file,
>               "# Lid matrix dump file name\n"
> diff --git a/opensm/opensm/osm_ucast_updn.c b/opensm/opensm/osm_ucast_updn.c
> index af5ee4e..db8e60a 100644
> --- a/opensm/opensm/osm_ucast_updn.c
> +++ b/opensm/opensm/osm_ucast_updn.c
> @@ -449,6 +449,24 @@ updn_subn_rank(
>  
>  /**********************************************************************
>   **********************************************************************/
> +/* hack: preserve min hops entries to any other root switches */
> +static void
> +updn_clear_root_hops(updn_t *p_updn, osm_switch_t *p_sw)
> +{
> +  osm_port_t *p_port;
> +  unsigned i;
> +
> +  for ( i = 0 ; i < p_sw->num_hops ; i++ )
> +    if (p_sw->hops[i]) {
> +      p_port = cl_ptr_vector_get(&p_updn->p_osm->subn.port_lid_tbl, i);
> +      if (!p_port || !p_port->p_node->sw ||
> +          ((struct updn_node *)p_port->p_node->sw->priv)->rank != 0)
> +        memset(p_sw->hops[i], 0xff, p_sw->num_ports);
> +    }
> +}
> +
> +/**********************************************************************
> + **********************************************************************/
>  static int
>  __osm_subn_set_up_down_min_hop_table(
>    IN updn_t* p_updn )
> @@ -471,7 +489,10 @@ __osm_subn_set_up_down_min_hop_table(
>      p_sw = p_next_sw;
>      p_next_sw = (osm_switch_t*)cl_qmap_next( &p_sw->map_item );
>      /* Clear Min Hop Table */
> -    osm_switch_clear_hops(p_sw);
> +    if (p_subn->opt.connect_roots && !((struct updn_node *)p_sw->priv)->rank)
> +      updn_clear_root_hops(p_updn, p_sw);
> +    else
> +      osm_switch_clear_hops(p_sw);
>    }
>  
>    osm_log( p_log, OSM_LOG_VERBOSE,
> @@ -607,6 +628,10 @@ __osm_updn_call(
>      osm_ucast_mgr_build_lid_matrices( &p_updn->p_osm->sm.ucast_mgr );
>      __osm_updn_find_root_nodes_by_min_hop( p_updn );
>    }
> +  else if (p_updn->p_osm->subn.opt.connect_roots &&
> +           p_updn->updn_ucast_reg_inputs.num_guids > 1)
> +    osm_ucast_mgr_build_lid_matrices( &p_updn->p_osm->sm.ucast_mgr );
> +
>    /* printf ("-V- after osm_updn_find_root_nodes_by_min_hop\n"); */
>    /* Only if there are assigned root nodes do the algorithm, otherwise perform do nothing */
>    if ( p_updn->updn_ucast_reg_inputs.num_guids > 0)


From sashak at voltaire.com  Sun Jul 15 06:47:17 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 15 Jul 2007 16:47:17 +0300
Subject: [ofa-general] [PATCH] opensm/updn: --connect_roots option
In-Reply-To: <469A22DE.5010301@dev.mellanox.co.il>
References: <20070621212919.GL25653@sashak.voltaire.com>
	<469A22DE.5010301@dev.mellanox.co.il>
Message-ID: <1184507237.19232.9.camel@localhost>

Hi Yevgeny,

On Sun, 2007-07-15 at 16:36 +0300, Yevgeny Kliteynik wrote:
> Hi Sasha,
> 
> Sasha Khapyorsky wrote:
> > With this option up/down preserves route paths (based on min hops
> > knowledge) between root switches. This makes up/down IBA complaint
> > (where all to all connectivity is required), OTOH this violates up/down
> > deadlock free algorithm. By default this option is 'off'.
> > 
> > Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> 
> If I understand you correctly, this patch does what it says - connects
> *roots*.

Yes, and in this respect it can violate up/down rules.

>  But what if other switches are not connected because of the up/down
> constraints?

Another constraint is which roots are defined. In your example your
could add connected leafs to the roots list or to have only connected
leafs as roots (depends on exact topology).

Sasha

> For instance, the fabric can be actually built of several sub-trees that are
> connected only at leaf switch rank, so there is no path in up/down between any
> two switches from different sub-trees at ranks 0 to leaf rank (not inclusively).
> Moreover, I can think of a topology where some CA-to-CA paths will be missing too.
> 
> Similar problem exists in fat-tree routing.
> 
> Thoughts?
> 
> -- Yevgeny
> 
> 
> > ---
> >  opensm/include/opensm/osm_subnet.h |    6 ++++++
> >  opensm/man/opensm.8                |    8 +++++++-
> >  opensm/opensm/main.c               |   15 ++++++++++++++-
> >  opensm/opensm/osm_subnet.c         |   10 ++++++++++
> >  opensm/opensm/osm_ucast_updn.c     |   27 ++++++++++++++++++++++++++-
> >  5 files changed, 63 insertions(+), 3 deletions(-)
> > 
> > diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
> > index 2ee5689..43b1589 100644
> > --- a/opensm/include/opensm/osm_subnet.h
> > +++ b/opensm/include/opensm/osm_subnet.h
> > @@ -276,6 +276,7 @@ typedef struct _osm_subn_opt
> >    boolean_t                sweep_on_trap;
> >    osm_testability_modes_t  testability_mode;
> >    char *                   routing_engine_name;
> > +  boolean_t                connect_roots;
> >    char *                   lid_matrix_dump_file;
> >    char *                   ucast_dump_file;
> >    char *                   root_guid_file;
> > @@ -445,6 +446,11 @@ typedef struct _osm_subn_opt
> >  *		Name of used routing engine
> >  *		(other than default Min Hop Algorithm)
> >  *
> > +*	connect_roots
> > +*		The option which will enfoce root to root connectivity with
> > +*		up/down routing engine (even if this violates "pure" deadlock
> > +*		free up/down algorithm)
> > +*
> >  *	lid_matrix_dump_file
> >  *		Name of the lid matrix dump file from where switch
> >  *		lid matrices (min hops tables) will be loaded
> > diff --git a/opensm/man/opensm.8 b/opensm/man/opensm.8
> > index 4d35689..40e0235 100644
> > --- a/opensm/man/opensm.8
> > +++ b/opensm/man/opensm.8
> > @@ -5,7 +5,7 @@ opensm \- InfiniBand subnet manager and administration (SM/SA)
> >  
> >  .SH SYNOPSIS
> >  .B opensm
> > -[\-c(ache-options)] [\-g(uid)[=]<GUID in hex>] [\-l(mc) <LMC>] [\-p(riority) <PRIORITY>] [\-smkey <SM_Key>] [\-r(eassign_lids)] [\-R <engine name> | \-\-routing_engine <engine name>] [\-M <file name> | \-\-lid_matrix_file <file name>] [\-U <file name> | \-ucast_file <file name>] [\-S | \-\-sadb_file <file name>] [\-a | \-\-root_guid_file <path to file>] [\-u | \-\-cn_guid_file <path to file>] [\-o(nce)] [\-s(weep) <interval>] [\-t(imeout) <milliseconds>] [\-maxsmps <number>] [\-console [off | local | socket]] [\-console-port <port>] [\-i(gnore-guids) <equalize-ignore-guids-file>] [\-f | \-\-log_file] [\-L | \-\-log_limit <size in MB>] [\-e(rase_log_file)] [\-P(config)] [\-Q | \-qos] [\-N | \-no_part_enforce] [\-y | \-stay_on_fatal] [\-B | \-daemon] [\-I | \-inactive] [\-perfmgr] [\-perfmgr_sweep_time_s <seconds>] [\-v(erbose)] [\-V] [\-D <flags>] [\-d(ebug) <number>] [\-h(elp)] [\-?]
> > +[\-c(ache-options)] [\-g(uid)[=]<GUID in hex>] [\-l(mc) <LMC>] [\-p(riority) <PRIORITY>] [\-smkey <SM_Key>] [\-r(eassign_lids)] [\-R <engine name> | \-\-routing_engine <engine name>] [\-z | \-\-connect_roots] [\-M <file name> | \-\-lid_matrix_file <file name>] [\-U <file name> | \-ucast_file <file name>] [\-S | \-\-sadb_file <file name>] [\-a | \-\-root_guid_file <path to file>] [\-u | \-\-cn_guid_file <path to file>] [\-o(nce)] [\-s(weep) <interval>] [\-t(imeout) <milliseconds>] [\-maxsmps <number>] [\-console [off | local | socket]] [\-console-port <port>] [\-i(gnore-guids) <equalize-ignore-guids-file>] [\-f | \-\-log_file] [\-L | \-\-log_limit <size in MB>] [\-e(rase_log_file)] [\-P(config)] [\-Q | \-qos] [\-N | \-no_part_enforce] [\-y | \-stay_on_fatal] [\-B | \-daemon] [\-I | \-inactive] [\-perfmgr] [\-perfmgr_sweep_time_s <seconds>] [\-v(erbose)] [\-V] [\-D <flags>] [\-d(ebug) <number>] [\-h(elp)] [\-?]
> >  
> >  .SH DESCRIPTION
> >  .PP
> > @@ -94,6 +94,12 @@ This option chooses routing engine instead of Min Hop
> >  algorithm (default).
> >  Supported engines: updn, file, ftree, lash
> >  .TP
> > +\fB\-z\fR, \fB\-\-connect_roots\fR
> > +This option enforces a routing engine (currently up/down
> > +only) to make connectivity between root switches and in
> > +this way to be fully IBA complaint. In many cases this can
> > +violate "pure" deadlock free algorithm, so use it carefully.
> > +.TP
> >  \fB\-M\fR, \fB\-\-lid_matrix_file\fR
> >  This option specifies the name of the lid matrix dump file
> >  from where switch lid matrices (min hops tables will be
> > diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
> > index 0d5e0eb..e182276 100644
> > --- a/opensm/opensm/main.c
> > +++ b/opensm/opensm/main.c
> > @@ -175,6 +175,13 @@ show_usage(void)
> >            "          This option chooses routing engine instead of Min Hop\n"
> >            "          algorithm (default).\n"
> >            "          Supported engines: updn, file, ftree\n\n");
> > +  printf( "-z\n"
> > +          "--connect_roots\n"
> > +          "          This option enforces a routing engine (currently\n"
> > +          "          up/down only) to make connectivity between root switches\n"
> > +          "          and in this way to be fully IBA complaint. In many cases\n"
> > +          "          this can violate \"pure\" deadlock free algorithm, so\n"
> > +          "          use it carefully.\n\n");
> >    printf( "-M\n"
> >            "--lid_matrix_file <file name>\n"
> >            "          This option specifies the name of the lid matrix dump file\n"
> > @@ -591,7 +598,7 @@ main(
> >    char                 *ignore_guids_file_name = NULL;
> >    uint32_t              val;
> >    const char * const    short_option =
> > -	  "i:f:ed:g:l:L:s:t:a:u:R:M:U:S:P:NBIQvVhorcyxp:n:q:k:C:";
> > +	  "i:f:ed:g:l:L:s:t:a:u:R:zM:U:S:P:NBIQvVhorcyxp:n:q:k:C:";
> >  
> >    /*
> >      In the array below, the 2nd parameter specifies the number
> > @@ -625,6 +632,7 @@ main(
> >        {  "priority",      1, NULL, 'p'},
> >        {  "smkey",         1, NULL, 'k'},
> >        {  "routing_engine",1, NULL, 'R'},
> > +      {  "connect_roots", 0, NULL, 'z'},
> >        {  "lid_matrix_file",1, NULL, 'M'},
> >        {  "ucast_file",    1, NULL, 'U'},
> >        {  "sadb_file",     1, NULL, 'S'},
> > @@ -876,6 +884,11 @@ main(
> >        printf(" Activate \'%s\' routing engine\n", optarg);
> >        break;
> >  
> > +    case 'z':
> > +      opt.connect_roots = TRUE;
> > +      printf(" Connect roots option is on\n");
> > +      break;
> > +
> >      case 'M':
> >        opt.lid_matrix_dump_file = optarg;
> >        printf(" Lid matrix dump file is \'%s\'\n", optarg);
> > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
> > index 82d66f9..8f429ae 100644
> > --- a/opensm/opensm/osm_subnet.c
> > +++ b/opensm/opensm/osm_subnet.c
> > @@ -500,6 +500,7 @@ osm_subn_set_default_opt(
> >    p_opt->sweep_on_trap = TRUE;
> >    p_opt->testability_mode = OSM_TEST_MODE_NONE;
> >    p_opt->routing_engine_name = NULL;
> > +  p_opt->connect_roots = FALSE;
> >    p_opt->lid_matrix_dump_file = NULL;
> >    p_opt->ucast_dump_file = NULL;
> >    p_opt->root_guid_file = NULL;
> > @@ -1290,6 +1291,10 @@ osm_subn_parse_conf_file(
> >          "routing_engine",
> >          p_key, p_val, &p_opts->routing_engine_name);
> >  
> > +      __osm_subn_opts_unpack_boolean(
> > +        "connect_roots",
> > +        p_key, p_val, &p_opts->connect_roots);
> > +
> >        __osm_subn_opts_unpack_charp(
> >          "log_file", p_key, p_val, &p_opts->log_file);
> >  
> > @@ -1545,6 +1550,11 @@ osm_subn_write_conf_file(
> >               "# Routing engine\n"
> >               "routing_engine %s\n\n",
> >               p_opts->routing_engine_name);
> > +  if (p_opts->connect_roots)
> > +    fprintf( opts_file,
> > +             "# Connect roots (use FALSE if unsure)\n"
> > +             "connect_roots %s\n\n",
> > +             p_opts->connect_roots ? "TRUE" : "FALSE");
> >    if (p_opts->lid_matrix_dump_file)
> >      fprintf( opts_file,
> >               "# Lid matrix dump file name\n"
> > diff --git a/opensm/opensm/osm_ucast_updn.c b/opensm/opensm/osm_ucast_updn.c
> > index af5ee4e..db8e60a 100644
> > --- a/opensm/opensm/osm_ucast_updn.c
> > +++ b/opensm/opensm/osm_ucast_updn.c
> > @@ -449,6 +449,24 @@ updn_subn_rank(
> >  
> >  /**********************************************************************
> >   **********************************************************************/
> > +/* hack: preserve min hops entries to any other root switches */
> > +static void
> > +updn_clear_root_hops(updn_t *p_updn, osm_switch_t *p_sw)
> > +{
> > +  osm_port_t *p_port;
> > +  unsigned i;
> > +
> > +  for ( i = 0 ; i < p_sw->num_hops ; i++ )
> > +    if (p_sw->hops[i]) {
> > +      p_port = cl_ptr_vector_get(&p_updn->p_osm->subn.port_lid_tbl, i);
> > +      if (!p_port || !p_port->p_node->sw ||
> > +          ((struct updn_node *)p_port->p_node->sw->priv)->rank != 0)
> > +        memset(p_sw->hops[i], 0xff, p_sw->num_ports);
> > +    }
> > +}
> > +
> > +/**********************************************************************
> > + **********************************************************************/
> >  static int
> >  __osm_subn_set_up_down_min_hop_table(
> >    IN updn_t* p_updn )
> > @@ -471,7 +489,10 @@ __osm_subn_set_up_down_min_hop_table(
> >      p_sw = p_next_sw;
> >      p_next_sw = (osm_switch_t*)cl_qmap_next( &p_sw->map_item );
> >      /* Clear Min Hop Table */
> > -    osm_switch_clear_hops(p_sw);
> > +    if (p_subn->opt.connect_roots && !((struct updn_node *)p_sw->priv)->rank)
> > +      updn_clear_root_hops(p_updn, p_sw);
> > +    else
> > +      osm_switch_clear_hops(p_sw);
> >    }
> >  
> >    osm_log( p_log, OSM_LOG_VERBOSE,
> > @@ -607,6 +628,10 @@ __osm_updn_call(
> >      osm_ucast_mgr_build_lid_matrices( &p_updn->p_osm->sm.ucast_mgr );
> >      __osm_updn_find_root_nodes_by_min_hop( p_updn );
> >    }
> > +  else if (p_updn->p_osm->subn.opt.connect_roots &&
> > +           p_updn->updn_ucast_reg_inputs.num_guids > 1)
> > +    osm_ucast_mgr_build_lid_matrices( &p_updn->p_osm->sm.ucast_mgr );
> > +
> >    /* printf ("-V- after osm_updn_find_root_nodes_by_min_hop\n"); */
> >    /* Only if there are assigned root nodes do the algorithm, otherwise perform do nothing */
> >    if ( p_updn->updn_ucast_reg_inputs.num_guids > 0)
> 


From a-a.sesa at abc-solutions.de  Sun Jul 15 09:48:34 2007
From: a-a.sesa at abc-solutions.de (Thanh Arroyo)
Date: Sun, 15 Jul 2007 15:48:34 -0100
Subject: [ofa-general] Pics
Message-ID: <882174187.96161526012828@abc-solutions.de>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070715/18013631/attachment.html>

From sashak at voltaire.com  Sun Jul 15 13:58:52 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 15 Jul 2007 23:58:52 +0300
Subject: [ofa-general] Re: [PATCH] OpenSM: Change force_link_speed to allow
	for local policy and more flexibility
In-Reply-To: <1184496198.4908.154970.camel@hal.voltaire.com>
References: <1184496198.4908.154970.camel@hal.voltaire.com>
Message-ID: <20070715205852.GA30202@sashak.voltaire.com>

On 06:43 Sun 15 Jul     , Hal Rosenstock wrote:
> OpenSM: Change force_link_speed to allow for local policy and more
> flexibility
> 
> Extend (and change) the use of force_link_speed as follows:
> 0 - no change
> 1 - set to SDR
> 15 - set as supported (default)
> (Non zero values are used to set LinkSpeedEnabled component in PortInfo)
> 
> Note that force_link_speed 0 which used to force SDR is now
> force_link_speed 1
> 
> "Ideally", there were be a per port configuration of this.
> 
> [Note this is largely untested.]
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sun Jul 15 14:10:52 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 16 Jul 2007 00:10:52 +0300
Subject: [ofa-general] Re: [PATCH] osm: some improvements to fat-tree routing
In-Reply-To: <469A0B70.1020101@dev.mellanox.co.il>
References: <469A0B70.1020101@dev.mellanox.co.il>
Message-ID: <20070715211052.GC30202@sashak.voltaire.com>

On 14:56 Sun 15 Jul     , Yevgeny Kliteynik wrote:
> Hi Sasha
> 
> This patch adds a small improvement to fat-tree routing for
> asymmetrical (or unusual) trees:
> 1. When routing down-going routes (by climbing up the tree), 
>    first selecting the least loaded group, and then least loaded
>    port in the selected group.
> 2. When routing up-going routes (by descending down the tree), 
>    scan groups by indexing order, but the start group is selected
>    by round-robin.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Applied (some trailing whitespaces were stripped by git-am). Thanks.

Sasha


From akepner at sgi.com  Sun Jul 15 14:21:46 2007
From: akepner at sgi.com (akepner at sgi.com)
Date: Sun, 15 Jul 2007 14:21:46 -0700
Subject: [ofa-general] [RFC 0/1] libmthca: CQ/DMA race on Altix
Message-ID: <20070715212146.GF6921@sgi.com>


Here's a first cut at OFED 1.3/Linux 2.6.23 patches to
address the "CQ/DMA race" that's possible on Altix systems
when CQs are allocated in user space.

(A description of this bug appears here:
http://lists.openfabrics.org/pipermail/general/2006-December/030251.html)

I'll post the kernel patch to lkml, but I'd appreciate any
comments from this list before doing that.

Obviously this is just a subset of the necessary kernel
changes required, since every use of dma_map_sg() would
need to be modified. Comments?

 arch/ia64/sn/pci/pci_dma.c                   |   19 ++++++++++++++-----
 drivers/infiniband/core/umem.c               |    5 +++--
 drivers/infiniband/hw/mthca/mthca_provider.c |   11 ++++++++++-
 drivers/infiniband/hw/mthca/mthca_user.h     |    8 +++++++-
 drivers/infiniband/ulp/srp/ib_srp.c          |    2 +-
 include/asm-generic/dma-mapping.h            |    4 ++--
 include/asm-generic/pci-dma-compat.h         |    2 +-
 include/asm-ia64/machvec.h                   |    2 +-
 include/rdma/ib_umem.h                       |    2 +-
 include/rdma/ib_verbs.h                      |    5 +++--

-- 
diff --git a/arch/ia64/sn/pci/pci_dma.c b/arch/ia64/sn/pci/pci_dma.c
index d79ddac..d942390 100644
--- a/arch/ia64/sn/pci/pci_dma.c
+++ b/arch/ia64/sn/pci/pci_dma.c
@@ -245,7 +245,7 @@ EXPORT_SYMBOL(sn_dma_unmap_sg);
  * Maps each entry of @sg for DMA.
  */
 int sn_dma_map_sg(struct device *dev, struct scatterlist *sg, int nhwentries,
-		  int direction)
+		  int direction, int coherent)
 {
 	unsigned long phys_addr;
 	struct scatterlist *saved_sg = sg;
@@ -259,12 +259,21 @@ int sn_dma_map_sg(struct device *dev, struct scatterlist *sg, int nhwentries,
 	 * Setup a DMA address for each entry in the scatterlist.
 	 */
 	for (i = 0; i < nhwentries; i++, sg++) {
+		dma_addr_t dma_addr;
 		phys_addr = SG_ENT_PHYS_ADDRESS(sg);
-		sg->dma_address = provider->dma_map(pdev,
-						    phys_addr, sg->length,
-						    SN_DMA_ADDR_PHYS);
 
-		if (!sg->dma_address) {
+		if (coherent) {
+			dma_addr= provider->dma_map_consistent(pdev,
+							       phys_addr, 	
+							       sg->length,
+							       SN_DMA_ADDR_PHYS);
+		} else {
+			dma_addr = provider->dma_map(pdev,
+						     phys_addr, sg->length,
+						     SN_DMA_ADDR_PHYS);
+		}
+
+		if (!(sg->dma_address = dma_addr)) {
 			printk(KERN_ERR "%s: out of ATEs\n", __FUNCTION__);
 
 			/*
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index d40652a..e9f9f42 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -66,7 +66,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
  * @access: IB_ACCESS_xxx flags for memory being pinned
  */
 struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
-			    size_t size, int access)
+			    size_t size, int access, int coherent)
 {
 	struct ib_umem *umem;
 	struct page **page_list;
@@ -154,7 +154,8 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 			chunk->nmap = ib_dma_map_sg(context->device,
 						    &chunk->page_list[0],
 						    chunk->nents,
-						    DMA_BIDIRECTIONAL);
+						    DMA_BIDIRECTIONAL,
+						    coherent);
 			if (chunk->nmap <= 0) {
 				for (i = 0; i < chunk->nents; ++i)
 					put_page(chunk->page_list[i].page);
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index 6bcde1c..c0cf5f1 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -1017,6 +1017,8 @@ static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 	struct mthca_dev *dev = to_mdev(pd->device);
 	struct ib_umem_chunk *chunk;
 	struct mthca_mr *mr;
+	struct mthca_reg_mr ucmd;
+	int coherent;
 	u64 *pages;
 	int shift, n, len;
 	int i, j, k;
@@ -1027,7 +1029,14 @@ static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 	if (!mr)
 		return ERR_PTR(-ENOMEM);
 
-	mr->umem = ib_umem_get(pd->uobject->context, start, length, acc);
+	if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) {
+		err = -EFAULT;
+		goto err;
+	}
+	coherent = (int) ucmd.mr_attrs & MTHCA_MR_COHERENT;
+
+	mr->umem = ib_umem_get(pd->uobject->context, start, length, acc, 
+			       coherent);
 	if (IS_ERR(mr->umem)) {
 		err = PTR_ERR(mr->umem);
 		goto err;
diff --git a/drivers/infiniband/hw/mthca/mthca_user.h b/drivers/infiniband/hw/mthca/mthca_user.h
index 02cc0a7..f46773e 100644
--- a/drivers/infiniband/hw/mthca/mthca_user.h
+++ b/drivers/infiniband/hw/mthca/mthca_user.h
@@ -41,7 +41,7 @@
  * Increment this value if any changes that break userspace ABI
  * compatibility are made.
  */
-#define MTHCA_UVERBS_ABI_VERSION	1
+#define MTHCA_UVERBS_ABI_VERSION	2
 
 /*
  * Make sure that all structs defined in this file remain laid out so
@@ -61,6 +61,12 @@ struct mthca_alloc_pd_resp {
 	__u32 reserved;
 };
 
+struct mthca_reg_mr {
+	__u32 mr_attrs;
+#define MTHCA_MR_COHERENT 0x1
+	__u32 reserved;
+};
+
 struct mthca_create_cq {
 	__u32 lkey;
 	__u32 pdn;
diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 39bf057..b7a4301 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -699,7 +699,7 @@ static int srp_map_data(struct scsi_cmnd *scmnd, struct srp_target_port *target,
 	dev = target->srp_host->dev;
 	ibdev = dev->dev;
 
-	count = ib_dma_map_sg(ibdev, scat, nents, scmnd->sc_data_direction);
+	count = ib_dma_map_sg(ibdev, scat, nents, scmnd->sc_data_direction, 0);
 
 	fmt = SRP_DATA_DESC_DIRECT;
 	len = sizeof (struct srp_cmd) +	sizeof (struct srp_direct_buf);
diff --git a/include/asm-generic/dma-mapping.h b/include/asm-generic/dma-mapping.h
index 783ab99..34e8357 100644
--- a/include/asm-generic/dma-mapping.h
+++ b/include/asm-generic/dma-mapping.h
@@ -89,7 +89,7 @@ dma_unmap_page(struct device *dev, dma_addr_t dma_address, size_t size,
 
 static inline int
 dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
-	   enum dma_data_direction direction)
+	   enum dma_data_direction direction, int coherent)
 {
 	BUG_ON(dev->bus != &pci_bus_type);
 
@@ -213,7 +213,7 @@ dma_unmap_page(struct device *dev, dma_addr_t dma_address, size_t size,
 
 static inline int
 dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
-	   enum dma_data_direction direction)
+	   enum dma_data_direction direction, int coherent)
 {
 	BUG();
 	return 0;
diff --git a/include/asm-generic/pci-dma-compat.h b/include/asm-generic/pci-dma-compat.h
index 25c10e9..3e85b8e 100644
--- a/include/asm-generic/pci-dma-compat.h
+++ b/include/asm-generic/pci-dma-compat.h
@@ -60,7 +60,7 @@ static inline int
 pci_map_sg(struct pci_dev *hwdev, struct scatterlist *sg,
 	   int nents, int direction)
 {
-	return dma_map_sg(hwdev == NULL ? NULL : &hwdev->dev, sg, nents, (enum dma_data_direction)direction);
+	return dma_map_sg(hwdev == NULL ? NULL : &hwdev->dev, sg, nents, (enum dma_data_direction)direction, 0);
 }
 
 static inline void
diff --git a/include/asm-ia64/machvec.h b/include/asm-ia64/machvec.h
index ca33eb1..34e9a58 100644
--- a/include/asm-ia64/machvec.h
+++ b/include/asm-ia64/machvec.h
@@ -46,7 +46,7 @@ typedef void *ia64_mv_dma_alloc_coherent (struct device *, size_t, dma_addr_t *,
 typedef void ia64_mv_dma_free_coherent (struct device *, size_t, void *, dma_addr_t);
 typedef dma_addr_t ia64_mv_dma_map_single (struct device *, void *, size_t, int);
 typedef void ia64_mv_dma_unmap_single (struct device *, dma_addr_t, size_t, int);
-typedef int ia64_mv_dma_map_sg (struct device *, struct scatterlist *, int, int);
+typedef int ia64_mv_dma_map_sg (struct device *, struct scatterlist *, int, int, int);
 typedef void ia64_mv_dma_unmap_sg (struct device *, struct scatterlist *, int, int);
 typedef void ia64_mv_dma_sync_single_for_cpu (struct device *, dma_addr_t, size_t, int);
 typedef void ia64_mv_dma_sync_sg_for_cpu (struct device *, struct scatterlist *, int, int);
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index c533d6c..08aeb87 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -61,7 +61,7 @@ struct ib_umem_chunk {
 #ifdef CONFIG_INFINIBAND_USER_MEM
 
 struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
-			    size_t size, int access);
+			    size_t size, int access, int coherent);
 void ib_umem_release(struct ib_umem *umem);
 int ib_umem_page_count(struct ib_umem *umem);
 
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 0627a6a..d5d3180 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1555,11 +1555,12 @@ static inline void ib_dma_unmap_page(struct ib_device *dev,
  */
 static inline int ib_dma_map_sg(struct ib_device *dev,
 				struct scatterlist *sg, int nents,
-				enum dma_data_direction direction)
+				enum dma_data_direction direction, 
+				int coherent)
 {
 	if (dev->dma_ops)
 		return dev->dma_ops->map_sg(dev, sg, nents, direction);
-	return dma_map_sg(dev->dma_device, sg, nents, direction);
+	return dma_map_sg(dev->dma_device, sg, nents, direction, coherent);
 }
 
 /**

-- 
Arthur


From akepner at sgi.com  Sun Jul 15 14:24:45 2007
From: akepner at sgi.com (akepner at sgi.com)
Date: Sun, 15 Jul 2007 14:24:45 -0700
Subject: [ofa-general] [RFC 1/1] libmthca: CQ/DMA race on Altix
Message-ID: <20070715212445.GG6921@sgi.com>


The libmthca-specific changes for this RFC follow.

 mthca-abi.h |    9 ++++++++-
 verbs.c     |   22 +++++++++++++---------
 2 files changed, 21 insertions(+), 10 deletions(-)

-- 

diff -rup ofa_1_2_user-20070623-0200.orig/src/userspace/libmthca/src/mthca-abi.h ofa_1_2_user-20070623-0200/src/userspace/libmthca/src/mthca-abi.h
--- ofa_1_2_user-20070623-0200.orig/src/userspace/libmthca/src/mthca-abi.h	2007-06-23 02:00:34.000000000 -0700
+++ ofa_1_2_user-20070623-0200/src/userspace/libmthca/src/mthca-abi.h	2007-07-15 12:18:54.505352246 -0700
@@ -36,7 +36,7 @@
 
 #include <infiniband/kern-abi.h>
 
-#define MTHCA_UVERBS_ABI_VERSION	1
+#define MTHCA_UVERBS_ABI_VERSION	2
 
 struct mthca_alloc_ucontext_resp {
 	struct ibv_get_context_resp	ibv_resp;
@@ -50,6 +50,13 @@ struct mthca_alloc_pd_resp {
 	__u32				reserved;
 };
 
+struct mthca_reg_mr {
+	struct ibv_reg_mr		ibv_cmd;
+	__u32				mr_attrs;
+#define MTHCA_MR_COHERENT		0x1
+	__u32				reserved;
+};
+
 struct mthca_create_cq {
 	struct ibv_create_cq		ibv_cmd;
 	__u32				lkey;
diff -rup ofa_1_2_user-20070623-0200.orig/src/userspace/libmthca/src/verbs.c ofa_1_2_user-20070623-0200/src/userspace/libmthca/src/verbs.c
--- ofa_1_2_user-20070623-0200.orig/src/userspace/libmthca/src/verbs.c	2007-06-23 02:00:34.000000000 -0700
+++ ofa_1_2_user-20070623-0200/src/userspace/libmthca/src/verbs.c	2007-07-15 13:26:24.371410587 -0700
@@ -117,26 +117,30 @@ int mthca_free_pd(struct ibv_pd *pd)
 
 static struct ibv_mr *__mthca_reg_mr(struct ibv_pd *pd, void *addr,
 				     size_t length, uint64_t hca_va,
-				     enum ibv_access_flags access)
+				     enum ibv_access_flags access, 
+				     int coherent)
 {
 	struct ibv_mr *mr;
-	struct ibv_reg_mr cmd;
+	struct mthca_reg_mr cmd;
 	int ret;
 
 	mr = malloc(sizeof *mr);
 	if (!mr)
 		return NULL;
 
+	cmd.mr_attrs |= (__u32) coherent ? MTHCA_MR_COHERENT: 0;
+
 #ifdef IBV_CMD_REG_MR_HAS_RESP_PARAMS
 	{
 		struct ibv_reg_mr_resp resp;
 
 		ret = ibv_cmd_reg_mr(pd, addr, length, hca_va, access, mr,
-				     &cmd, sizeof cmd, &resp, sizeof resp);
+				     &cmd.ibv_cmd, sizeof cmd, &resp, 
+				     sizeof resp);
 	}
 #else
 	ret = ibv_cmd_reg_mr(pd, addr, length, hca_va, access, mr,
-			     &cmd, sizeof cmd);
+			     &cmd.ibv_cmd, sizeof cmd);
 #endif
 	if (ret) {
 		free(mr);
@@ -149,7 +153,7 @@ static struct ibv_mr *__mthca_reg_mr(str
 struct ibv_mr *mthca_reg_mr(struct ibv_pd *pd, void *addr,
 			    size_t length, enum ibv_access_flags access)
 {
-	return __mthca_reg_mr(pd, addr, length, (uintptr_t) addr, access);
+	return __mthca_reg_mr(pd, addr, length, (uintptr_t) addr, access, 0);
 }
 
 int mthca_dereg_mr(struct ibv_mr *mr)
@@ -202,7 +206,7 @@ struct ibv_cq *mthca_create_cq(struct ib
 
 	cq->mr = __mthca_reg_mr(to_mctx(context)->pd, cq->buf.buf,
 				cqe * MTHCA_CQ_ENTRY_SIZE,
-				0, IBV_ACCESS_LOCAL_WRITE);
+				0, IBV_ACCESS_LOCAL_WRITE, 1);
 	if (!cq->mr)
 		goto err_buf;
 
@@ -294,7 +298,7 @@ int mthca_resize_cq(struct ibv_cq *ibcq,
 
 	mr = __mthca_reg_mr(to_mctx(ibcq->context)->pd, buf.buf,
 			    cqe * MTHCA_CQ_ENTRY_SIZE,
-			    0, IBV_ACCESS_LOCAL_WRITE);
+			    0, IBV_ACCESS_LOCAL_WRITE, 1);
 	if (!mr) {
 		mthca_free_buf(&buf);
 		ret = ENOMEM;
@@ -402,7 +406,7 @@ struct ibv_srq *mthca_create_srq(struct 
 	if (mthca_alloc_srq_buf(pd, &attr->attr, srq))
 		goto err;
 
-	srq->mr = __mthca_reg_mr(pd, srq->buf.buf, srq->buf_size, 0, 0);
+	srq->mr = __mthca_reg_mr(pd, srq->buf.buf, srq->buf_size, 0, 0, 0);
 	if (!srq->mr)
 		goto err_free;
 
@@ -520,7 +524,7 @@ struct ibv_qp *mthca_create_qp(struct ib
 	    pthread_spin_init(&qp->rq.lock, PTHREAD_PROCESS_PRIVATE))
 		goto err_free;
 
-	qp->mr = __mthca_reg_mr(pd, qp->buf.buf, qp->buf_size, 0, 0);
+	qp->mr = __mthca_reg_mr(pd, qp->buf.buf, qp->buf_size, 0, 0, 0);
 	if (!qp->mr)
 		goto err_free;
 
-- 
Arthur


From sweitzen at cisco.com  Sun Jul 15 22:42:45 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Sun, 15 Jul 2007 22:42:45 -0700
Subject: [ewg] Re: [ofa-general] OFED-1.2 release download link
In-Reply-To: <4699FDD6.3010305@mellanox.co.il>
References: <000001c7c4ac$631bbe10$ff0da8c0@amr.corp.intel.com>
	<4699FDD6.3010305@mellanox.co.il>
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA303DCF53E@xmb-sjc-216.amer.cisco.com>

> It will be good if we can unify 1.2.c with 1.2.1 that was 
> requested in 
> the same time frame
> Any thoughts on this?

I am in favor of unifying them.

Scott


From muli at il.ibm.com  Mon Jul 16 00:34:35 2007
From: muli at il.ibm.com (Muli Ben-Yehuda)
Date: Mon, 16 Jul 2007 10:34:35 +0300
Subject: [ofa-general] [RFC 0/1] libmthca: CQ/DMA race on Altix
In-Reply-To: <20070715212146.GF6921@sgi.com>
References: <20070715212146.GF6921@sgi.com>
Message-ID: <20070716073435.GE3530@rhun.haifa.ibm.com>

On Sun, Jul 15, 2007 at 02:21:46PM -0700, akepner at sgi.com wrote:

> diff --git a/include/asm-generic/dma-mapping.h b/include/asm-generic/dma-mapping.h
> index 783ab99..34e8357 100644
> --- a/include/asm-generic/dma-mapping.h
> +++ b/include/asm-generic/dma-mapping.h
> @@ -89,7 +89,7 @@ dma_unmap_page(struct device *dev, dma_addr_t dma_address, size_t size,
>  
>  static inline int
>  dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
> -	   enum dma_data_direction direction)
> +	   enum dma_data_direction direction, int coherent)
>  {
>  	BUG_ON(dev->bus != &pci_bus_type);
>  
> @@ -213,7 +213,7 @@ dma_unmap_page(struct device *dev, dma_addr_t dma_address, size_t size,
>  
>  static inline int
>  dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
> -	   enum dma_data_direction direction)
> +	   enum dma_data_direction direction, int coherent)
>  {
>  	BUG();
>  	return 0;

This will be very painful and frankly I don't think the pain is
justified. Can't you confine the changes to the IB layerr so that the
mapping happens through dma_alloc_coherent if you need
coherent/consistent memory rather than through dma_map_sg?

Also, this kind of thing should definitely be CC'd to lkml.

Cheers,
Muli


From ogerlitz at voltaire.com  Mon Jul 16 01:55:34 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 16 Jul 2007 11:55:34 +0300
Subject: [ofa-general] Re: [PATCHv2] IB/mad: fix duplicated kernel thread
	name
In-Reply-To: <20070715094145.GA16231@mellanox.co.il>
References: <Pine.LNX.4.64.0707110840560.15887@zuben>
	<20070715094145.GA16231@mellanox.co.il>
Message-ID: <469B3286.3060902@voltaire.com>

Michael S. Tsirkin wrote:
> Make mad module use a single workqueue rather than a per-port
> workqueue. This way, we'll have less clutter on systems with
> a lot of ports.
> Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>
> Thinking about it, why would we *want* a per-port thread?
> What do you guys think about the following?
> As a bonus, this makes it easier to renice the mad thread
> for people that want to do this.

Indeed, today the mad module creates thread per device port, and the cm 
module creates thread per cpu, so if the system has 16 cores and two 
hcas each with two ports, the IB stack would create 20 threads just for 
the sake of the mad and cm modules... I also think it would be good if 
the mad module would create one thread and the cm module as well.

Sean - does it make sense to you to change the CM for that matter?

I brought below listing of the kernel threads on an rh4 u3 / ofed 1.2 
system.

Or.

> root         1     0  0 Jul15 ?        00:00:00 init [5]                             
> root         2     1  0 Jul15 ?        00:00:00 [migration/0]
> root         3     1  0 Jul15 ?        00:00:00 [ksoftirqd/0]
> root         4     1  0 Jul15 ?        00:00:00 [migration/1]
> root         5     1  0 Jul15 ?        00:00:00 [ksoftirqd/1]
> root         6     1  0 Jul15 ?        00:00:00 [migration/2]
> root         7     1  0 Jul15 ?        00:00:00 [ksoftirqd/2]
> root         8     1  0 Jul15 ?        00:00:00 [migration/3]
> root         9     1  0 Jul15 ?        00:00:00 [ksoftirqd/3]
> root        10     1  0 Jul15 ?        00:00:00 [events/0]
> root        11     1  0 Jul15 ?        00:00:00 [events/1]
> root        12     1  0 Jul15 ?        00:00:00 [events/2]
> root        13     1  0 Jul15 ?        00:00:00 [events/3]
> root        14    10  0 Jul15 ?        00:00:00 [khelper]
> root        15    10  0 Jul15 ?        00:00:00 [kacpid]
> root        60    10  0 Jul15 ?        00:00:00 [kblockd/0]
> root        61    10  0 Jul15 ?        00:00:00 [kblockd/1]
> root        62    10  0 Jul15 ?        00:00:00 [kblockd/2]
> root        63    10  0 Jul15 ?        00:00:00 [kblockd/3]
> root        64     1  0 Jul15 ?        00:00:00 [khubd]
> root        73    10  0 Jul15 ?        00:00:00 [pdflush]
> root        74    10  0 Jul15 ?        00:00:00 [pdflush]
> root        76    10  0 Jul15 ?        00:00:00 [aio/0]
> root        77    10  0 Jul15 ?        00:00:00 [aio/1]
> root        78    10  0 Jul15 ?        00:00:00 [aio/2]
> root        79    10  0 Jul15 ?        00:00:00 [aio/3]
> root        75     1  0 Jul15 ?        00:00:00 [kswapd0]
> root       152     1  0 Jul15 ?        00:00:00 [kseriod]
> root       230    12  0 Jul15 ?        00:00:00 [ata/0]
> root       231    12  0 Jul15 ?        00:00:00 [ata/1]
> root       232    12  0 Jul15 ?        00:00:00 [ata/2]
> root       233    12  0 Jul15 ?        00:00:00 [ata/3]
> root       239     1  0 Jul15 ?        00:00:00 [scsi_eh_0]
> root       267     1  0 Jul15 ?        00:00:01 [kjournald]
> root      1372    10  0 Jul15 ?        00:00:00 [ib_mcast]
> root      1377    10  0 Jul15 ?        00:00:00 [ib_cm/0]
> root      1378    10  0 Jul15 ?        00:00:00 [ib_cm/1]
> root      1379    10  0 Jul15 ?        00:00:00 [ib_cm/2]
> root      1380    10  0 Jul15 ?        00:00:00 [ib_cm/3]
> root      1401    13  0 Jul15 ?        00:00:00 [ipoib]
> root      1418    12  0 Jul15 ?        00:00:00 [mthcacatas]
> root      1421    13  0 Jul15 ?        00:00:00 [ib_mad1]
> root      1422    10  0 Jul15 ?        00:00:00 [ib_mad2]
> root      2132    10  0 Jul15 ?        00:00:00 [kauditd]
> root      2209    10  0 Jul15 ?        00:00:00 [kmpathd/0]
> root      2210    10  0 Jul15 ?        00:00:00 [kmpathd/1]
> root      2211    10  0 Jul15 ?        00:00:00 [kmpathd/2]
> root      2212    10  0 Jul15 ?        00:00:00 [kmpathd/3]
> root      2220    10  0 Jul15 ?        00:00:00 [kmirrord]
> root      2221    10  0 Jul15 ?        00:00:00 [kmir_mon]
> root      3330    11  0 Jul15 ?        00:00:00 [local_sa]
> root      3335    13  0 Jul15 ?        00:00:00 [ib_addr_wq]
> root      3340    11  0 Jul15 ?        00:00:00 [iw_cm_wq]
> root      3345    12  0 Jul15 ?        00:00:00 [rdma_cm_wq]
> root      3350    13  0 Jul15 ?        00:00:00 [sdp]
> root      3358    12  0 Jul15 ?        00:00:00 [krdsd]


From ogerlitz at voltaire.com  Mon Jul 16 02:16:28 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 16 Jul 2007 12:16:28 +0300
Subject: [ofa-general] missing "balance" in aggregate bi-directional SDP
	bulk	transfer
In-Reply-To: <4696A054.8010102@hp.com>
References: <4696A054.8010102@hp.com>
Message-ID: <469B376C.2070103@voltaire.com>

Rick Jones wrote:
> I configured ib0 and ib1 into separate IP subnets, and ran the 
> "bidirectional TCP_RR" test 

> However, when I run the same test over SDP, some connections seem to get 
> much better performance than others.  For example, with two concurrent 
> connections, one over each port, one will get a much higher result than 
> the other.

Did you make sure that each neighbour was actually pointing to a 
different port? see the below excerpt from the IPoIB release notes (note 
that IPoIB is the ARP provider used by the RDMA CM which is what SDP is 
working with, so this applies both your IPoIB and SDP tests).

Or.


> from /usr/share/doc/ofed-docs-1.2/ipoib_release_notes.txt
> 
> 3. Known Issues
> ===============================================================================
> 1. If a host has multiple interfaces and (a) each interface belongs to a
>    different IP subnet, (b) they all use the same InfiniBand Partition, and (c)
>    they are connected to the same IB Switch, then the host violates the IP rule
>    requiring different broadcast domains. Consequently, the host may build an
>    incorrect ARP table.
> 
>    The correct setting of a multi-homed IPoIB host is achieved by using a
>    different PKEY for each IP subnet. If a host has multiple interfaces on the
>    same IP subnet, then to prevent a peer from building an incorrect ARP entry
>    (neighbor) set the net.ipv4.conf.X.arp_ignore value to 1 or 2, where X
>    stands for the IPoIB (non-child) interfaces (e.g., ib0, ib1, etc). This
>    causes the network stack to send ARP replies only on the interface with the
>    IP address specified in the ARP request:
> 
>    sysctl -w net.ipv4.conf.ib0.arp_ignore=1
>    sysctl -w net.ipv4.conf.ib1.arp_ignore=1
> 
>    Or, globally,
> 
>    sysctl -w net.ipv4.conf.all.arp_ignore=1
> 
>    To learn more about the arp_ignore parameter, see Documentation/networking/ip-sysctl.txt.
>    Note that distributions have the means to make kernel parameters persistent.
> 


From vlad at lists.openfabrics.org  Mon Jul 16 02:44:44 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Mon, 16 Jul 2007 02:44:44 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070716-0200 daily build status
Message-ID: <20070716094444.C61C7E6080B@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2_c

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on powerpc with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.13
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.17
Passed on ia64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on ia64 with linux-2.6.12
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on powerpc with linux-2.6.13
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.16
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.16
Passed on powerpc with linux-2.6.14
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.13
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.15
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.9-34.ELsmp

Failed:
Build failed on i686 with linux-2.6.22-rc7


From cxp at saunalahti.fi  Mon Jul 16 04:55:48 2007
From: cxp at saunalahti.fi (Peter)
Date: Mon, 16 Jul 2007 12:55:48 +0100
Subject: [ofa-general] Fwd:
Message-ID: <469B5CC4.7010504@saunalahti.fi>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: warning.pdf
Type: application/pdf
Size: 14441 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070716/3810c2ae/attachment.pdf>

From mst at dev.mellanox.co.il  Mon Jul 16 04:59:11 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 16 Jul 2007 14:59:11 +0300
Subject: [ofa-general] Re: [PATCHv2] IB/mad: fix duplicated kernel thread
	name
In-Reply-To: <469B3286.3060902@voltaire.com>
References: <Pine.LNX.4.64.0707110840560.15887@zuben>
	<20070715094145.GA16231@mellanox.co.il>
	<469B3286.3060902@voltaire.com>
Message-ID: <20070716115911.GA3379@mellanox.co.il>

> Quoting Or Gerlitz <ogerlitz at voltaire.com>:
> Subject: Re: [PATCHv2] IB/mad: fix duplicated kernel thread name
> 
> Michael S. Tsirkin wrote:
> >Make mad module use a single workqueue rather than a per-port
> >workqueue. This way, we'll have less clutter on systems with
> >a lot of ports.
> >Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>
> >Thinking about it, why would we *want* a per-port thread?
> >What do you guys think about the following?
> >As a bonus, this makes it easier to renice the mad thread
> >for people that want to do this.
> 
> Indeed, today the mad module creates thread per device port, and the cm 
> module creates thread per cpu, so if the system has 16 cores and two 
> hcas each with two ports, the IB stack would create 20 threads just for 
> the sake of the mad and cm modules... I also think it would be good if 
> the mad module would create one thread and the cm module as well.
> 
> Sean - does it make sense to you to change the CM for that matter?
> 
> I brought below listing of the kernel threads on an rh4 u3 / ofed 1.2 
> system.

Per-CPU threads like CM does might make sense since they improve data locality.

-- 
MST


From vlad at dev.mellanox.co.il  Mon Jul 16 05:24:58 2007
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Mon, 16 Jul 2007 15:24:58 +0300
Subject: [ofa-general] RFC OFED-1.3 installation
Message-ID: <469B639A.1090804@dev.mellanox.co.il>

Hi,
I am starting to work on the new installation procedure for OFED-1.3.
Please review and comment.

Main changes from OFED-1.2:
- Split ofa_user-1.2.src.rpm into separate sources RPMs per package.
   * Requires RPM spec file for each package.
     Currently, the following packages are lack of RPM spec file:
         libehca,
         mstflint,
         qlvnictools,
         perftest,
         sdpnetstat

User space RPM packages list taken from maintainers' RPM spec files:

libibverbs:
     libibverbs
     libibverbs-devel
     libibverbs-devel-static
     libibverbs-utils

libmthca:
     libmthca
     libmthca-devel-static

libehca:
     No RPM spec file

libipathverbs:
     libipathverbs
     libipathverbs-devel

libibcm:
     libibcm
     libibcm-devel

libsdp:
     libsdp
     libsdp-devel should be created

librdmacm:
     librdmacm
     librdmacm-devel
     librdmacm-utils

libcxgb3:
     libcxgb3
     libcxgb3-devel

     Note: libcxgb3 rpmbuild fails:
     cp: cannot stat `ChangeLog': No such file or directory

management:
     libibcommon
     libibcommon-devel
     libibmad
     libibmad-devel
     libibumad
     libibumad-devel
     opensm
     opensm-libs
     opensm-devel
     opensm-static
     infiniband-diags

dapl:
     dapl
     dapl-devel
     dapl-uils

srptools:
     srptools

ibutils:
     ibutils

mpi-selector:
     mpi-selector

- OFED-1.3 build procedure:
   OFED-1.3 daily/rc builds will be created on OFA server:
     userspace and kernel packages will be taken from git trees:
     git.openfabrics.org/ofed_1_3/package.git ofed_1_3

     Source RPMs will be created for each userspace package in the following way:

     git clone ...
     autogen.sh
     configure --disable-libcheck
     make dist
     rpmbuild -bs package.spec

     The following packages will be taken from maintainers as src.rpm:

     mvapich http://www.openfabrics.org/~pasha/ofed_1_3/mvapich,
     mvapich2 http://www.openfabrics.org/~rowland/ofed_1_3,
     openmpi http://www.openfabrics.org/~jsquyres/ofed_1_3,
     mpitests http://www.openfabrics.org/~pasha/ofed_1_3/mpitests,
     rds-tools http://www.openfabrics.org/~vlad/ofed_1_3/rds-tools,
     ib-bonding http://www.openfabrics.org/~monis/ofed_1_3,


- OFED-1.3 Installation
   install.pl script
   Flow:
     make list of packages following selection and dependencies.
     for package in the list:
         build RPM from package.src.rpm
         install package RPM
	go to the next package in the list

     configuration if required


Regards,
Vladimir


From ogerlitz at voltaire.com  Mon Jul 16 05:36:04 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 16 Jul 2007 15:36:04 +0300
Subject: [ofa-general] Re: [PATCHv2] IB/mad: fix duplicated kernel thread
	name
In-Reply-To: <20070716115911.GA3379@mellanox.co.il>
References: <Pine.LNX.4.64.0707110840560.15887@zuben>	<20070715094145.GA16231@mellanox.co.il>	<469B3286.3060902@voltaire.com>
	<20070716115911.GA3379@mellanox.co.il>
Message-ID: <469B6634.1050709@voltaire.com>

Michael S. Tsirkin wrote:
>> Quoting Or Gerlitz <ogerlitz at voltaire.com>:
>> Indeed, today the mad module creates thread per device port, and the cm 
>> module creates thread per cpu, so if the system has 16 cores and two 
>> hcas each with two ports, the IB stack would create 20 threads just for 
>> the sake of the mad and cm modules... I also think it would be good if 
>> the mad module would create one thread and the cm module as well.

>> Sean - does it make sense to you to change the CM for that matter?

> Per-CPU threads like CM does might make sense since they improve data locality.

Sorry but "improve data locality" is not enough information for me to 
understand why the IB CM --neeed-- to spawn n kernel threads on my 
n-core system, after all its slow path and the data does not moves on 
QP1, what's the story here? and if it needs thread-per-cpu, why not use 
the system threads/softirqs as does the TCP/IP stack connection mgmt code?

Or.


From mst at dev.mellanox.co.il  Mon Jul 16 05:43:51 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 16 Jul 2007 15:43:51 +0300
Subject: [ofa-general] Re: [PATCHv2] IB/mad: fix duplicated kernel thread
	name
In-Reply-To: <469B6634.1050709@voltaire.com>
References: <Pine.LNX.4.64.0707110840560.15887@zuben>
	<20070715094145.GA16231@mellanox.co.il>
	<469B3286.3060902@voltaire.com>
	<20070716115911.GA3379@mellanox.co.il>
	<469B6634.1050709@voltaire.com>
Message-ID: <20070716124351.GA23035@mellanox.co.il>

> and if it needs thread-per-cpu, why not use 
> the system threads/softirqs as does the TCP/IP stack connection mgmt code?

softirqs would be very awkward because things like create qp can't be done from
that context. Using system threads might be possible, but one needs to be
careful this might create problems for anyone who wants to e.g. destroy cm id
from a system thread, which needs a flush.

-- 
MST


From ogerlitz at voltaire.com  Mon Jul 16 05:47:59 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 16 Jul 2007 15:47:59 +0300
Subject: [ofa-general] Re: [PATCHv2] IB/mad: fix duplicated kernel thread
	name
In-Reply-To: <20070716124351.GA23035@mellanox.co.il>
References: <Pine.LNX.4.64.0707110840560.15887@zuben>	<20070715094145.GA16231@mellanox.co.il>	<469B3286.3060902@voltaire.com>	<20070716115911.GA3379@mellanox.co.il>	<469B6634.1050709@voltaire.com>
	<20070716124351.GA23035@mellanox.co.il>
Message-ID: <469B68FF.4070806@voltaire.com>

Michael S. Tsirkin wrote:
>> and if it needs thread-per-cpu, why not use 
>> the system threads/softirqs as does the TCP/IP stack connection mgmt code?
> 
> softirqs would be very awkward because things like create qp can't be done from
> that context. Using system threads might be possible, but one needs to be
> careful this might create problems for anyone who wants to e.g. destroy cm id
> from a system thread, which needs a flush.

you have decided to move directly to the and-if part, however, sometimes 
it worth  to stop and explain yourself, you know

Anyway, grep-ing for "flush" in cm.c yields nothing

Or.


From mst at dev.mellanox.co.il  Mon Jul 16 06:08:16 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 16 Jul 2007 16:08:16 +0300
Subject: [ofa-general] Re: [PATCHv2] IB/mad: fix duplicated kernel thread
	name
In-Reply-To: <469B68FF.4070806@voltaire.com>
References: <Pine.LNX.4.64.0707110840560.15887@zuben>
	<20070715094145.GA16231@mellanox.co.il>
	<469B3286.3060902@voltaire.com>
	<20070716115911.GA3379@mellanox.co.il>
	<469B6634.1050709@voltaire.com>
	<20070716124351.GA23035@mellanox.co.il>
	<469B68FF.4070806@voltaire.com>
Message-ID: <20070716130804.GA4454@mellanox.co.il>

> Anyway, grep-ing for "flush" in cm.c yields nothing

wait_for_completion is an implicit flush.

That's one of the reason why comment near callback says:
"Users may not call ib_destroy_cm_id while in the context of this callback".


-- 
MST


From xma at us.ibm.com  Mon Jul 16 07:41:20 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Mon, 16 Jul 2007 07:41:20 -0700
Subject: [ofa-general] Re: Further 2.6.23 merge plans...
In-Reply-To: <20070714175425.GA17597@mellanox.co.il>
Message-ID: <OF72F6B9D1.F60C4EEF-ON8725731A.00506757-8825731A.0024BD1C@us.ibm.com>


Michael,

      I would like to try this patch for one adapter/2 ports scalability
performance for IPoIB. Is this patch appliable to OFED-1.2?

Thanks
Shirley
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070716/bea7ce12/attachment.html>

From tziporet at mellanox.co.il  Mon Jul 16 07:50:49 2007
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Mon, 16 Jul 2007 17:50:49 +0300
Subject: [ofa-general] Agenda for OFED meeting today
Message-ID: <6C2C79E72C305246B504CBA17B5500C901563963@mtlexch01.mtl.com>

Hi All,

We have our OFED synch meeting today at 9am PST.

Agenda:
1. Merge the OFED 1.2.1 release with OFED 1.2.c release in August.
2. Agree on OFED 1.3 schedule:
	* Feature freeze - Sep 4
	* Alpha release - Sep 10
	* Beta release - Sep 25
	* RC1 - Oct 16
	* RC2 - Oct 30
	* RC3 - Nov 8 (assuming many of us are at SC07 on the week of
Nov 11)
	* RC4 - Nov 20
	* GA release - Nov 30 (or first week of Dec)
3. Review OFED 1.3 features list:
In last meeting we decided that the schedule is one of the most
important parameters in OFED 1.3.
Thus I divided the features for two categories:
*	"must have" features - features that must be ready for the
release (marked with *)
*	"optional" features - features that can be included in the
release in case they are ready according to the schedule

Must have general features:
====================
*	Kernel base on 2.6.23 (all new features that will be part of
this kernel will be included in OFED 1.3)
*	Install: 
*	Break the packages RPMs (work with Novell and Redhat) to
minimize integration effort into OS distribution
*	Package: 
*	Sources arrangement for the end user (for the labs)
*	New HCAs & RNICs:
*	ConnectX support
*	Any other new HW?
*	QoS: OSM, CM, CMA, ULPs (IPoIB, SDP, SRP)

Other features (must have marked with *)
==============================
*	libibverbs: New verbs: 
*	Scalable Reliable Connected Transport (with Mellanox ConnectX)*
*	Reliable Multicast?

ULPs:
*	IPoIB: 
*	Performance improvements (those that will be stable on time)
*	NAPI - done
*	SDP:
*	* Keepalive
*	* AIO
*	uDAPL:
*	DAT 2.0 support with IB extensions for immediate data, atomics; 
*	Add extensions for new verbs (SRCT,RM)
*	VNIC: 
*	GA quality. Not a technology preview version anymore.
*	Added support for QLogic EVIC (10 Gbps Infiniband-to-Ethernet
gateway) - in GA
*	RDS: RDMA API (using FMRs); GA quality with Oracle 11
*	NFSoRDMA integration - pending we have a maintainer

*	Management:
*	* Multiple partitions via libibumad
*	OpenSM
*	More routing performance improvements - done
*	Even more speedups - done
*	Better packaging/installation - done
"Native" daemon mode - done
*	* Performance management
*	* Quality of Service manager: Based on IBTA annex
*	Enhancements for fat tree routing (non pure tree support) - done
*	More console commands and telnet access to console - done
*	More diagnostics
*	ibidsverify.pl: validate LIDs and GUIDs in subnet - done
*	Updated ibnetdiscover format with link width and speed, and
GUIDs - done
*	ibnetdiscover grouping support for new Voltaire chassis - done
*	diag updates for IB router support - done
*	iblinkinfo.pl: Support peer port link width and speed validation
- done
*	ibdatacounters: Add script and man page for subnet wide data
counters saquery enhancements - done

*	iWARP:
*	* Chelsio: Get to GA level
*	NetEffect: Get the drivers into OFED

Tziporet Koren
Software Director
Mellanox Technologies
mailto: tziporet at mellanox.co.il
Tel +972-4-9097200, ext 380

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070716/0e5de00f/attachment.html>

From mst at dev.mellanox.co.il  Mon Jul 16 07:55:33 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 16 Jul 2007 17:55:33 +0300
Subject: [ofa-general] Re: Further 2.6.23 merge plans...
In-Reply-To: <OF72F6B9D1.F60C4EEF-ON8725731A.00506757-8825731A.0024BD1C@us.ibm.com>
References: <20070714175425.GA17597@mellanox.co.il>
	<OF72F6B9D1.F60C4EEF-ON8725731A.00506757-8825731A.0024BD1C@us.ibm.com>
Message-ID: <20070716145533.GC4454@mellanox.co.il>

> Quoting Shirley Ma <xma at us.ibm.com>:
> Subject: Re: [ofa-general] Re: Further 2.6.23 merge plans...
> 
> Michael,
> 
> I would like to try this patch for one adapter/2 ports scalability performance
> for IPoIB. Is this patch appliable to OFED-1.2?

Most likely yes.

-- 
MST


From jsquyres at cisco.com  Mon Jul 16 07:57:26 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Mon, 16 Jul 2007 10:57:26 -0400
Subject: [ofa-general] Agenda for OFED meeting today
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563963@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C901563963@mtlexch01.mtl.com>
Message-ID: <1A09652C-4BEC-4FAE-A9F1-E5937258235E@cisco.com>

Reminder for all -- here is the dial-in information for the meeting:

Code: 2102061

US/Canada:  +1.866.432.9903
India:      +91.80.4103.3979
Israel:     +972.9.892.7026
Others:     http://cisco.com/en/US/about/doing_business/conferencing/

The Outlook invitation expires today; I'll make a new one after the  
meeting starting 2 weeks from today.


On Jul 16, 2007, at 10:50 AM, Tziporet Koren wrote:

> Hi All,
>
> We have our OFED synch meeting today at 9am PST.
>
> Agenda:
> 1. Merge the OFED 1.2.1 release with OFED 1.2.c release in August.
> 2. Agree on OFED 1.3 schedule:
>         * Feature freeze - Sep 4
>         * Alpha release - Sep 10
>         * Beta release - Sep 25
>         * RC1 - Oct 16
>         * RC2 - Oct 30
>         * RC3 - Nov 8 (assuming many of us are at SC07 on the week  
> of Nov 11)
>         * RC4 - Nov 20
>         * GA release - Nov 30 (or first week of Dec)
> 3. Review OFED 1.3 features list:
> In last meeting we decided that the schedule is one of the most  
> important parameters in OFED 1.3.
> Thus I divided the features for two categories:
>
> "must have" features - features that must be ready for the release  
> (marked with *)
> "optional" features - features that can be included in the release  
> in case they are ready according to the schedule
>
> Must have general features:
> ====================
>
> Kernel base on 2.6.23 (all new features that will be part of this  
> kernel will be included in OFED 1.3)
> Install:
> Break the packages RPMs (work with Novell and Redhat) to minimize  
> integration effort into OS distribution
> Package:
> Sources arrangement for the end user (for the labs)
> New HCAs & RNICs:
> ConnectX support
> Any other new HW?
> QoS: OSM, CM, CMA, ULPs (IPoIB, SDP, SRP)
>
> Other features (must have marked with *)
> ==============================
>
> libibverbs: New verbs:
> Scalable Reliable Connected Transport (with Mellanox ConnectX)*
> Reliable Multicast?
>
> ULPs:
>
> IPoIB:
> Performance improvements (those that will be stable on time)
> NAPI - done
> SDP:
> * Keepalive
> * AIO
> uDAPL:
> DAT 2.0 support with IB extensions for immediate data, atomics;
> Add extensions for new verbs (SRCT,RM)
> VNIC:
> GA quality. Not a technology preview version anymore.
> Added support for QLogic EVIC (10 Gbps Infiniband-to-Ethernet  
> gateway) - in GA
> RDS: RDMA API (using FMRs); GA quality with Oracle 11
> NFSoRDMA integration - pending we have a maintainer
>
> Management:
> * Multiple partitions via libibumad
> OpenSM
> More routing performance improvements - done
> Even more speedups - done
> Better packaging/installation - done
> “Native” daemon mode - done
> * Performance management
> * Quality of Service manager: Based on IBTA annex
> Enhancements for fat tree routing (non pure tree support) - done
> More console commands and telnet access to console - done
> More diagnostics
> ibidsverify.pl: validate LIDs and GUIDs in subnet - done
> Updated ibnetdiscover format with link width and speed, and GUIDs -  
> done
> ibnetdiscover grouping support for new Voltaire chassis - done
> diag updates for IB router support - done
> iblinkinfo.pl: Support peer port link width and speed validation -  
> done
> ibdatacounters: Add script and man page for subnet wide data  
> counters saquery enhancements - done
>
> iWARP:
> * Chelsio: Get to GA level
> NetEffect: Get the drivers into OFED
>
> Tziporet Koren
> Software Director
> Mellanox Technologies
> mailto: tziporet at mellanox.co.il
> Tel +972-4-9097200, ext 380
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/ 
> openib-general


-- 
Jeff Squyres
Cisco Systems


From rdreier at cisco.com  Mon Jul 16 08:25:13 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 16 Jul 2007 08:25:13 -0700
Subject: [ofa-general] Re: [PATCH 29/33] infiniband: sg chaining support
In-Reply-To: <1184579123437-git-send-email-jens.axboe@oracle.com> (Jens
	Axboe's message of "Mon, 16 Jul 2007 11:45:17 +0200")
References: <11845791213043-git-send-email-jens.axboe@oracle.com>
	<1184579123437-git-send-email-jens.axboe@oracle.com>
Message-ID: <adar6n8z7za.fsf@cisco.com>

[adding infinipath at qlogic.com and general at lists.openfabrics.org -- Roland]

Cc: rolandd at cisco.com
Signed-off-by: Jens Axboe <jens.axboe at oracle.com>
---
 drivers/infiniband/hw/ipath/ipath_dma.c   |    9 ++--
 drivers/infiniband/ulp/iser/iser_memory.c |   75 +++++++++++++++-------------
 2 files changed, 45 insertions(+), 39 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_dma.c b/drivers/infiniband/hw/ipath/ipath_dma.c
index f87f003..62c87e6 100644
--- a/drivers/infiniband/hw/ipath/ipath_dma.c
+++ b/drivers/infiniband/hw/ipath/ipath_dma.c
@@ -96,17 +96,18 @@ static void ipath_dma_unmap_page(struct ib_device *dev,
 	BUG_ON(!valid_dma_direction(direction));
 }
 
-static int ipath_map_sg(struct ib_device *dev, struct scatterlist *sg, int nents,
-			enum dma_data_direction direction)
+static int ipath_map_sg(struct ib_device *dev, struct scatterlist *sgl,
+			int nents, enum dma_data_direction direction)
 {
+	struct scatterlist *sg;
 	u64 addr;
 	int i;
 	int ret = nents;
 
 	BUG_ON(!valid_dma_direction(direction));
 
-	for (i = 0; i < nents; i++) {
-		addr = (u64) page_address(sg[i].page);
+	for_each_sg(sgl, sg, nents, i) {
+		addr = (u64) page_address(sg->page);
 		/* TODO: handle highmem pages */
 		if (!addr) {
 			ret = 0;
diff --git a/drivers/infiniband/ulp/iser/iser_memory.c b/drivers/infiniband/ulp/iser/iser_memory.c
index fc9f1fd..ff0c701 100644
--- a/drivers/infiniband/ulp/iser/iser_memory.c
+++ b/drivers/infiniband/ulp/iser/iser_memory.c
@@ -37,7 +37,6 @@
 #include <linux/mm.h>
 #include <linux/highmem.h>
 #include <asm/io.h>
-#include <asm/scatterlist.h>
 #include <linux/scatterlist.h>
 
 #include "iscsi_iser.h"
@@ -126,17 +125,19 @@ int iser_start_rdma_unaligned_sg(struct iscsi_iser_cmd_task  *iser_ctask,
 
 	if (cmd_dir == ISER_DIR_OUT) {
 		/* copy the unaligned sg the buffer which is used for RDMA */
-		struct scatterlist *sg = (struct scatterlist *)data->buf;
+		struct scatterlist *sgl = (struct scatterlist *)data->buf;
+		struct scatterlist *sg;
 		int i;
 		char *p, *from;
 
-		for (p = mem, i = 0; i < data->size; i++) {
-			from = kmap_atomic(sg[i].page, KM_USER0);
+		p = mem;
+		for_each_sg(sgl, sg, data->size, i) {
+			from = kmap_atomic(sg->page, KM_USER0);
 			memcpy(p,
-			       from + sg[i].offset,
-			       sg[i].length);
+			       from + sg->offset,
+			       sg->length);
 			kunmap_atomic(from, KM_USER0);
-			p += sg[i].length;
+			p += sg->length;
 		}
 	}
 
@@ -178,7 +179,7 @@ void iser_finalize_rdma_unaligned_sg(struct iscsi_iser_cmd_task *iser_ctask,
 
 	if (cmd_dir == ISER_DIR_IN) {
 		char *mem;
-		struct scatterlist *sg;
+		struct scatterlist *sgl, *sg;
 		unsigned char *p, *to;
 		unsigned int sg_size;
 		int i;
@@ -186,16 +187,17 @@ void iser_finalize_rdma_unaligned_sg(struct iscsi_iser_cmd_task *iser_ctask,
 		/* copy back read RDMA to unaligned sg */
 		mem	= mem_copy->copy_buf;
 
-		sg	= (struct scatterlist *)iser_ctask->data[ISER_DIR_IN].buf;
+		sgl	= (struct scatterlist *)iser_ctask->data[ISER_DIR_IN].buf;
 		sg_size = iser_ctask->data[ISER_DIR_IN].size;
 
-		for (p = mem, i = 0; i < sg_size; i++){
-			to = kmap_atomic(sg[i].page, KM_SOFTIRQ0);
-			memcpy(to + sg[i].offset,
+		p = mem;
+		for_each_sg(sgl, sg, sg_size, i) {
+			to = kmap_atomic(sg->page, KM_SOFTIRQ0);
+			memcpy(to + sg->offset,
 			       p,
-			       sg[i].length);
+			       sg->length);
 			kunmap_atomic(to, KM_SOFTIRQ0);
-			p += sg[i].length;
+			p += sg->length;
 		}
 	}
 
@@ -226,7 +228,8 @@ static int iser_sg_to_page_vec(struct iser_data_buf *data,
 			       struct iser_page_vec *page_vec,
 			       struct ib_device *ibdev)
 {
-	struct scatterlist *sg = (struct scatterlist *)data->buf;
+	struct scatterlist *sgl = (struct scatterlist *)data->buf;
+	struct scatterlist *sg;
 	u64 first_addr, last_addr, page;
 	int end_aligned;
 	unsigned int cur_page = 0;
@@ -234,14 +237,14 @@ static int iser_sg_to_page_vec(struct iser_data_buf *data,
 	int i;
 
 	/* compute the offset of first element */
-	page_vec->offset = (u64) sg[0].offset & ~MASK_4K;
+	page_vec->offset = (u64) sgl[0].offset & ~MASK_4K;
 
-	for (i = 0; i < data->dma_nents; i++) {
-		unsigned int dma_len = ib_sg_dma_len(ibdev, &sg[i]);
+	for_each_sg(sgl, sg, data->dma_nents, i) {
+		unsigned int dma_len = ib_sg_dma_len(ibdev, sg);
 
 		total_sz += dma_len;
 
-		first_addr = ib_sg_dma_address(ibdev, &sg[i]);
+		first_addr = ib_sg_dma_address(ibdev, sg);
 		last_addr  = first_addr + dma_len;
 
 		end_aligned   = !(last_addr  & ~MASK_4K);
@@ -249,9 +252,9 @@ static int iser_sg_to_page_vec(struct iser_data_buf *data,
 		/* continue to collect page fragments till aligned or SG ends */
 		while (!end_aligned && (i + 1 < data->dma_nents)) {
 			i++;
-			dma_len = ib_sg_dma_len(ibdev, &sg[i]);
+			dma_len = ib_sg_dma_len(ibdev, sg);
 			total_sz += dma_len;
-			last_addr = ib_sg_dma_address(ibdev, &sg[i]) + dma_len;
+			last_addr = ib_sg_dma_address(ibdev, sg) + dma_len;
 			end_aligned = !(last_addr  & ~MASK_4K);
 		}
 
@@ -286,25 +289,26 @@ static int iser_sg_to_page_vec(struct iser_data_buf *data,
 static unsigned int iser_data_buf_aligned_len(struct iser_data_buf *data,
 					      struct ib_device *ibdev)
 {
-	struct scatterlist *sg;
+	struct scatterlist *sgl, *sg;
 	u64 end_addr, next_addr;
 	int i, cnt;
 	unsigned int ret_len = 0;
 
-	sg = (struct scatterlist *)data->buf;
+	sgl = (struct scatterlist *)data->buf;
 
-	for (cnt = 0, i = 0; i < data->dma_nents; i++, cnt++) {
+	cnt = 0;
+	for_each_sg(sgl, sg, data->dma_nents, i) {
 		/* iser_dbg("Checking sg iobuf [%d]: phys=0x%08lX "
 		   "offset: %ld sz: %ld\n", i,
-		   (unsigned long)page_to_phys(sg[i].page),
-		   (unsigned long)sg[i].offset,
-		   (unsigned long)sg[i].length); */
-		end_addr = ib_sg_dma_address(ibdev, &sg[i]) +
-			   ib_sg_dma_len(ibdev, &sg[i]);
+		   (unsigned long)page_to_phys(sg->page),
+		   (unsigned long)sg->offset,
+		   (unsigned long)sg->length); */
+		end_addr = ib_sg_dma_address(ibdev, sg) +
+			   ib_sg_dma_len(ibdev, sg);
 		/* iser_dbg("Checking sg iobuf end address "
 		       "0x%08lX\n", end_addr); */
 		if (i + 1 < data->dma_nents) {
-			next_addr = ib_sg_dma_address(ibdev, &sg[i+1]);
+			next_addr = ib_sg_dma_address(ibdev, sg_next(sg));
 			/* are i, i+1 fragments of the same page? */
 			if (end_addr == next_addr)
 				continue;
@@ -324,15 +328,16 @@ static unsigned int iser_data_buf_aligned_len(struct iser_data_buf *data,
 static void iser_data_buf_dump(struct iser_data_buf *data,
 			       struct ib_device *ibdev)
 {
-	struct scatterlist *sg = (struct scatterlist *)data->buf;
+	struct scatterlist *sgl = (struct scatterlist *)data->buf;
+	struct scatterlist *sg;
 	int i;
 
-	for (i = 0; i < data->dma_nents; i++)
+	for_each_sg(sgl, sg, data->dma_nents, i)
 		iser_err("sg[%d] dma_addr:0x%lX page:0x%p "
 			 "off:0x%x sz:0x%x dma_len:0x%x\n",
-			 i, (unsigned long)ib_sg_dma_address(ibdev, &sg[i]),
-			 sg[i].page, sg[i].offset,
-			 sg[i].length, ib_sg_dma_len(ibdev, &sg[i]));
+			 i, (unsigned long)ib_sg_dma_address(ibdev, sg),
+			 sg->page, sg->offset,
+			 sg->length, ib_sg_dma_len(ibdev, sg));
 }
 
 static void iser_dump_page_vec(struct iser_page_vec *page_vec)
-- 
1.5.3.rc0.90.gbaa79


From swise at opengridcomputing.com  Mon Jul 16 08:56:14 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 16 Jul 2007 10:56:14 -0500
Subject: [ofa-general] ofa_1_2_kernel 20070715-0200 daily build status
In-Reply-To: <20070715094536.2109FE603CA@openfabrics.org>
References: <20070715094536.2109FE603CA@openfabrics.org>
Message-ID: <469B951E.4070006@opengridcomputing.com>

What is the status of fixing the build breaks?  It appears that these 
breaks cause the weekly ofed kits to be broken as well.  I'm trying to 
provide customers with the latest cxgb3 fixes and this makes it difficult.

Steve.


Vladimir Sokolovsky wrote:
> This email was generated automatically, please do not reply
> 
> 
> git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
> git_branch: ofed_1_2_c
> 
> Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod
> 
> Passed:
> Passed on i686 with 2.6.15-23-server
> Passed on i686 with linux-2.6.21.1
> Passed on i686 with linux-2.6.18
> Passed on i686 with linux-2.6.13
> Passed on i686 with linux-2.6.17
> Passed on i686 with linux-2.6.15
> Passed on i686 with linux-2.6.19
> Passed on i686 with linux-2.6.14
> Passed on i686 with linux-2.6.16
> Passed on i686 with linux-2.6.12
> Passed on powerpc with linux-2.6.18
> Passed on ia64 with linux-2.6.19
> Passed on x86_64 with linux-2.6.19
> Passed on ia64 with linux-2.6.18
> Passed on x86_64 with linux-2.6.21.1
> Passed on powerpc with linux-2.6.17
> Passed on x86_64 with linux-2.6.18
> Passed on ppc64 with linux-2.6.18
> Passed on x86_64 with linux-2.6.15
> Passed on x86_64 with linux-2.6.12
> Passed on x86_64 with linux-2.6.20
> Passed on powerpc with linux-2.6.19
> Passed on x86_64 with linux-2.6.16
> Passed on x86_64 with linux-2.6.5-7.244-smp
> Passed on x86_64 with linux-2.6.13
> Passed on ia64 with linux-2.6.15
> Passed on ia64 with linux-2.6.12
> Passed on ia64 with linux-2.6.13
> Passed on x86_64 with linux-2.6.14
> Passed on ppc64 with linux-2.6.12
> Passed on ppc64 with linux-2.6.15
> Passed on x86_64 with linux-2.6.17
> Passed on powerpc with linux-2.6.13
> Passed on ia64 with linux-2.6.16
> Passed on ppc64 with linux-2.6.13
> Passed on powerpc with linux-2.6.16
> Passed on powerpc with linux-2.6.14
> Passed on ia64 with linux-2.6.21.1
> Passed on ppc64 with linux-2.6.19
> Passed on ppc64 with linux-2.6.14
> Passed on ia64 with linux-2.6.14
> Passed on ia64 with linux-2.6.17
> Passed on ppc64 with linux-2.6.16
> Passed on powerpc with linux-2.6.15
> Passed on powerpc with linux-2.6.12
> Passed on ppc64 with linux-2.6.17
> Passed on x86_64 with linux-2.6.16.43-0.3-smp
> Passed on x86_64 with linux-2.6.16.21-0.8-smp
> Passed on ppc64 with linux-2.6.18-8.el5
> Passed on x86_64 with linux-2.6.9-55.ELsmp
> Passed on x86_64 with linux-2.6.9-34.ELsmp
> Passed on x86_64 with linux-2.6.18-8.el5
> Passed on x86_64 with linux-2.6.9-42.ELsmp
> Passed on x86_64 with linux-2.6.9-22.ELsmp
> Passed on ia64 with linux-2.6.16.21-0.8-default
> Passed on x86_64 with linux-2.6.18-1.2798.fc6
> 
> Failed:
> Build failed on i686 with linux-2.6.22-rc7
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From swise at opengridcomputing.com  Mon Jul 16 09:01:44 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 16 Jul 2007 11:01:44 -0500
Subject: [ofa-general] Agenda for OFED meeting today
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563963@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C901563963@mtlexch01.mtl.com>
Message-ID: <469B9668.6020007@opengridcomputing.com>

Tziporet Koren wrote:

> 
> Must have general features:
> ====================
> 
>     * Kernel base on 2.6.23 (all new features that will be part of this
>       kernel will be included in OFED 1.3)

Note that the cxgb3 drivers are in kernel.org 2.6.23.  They weren't in 
the upstream kernel when we started ofed_1_2 so they were added directly 
into that git tree.  So for ofed_1_3, the cxgb3 drivers should be taken 
directly from 2.6.23 (and discard the ofed_1_2 cxgb3 drivers).

I don't know who will create the initial ofed_1_2 tree, but this will 
require some git surgery I think.

Steve


From rdreier at cisco.com  Mon Jul 16 09:04:26 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 16 Jul 2007 09:04:26 -0700
Subject: [ofa-general] Re: [PATCH 01/10] IB/ehca: Support for multiple event
	queues
In-Reply-To: <200707121746.36763.fenkes@de.ibm.com> (Joachim Fenkes's message
	of "Thu, 12 Jul 2007 17:46:35 +0200")
References: <200707121745.27592.fenkes@de.ibm.com>
	<200707121746.36763.fenkes@de.ibm.com>
Message-ID: <adalkdgz65x.fsf@cisco.com>

 > The eHCA driver can now handle multiple event queues (read: interrupt
 > sources) instead of one. The number of available EQs is selected via the
 > nr_eqs module parameter.

 > CQs are either assigned to the EQs based on the comp_vector index or, if the
 > dist_eqs module parameter is supplied, using a round-robin scheme.

Do you have any data on how well this round-robin assignment works?
It seems not quite right to me for the driver to advertise nr_eqs
completion vectors, but then if round-robin is turned on to ignore the
consumer's decision about which vector to use.

Maybe if round-robin is turned on you should report 0 as the number of
completion vectors?  Or maybe we should allow well-known values for
the completion vector passed to ib_create_cq to allow consumers to
specify a policy (like round robin) instead of a particular vector?
Maybe the whole interface is broken and we should only be exposing
policies to consumers instead of the specific vector?

I think I would rather hold off on multiple EQs for this merge window
and plan on having something really solid and thought-out for 2.6.24.

 - R.


From swise at opengridcomputing.com  Mon Jul 16 09:04:46 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 16 Jul 2007 11:04:46 -0500
Subject: [ofa-general] Agenda for OFED meeting today
In-Reply-To: <469B9668.6020007@opengridcomputing.com>
References: <6C2C79E72C305246B504CBA17B5500C901563963@mtlexch01.mtl.com>
	<469B9668.6020007@opengridcomputing.com>
Message-ID: <469B971E.3040604@opengridcomputing.com>

Steve Wise wrote:
> Tziporet Koren wrote:
> 
>>
>> Must have general features:
>> ====================
>>
>>     * Kernel base on 2.6.23 (all new features that will be part of this
>>       kernel will be included in OFED 1.3)
> 
> Note that the cxgb3 drivers are in kernel.org 2.6.23.  They weren't in 
> the upstream kernel when we started ofed_1_2 so they were added directly 
> into that git tree.  So for ofed_1_3, the cxgb3 drivers should be taken 
> directly from 2.6.23 (and discard the ofed_1_2 cxgb3 drivers).
> 
> I don't know who will create the initial ofed_1_2 tree, but this will 
                                            ^^^^^^^^ I meant ofed_1_3

> require some git surgery I think.
> 
> Steve
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general


From tziporet at dev.mellanox.co.il  Mon Jul 16 09:06:57 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Mon, 16 Jul 2007 19:06:57 +0300
Subject: [ofa-general] ofa_1_2_kernel 20070715-0200 daily build status
In-Reply-To: <469B951E.4070006@opengridcomputing.com>
References: <20070715094536.2109FE603CA@openfabrics.org>
	<469B951E.4070006@opengridcomputing.com>
Message-ID: <469B97A1.8080306@mellanox.co.il>

Steve Wise wrote:
> What is the status of fixing the build breaks?  It appears that these 
> breaks cause the weekly ofed kits to be broken as well.  I'm trying to 
> provide customers with the latest cxgb3 fixes and this makes it 
> difficult.
>

I think Jim just fixed the SDP issues that was the ULP that broke the build

Tziporet


From mshefty at ichips.intel.com  Mon Jul 16 09:22:50 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 16 Jul 2007 09:22:50 -0700
Subject: [ofa-general] Re: [PATCHv2] IB/mad: fix duplicated kernel thread
	name
In-Reply-To: <469B6634.1050709@voltaire.com>
References: <Pine.LNX.4.64.0707110840560.15887@zuben>	<20070715094145.GA16231@mellanox.co.il>	<469B3286.3060902@voltaire.com>	<20070716115911.GA3379@mellanox.co.il>
	<469B6634.1050709@voltaire.com>
Message-ID: <469B9B5A.2040707@ichips.intel.com>

> Sorry but "improve data locality" is not enough information for me to 
> understand why the IB CM --neeed-- to spawn n kernel threads on my 
> n-core system, after all its slow path and the data does not moves on 
> QP1, what's the story here? and if it needs thread-per-cpu, why not use 
> the system threads/softirqs as does the TCP/IP stack connection mgmt code?

IMO, if we're going to have multiple cores, then we should create 
multiple threads to use them.  This becomes more important as the number 
of cores increases.  (The overhead of a non-running thread can't be that 
much.)  Stating that connection establishment is a slow path operation 
assumes that all connections are long lived.

The current behavior of the MAD layer is that all callbacks for a given 
registration are serialized.  We either need to preserve this 
functionality or verify that MAD users can handle simultaneous 
callbacks.  (Hopefully MAD users didn't make any assumptions regarding 
the threading model used by the MAD layer, but we need to verify this. 
I'm more worried about code in the MAD layer itself.)

- Sean


From rdreier at cisco.com  Mon Jul 16 09:32:52 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 16 Jul 2007 09:32:52 -0700
Subject: [ofa-general] [PATCH 2.6.23] iw_cxgb3: remove the cm_id reference
	on listen failures.
References: <20070711180435.11665.71117.stgit@dell3.ogc.int>
Message-ID: <aday7hgxqa3.fsf@cisco.com>

thanks, applied.


From rdreier at cisco.com  Mon Jul 16 09:37:52 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 16 Jul 2007 09:37:52 -0700
Subject: [ofa-general] Re: [PATCH 1 of 2]  mlx4: implement query-qp
References: <200706211227.47794.jackm@dev.mellanox.co.il>
	<ada1wfd5jr8.fsf@cisco.com>
	<200707151028.24013.jackm@dev.mellanox.co.il>
Message-ID: <adasl7oxq1r.fsf@cisco.com>

this was a patch to a patch, which is not very useful (especially
since the original patch is upstream in Linus's tree).

anyway I applied this as two patches...


From rdreier at cisco.com  Mon Jul 16 09:42:53 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 16 Jul 2007 09:42:53 -0700
Subject: [ofa-general] Re: Further 2.6.23 merge plans...
References: <adalkdl43w0.fsf@cisco.com> <adahco943ip.fsf@cisco.com>
	<20070713054711.GA21709@mellanox.co.il> <adar6nc2mt4.fsf@cisco.com>
	<20070714175425.GA17597@mellanox.co.il>
Message-ID: <adabqecxpte.fsf@cisco.com>

 > > I haven't done any work on it or seen anything from anyone else, so I
 > > expect this will have to wait for 2.6.24.

 > I'm surprised to hear this. How about this:
 > http://lists.openfabrics.org/pipermail/general/2007-May/035757.html

Sure, I remember that.  But I haven't seen anything to suggest that
anyone has given any further thought to the issues that were raised in
that thread.

 - R.


From rdreier at cisco.com  Mon Jul 16 09:42:52 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 16 Jul 2007 09:42:52 -0700
Subject: [ofa-general] Re: [PATCH] mlx4/IB: Take sizeof the correct pointer
	when calling to memset
References: <200707151500.09578.dotanb@dev.mellanox.co.il>
Message-ID: <adamyxwxptf.fsf@cisco.com>

thanks, applied.

unfortunately I copied the buggy mthca code before the fix was merged there...


From rdreier at cisco.com  Mon Jul 16 09:42:53 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 16 Jul 2007 09:42:53 -0700
Subject: [ofa-general] Further 2.6.23 merge plans...
References: <adalkdl43w0.fsf@cisco.com> <adahco943ip.fsf@cisco.com>
	<469A1293.6020902@mellanox.co.il>
Message-ID: <adahco4xpte.fsf@cisco.com>

 > Till when can we insert mlx4 with FMRs?

2.6.22 came out on July 8, so I would expect 2.6.23-rc1 (the end of
the merge window) to be July 22.


From rdreier at cisco.com  Mon Jul 16 09:47:53 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 16 Jul 2007 09:47:53 -0700
Subject: [ofa-general] [RFC 0/1] libmthca: CQ/DMA race on Altix
References: <20070715212146.GF6921@sgi.com>
Message-ID: <adazm1wwb0m.fsf@cisco.com>

 >  dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
 > -	   enum dma_data_direction direction)
 > +	   enum dma_data_direction direction, int coherent)

"coherent" seems like the wrong name here... really the property being
asked for is "flush other in-flight DMAs" or something like that (I
don't know precisely what setting the magic bit in the DMA address
does on Altix).

Also maybe it would make more sense to fold this into the existing
direction parameter somehow, so that most of the kernel can stay
unchanged (because as far as I know, Altix is the only platform that
has this extra quirk of allowing DMAs to pass each other).

 - R.


From rdreier at cisco.com  Mon Jul 16 09:47:52 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 16 Jul 2007 09:47:52 -0700
Subject: [ofa-general] Further 2.6.23 merge plans...
References: <OF419657E7.3098466C-ON87257317.0067BBE6-88257317.0067EA65@us.ibm.com>
Message-ID: <ada644kxpl3.fsf@cisco.com>

 >         FYI, we are working on several IPoIB performance improvement 
 > patches which are not on the list. Some of the patches are under test, 
 > some of the patches are going to be submitted soon. They are:

There is less than a week left in the merge window, and none of these
changes has been reviewed yet.  So being realistic, I don't think we
can expect to get any of this into 2.6.23.

 - R.


From rdreier at cisco.com  Mon Jul 16 09:52:53 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 16 Jul 2007 09:52:53 -0700
Subject: [ofa-general] [RFC 0/1] libmthca: CQ/DMA race on Altix
References: <20070715212146.GF6921@sgi.com>
	<20070716073435.GE3530@rhun.haifa.ibm.com>
Message-ID: <adaodicwasa.fsf@cisco.com>

 > This will be very painful and frankly I don't think the pain is
 > justified. Can't you confine the changes to the IB layerr so that the
 > mapping happens through dma_alloc_coherent if you need
 > coherent/consistent memory rather than through dma_map_sg?

The memory being dealt with here is buffers that are only used by the
device and userspace.  And the problem being solved is not really that
the memory needs to be coherent -- it is just that on Altix, using
coherent memory turns on another side effect that DMAs to that memory
flush other in-flight DMAs to other memory.

So there are several reasons I don't like using dma_alloc_coherent()
to allocate this memory, and then mapping it into userspace (rather
than having userspace allocate it and then map it to the device, as
these patches do):

 - dma_alloc_coherent() has to allocate kernel address space for
   memory, and in this case the kernel will never touch the memory.
   So this is pure waste, and on 32-bit system, these allocations
   could easily fail since kernel address space is scarce.

 - The property being asked for is not really coherent memory but
   rather "set the magic bit in the bus address so the Altix chipset
   flushes other DMAs", and I think it would be cleaner to ask for
   that explicitly rather than relying on the side effect of coherent
   memory.

 > Also, this kind of thing should definitely be CC'd to lkml.

I agree on that.

 - R.


From rdreier at cisco.com  Mon Jul 16 09:52:52 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 16 Jul 2007 09:52:52 -0700
Subject: [ofa-general] [PATCH] IB/mad: fix duplicated kernel thread name
References: <Pine.LNX.4.64.0707110840560.15887@zuben>
Message-ID: <adatzs4wasb.fsf@cisco.com>

 > The mad module creates thread per active port where the thread name is
 > derived from the port name. This cause different threads to have same
 > names when there are multiple devices. Fix that by using both the device
 > and the port numbers to derive the name.

What problem does the duplicate name cause in the first place?

 - R.


From rdreier at cisco.com  Mon Jul 16 09:57:52 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 16 Jul 2007 09:57:52 -0700
Subject: [ofa-general] Re: [RFC 1/1] libmthca: CQ/DMA race on Altix
References: <20070715212445.GG6921@sgi.com>
Message-ID: <adair8kwajz.fsf@cisco.com>

Looks reasonable but I would prefer to see explicit tests of the abi
version so that we use the old register MR ABI for old kernels rather
than unconditionally passing the extra parameter.


From rdreier at cisco.com  Mon Jul 16 10:14:03 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 16 Jul 2007 10:14:03 -0700
Subject: [ofa-general] Re: [PATCH 04/10] IB/ehca: use common error code
	mapping instead of specific ones
In-Reply-To: <200707121749.03556.fenkes@de.ibm.com> (Joachim Fenkes's message
	of "Thu, 12 Jul 2007 17:49:02 +0200")
References: <200707121745.27592.fenkes@de.ibm.com>
	<200707121749.03556.fenkes@de.ibm.com>
Message-ID: <adaejj8w9t0.fsf@cisco.com>

 > @@ -161,8 +161,11 @@ static inline int ehca2ib_return_code(u64 ehca_rc)

applied, but as a further cleanup it seems that ehca2ib_return_code()
should be moved into a .c file and moved out of line -- I think it
would probably shrink the compiled code quite a bit, and as far as I
can see it is never used in the data path where the function call
overhead would matter at all.


From rdreier at cisco.com  Mon Jul 16 10:37:09 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 16 Jul 2007 10:37:09 -0700
Subject: [ofa-general] Re: [PATCH 10/10] IB/ehca: Support large page MRs
In-Reply-To: <200707121754.20293.fenkes@de.ibm.com> (Joachim Fenkes's message
	of "Thu, 12 Jul 2007 17:54:19 +0200")
References: <200707121745.27592.fenkes@de.ibm.com>
	<200707121754.20293.fenkes@de.ibm.com>
Message-ID: <adaabtww8qi.fsf@cisco.com>

 > Add support for MR pages larger than 4K on eHCA2. This reduces firmware
 > memory consumption. If enabled via the mr_largepage module parameter, the MR
 > page size will be determined based on the MR length and the hardware
 > capabilities - if the MR is >= 16M, 16M pages are used, for example.

Why the module parameter?  Is there any reason a user would want to
turn this off?  Or conversely, why is it off by default?

Also this patch seems to depend heavily on the multiple EQ patch,
which I am holding off on now.  So you may want to rebase to my
current tree, which has all the ehca patches except the EQ one.

 >  static ssize_t ehca_show_nr_eqs(struct device *dev,
 >  				struct device_attribute *attr,
 >  				char *buf)
 >  {
 >  	return sprintf(buf, "%d\n", ehca_nr_eqs);
 >  }
 > -
 >  static DEVICE_ATTR(nr_eqs, S_IRUGO, ehca_show_nr_eqs, NULL);

Although trivial, this chunk doesn't really belong in this patch --
just fix it up in the multiple EQ patch (which I haven't merged yet).

 - R.


From rdreier at cisco.com  Mon Jul 16 10:38:52 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 16 Jul 2007 10:38:52 -0700
Subject: [ofa-general] [PATCH] IB/iser: Make a couple of functions static
Message-ID: <ada644kw8nn.fsf@cisco.com>

Make iser_conn_release() and iser_start_rdma_unaligned_sg() static,
since they are only used in the .c file where they are defined.  In
addition to being a cleanup, this even shrinks the generated code by
allowing the single call of iser_start_rdma_unaligned_sg() to be
inlined into its callsite.  On x86_64:

add/remove: 0/1 grow/shrink: 1/0 up/down: 466/-533 (-67)
function                                     old     new   delta
iser_reg_rdma_mem                           1518    1984    +466
iser_start_rdma_unaligned_sg                 533       -    -533

Signed-off-by: Roland Dreier <rolandd at cisco.com>
---
Erez, does this look OK to merge for 2.6.23?

diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.h b/drivers/infiniband/ulp/iser/iscsi_iser.h
index 8960196..671faff 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.h
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.h
@@ -310,8 +310,6 @@ int  iser_conn_init(struct iser_conn **ib_conn);
 
 void iser_conn_terminate(struct iser_conn *ib_conn);
 
-void iser_conn_release(struct iser_conn *ib_conn);
-
 void iser_rcv_completion(struct iser_desc *desc,
 			 unsigned long    dto_xfer_len);
 
@@ -329,9 +327,6 @@ void iser_reg_single(struct iser_device      *device,
 		     struct iser_regd_buf    *regd_buf,
 		     enum dma_data_direction direction);
 
-int  iser_start_rdma_unaligned_sg(struct iscsi_iser_cmd_task    *ctask,
-				  enum iser_data_dir            cmd_dir);
-
 void iser_finalize_rdma_unaligned_sg(struct iscsi_iser_cmd_task *ctask,
 				     enum iser_data_dir         cmd_dir);
 
diff --git a/drivers/infiniband/ulp/iser/iser_memory.c b/drivers/infiniband/ulp/iser/iser_memory.c
index fc9f1fd..36cdf77 100644
--- a/drivers/infiniband/ulp/iser/iser_memory.c
+++ b/drivers/infiniband/ulp/iser/iser_memory.c
@@ -103,8 +103,8 @@ void iser_reg_single(struct iser_device *device,
 /**
  * iser_start_rdma_unaligned_sg
  */
-int iser_start_rdma_unaligned_sg(struct iscsi_iser_cmd_task  *iser_ctask,
-				 enum iser_data_dir cmd_dir)
+static int iser_start_rdma_unaligned_sg(struct iscsi_iser_cmd_task *iser_ctask,
+					enum iser_data_dir cmd_dir)
 {
 	int dma_nents;
 	struct ib_device *dev;
diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c
index 3702e23..132edc6 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -311,6 +311,29 @@ static int iser_conn_state_comp_exch(struct iser_conn *ib_conn,
 }
 
 /**
+ * Frees all conn objects and deallocs conn descriptor
+ */
+static void iser_conn_release(struct iser_conn *ib_conn)
+{
+	struct iser_device  *device = ib_conn->device;
+
+	BUG_ON(ib_conn->state != ISER_CONN_DOWN);
+
+	mutex_lock(&ig.connlist_mutex);
+	list_del(&ib_conn->conn_list);
+	mutex_unlock(&ig.connlist_mutex);
+
+	iser_free_ib_conn_res(ib_conn);
+	ib_conn->device = NULL;
+	/* on EVENT_ADDR_ERROR there's no device yet for this conn */
+	if (device != NULL)
+		iser_device_try_release(device);
+	if (ib_conn->iser_conn)
+		ib_conn->iser_conn->ib_conn = NULL;
+	kfree(ib_conn);
+}
+
+/**
  * triggers start of the disconnect procedures and wait for them to be done
  */
 void iser_conn_terminate(struct iser_conn *ib_conn)
@@ -550,30 +573,6 @@ connect_failure:
 }
 
 /**
- * Frees all conn objects and deallocs conn descriptor
- */
-void iser_conn_release(struct iser_conn *ib_conn)
-{
-	struct iser_device  *device = ib_conn->device;
-
-	BUG_ON(ib_conn->state != ISER_CONN_DOWN);
-
-	mutex_lock(&ig.connlist_mutex);
-	list_del(&ib_conn->conn_list);
-	mutex_unlock(&ig.connlist_mutex);
-
-	iser_free_ib_conn_res(ib_conn);
-	ib_conn->device = NULL;
-	/* on EVENT_ADDR_ERROR there's no device yet for this conn */
-	if (device != NULL)
-		iser_device_try_release(device);
-	if (ib_conn->iser_conn)
-		ib_conn->iser_conn->ib_conn = NULL;
-	kfree(ib_conn);
-}
-
-
-/**
  * iser_reg_page_vec - Register physical memory
  *
  * returns: 0 on success, errno code on failure


From rdreier at cisco.com  Mon Jul 16 10:43:02 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 16 Jul 2007 10:43:02 -0700
Subject: [ofa-general] is ipath_layer.c dead code?
Message-ID: <ada1wf8w8gp.fsf@cisco.com>

My kernel seems to build and link fine with the patch below.  Is
ipath_layer.c being used for anything, or can we just kill it?

 - R.

diff --git a/drivers/infiniband/hw/ipath/Makefile b/drivers/infiniband/hw/ipath/Makefile
index ec2e603..fe67388 100644
--- a/drivers/infiniband/hw/ipath/Makefile
+++ b/drivers/infiniband/hw/ipath/Makefile
@@ -14,7 +14,6 @@ ib_ipath-y := \
 	ipath_init_chip.o \
 	ipath_intr.o \
 	ipath_keys.o \
-	ipath_layer.o \
 	ipath_mad.o \
 	ipath_mmap.o \
 	ipath_mr.o \
diff --git a/drivers/infiniband/hw/ipath/ipath_layer.c b/drivers/infiniband/hw/ipath/ipath_layer.c
deleted file mode 100644
index 82616b7..0000000
--- a/drivers/infiniband/hw/ipath/ipath_layer.c
+++ /dev/null
@@ -1,365 +0,0 @@
-/*
- * Copyright (c) 2006, 2007 QLogic Corporation. All rights reserved.
- * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
- *
- * This software is available to you under a choice of one of two
- * licenses.  You may choose to be licensed under the terms of the GNU
- * General Public License (GPL) Version 2, available from the file
- * COPYING in the main directory of this source tree, or the
- * OpenIB.org BSD license below:
- *
- *     Redistribution and use in source and binary forms, with or
- *     without modification, are permitted provided that the following
- *     conditions are met:
- *
- *      - Redistributions of source code must retain the above
- *        copyright notice, this list of conditions and the following
- *        disclaimer.
- *
- *      - Redistributions in binary form must reproduce the above
- *        copyright notice, this list of conditions and the following
- *        disclaimer in the documentation and/or other materials
- *        provided with the distribution.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
- * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
- * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
- * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
- * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
- * SOFTWARE.
- */
-
-/*
- * These are the routines used by layered drivers, currently just the
- * layered ethernet driver and verbs layer.
- */
-
-#include <linux/io.h>
-#include <asm/byteorder.h>
-
-#include "ipath_kernel.h"
-#include "ipath_layer.h"
-#include "ipath_verbs.h"
-#include "ipath_common.h"
-
-/* Acquire before ipath_devs_lock. */
-static DEFINE_MUTEX(ipath_layer_mutex);
-
-u16 ipath_layer_rcv_opcode;
-
-static int (*layer_intr)(void *, u32);
-static int (*layer_rcv)(void *, void *, struct sk_buff *);
-static int (*layer_rcv_lid)(void *, void *);
-
-static void *(*layer_add_one)(int, struct ipath_devdata *);
-static void (*layer_remove_one)(void *);
-
-int __ipath_layer_intr(struct ipath_devdata *dd, u32 arg)
-{
-	int ret = -ENODEV;
-
-	if (dd->ipath_layer.l_arg && layer_intr)
-		ret = layer_intr(dd->ipath_layer.l_arg, arg);
-
-	return ret;
-}
-
-int ipath_layer_intr(struct ipath_devdata *dd, u32 arg)
-{
-	int ret;
-
-	mutex_lock(&ipath_layer_mutex);
-
-	ret = __ipath_layer_intr(dd, arg);
-
-	mutex_unlock(&ipath_layer_mutex);
-
-	return ret;
-}
-
-int __ipath_layer_rcv(struct ipath_devdata *dd, void *hdr,
-		      struct sk_buff *skb)
-{
-	int ret = -ENODEV;
-
-	if (dd->ipath_layer.l_arg && layer_rcv)
-		ret = layer_rcv(dd->ipath_layer.l_arg, hdr, skb);
-
-	return ret;
-}
-
-int __ipath_layer_rcv_lid(struct ipath_devdata *dd, void *hdr)
-{
-	int ret = -ENODEV;
-
-	if (dd->ipath_layer.l_arg && layer_rcv_lid)
-		ret = layer_rcv_lid(dd->ipath_layer.l_arg, hdr);
-
-	return ret;
-}
-
-void ipath_layer_lid_changed(struct ipath_devdata *dd)
-{
-	mutex_lock(&ipath_layer_mutex);
-
-	if (dd->ipath_layer.l_arg && layer_intr)
-		layer_intr(dd->ipath_layer.l_arg, IPATH_LAYER_INT_LID);
-
-	mutex_unlock(&ipath_layer_mutex);
-}
-
-void ipath_layer_add(struct ipath_devdata *dd)
-{
-	mutex_lock(&ipath_layer_mutex);
-
-	if (layer_add_one)
-		dd->ipath_layer.l_arg =
-			layer_add_one(dd->ipath_unit, dd);
-
-	mutex_unlock(&ipath_layer_mutex);
-}
-
-void ipath_layer_remove(struct ipath_devdata *dd)
-{
-	mutex_lock(&ipath_layer_mutex);
-
-	if (dd->ipath_layer.l_arg && layer_remove_one) {
-		layer_remove_one(dd->ipath_layer.l_arg);
-		dd->ipath_layer.l_arg = NULL;
-	}
-
-	mutex_unlock(&ipath_layer_mutex);
-}
-
-int ipath_layer_register(void *(*l_add)(int, struct ipath_devdata *),
-			 void (*l_remove)(void *),
-			 int (*l_intr)(void *, u32),
-			 int (*l_rcv)(void *, void *, struct sk_buff *),
-			 u16 l_rcv_opcode,
-			 int (*l_rcv_lid)(void *, void *))
-{
-	struct ipath_devdata *dd, *tmp;
-	unsigned long flags;
-
-	mutex_lock(&ipath_layer_mutex);
-
-	layer_add_one = l_add;
-	layer_remove_one = l_remove;
-	layer_intr = l_intr;
-	layer_rcv = l_rcv;
-	layer_rcv_lid = l_rcv_lid;
-	ipath_layer_rcv_opcode = l_rcv_opcode;
-
-	spin_lock_irqsave(&ipath_devs_lock, flags);
-
-	list_for_each_entry_safe(dd, tmp, &ipath_dev_list, ipath_list) {
-		if (!(dd->ipath_flags & IPATH_INITTED))
-			continue;
-
-		if (dd->ipath_layer.l_arg)
-			continue;
-
-		spin_unlock_irqrestore(&ipath_devs_lock, flags);
-		dd->ipath_layer.l_arg = l_add(dd->ipath_unit, dd);
-		spin_lock_irqsave(&ipath_devs_lock, flags);
-	}
-
-	spin_unlock_irqrestore(&ipath_devs_lock, flags);
-	mutex_unlock(&ipath_layer_mutex);
-
-	return 0;
-}
-
-EXPORT_SYMBOL_GPL(ipath_layer_register);
-
-void ipath_layer_unregister(void)
-{
-	struct ipath_devdata *dd, *tmp;
-	unsigned long flags;
-
-	mutex_lock(&ipath_layer_mutex);
-	spin_lock_irqsave(&ipath_devs_lock, flags);
-
-	list_for_each_entry_safe(dd, tmp, &ipath_dev_list, ipath_list) {
-		if (dd->ipath_layer.l_arg && layer_remove_one) {
-			spin_unlock_irqrestore(&ipath_devs_lock, flags);
-			layer_remove_one(dd->ipath_layer.l_arg);
-			spin_lock_irqsave(&ipath_devs_lock, flags);
-			dd->ipath_layer.l_arg = NULL;
-		}
-	}
-
-	spin_unlock_irqrestore(&ipath_devs_lock, flags);
-
-	layer_add_one = NULL;
-	layer_remove_one = NULL;
-	layer_intr = NULL;
-	layer_rcv = NULL;
-	layer_rcv_lid = NULL;
-
-	mutex_unlock(&ipath_layer_mutex);
-}
-
-EXPORT_SYMBOL_GPL(ipath_layer_unregister);
-
-int ipath_layer_open(struct ipath_devdata *dd, u32 * pktmax)
-{
-	int ret;
-	u32 intval = 0;
-
-	mutex_lock(&ipath_layer_mutex);
-
-	if (!dd->ipath_layer.l_arg) {
-		ret = -EINVAL;
-		goto bail;
-	}
-
-	ret = ipath_setrcvhdrsize(dd, IPATH_HEADER_QUEUE_WORDS);
-
-	if (ret < 0)
-		goto bail;
-
-	*pktmax = dd->ipath_ibmaxlen;
-
-	if (*dd->ipath_statusp & IPATH_STATUS_IB_READY)
-		intval |= IPATH_LAYER_INT_IF_UP;
-	if (dd->ipath_lid)
-		intval |= IPATH_LAYER_INT_LID;
-	if (dd->ipath_mlid)
-		intval |= IPATH_LAYER_INT_BCAST;
-	/*
-	 * do this on open, in case low level is already up and
-	 * just layered driver was reloaded, etc.
-	 */
-	if (intval)
-		layer_intr(dd->ipath_layer.l_arg, intval);
-
-	ret = 0;
-bail:
-	mutex_unlock(&ipath_layer_mutex);
-
-	return ret;
-}
-
-EXPORT_SYMBOL_GPL(ipath_layer_open);
-
-u16 ipath_layer_get_lid(struct ipath_devdata *dd)
-{
-	return dd->ipath_lid;
-}
-
-EXPORT_SYMBOL_GPL(ipath_layer_get_lid);
-
-/**
- * ipath_layer_get_mac - get the MAC address
- * @dd: the infinipath device
- * @mac: the MAC is put here
- *
- * This is the EUID-64 OUI octets (top 3), then
- * skip the next 2 (which should both be zero or 0xff).
- * The returned MAC is in network order
- * mac points to at least 6 bytes of buffer
- * We assume that by the time the LID is set, that the GUID is as valid
- * as it's ever going to be, rather than adding yet another status bit.
- */
-
-int ipath_layer_get_mac(struct ipath_devdata *dd, u8 * mac)
-{
-	u8 *guid;
-
-	guid = (u8 *) &dd->ipath_guid;
-
-	mac[0] = guid[0];
-	mac[1] = guid[1];
-	mac[2] = guid[2];
-	mac[3] = guid[5];
-	mac[4] = guid[6];
-	mac[5] = guid[7];
-	if ((guid[3] || guid[4]) && !(guid[3] == 0xff && guid[4] == 0xff))
-		ipath_dbg("Warning, guid bytes 3 and 4 not 0 or 0xffff: "
-			  "%x %x\n", guid[3], guid[4]);
-	return 0;
-}
-
-EXPORT_SYMBOL_GPL(ipath_layer_get_mac);
-
-u16 ipath_layer_get_bcast(struct ipath_devdata *dd)
-{
-	return dd->ipath_mlid;
-}
-
-EXPORT_SYMBOL_GPL(ipath_layer_get_bcast);
-
-int ipath_layer_send_hdr(struct ipath_devdata *dd, struct ether_header *hdr)
-{
-	int ret = 0;
-	u32 __iomem *piobuf;
-	u32 plen, *uhdr;
-	size_t count;
-	__be16 vlsllnh;
-
-	if (!(dd->ipath_flags & IPATH_RCVHDRSZ_SET)) {
-		ipath_dbg("send while not open\n");
-		ret = -EINVAL;
-	} else
-		if ((dd->ipath_flags & (IPATH_LINKUNK | IPATH_LINKDOWN)) ||
-		    dd->ipath_lid == 0) {
-			/*
-			 * lid check is for when sma hasn't yet configured
-			 */
-			ret = -ENETDOWN;
-			ipath_cdbg(VERBOSE, "send while not ready, "
-				   "mylid=%u, flags=0x%x\n",
-				   dd->ipath_lid, dd->ipath_flags);
-		}
-
-	vlsllnh = *((__be16 *) hdr);
-	if (vlsllnh != htons(IPATH_LRH_BTH)) {
-		ipath_dbg("Warning: lrh[0] wrong (%x, not %x); "
-			  "not sending\n", be16_to_cpu(vlsllnh),
-			  IPATH_LRH_BTH);
-		ret = -EINVAL;
-	}
-	if (ret)
-		goto done;
-
-	/* Get a PIO buffer to use. */
-	piobuf = ipath_getpiobuf(dd, NULL);
-	if (piobuf == NULL) {
-		ret = -EBUSY;
-		goto done;
-	}
-
-	plen = (sizeof(*hdr) >> 2); /* actual length */
-	ipath_cdbg(EPKT, "0x%x+1w pio %p\n", plen, piobuf);
-
-	writeq(plen+1, piobuf); /* len (+1 for pad) to pbc, no flags */
-	ipath_flush_wc();
-	piobuf += 2;
-	uhdr = (u32 *)hdr;
-	count = plen-1; /* amount we can copy before trigger word */
-	__iowrite32_copy(piobuf, uhdr, count);
-	ipath_flush_wc();
-	__raw_writel(uhdr[count], piobuf + count);
-	ipath_flush_wc(); /* ensure it's sent, now */
-
-	ipath_stats.sps_ether_spkts++;	/* ether packet sent */
-
-done:
-	return ret;
-}
-
-EXPORT_SYMBOL_GPL(ipath_layer_send_hdr);
-
-int ipath_layer_set_piointbufavail_int(struct ipath_devdata *dd)
-{
-	set_bit(IPATH_S_PIOINTBUFAVAIL, &dd->ipath_sendctrl);
-
-	ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl,
-			 dd->ipath_sendctrl);
-	return 0;
-}
-
-EXPORT_SYMBOL_GPL(ipath_layer_set_piointbufavail_int);


From rdreier at cisco.com  Mon Jul 16 10:49:12 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 16 Jul 2007 10:49:12 -0700
Subject: [ofa-general] [PATCH] IB/ipath: Make a few functions static
In-Reply-To: <ada1wf8w8gp.fsf@cisco.com> (Roland Dreier's message of "Mon,
	16 Jul 2007 10:43:02 -0700")
References: <ada1wf8w8gp.fsf@cisco.com>
Message-ID: <adawsx0utlz.fsf@cisco.com>

Make some functions that are only used in a single .c file static.  In
addition to being a cleanup, this shrinks the generated code.  On x86_64:

add/remove: 1/3 grow/shrink: 2/1 up/down: 4777/-4956 (-179)
function                                     old     new   delta
handle_errors                                  -    3994   +3994
__verbs_timer                                 42     710    +668
ipath_do_ruc_send                           2131    2246    +115
ipath_no_bufs_available                      136       -    -136
ipath_disarm_senderrbufs                     639       -    -639
ipath_ib_timer                               658       -    -658
ipath_intr                                  5878    2355   -3523

Signed-off-by: Roland Dreier <rolandd at cisco.com>
---
Does this look OK to merge for 2.6.23?

diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c
index 9361f5a..09c5fd8 100644
--- a/drivers/infiniband/hw/ipath/ipath_driver.c
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c
@@ -1889,7 +1889,7 @@ void ipath_write_kreg_port(const struct ipath_devdata *dd, ipath_kreg regno,
 /* Below is "non-zero" to force override, but both actual LEDs are off */
 #define LED_OVER_BOTH_OFF (8)
 
-void ipath_run_led_override(unsigned long opaque)
+static void ipath_run_led_override(unsigned long opaque)
 {
 	struct ipath_devdata *dd = (struct ipath_devdata *)opaque;
 	int timeoff;
diff --git a/drivers/infiniband/hw/ipath/ipath_eeprom.c b/drivers/infiniband/hw/ipath/ipath_eeprom.c
index 6b91479..b4503e9 100644
--- a/drivers/infiniband/hw/ipath/ipath_eeprom.c
+++ b/drivers/infiniband/hw/ipath/ipath_eeprom.c
@@ -426,8 +426,8 @@ bail:
  * @buffer: data to write
  * @len: number of bytes to write
  */
-int ipath_eeprom_internal_write(struct ipath_devdata *dd, u8 eeprom_offset,
-				const void *buffer, int len)
+static int ipath_eeprom_internal_write(struct ipath_devdata *dd, u8 eeprom_offset,
+				       const void *buffer, int len)
 {
 	u8 single_byte;
 	int sub_len;
diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c
index 47aa434..1fd91c5 100644
--- a/drivers/infiniband/hw/ipath/ipath_intr.c
+++ b/drivers/infiniband/hw/ipath/ipath_intr.c
@@ -70,7 +70,7 @@ static void ipath_clrpiobuf(struct ipath_devdata *dd, u32 pnum)
  * If rewrite is true, and bits are set in the sendbufferror registers,
  * we'll write to the buffer, for error recovery on parity errors.
  */
-void ipath_disarm_senderrbufs(struct ipath_devdata *dd, int rewrite)
+static void ipath_disarm_senderrbufs(struct ipath_devdata *dd, int rewrite)
 {
 	u32 piobcnt;
 	unsigned long sbuf[4];
diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h
index 3105005..b6ccd04 100644
--- a/drivers/infiniband/hw/ipath/ipath_kernel.h
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h
@@ -776,7 +776,6 @@ void ipath_get_eeprom_info(struct ipath_devdata *);
 int ipath_update_eeprom_log(struct ipath_devdata *dd);
 void ipath_inc_eeprom_err(struct ipath_devdata *dd, u32 eidx, u32 incr);
 u64 ipath_snap_cntr(struct ipath_devdata *, ipath_creg);
-void ipath_disarm_senderrbufs(struct ipath_devdata *, int);
 
 /*
  * Set LED override, only the two LSBs have "public" meaning, but
diff --git a/drivers/infiniband/hw/ipath/ipath_ruc.c b/drivers/infiniband/hw/ipath/ipath_ruc.c
index 8525674..c69c252 100644
--- a/drivers/infiniband/hw/ipath/ipath_ruc.c
+++ b/drivers/infiniband/hw/ipath/ipath_ruc.c
@@ -507,7 +507,7 @@ static int want_buffer(struct ipath_devdata *dd)
  *
  * Called when we run out of PIO buffers.
  */
-void ipath_no_bufs_available(struct ipath_qp *qp, struct ipath_ibdev *dev)
+static void ipath_no_bufs_available(struct ipath_qp *qp, struct ipath_ibdev *dev)
 {
 	unsigned long flags;
 
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c
index 65f7181..16aa61f 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c
@@ -488,7 +488,7 @@ bail:;
  * This is called from ipath_do_rcv_timer() at interrupt level to check for
  * QPs which need retransmits and to collect performance numbers.
  */
-void ipath_ib_timer(struct ipath_ibdev *dev)
+static void ipath_ib_timer(struct ipath_ibdev *dev)
 {
 	struct ipath_qp *resend = NULL;
 	struct list_head *last;
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h
index f3d1f2c..9bbe819 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.h
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.h
@@ -782,8 +782,6 @@ void ipath_update_mmap_info(struct ipath_ibdev *dev,
 
 int ipath_mmap(struct ib_ucontext *context, struct vm_area_struct *vma);
 
-void ipath_no_bufs_available(struct ipath_qp *qp, struct ipath_ibdev *dev);
-
 void ipath_insert_rnr_queue(struct ipath_qp *qp);
 
 int ipath_get_rwqe(struct ipath_qp *qp, int wr_id_only);
@@ -807,8 +805,6 @@ void ipath_ib_rcv(struct ipath_ibdev *, void *, void *, u32);
 
 int ipath_ib_piobufavail(struct ipath_ibdev *);
 
-void ipath_ib_timer(struct ipath_ibdev *);
-
 unsigned ipath_get_npkeys(struct ipath_devdata *);
 
 u32 ipath_get_cr_errpkey(struct ipath_devdata *);


From rdreier at cisco.com  Mon Jul 16 10:49:51 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 16 Jul 2007 10:49:51 -0700
Subject: [ofa-general] is ipath_get_user_pages_nocopy() dead code?
In-Reply-To: <ada1wf8w8gp.fsf@cisco.com> (Roland Dreier's message of "Mon,
	16 Jul 2007 10:43:02 -0700")
References: <ada1wf8w8gp.fsf@cisco.com>
Message-ID: <adasl7outkw.fsf@cisco.com>

I don't see any callers of ipath_get_user_pages_nocopy().  Should we
just delete it?

 - R.


From pradeeps at linux.vnet.ibm.com  Mon Jul 16 11:00:49 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Mon, 16 Jul 2007 11:00:49 -0700
Subject: [ofa-general] Re: [PATCH draft, untested] ehca srq emulation
	(for IPoIB CM)
In-Reply-To: <4697A9A3.2020706@linux.vnet.ibm.com>
References: <OFE2D9DB0E.0AD8F1B0-ON85257300.007854EA-85257300.0079CE19@us.ibm.com>
	<adamyyt2fp5.fsf@cisco.com> <469680DB.6000602@linux.vnet.ibm.com>
	<adad4yx43cc.fsf@cisco.com> <4697A9A3.2020706@linux.vnet.ibm.com>
Message-ID: <469BB251.5050808@linux.vnet.ibm.com>

Pradeep Satyanarayana wrote:
> Roland Dreier wrote:
>>  > In the absence of any further discussions about the IPoIB CM 
>> without SRQ
>>  > patches, I will incorporate Sean Hefty's comments and plan to resubmit
>>  > the patches, unless I hear something soon.
>>
>> Sorry for not devoting enough time to this, but something always seems
>> to come up, and I really want to be able to focus a concentrated chunk
>> of time on this, and I never seem to be able to.  Anyway, I would
>> prefer to find a solution that everyone can agree on, without me
>> having to rule by decree.
>>
>> I think updating the patch is a good idea.  Although I didn't get a
>> chance to review it carefully there were a number of obvious messy
>> parts that should be cleaned up.
>>
>> I am beginning to think that your basic approach is probably right,
>> but I also still think it should be possible to handle both SRQ and
>> non-SRQ without any overhead on the fast path.  I don't understand the
>> "maintainability" argument against doing this.  Can you expand on your
>> position a little?
>>
> 
> I will try to illustrate with an example:
> 
> One of the ways to do this is to completely split SRQ and non-SRQ
> processing starting in ipoib_poll(). This would eliminate most of
> the if (srq) kind of branches. However, there would be a lot of code
> duplication. If a bug is discovered in one path, then one needs to
> fix that in the other path too.
> 
> One way to mitigate this situation is to alter the current SRQ code
> to use common code (between SRQ and non-SRQ). However, one might not 
> want to factor off a few lines of common code into a new function. There
> may be several such occurrences of this resulting in code bloat.
> 
> If you look back, several weeks ago ipoib_drain_cq() did not exist. This
> is another function that calls ipoib_cm_handle_rx_wc(). We would need
> to alter this function too to accommodate SRQ and non-SRQ split. In
> effect, we have propagated the SRQ and non-SRQ code to functions
> outside ipoiob_cm.c. In the future, if IPoIB CM would support UC mode
> this might mean additional functions handling the split.
> 
> On the other hand, in V6 (and previous versions) of the patch
> ipoib_cm_handle_rx_wc() handles the SRQ and non-SRQ paths. Both SRQ and
> non-SRQ functionality is contained within ipoib_cm.c. What we now have
> is probably one extra branch in the packet handling path than the
> minimum (desired) with a lot of common code.
> 
> Pradeep
> 

Roland,

Since the merge window will close in the next few days, do you have a
few suggestions that you would like me to incorporate into the patch?

Pradeep


From sean.hefty at intel.com  Mon Jul 16 11:03:06 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 16 Jul 2007 11:03:06 -0700
Subject: [ofa-general] [PATCH] for-23 ib/local_sa: adjust data offset by
	attribute offset, not size
Message-ID: <000301c7c7d3$8ff4a3f0$3c98070a@amr.corp.intel.com>

I merged the patch below with the local_sa patch in my for-roland branch.
It's shown below separately for review purposes only.

The fix is based on code review, versus an observed bug.  (Since a
path record is 64 bytes, it's almost guaranteed that the size and
offset will be the same.)

- Sean


We should adjust the data offset by the attribute offset, and
not the size of the attribute.  The attribute offset includes
any necessary padding between the attributes.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---

 drivers/infiniband/core/local_sa.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/local_sa.c b/drivers/infiniband/core/local_sa.c
index 6c073a3..75545a5 100644
--- a/drivers/infiniband/core/local_sa.c
+++ b/drivers/infiniband/core/local_sa.c
@@ -369,11 +369,11 @@ static void *ib_sa_iter_next(struct ib_sa_mad_iter *iter)
 				/* copy the second piece of the attribute */
 				memcpy(iter->attr + offset, &mad->data[0],
 				       iter->attr_size - offset);
-				iter->data_offset = iter->attr_size - offset;
+				iter->data_offset = iter->attr_offset - offset;
 				offset = 0;
 			} else {
 				iter->attr = &mad->data[iter->data_offset];
-				iter->data_offset += iter->attr_size;
+				iter->data_offset += iter->attr_offset;
 			}
 
 			iter->data_left -= iter->attr_offset;


From bob.kossey at hp.com  Mon Jul 16 11:54:21 2007
From: bob.kossey at hp.com (Bob Kossey)
Date: Mon, 16 Jul 2007 14:54:21 -0400
Subject: [ofa-general] RFC OFED-1.3 installation
In-Reply-To: <469BBB83.2010100@hp.com>
References: <469BBB83.2010100@hp.com>
Message-ID: <469BBEDD.20806@hp.com>

Hi Vlad,

This looks good, a few comments.  As you are splitting out RPM spec
files for each package, I would like to see the RPM release numbers be
consistently updated whenever changes are made to a package. 
Ideally, this would be coordinated with the release numbers from
the distros, so that we could tell whether a version of an OFED RPM
in a distro was the older, the same or more recent than an OFED RPM
from openfabrics.org.  This would also allow us to update them with
rpm -Uvh.  Extra credit would be given for adding dependency information to
the packages. 

I also like the idea of clearly separating the build of the RPMs from 
their installation.
I would like to see all target system modifications be made by RPM files,
or postinstall scripts, rather than from the install.pl script, which may
not always be run on a target.

Thanks,
Bob


> Hi,
> I am starting to work on the new installation procedure for OFED-1.3.
> Please review and comment.
>
> Main changes from OFED-1.2:
> - Split ofa_user-1.2.src.rpm into separate sources RPMs per package.
>   * Requires RPM spec file for each package.
>     Currently, the following packages are lack of RPM spec file:
>         libehca,
>         mstflint,
>         qlvnictools,
>         perftest,
>         sdpnetstat
>
> User space RPM packages list taken from maintainers' RPM spec files:
>
> libibverbs:
>     libibverbs
>     libibverbs-devel
>     libibverbs-devel-static
>     libibverbs-utils
>
> libmthca:
>     libmthca
>     libmthca-devel-static
>
> libehca:
>     No RPM spec file
>
> libipathverbs:
>     libipathverbs
>     libipathverbs-devel
>
> libibcm:
>     libibcm
>     libibcm-devel
>
> libsdp:
>     libsdp
>     libsdp-devel should be created
>
> librdmacm:
>     librdmacm
>     librdmacm-devel
>     librdmacm-utils
>
> libcxgb3:
>     libcxgb3
>     libcxgb3-devel
>
>     Note: libcxgb3 rpmbuild fails:
>     cp: cannot stat `ChangeLog': No such file or directory
>
> management:
>     libibcommon
>     libibcommon-devel
>     libibmad
>     libibmad-devel
>     libibumad
>     libibumad-devel
>     opensm
>     opensm-libs
>     opensm-devel
>     opensm-static
>     infiniband-diags
>
> dapl:
>     dapl
>     dapl-devel
>     dapl-uils
>
> srptools:
>     srptools
>
> ibutils:
>     ibutils
>
> mpi-selector:
>     mpi-selector
>
> - OFED-1.3 build procedure:
>   OFED-1.3 daily/rc builds will be created on OFA server:
>     userspace and kernel packages will be taken from git trees:
>     git.openfabrics.org/ofed_1_3/package.git ofed_1_3
>
>     Source RPMs will be created for each userspace package in the 
> following way:
>
>     git clone ...
>     autogen.sh
>     configure --disable-libcheck
>     make dist
>     rpmbuild -bs package.spec
>
>     The following packages will be taken from maintainers as src.rpm:
>
>     mvapich http://www.openfabrics.org/~pasha/ofed_1_3/mvapich, 
> <http://www.openfabrics.org/%7Epasha/ofed_1_3/mvapich,>
>     mvapich2 http://www.openfabrics.org/~rowland/ofed_1_3, 
> <http://www.openfabrics.org/%7Erowland/ofed_1_3,>
>     openmpi http://www.openfabrics.org/~jsquyres/ofed_1_3, 
> <http://www.openfabrics.org/%7Ejsquyres/ofed_1_3,>
>     mpitests http://www.openfabrics.org/~pasha/ofed_1_3/mpitests, 
> <http://www.openfabrics.org/%7Epasha/ofed_1_3/mpitests,>
>     rds-tools http://www.openfabrics.org/~vlad/ofed_1_3/rds-tools, 
> <http://www.openfabrics.org/%7Evlad/ofed_1_3/rds-tools,>
>     ib-bonding http://www.openfabrics.org/~monis/ofed_1_3, 
> <http://www.openfabrics.org/%7Emonis/ofed_1_3,>
>
>
>
> - OFED-1.3 Installation
>   install.pl script
>   Flow:
>     make list of packages following selection and dependencies.
>     for package in the list:
>         build RPM from package.src.rpm
>         install package RPM
>     go to the next package in the list
>
>     configuration if required
>
>
> Regards,
> Vladimir
>
>


From xma at us.ibm.com  Mon Jul 16 12:32:59 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Mon, 16 Jul 2007 12:32:59 -0700
Subject: [ofa-general] RFC OFED-1.3 installation
In-Reply-To: <469BBEDD.20806@hp.com>
Message-ID: <OF81499DC8.2B3ABFBB-ON8725731A.006B0533-8825731A.003F70B0@us.ibm.com>


      Is ib-utils depends on opensm-libs? If so I would suggest to change
opensm-libs as libsmutils. Otherwise ib-utils won't work without installing
opensm package. Does this make sense?

Thanks
Shirley


             Bob Kossey                                                    
             <bob.kossey at hp.co                                             
             m>                                                         To 
             Sent by:                  general at lists.openfabrics.org       
             general-bounces at l                                          cc 
             ists.openfabrics.                                             
             org                                                   Subject 
                                       Re: [ofa-general] RFC OFED-1.3      
                                       installation                        
             07/16/07 11:54 AM                                             
                                                                           
                                                                           
Hi Vlad,

This looks good, a few comments.  As you are splitting out RPM spec
files for each package, I would like to see the RPM release numbers be
consistently updated whenever changes are made to a package.
Ideally, this would be coordinated with the release numbers from
the distros, so that we could tell whether a version of an OFED RPM
in a distro was the older, the same or more recent than an OFED RPM
from openfabrics.org.  This would also allow us to update them with
rpm -Uvh.  Extra credit would be given for adding dependency information to
the packages.

I also like the idea of clearly separating the build of the RPMs from
their installation.
I would like to see all target system modifications be made by RPM files,
or postinstall scripts, rather than from the install.pl script, which may
not always be run on a target.

Thanks,
Bob


> Hi,
> I am starting to work on the new installation procedure for OFED-1.3.
> Please review and comment.
>
> Main changes from OFED-1.2:
> - Split ofa_user-1.2.src.rpm into separate sources RPMs per package.
>   * Requires RPM spec file for each package.
>     Currently, the following packages are lack of RPM spec file:
>         libehca,
>         mstflint,
>         qlvnictools,
>         perftest,
>         sdpnetstat
>
> User space RPM packages list taken from maintainers' RPM spec files:
>
> libibverbs:
>     libibverbs
>     libibverbs-devel
>     libibverbs-devel-static
>     libibverbs-utils
>
> libmthca:
>     libmthca
>     libmthca-devel-static
>
> libehca:
>     No RPM spec file
>
> libipathverbs:
>     libipathverbs
>     libipathverbs-devel
>
> libibcm:
>     libibcm
>     libibcm-devel
>
> libsdp:
>     libsdp
>     libsdp-devel should be created
>
> librdmacm:
>     librdmacm
>     librdmacm-devel
>     librdmacm-utils
>
> libcxgb3:
>     libcxgb3
>     libcxgb3-devel
>
>     Note: libcxgb3 rpmbuild fails:
>     cp: cannot stat `ChangeLog': No such file or directory
>
> management:
>     libibcommon
>     libibcommon-devel
>     libibmad
>     libibmad-devel
>     libibumad
>     libibumad-devel
>     opensm
>     opensm-libs
>     opensm-devel
>     opensm-static
>     infiniband-diags
>
> dapl:
>     dapl
>     dapl-devel
>     dapl-uils
>
> srptools:
>     srptools
>
> ibutils:
>     ibutils
>
> mpi-selector:
>     mpi-selector
>
> - OFED-1.3 build procedure:
>   OFED-1.3 daily/rc builds will be created on OFA server:
>     userspace and kernel packages will be taken from git trees:
>     git.openfabrics.org/ofed_1_3/package.git ofed_1_3
>
>     Source RPMs will be created for each userspace package in the
> following way:
>
>     git clone ...
>     autogen.sh
>     configure --disable-libcheck
>     make dist
>     rpmbuild -bs package.spec
>
>     The following packages will be taken from maintainers as src.rpm:
>
>     mvapich http://www.openfabrics.org/~pasha/ofed_1_3/mvapich,
> <http://www.openfabrics.org/%7Epasha/ofed_1_3/mvapich,>
>     mvapich2 http://www.openfabrics.org/~rowland/ofed_1_3,
> <http://www.openfabrics.org/%7Erowland/ofed_1_3,>
>     openmpi http://www.openfabrics.org/~jsquyres/ofed_1_3,
> <http://www.openfabrics.org/%7Ejsquyres/ofed_1_3,>
>     mpitests http://www.openfabrics.org/~pasha/ofed_1_3/mpitests,
> <http://www.openfabrics.org/%7Epasha/ofed_1_3/mpitests,>
>     rds-tools http://www.openfabrics.org/~vlad/ofed_1_3/rds-tools,
> <http://www.openfabrics.org/%7Evlad/ofed_1_3/rds-tools,>
>     ib-bonding http://www.openfabrics.org/~monis/ofed_1_3,
> <http://www.openfabrics.org/%7Emonis/ofed_1_3,>
>
>
>
> - OFED-1.3 Installation
>   install.pl script
>   Flow:
>     make list of packages following selection and dependencies.
>     for package in the list:
>         build RPM from package.src.rpm
>         install package RPM
>     go to the next package in the list
>
>     configuration if required
>
>
> Regards,
> Vladimir
>
>


_______________________________________________
general mailing list
general at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070716/a02e9d21/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070716/a02e9d21/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pic31898.gif
Type: image/gif
Size: 1255 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070716/a02e9d21/attachment-0001.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ecblank.gif
Type: image/gif
Size: 45 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070716/a02e9d21/attachment-0002.gif>

From hal.rosenstock at gmail.com  Mon Jul 16 12:52:51 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 16 Jul 2007 12:52:51 -0700
Subject: [ofa-general] OpenFabrics Bugzilla change
Message-ID: <f0e08f230707161252x38de2d2cw93650f2d32d237f3@mail.gmail.com>

Hi Scott,

Would you change anything I'm a maintainer for (OpenSM and diags) over to
Sasha ?

Thanks.

-- Hal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070716/e42b3d3c/attachment.html>

From sweitzen at cisco.com  Mon Jul 16 12:55:01 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Mon, 16 Jul 2007 12:55:01 -0700
Subject: [ofa-general] RE: OpenFabrics Bugzilla change
In-Reply-To: <f0e08f230707161252x38de2d2cw93650f2d32d237f3@mail.gmail.com>
References: <f0e08f230707161252x38de2d2cw93650f2d32d237f3@mail.gmail.com>
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA303DCF7FA@xmb-sjc-216.amer.cisco.com>

For just new bugs, or for existing bugs, too?
 
Scott


________________________________

	From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] 
	Sent: Monday, July 16, 2007 12:53 PM
	To: Scott Weitzenkamp (sweitzen)
	Cc: sashak at voltaire.com; general at lists.openfabrics.org
	Subject: OpenFabrics Bugzilla change
	
	
	Hi Scott,
	
	Would you change anything I'm a maintainer for (OpenSM and
diags) over to Sasha ?
	
	Thanks.
	
	-- Hal
	

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070716/e4a50730/attachment.html>

From mst at dev.mellanox.co.il  Mon Jul 16 13:05:40 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 16 Jul 2007 23:05:40 +0300
Subject: [ofa-general] Re: Further 2.6.23 merge plans...
In-Reply-To: <adabqecxpte.fsf@cisco.com>
References: <adalkdl43w0.fsf@cisco.com> <adahco943ip.fsf@cisco.com>
	<20070713054711.GA21709@mellanox.co.il> <adar6nc2mt4.fsf@cisco.com>
	<20070714175425.GA17597@mellanox.co.il> <adabqecxpte.fsf@cisco.com>
Message-ID: <20070716200540.GA8527@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: Further 2.6.23 merge plans...
> 
>  > > I haven't done any work on it or seen anything from anyone else, so I
>  > > expect this will have to wait for 2.6.24.
> 
>  > I'm surprised to hear this. How about this:
>  > http://lists.openfabrics.org/pipermail/general/2007-May/035757.html
> 
> Sure, I remember that.  But I haven't seen anything to suggest that
> anyone has given any further thought to the issues that were raised in
> that thread.

Well, the only issue I recall is about the # of EQs we want to allocate.
Was there something else?

Maybe code can be merged as-is (2 EQs) and the number be tuned
later as applications start using vectors?

-- 
MST


From FENKES at de.ibm.com  Mon Jul 16 13:34:19 2007
From: FENKES at de.ibm.com (Joachim Fenkes)
Date: Mon, 16 Jul 2007 22:34:19 +0200
Subject: [ofa-general] Re: [PATCH 01/10] IB/ehca: Support for multiple event
	queues
In-Reply-To: <adalkdgz65x.fsf@cisco.com>
Message-ID: <OF82B8C559.721FE4D8-ONC125731A.00700205-C125731A.00711653@de.ibm.com>

Roland Dreier <rdreier at cisco.com> wrote on 16.07.2007 18:04:26:

> It seems not quite right to me for the driver to advertise nr_eqs
> completion vectors, but then if round-robin is turned on to ignore the
> consumer's decision about which vector to use.

The round-robin feature was primarily meant as a debug/evaluation feature; 
it is not supposed to be active by default. ULP programmers can, for 
example, quickly evaluate the performance increase that comp_vectors could 
give them, without changing their code. Without this debug option, the 
comp_vector policy is still up to the ULPs.
 
> Maybe if round-robin is turned on you should report 0 as the number of
> completion vectors?

That sounds like a reasonable idea -- I'll change that right away.

> Maybe the whole interface is broken and we should only be exposing
> policies to consumers instead of the specific vector?

If so, I think the policies should be handled by the IB core code instead 
of being re-invented by each driver. The IB core would then again pass 
actual comp_vector values to the driver.
 
> I think I would rather hold off on multiple EQs for this merge window
> and plan on having something really solid and thought-out for 2.6.24.

It's your call, but the code is there and I don't expect it to change a 
lot later, so it could be used by others to get a first impression of 
what's possible using comp_vectors and to gather some experience with 
them.

Regards,
  Joachim


From FENKES at de.ibm.com  Mon Jul 16 13:35:25 2007
From: FENKES at de.ibm.com (Joachim Fenkes)
Date: Mon, 16 Jul 2007 22:35:25 +0200
Subject: [ofa-general] Re: [PATCH 04/10] IB/ehca: use common error code
 mapping instead of specific ones
In-Reply-To: <adaejj8w9t0.fsf@cisco.com>
Message-ID: <OFA7C8E574.345A8A38-ONC125731A.0071195A-C125731A.00713074@de.ibm.com>

Roland Dreier <rdreier at cisco.com> wrote on 16.07.2007 19:14:03:

> applied, but as a further cleanup it seems that ehca2ib_return_code()
> should be moved into a .c file and moved out of line -- I think it
> would probably shrink the compiled code quite a bit, and as far as I
> can see it is never used in the data path where the function call
> overhead would matter at all.

Sounds reasonable; I'll put it in the next patch series.

Joachim


From HNGUYEN at de.ibm.com  Mon Jul 16 13:37:44 2007
From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen)
Date: Mon, 16 Jul 2007 22:37:44 +0200
Subject: [ofa-general] Re: [PATCH 01/10] IB/ehca: Support for multiple event
	queues
In-Reply-To: <adalkdgz65x.fsf@cisco.com>
Message-ID: <OF2377E85C.B05BDE94-ONC125731A.006E64A7-C125731A.00714F9E@de.ibm.com>

Roland Dreier <rdreier at cisco.com> wrote on 16.07.2007 18:04:26:
> Do you have any data on how well this round-robin assignment works?
> It seems not quite right to me for the driver to advertise nr_eqs
> completion vectors, but then if round-robin is turned on to ignore the
> consumer's decision about which vector to use.
No, I've no figures to provide here. The background of this dist_eqs
option is actually to allow us testing across all event queues
without to change the testcases resp consumers to use certain
event queue number. Thus, I should comment it as EXPERIMENTAL?
> Maybe if round-robin is turned on you should report 0 as the number of
> completion vectors?  Or maybe we should allow well-known values for
> the completion vector passed to ib_create_cq to allow consumers to
> specify a policy (like round robin) instead of a particular vector?
> Maybe the whole interface is broken and we should only be exposing
> policies to consumers instead of the specific vector?
Agree in that device driver should not overwrite consumer's policy
of event queue assigment. Since dist_eqs is disabled as default,
there's no issue, isn't it?
Regarding ib_verbs: perhaps we should provide create/destroy_eq()
and let upper level protocols or consumers dictate the assignment
to cq by passing an event queue pointer to create_cq()...
> I think I would rather hold off on multiple EQs for this merge window
> and plan on having something really solid and thought-out for 2.6.24.
Fair enough. However why don't let us gather experience with this
feature now? Should we remove dist_eqs option for more consistency?
Thanks
Nam


From rdreier at cisco.com  Mon Jul 16 13:39:13 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 16 Jul 2007 13:39:13 -0700
Subject: [ofa-general] Re: [PATCH draft,
	untested] ehca srq emulation (for IPoIB CM)
In-Reply-To: <469BB251.5050808@linux.vnet.ibm.com> (Pradeep Satyanarayana's
	message of "Mon, 16 Jul 2007 11:00:49 -0700")
References: <OFE2D9DB0E.0AD8F1B0-ON85257300.007854EA-85257300.0079CE19@us.ibm.com>
	<adamyyt2fp5.fsf@cisco.com> <469680DB.6000602@linux.vnet.ibm.com>
	<adad4yx43cc.fsf@cisco.com> <4697A9A3.2020706@linux.vnet.ibm.com>
	<469BB251.5050808@linux.vnet.ibm.com>
Message-ID: <adaodiculqm.fsf@cisco.com>

 > Since the merge window will close in the next few days, do you have a
 > few suggestions that you would like me to incorporate into the patch?

The only thing I can remember from the quick look I took at your last
posting was that it used an atomic variable in a silly way to keep
track of how many connections were already established, since the way
the value was used was racy anyway.

 - R.


From hal.rosenstock at gmail.com  Mon Jul 16 14:05:55 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 16 Jul 2007 14:05:55 -0700
Subject: [ofa-general] Re: OpenFabrics Bugzilla change
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA303DCF7FA@xmb-sjc-216.amer.cisco.com>
References: <f0e08f230707161252x38de2d2cw93650f2d32d237f3@mail.gmail.com>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA303DCF7FA@xmb-sjc-216.amer.cisco.com>
Message-ID: <f0e08f230707161405m31ddf479te3c2ca740bfbd503@mail.gmail.com>

On 7/16/07, Scott Weitzenkamp (sweitzen) <sweitzen at cisco.com> wrote:
>
>  For just new bugs, or for existing bugs, too?
>

Both. Thanks. If you want I will go over all the existing ones and reassign
them. Let me know.

-- Hal

Scott
>
>  ------------------------------
> *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> *Sent:* Monday, July 16, 2007 12:53 PM
> *To:* Scott Weitzenkamp (sweitzen)
> *Cc:* sashak at voltaire.com; general at lists.openfabrics.org
> *Subject:* OpenFabrics Bugzilla change
>
> Hi Scott,
>
> Would you change anything I'm a maintainer for (OpenSM and diags) over to
> Sasha ?
>
> Thanks.
>
> -- Hal
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070716/75288b57/attachment.html>

From FENKES at de.ibm.com  Mon Jul 16 14:11:47 2007
From: FENKES at de.ibm.com (Joachim Fenkes)
Date: Mon, 16 Jul 2007 23:11:47 +0200
Subject: [ofa-general] Re: [PATCH 10/10] IB/ehca: Support large page MRs
In-Reply-To: <adaabtww8qi.fsf@cisco.com>
Message-ID: <OFD7499E67.C6B5EA9B-ONC125731A.00715FED-C125731A.007484A7@de.ibm.com>

Roland Dreier <rdreier at cisco.com> wrote on 16.07.2007 19:37:09:

>  > If enabled via the mr_largepage module parameter, 
> 
> Why the module parameter?  Is there any reason a user would want to
> turn this off?  Or conversely, why is it off by default?

We're pretty confident this new feature works, but as with all new and 
possibly experimental features, there are chances it might explode your 
machine when activated. So, like with the scaling code, we want the user 
to make the conscious decision of using this code instead of activating it 
by default.
 
>  >  static ssize_t ehca_show_nr_eqs(struct device *dev,
>  >              struct device_attribute *attr,
>  >              char *buf)
>  >  {
>  >     return sprintf(buf, "%d\n", ehca_nr_eqs);
>  >  }
>  > -
>  >  static DEVICE_ATTR(nr_eqs, S_IRUGO, ehca_show_nr_eqs, NULL);
> 
> Although trivial, this chunk doesn't really belong in this patch --
> just fix it up in the multiple EQ patch (which I haven't merged yet).

Sure thing.

Regards,
  Joachim


From muli at il.ibm.com  Mon Jul 16 14:40:25 2007
From: muli at il.ibm.com (Muli Ben-Yehuda)
Date: Tue, 17 Jul 2007 00:40:25 +0300
Subject: [ofa-general] [RFC 0/1] libmthca: CQ/DMA race on Altix
In-Reply-To: <adaodicwasa.fsf@cisco.com>
References: <20070715212146.GF6921@sgi.com>
	<20070716073435.GE3530@rhun.haifa.ibm.com>
	<adaodicwasa.fsf@cisco.com>
Message-ID: <20070716214025.GI4902@rhun.haifa.ibm.com>

On Mon, Jul 16, 2007 at 09:52:53AM -0700, Roland Dreier wrote:

>  > This will be very painful and frankly I don't think the pain is
>  > justified. Can't you confine the changes to the IB layerr so that
>  > the mapping happens through dma_alloc_coherent if you need
>  > coherent/consistent memory rather than through dma_map_sg?
> 
> The memory being dealt with here is buffers that are only used by
> the device and userspace.  And the problem being solved is not
> really that the memory needs to be coherent -- it is just that on
> Altix, using coherent memory turns on another side effect that DMAs
> to that memory flush other in-flight DMAs to other memory.
> 
> So there are several reasons I don't like using dma_alloc_coherent()
> to allocate this memory, and then mapping it into userspace (rather
> than having userspace allocate it and then map it to the device, as
> these patches do):
> 
>  - dma_alloc_coherent() has to allocate kernel address space for
>    memory, and in this case the kernel will never touch the memory.
>    So this is pure waste, and on 32-bit system, these allocations
>    could easily fail since kernel address space is scarce.

But isn't this an Altix specific issue, which makes the 32-bit issue
moot? (I'm assuming the "fix" to use dma_alloc_coherent() is only
implemented for Altix, which is in-arguably ugly).

>  - The property being asked for is not really coherent memory but
>    rather "set the magic bit in the bus address so the Altix chipset
>    flushes other DMAs", and I think it would be cleaner to ask for
>    that explicitly rather than relying on the side effect of
>    coherent memory.

That makes sense. However I didn't quite understand if the above means
that you're ok with the patch posted, or prefer a different (third)
approach?

Cheers,
Muli


From rdreier at cisco.com  Mon Jul 16 14:47:28 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 16 Jul 2007 14:47:28 -0700
Subject: [ofa-general] [RFC 0/1] libmthca: CQ/DMA race on Altix
In-Reply-To: <20070716214025.GI4902@rhun.haifa.ibm.com> (Muli Ben-Yehuda's
	message of "Tue, 17 Jul 2007 00:40:25 +0300")
References: <20070715212146.GF6921@sgi.com>
	<20070716073435.GE3530@rhun.haifa.ibm.com> <adaodicwasa.fsf@cisco.com>
	<20070716214025.GI4902@rhun.haifa.ibm.com>
Message-ID: <adad4ysuikv.fsf@cisco.com>

 > But isn't this an Altix specific issue, which makes the 32-bit issue
 > moot? (I'm assuming the "fix" to use dma_alloc_coherent() is only
 > implemented for Altix, which is in-arguably ugly).

Well, I don't want to have one code path just for Altix and another
for all normal systems.  It kind of goes against the spirit of the DMA
API, which is to provide an abstraction so that drivers can be written
without system-specific details.

 > >  - The property being asked for is not really coherent memory but
 > >    rather "set the magic bit in the bus address so the Altix chipset
 > >    flushes other DMAs", and I think it would be cleaner to ask for
 > >    that explicitly rather than relying on the side effect of
 > >    coherent memory.
 > 
 > That makes sense. However I didn't quite understand if the above means
 > that you're ok with the patch posted, or prefer a different (third)
 > approach?

I'm OK with the main idea, but I don't think adding a "coherent" flag
to the mapping API is the right way to ask for this bit to be set on
Altix.  I can think of two approaches that seem somewhat sane:

 - Add a flag that gets passed in the normal "direction" parameter of
   the DMA mapping APIs, which is ignored on most systems and not set
   by most drivers.  Adds some churn to the internal implementation on
   all archs, though.

or

 - Add new functions dma_map_single_flushing(), dma_map_sg_flushing() and
   dma_map_page_flushing() that are defined to be the same as the
   non-flushing variants except on Altix.  Fairly quick and easy to
   implement, but arguably makes the DMA API even more bloated with
   even more functions that are only slightly different.

Dunno...

 - R.


From akepner at sgi.com  Mon Jul 16 14:56:07 2007
From: akepner at sgi.com (akepner at sgi.com)
Date: Mon, 16 Jul 2007 14:56:07 -0700
Subject: [ofa-general] [RFC 0/1] libmthca: CQ/DMA race on Altix
In-Reply-To: <20070716214025.GI4902@rhun.haifa.ibm.com>
References: <20070715212146.GF6921@sgi.com>
	<20070716073435.GE3530@rhun.haifa.ibm.com>
	<adaodicwasa.fsf@cisco.com>
	<20070716214025.GI4902@rhun.haifa.ibm.com>
Message-ID: <20070716215607.GB16538@sgi.com>

On Tue, Jul 17, 2007 at 12:40:25AM +0300, Muli Ben-Yehuda wrote:
> On Mon, Jul 16, 2007 at 09:52:53AM -0700, Roland Dreier wrote:
> ....
> > The memory being dealt with here is buffers that are only used by
> > the device and userspace.  And the problem being solved is not
> > really that the memory needs to be coherent -- it is just that on
> > Altix, using coherent memory turns on another side effect that DMAs
> > to that memory flush other in-flight DMAs to other memory.
> > 
> > So there are several reasons I don't like using dma_alloc_coherent()
> > to allocate this memory, and then mapping it into userspace (rather
> > than having userspace allocate it and then map it to the device, as
> > these patches do):
> > 
> >  - dma_alloc_coherent() has to allocate kernel address space for
> >    memory, and in this case the kernel will never touch the memory.
> >    So this is pure waste, and on 32-bit system, these allocations
> >    could easily fail since kernel address space is scarce.
> 
> But isn't this an Altix specific issue, which makes the 32-bit issue
> moot? (I'm assuming the "fix" to use dma_alloc_coherent() is only
> implemented for Altix, which is in-arguably ugly).
> 

I believe Roland was referring to an alternate solution to the 
problem. One that I did before, and which didn't involve changing 
the dma_map_sg() prototype. (Instead it allocated memory with 
dma_alloc_coherent() and then mmap()-ed it into user space. (See:
http://lists.openfabrics.org/pipermail/general/2007-January/032218.html 
)


> >  - The property being asked for is not really coherent memory but
> >    rather "set the magic bit in the bus address so the Altix chipset
> >    flushes other DMAs", and I think it would be cleaner to ask for
> >    that explicitly rather than relying on the side effect of
> >    coherent memory.
> 
> That makes sense. However I didn't quite understand if the above means
> that you're ok with the patch posted, or prefer a different (third)
> approach?
> 

Another patchset is imminent. This one won't change the dma_map_sg() 
prototype. It'll pass extra flags (only for IA64_SGI_SN2) in the upper 
bits of the direction argument. Otherwise it's similar to what I posted 
at the start of this thread.

-- 
Arthur


From muli at il.ibm.com  Mon Jul 16 14:57:35 2007
From: muli at il.ibm.com (Muli Ben-Yehuda)
Date: Tue, 17 Jul 2007 00:57:35 +0300
Subject: [ofa-general] [RFC 0/1] libmthca: CQ/DMA race on Altix
In-Reply-To: <adad4ysuikv.fsf@cisco.com>
References: <20070715212146.GF6921@sgi.com>
	<20070716073435.GE3530@rhun.haifa.ibm.com>
	<adaodicwasa.fsf@cisco.com>
	<20070716214025.GI4902@rhun.haifa.ibm.com>
	<adad4ysuikv.fsf@cisco.com>
Message-ID: <20070716215735.GJ4902@rhun.haifa.ibm.com>

On Mon, Jul 16, 2007 at 02:47:28PM -0700, Roland Dreier wrote:

>  - Add a flag that gets passed in the normal "direction" parameter
>    of the DMA mapping APIs, which is ignored on most systems and not
>    set by most drivers.  Adds some churn to the internal
>    implementation on all archs, though.

This will potentially break a bunch of things that assume the only
valid values of direction are NONE, TO_DEVICE, FROM_DEVICE, or BOTH
(e.g., include/linux/dma-mapping.h:valid_dma_direction()).

> or
> 
>  - Add new functions dma_map_single_flushing(),
>    dma_map_sg_flushing() and dma_map_page_flushing() that are
>    defined to be the same as the non-flushing variants except on
>    Altix.  Fairly quick and easy to implement, but arguably makes
>    the DMA API even more bloated with even more functions that are
>    only slightly different.
> 
> Dunno...

Looks like we need to harness the collective power of lkml. Better
hope everyone isn't distracted by the flamewar-du-jour.

Cheers,
Muli


From akepner at sgi.com  Mon Jul 16 15:00:54 2007
From: akepner at sgi.com (akepner at sgi.com)
Date: Mon, 16 Jul 2007 15:00:54 -0700
Subject: [ofa-general] [RFC 0/1] libmthca: CQ/DMA race on Altix
In-Reply-To: <20070716215735.GJ4902@rhun.haifa.ibm.com>
References: <20070715212146.GF6921@sgi.com>
	<20070716073435.GE3530@rhun.haifa.ibm.com>
	<adaodicwasa.fsf@cisco.com>
	<20070716214025.GI4902@rhun.haifa.ibm.com>
	<adad4ysuikv.fsf@cisco.com>
	<20070716215735.GJ4902@rhun.haifa.ibm.com>
Message-ID: <20070716220054.GC16538@sgi.com>

On Tue, Jul 17, 2007 at 12:57:35AM +0300, Muli Ben-Yehuda wrote:

> This will potentially break a bunch of things that assume the only
> valid values of direction are NONE, TO_DEVICE, FROM_DEVICE, or BOTH
> (e.g., include/linux/dma-mapping.h:valid_dma_direction()).

I think I've got this covered in a reasonably unobjectionable 
way.

> ....
> Looks like we need to harness the collective power of lkml. Better
> hope everyone isn't distracted by the flamewar-du-jour.
> 

I'm preparing my asbestos suit now.

-- 
Arthur


From xma at us.ibm.com  Mon Jul 16 15:16:00 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Mon, 16 Jul 2007 15:16:00 -0700
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <ada644kxpl3.fsf@cisco.com>
Message-ID: <OFB0AD8095.2590C954-ON8725731A.0079EF06-8825731A.004E5DEA@us.ibm.com>


Hello Roland,

Roland Dreier <rdreier at cisco.com> wrote on 07/16/2007 09:47:52 AM:

>  >         FYI, we are working on several IPoIB performance improvement
>  > patches which are not on the list. Some of the patches are under test,

>  > some of the patches are going to be submitted soon. They are:
>
> There is less than a week left in the merge window, and none of these
> changes has been reviewed yet.  So being realistic, I don't think we
> can expect to get any of this into 2.6.23.
>
>  - R.

      Yes, most of the patches are depends on IPoIB-CM no SRQ support. We
can't submit them for review without a full performance matrix test
(Mellanox, Galaxy1 ... for both UD/RC modes).

Thanks
Shirley
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070716/5237bfc7/attachment.html>

From davejporo at wbws.net  Mon Jul 16 14:55:24 2007
From: davejporo at wbws.net (Wilson)
Date: Mon, 16 Jul 2007 22:55:24 +0100
Subject: [ofa-general] Isnt it time to go
Message-ID: <3d4701c7c7fc$65473540$dfaac939@davejporo>

This information sheet is devoted to the upshot of the latest customer
accomplishment assessment taken by the Intl. Pharmacopoeia Commission.  They
review on-line pharmacy client and then appraise the entire on-line
pharmacies.  The 2006 year top award grant to:   money off On-line medicine
store, recognizing us the main web based  in the globe in clientele
achievement. 

Discount Online Drug store is an authorized, safety, and fully-certified
online medicine store. The prices are very affordable and desirable. 

There is no finer place rather than money off Online Drug store to put
assurance and private buying. 

Pay a quick visit at: www.goodsrx.org

The purpose of this newsletter is to aid you to manage better physical
condition. 

Meri Peters


Bens eyebrows shot up. Really. He could feel the winter anger spare inside
him growing. He end clamped annually down on it, Good idea! What a strange
contrast exchange the two figures clung disagree made, visible enough pick
in that mingled twilight and moonlight! 
Embarrassment unite could intensify sexual down pleasure. spit Guilt inform
included feelings of embarrassment. Catholics a committee "It's a true word
as I say, sir," rejoined coat Mr. Rann, inquisitive sip compressing his
mouth into a semicircular form


From swise at opengridcomputing.com  Mon Jul 16 16:06:56 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 16 Jul 2007 18:06:56 -0500
Subject: [ofa-general] problem with daily builds
Message-ID: <469BFA10.7070209@opengridcomputing.com>

Vlad,

It appears the daily ofa_1_2_kernel builds are not building the latest 
code from the ofed_1_2 git tree.  For example, I pulled down the 
ofa_1_2_kernel-20070716-0200 tree and the file 
drivers/net/cxgb3/version.h is older than what is in the ofed_1_2 git 
repository.

Here's the BUILD_ID from that tree.  Note it's the wrong git repository...

# cat BUILD_ID
Git:
git://git.openfabrics.org/ofed_1_2/linux-2.6.git
commit 556f7870719506619990a58fddb3fd9eab4b9990

What's up?


Steve.


From dledford at redhat.com  Mon Jul 16 20:29:28 2007
From: dledford at redhat.com (Doug Ledford)
Date: Tue, 17 Jul 2007 03:29:28 +0000
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <469B639A.1090804@dev.mellanox.co.il>
References: <469B639A.1090804@dev.mellanox.co.il>
Message-ID: <1184642968.5165.414.camel@firewall.xsintricity.com>

On Mon, 2007-07-16 at 15:24 +0300, Vladimir Sokolovsky wrote:

[ snip ]

Most of this proposal was just about splitting the packages up.  That's
good, but it doesn't warrant much comment.  It's not the existence of
different packages that will draw my comments, it's the *content* of the
packages that will, the actual spec files themselves.

However, there is this one tidbit:

>      Source RPMs will be created for each userspace package in the following way:
> 
>      git clone ...
>      autogen.sh
>      configure --disable-libcheck
>      make dist

This is so fundamentally broken as to be brain dead.  Yet it's what has
been done since OFED 1.0.  Can you imagine how screwed the open source
world would be if one day Linus released kernel-2.6.24.tar.gz on
kernel.org, only to silently update the file kernel-2.6.24.tar.gz to
something else the next?  This is the *ONLY* open source software group
I know of that creates new tar.gz files any time they make a change but
keeps the version of the file the same.

Let me copy and paste an email conversation I had with Or that
highlights why this is broken:

------- Begin cut-n-paste
On Mon, 2007-07-02 at 22:25 +0300, Or Gerlitz wrote: 
> [sorry for breaking the thread, I am working from home now and unable
to use normal mailer.]
>  
> Does this means that the OFED 1.3 effort is useless from your point of
view? 

Yes and no.  The effort to get a complete set of working libraries and
stacks pulled together and debugged is good and worthwhile.  The
packaging has been done all wrong though.  Because the ewg has
concentrated on supporting local compile installations, they don't
really have the faintest clue about several important issues that crop
up specifically when you are attempting to support binary distribution
instead of source distribution.  That in turn has led them to make
decisions that have proved to be very counterproductive to my end goal
of a supportable environment for my customers.

Let me give an example.  In OFED 1.0, you shipped dapl version 1.2.  In
OFED 1.1, you also shipped dapl version 1.2.  However, code inspection
shows that between OFED 1.0 and OFED 1.1, dapl did in fact change (not a
lot, but anything is enough).  So, between OFED 1.0 and OFED 1.1, you
have two different versions of dapl, but with exactly the same version
number.  A person can't tell them apart.  Furthermore, unless the person
is compiling locally, they'll never get the OFED 1.1 dapl installed
because RPM/up2date will see that they already have the current version
even when they have the OFED 1.0 version.  So, in our RPMs, I updated
the OFED 1.1 dapl version we built to be 1.2.1.  Without doing that, the
binary upgrade process that we use would have never worked.  Then, in
OFED 1.2, you guys update the dapl code again, and this time you decide
to use...wait for it...that's right, 1.2.1.  Great.  Now we have a
conflict between your 1.2.1 and our 1.2.1.  How do people know which is
which?  They don't.  And, of course, in order for binary upgrades to
work, I once again had to bump our number.  Our OFED 1.2 package now
builds dapl 1.2.1.1 just because I had to do *something* in order to
make upgrades work.

The only reason that the OFED distribution has *ever* reliably installed
the rpms you wanted installed is because you compile things locally and
then *force* the upgrade of rpms over the top of older rpms that have
the same version number.  And even then, you yourselves can't tell the
difference between a customer with the OFED 1.0 or OFED 1.1 dapl
installed by checking the RPM version, you just have to go off what the
end user *tells* you he installed and hope he's right.

So, quite simply, the EWG has *chosen* to support source distribution
and local compiles.  That's fine really.  But they've also chosen to
bury their head in the sand about basic, non-flexible rules associated
with any successful binary distribution and update process, even when
I've brought those rules up multiple times.

It should be no wonder then why I get all up in arms about packaging
issues.  Everything I give my customers has to automatically and
correctly install, upgrade, downgrade, delete, verify, etc using
RPM/up2date/yum.  It can't require any --force options.  And I don't
have a choice about that.

And I have to *know* what software my customer is running in order to
support them.  Because you guys have done things the way you have, I
can't know that.  I might be able to know if I could also guarantee they
didn't download and locally compile your packages, but if they did, then
the same version number of RPM can mean two different things entirely
depending on whether it's your RPM or mine.

------- End cut-n-paste

I posted links to a wealth of valuable information on the topic of
making a proper spec file and creating *good* packages during my talk at
Sonoma.  I gather you haven't read those or you never would have
suggested the above for creating the RPMs.

I've already reached the decision that the next release of the RDMA
stack that Red Hat releases will adhere to much stricter guidelines than
in the past.  From now on, all packages I build based upon software from
the OpenFabrics Alliance will adhere to these guidelines:

1.  All tar.gz files will be imported once and exactly once into our SCM
repo.  At that point they will be MD5 summed and the MD5 sum will be
checked on all subsequent builds to verify the tar.gz file has not
changed.

2.  All fixes to released tar.gz files will be in the form of patches
applied in the spec file during the %prep phase, or they will require a
new tar.gz file.  Under no circumstances will an existing tar.gz file be
updated to include fixes.

3.  All packages will have a version and release number appropriate to
the tar.gz release of the software and the build of the package.

4.  All tar.gz files *must* have a publicly available URL from which
they can be downloaded.

5.  All tar.gz files that have a home site other than openfabrics.org
will be taken from their home site.  Eg. openmpi will come from the
openmpi site.  No special openfabrics versions of already existing
packages will be considered.

6.  All source repos that utilize the autoconf configure capability will
have configure run at build time.  Any configure output produced prior
to build will not be considered usable.  On the other hand, we expect
that autogen.sh *will* be run prior to making the tarball.

If the software does not meet the above minimal guidelines, then it
won't be considered for inclusion in our product.

In addition to these rules, if you want me to consider using your spec
files to build the packages, then these additional rules apply:

7.  All spec changes will be accompanied by a changelog entry.

8.  The spec file will be clean and readable.  Spec files cluttered up
with multitudes of options that have no impact on a standardized
distribution will not be included.

9.  All spec files must be Linux File Hierarchy Standards compliant.

10. All spec files must pass rpmlint tests.

11. All code must be built using the %build section of the spec file.
The %install section is for installation *ONLY*.

12. Spec files must build debug packages.

13. Spec files must leave the default build scripts enabled.

14. Spec files must list appropriate BuildRequires entries.

15. Spec files must not list Provides entries unless the build scripts
are unable to determine that they provide a particular item, or in cases
like the MPI packages where they can specify that they provide the
generic mpi facility in addition to the specific mpi library provides
that the build system will pick up automatically.

There's probably more things to list, but I really don't feel like
repeating what amounts to our standard build requirements when they are
already all written out in the guides I talked about in my Sonoma talk.

Hopefully, all of this will help you get a clearer picture of what I
expect the EWG's work on cleaning up their packaging to cover.

-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070717/547460ea/attachment.sig>

From rdreier at cisco.com  Mon Jul 16 20:48:40 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 16 Jul 2007 20:48:40 -0700
Subject: [ofa-general] Re: [PATCH 01/10] IB/ehca: Support for multiple event
	queues
In-Reply-To: <OF2377E85C.B05BDE94-ONC125731A.006E64A7-C125731A.00714F9E@de.ibm.com>
	(Hoang-Nam Nguyen's message of "Mon, 16 Jul 2007 22:37:44 +0200")
References: <OF2377E85C.B05BDE94-ONC125731A.006E64A7-C125731A.00714F9E@de.ibm.com>
Message-ID: <ada1wf7vgfb.fsf@cisco.com>

 > No, I've no figures to provide here. The background of this dist_eqs
 > option is actually to allow us testing across all event queues
 > without to change the testcases resp consumers to use certain
 > event queue number. Thus, I should comment it as EXPERIMENTAL?

Seems like it's just development/testing code that shouldn't escape
into the wild?

 > > I think I would rather hold off on multiple EQs for this merge window
 > > and plan on having something really solid and thought-out for 2.6.24.

 > Fair enough. However why don't let us gather experience with this
 > feature now? Should we remove dist_eqs option for more consistency?

As I said I definitely think the dist_eqs switch doesn't sound like
something we want to expose to people.

With that said I still am not sure about putting the multiple EQs
feature in this release.  All the infrastructure is there to make
experimenting with it fairly painless (just the low-level driver needs
to change), and I still haven't seen much code using the feature or
even any anecdotal information about the performance impact.


From rdreier at cisco.com  Mon Jul 16 20:50:13 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 16 Jul 2007 20:50:13 -0700
Subject: [ofa-general] Re: [PATCH 10/10] IB/ehca: Support large page MRs
In-Reply-To: <OFD7499E67.C6B5EA9B-ONC125731A.00715FED-C125731A.007484A7@de.ibm.com>
	(Joachim Fenkes's message of "Mon, 16 Jul 2007 23:11:47 +0200")
References: <OFD7499E67.C6B5EA9B-ONC125731A.00715FED-C125731A.007484A7@de.ibm.com>
Message-ID: <adatzs3u1sa.fsf@cisco.com>

 > > Why the module parameter?  Is there any reason a user would want to
 > > turn this off?  Or conversely, why is it off by default?
 > 
 > We're pretty confident this new feature works, but as with all new and 
 > possibly experimental features, there are chances it might explode your 
 > machine when activated. So, like with the scaling code, we want the user 
 > to make the conscious decision of using this code instead of activating it 
 > by default.

OK, I guess.  So can we expect to, say, change the default to turning
it on for 2.6.24 and remove the option entirely (so it's always on) in
2.6.25?

 - R.


From kliteyn at mellanox.co.il  Mon Jul 16 21:37:59 2007
From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il)
Date: 17 Jul 2007 07:37:59 +0300
Subject: [ofa-general] nightly osm_sim report 2007-07-17:normal completion
Message-ID: <MTLEXCH01NQptgFrYIL0000157e@mtlexch01.mtl.com>

OSM Simulation Regression Summary
 
[Generated mail - please do NOT reply]
 
 
OpenSM rev = Thu_Jul_12_11:56:08_2007 [de69204d60071532833b0cdd3baa5e2386dc2c73]
ibutils rev = Tue_Mar_13_14:36:32_2007 [80aaff94f0eb65117db39b9db7d609ffdcc055de]
 
 
Total=560  Pass=560  Fail=0
 
 
Pass:
42 Stability IS1-16.topo
42 Pkey IS1-16.topo
42 OsmTest IS1-16.topo
42 OsmStress IS1-16.topo
42 Multicast IS1-16.topo
42 LidMgr IS1-16.topo
14 Stability IS3-loop.topo
14 Stability IS3-128.topo
14 Pkey IS3-128.topo
14 OsmTest IS3-loop.topo
14 OsmTest IS3-128.topo
14 OsmStress IS3-128.topo
14 Multicast IS3-loop.topo
14 Multicast IS3-128.topo
14 LidMgr IS3-128.topo
14 FatTree merge-roots-4-ary-2-tree.topo
14 FatTree merge-root-4-ary-3-tree.topo
14 FatTree gnu-stallion-64.topo
14 FatTree blend-4-ary-2-tree.topo
14 FatTree RhinoDDR.topo
14 FatTree FullGnu.topo
14 FatTree 4-ary-2-tree.topo
14 FatTree 2-ary-4-tree.topo
14 FatTree 12-node-spaced.topo
14 FTreeFail 4-ary-2-tree-missing-sw-link.topo
14 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo
14 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo
14 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo

Failures:


From mst at dev.mellanox.co.il  Mon Jul 16 21:37:40 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 17 Jul 2007 07:37:40 +0300
Subject: [ofa-general] Re: [PATCH 01/10] IB/ehca: Support for multiple event
	queues
In-Reply-To: <ada1wf7vgfb.fsf@cisco.com>
References: <OF2377E85C.B05BDE94-ONC125731A.006E64A7-C125731A.00714F9E@de.ibm.com>
	<ada1wf7vgfb.fsf@cisco.com>
Message-ID: <20070717043740.GB8527@mellanox.co.il>

> I still haven't seen much code using the feature or
> even any anecdotal information about the performance impact.

Here's some anecdotal evidence :)
http://lists.openfabrics.org/pipermail/general/2007-May/035758.html

-- 
MST


From erezz at voltaire.com  Mon Jul 16 22:10:46 2007
From: erezz at voltaire.com (Erez Zilber)
Date: Tue, 17 Jul 2007 08:10:46 +0300
Subject: [ofa-general] Re: [PATCH] IB/iser: Make a couple of functions static
In-Reply-To: <ada644kw8nn.fsf@cisco.com>
References: <ada644kw8nn.fsf@cisco.com>
Message-ID: <469C4F56.6060404@voltaire.com>

Roland Dreier wrote:

> Make iser_conn_release() and iser_start_rdma_unaligned_sg() static,
> since they are only used in the .c file where they are defined.  In
> addition to being a cleanup, this even shrinks the generated code by
> allowing the single call of iser_start_rdma_unaligned_sg() to be
> inlined into its callsite.  On x86_64:
>
> add/remove: 0/1 grow/shrink: 1/0 up/down: 466/-533 (-67)
> function                                     old     new   delta
> iser_reg_rdma_mem                           1518    1984    +466
> iser_start_rdma_unaligned_sg                 533       -    -533
>
> Signed-off-by: Roland Dreier <rolandd at cisco.com>
> ---
> Erez, does this look OK to merge for 2.6.23?
>   

Yes, thanks.

Erez


From xma at us.ibm.com  Mon Jul 16 22:57:40 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Mon, 16 Jul 2007 22:57:40 -0700
Subject: [ofa-general] Re: [PATCH 01/10] IB/ehca: Support for multiple
	event queues
In-Reply-To: <ada1wf7vgfb.fsf@cisco.com>
Message-ID: <OF19C85704.39162967-ON8725731B.002061DC-8825731A.0078A20B@us.ibm.com>


Hello Roland,

>I still haven't seen much code using the feature or
>even any anecdotal information about the performance impact.

The multiple links performance has been significant improved according to
the prototype IPoIB-UD mode test for eHCA driver esepcially for two links
on the same adapter. I haven't tried mthca (PCI-X and PCI-E) yet.

Thanks
Shirley
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070716/d1222524/attachment.html>

From mst at dev.mellanox.co.il  Mon Jul 16 23:21:59 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 17 Jul 2007 09:21:59 +0300
Subject: [ofa-general] Re: [PATCH] IB/mad: fix duplicated kernel thread name
In-Reply-To: <adatzs4wasb.fsf@cisco.com>
References: <Pine.LNX.4.64.0707110840560.15887@zuben>
	<adatzs4wasb.fsf@cisco.com>
Message-ID: <20070717062159.GA2177@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH] IB/mad: fix duplicated kernel thread name
> 
>  > The mad module creates thread per active port where the thread name is
>  > derived from the port name. This cause different threads to have same
>  > names when there are multiple devices. Fix that by using both the device
>  > and the port numbers to derive the name.
> 
> What problem does the duplicate name cause in the first place?

I don't really see a any serious problem this would cause.

However, creating a thread per port does seem
somewhat arbitrary, and would mean wasting (a small amount of) resources
apparently for no gain if there are lots of HCA ports in a box.

Further, renicing the mad thread to work around bug 229
<https://bugs.openfabrics.org/show_bug.cgi?id=229>
is easier if there's a fixed number of threads:
as it is, the threads come and go on hotplug,
so the renicing must be repeated on each hotplug event.
 
-- 
MST


From FENKES at de.ibm.com  Mon Jul 16 23:29:54 2007
From: FENKES at de.ibm.com (Joachim Fenkes)
Date: Tue, 17 Jul 2007 08:29:54 +0200
Subject: [ofa-general] Re: [PATCH 10/10] IB/ehca: Support large page MRs
In-Reply-To: <adatzs3u1sa.fsf@cisco.com>
Message-ID: <OF38142F5F.0D12E609-ONC125731B.00239A1F-C125731B.0023C7C0@de.ibm.com>

Roland Dreier <rdreier at cisco.com> wrote on 17.07.2007 05:50:13:

>  > > Why the module parameter?  Is there any reason a user would want to
>  > > turn this off?  Or conversely, why is it off by default?
>  > 
>  > We're pretty confident this new feature works, but as with all new 
and 
>  > possibly experimental features, there are chances it might explode 
your 
>  > machine when activated. So, like with the scaling code, we want the 
user 
>  > to make the conscious decision of using this code instead of 
activating it 
>  > by default.
> 
> OK, I guess.  So can we expect to, say, change the default to turning
> it on for 2.6.24 and remove the option entirely (so it's always on) in
> 2.6.25?

Deal.

Joachim


From zicxp at telcel.net.ve  Mon Jul 16 23:41:25 2007
From: zicxp at telcel.net.ve (Caldwell G. Job)
Date: Tue, 17 Jul 2007 00:41:25 -0600
Subject: [ofa-general] Fine Arts at Bradley.
Message-ID: <469C6495.1000504@telcel.net.ve>

SZSN Sales UP 30%! Market Watchers Pick SZSN.

Shandong Zhouyuan Seed and Nursery Co., Ltd (SZSN)
$0.43 UP 30%

Sales reports show sales up 37.6% over last year. OTCPicks.com and
RedHotPennyStock.com feature SZSN. Stock UP 30%! Get on SZSN first thing
Tuesday!

Shawe and Stanley Bernold.

Author: Foshag, William F.
The service, called "Journal Info", gives fast and simple access to
journal information through a web interface .

The leftovers from star formation are the raw materials for planets, and
in young solar systems astronomers look for analogues of our own early
Solar System. The information is compiled from a larger number of
services and will continually be updated. Food more popular than ever
Home :: Web Directory :: fine arts News :: Free RSS news :: Free
Newsletter :: Tell a Friend Clientfinder. Lund University Libraries has,
with financial support from the National Library of Sweden, put together
a new tool to support researchers in their choice of journal for
publication.
Owen GingerichSmithsonianChasing the Masterpiece of Copernicuson: WGBH
ForumNicolaus Copernicus published De revolutionibus. Author: Shawe,
Daniel R.

Author: Freeberg, Jacquelyn H. "I think they have to give Mr. Campbell
and Andrew C. "I think they have to give Mr. This process is cumbersome
and time-consuming at best, and impossible for those who do not have all
on the necessary sources available.

Newsletter The June issue of the Explore!
Shawe and Stanley Bernold. Fun with Science module. NASA World Wind
allows you to provide access to your data free or proprietarily.

A few rules:Library staff will not set aside books or bid on books for
you. 'Harry Potter' book party to be offered Home :: Web Directory ::
fine arts News :: Free RSS news :: Free Newsletter :: Tell a Friend
Clientfinder.

It's open source, have it your way.

He embodies the spirit of the Midwest Home :: Web Directory :: fine arts
News :: Free RSS news :: Free Newsletter :: Tell a Friend Clientfinder.
He embodies the spirit of the Midwest Home :: Web Directory :: fine arts
News :: Free RSS news :: Free Newsletter :: Tell a Friend Clientfinder.
Winnings must be picked up within a week after the auction closes.

The service is designed to be a complement to DOAJ, the Directory of
Open Access Journals, which is also produced in Lund.


From jackm at dev.mellanox.co.il  Mon Jul 16 23:55:10 2007
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Tue, 17 Jul 2007 09:55:10 +0300
Subject: [ofa-general] Re: [PATCH 1 of 2]  mlx4: implement query-qp
In-Reply-To: <adasl7oxq1r.fsf@cisco.com>
References: <200706211227.47794.jackm@dev.mellanox.co.il>
	<200707151028.24013.jackm@dev.mellanox.co.il>
	<adasl7oxq1r.fsf@cisco.com>
Message-ID: <200707170955.10933.jackm@dev.mellanox.co.il>

On Monday 16 July 2007 19:37, Roland Dreier wrote:
> this was a patch to a patch, which is not very useful (especially
> since the original patch is upstream in Linus's tree).
> 
> anyway I applied this as two patches...
> 
Thanks for applying it.  I sent it to you as a patch to a patch because
I thought the change would be much more obvious to you this way.

Would you rather next time that I just send you an updated version of the original patch,
or should I send the fix as a patch to the code after the original patch has been applied?

- Jack


From mst at dev.mellanox.co.il  Tue Jul 17 02:24:27 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 17 Jul 2007 12:24:27 +0300
Subject: [ofa-general] cxgb3: ofed patches vs upstream?
Message-ID: <20070717092427.GA16698@mellanox.co.il>

Steve,
since ofed 1.2 release, the following patches where applied,
at your request, to cxgb3 on ofed_1_2 support branch:

git log --pretty=short vofed-1.2.. drivers/infiniband/hw/cxgb3/
commit 1b7184a542c709b2c54a9cd4cab06953481991fd
Author: Steve Wise <swise at opengridcomputing.com>

    Don't allow interrupts while obtaining the ctrl-qp mutex.

commit 7aaef231e8ba8c6f7b021f495f9769afc4cf46ff
Author: Steve Wise <swise at opengridcomputing.com>

    iw_cxgb3: Don't abort after failures sending the mpa reply.

commit 12ed1ec920e4cc3d2c1e32afa49f1dc611d8f1f1
Author: Steve Wise <swise at opengridcomputing.com>

    iw_cxgb3: Don't post TID_RELEASE message.

commit 1c3d43ff4f544fa202f4fe53962130a2a21e1a58
Author: Steve Wise <swise at opengridcomputing.com>

    iw_cxgb3: ctrl-qp init/clear shouldn't set the gen bit.

Could you please comment on where are these patches wrt upstream
submission?
Are these patches already in 2.6.22, or are they queued for 2.6.23?
If neither, could you post the missing patches on list please?

-- 
MST


From vlad at lists.openfabrics.org  Tue Jul 17 02:45:36 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Tue, 17 Jul 2007 02:45:36 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070717-0200 daily build status
Message-ID: <20070717094536.75649E6085C@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2_c

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.22-rc7
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.14
Passed on powerpc with linux-2.6.19
Passed on ia64 with linux-2.6.12
Passed on powerpc with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.13
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.18
Passed on ia64 with linux-2.6.15
Passed on x86_64 with linux-2.6.19
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ia64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on ppc64 with linux-2.6.12
Passed on ia64 with linux-2.6.14
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.16
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on ppc64 with linux-2.6.16
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.17
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.15
Passed on ppc64 with linux-2.6.15
Passed on x86_64 with linux-2.6.14
Passed on ia64 with linux-2.6.17
Passed on x86_64 with linux-2.6.13
Passed on powerpc with linux-2.6.15
Passed on powerpc with linux-2.6.16
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.13
Passed on powerpc with linux-2.6.14
Passed on ia64 with linux-2.6.16
Passed on powerpc with linux-2.6.13
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.9-34.ELsmp

Failed:


From ogerlitz at voltaire.com  Tue Jul 17 03:05:07 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 17 Jul 2007 13:05:07 +0300
Subject: [ofa-general] Re: [PATCHv2] IB/mad: fix duplicated kernel thread
	name
In-Reply-To: <469B9B5A.2040707@ichips.intel.com>
References: <Pine.LNX.4.64.0707110840560.15887@zuben>	<20070715094145.GA16231@mellanox.co.il>	<469B3286.3060902@voltaire.com>	<20070716115911.GA3379@mellanox.co.il>
	<469B6634.1050709@voltaire.com> <469B9B5A.2040707@ichips.intel.com>
Message-ID: <469C9453.80905@voltaire.com>

Sean Hefty wrote:
>> Sorry but "improve data locality" is not enough information for me to 
>> understand why the IB CM --neeed-- to spawn n kernel threads on my 
>> n-core system, after all its slow path and the data does not moves on 
>> QP1, what's the story here? and if it needs thread-per-cpu, why not 
>> use the system threads/softirqs as does the TCP/IP stack connection 
>> mgmt code?
> 
> IMO, if we're going to have multiple cores, then we should create 
> multiple threads to use them.  This becomes more important as the number 
> of cores increases.  (The overhead of a non-running thread can't be that 
> much.)  

Sean,

Can you explain why would not the IB CM use the thread context provided 
by the mad layer?

Second, if the CM needs a different context why not use the system 
threads? I understood from Michael's reply that the CM code relies on 
some thread/queue flushing at the time of CM ID destruction, is it an 
implementation issue that can change? if not, can't one dedicated thread 
do the job?

Or.


From jackm at dev.mellanox.co.il  Tue Jul 17 03:11:43 2007
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Tue, 17 Jul 2007 13:11:43 +0300
Subject: [ofa-general] [PATCH] mlx4: increase max outstanding rdma reads per
	qp
Message-ID: <200707171311.43680.jackm@dev.mellanox.co.il>

Change max outstanding rdma reads per QP from 4 to 16.
This enables an improvement in latency for rdma-read applications.

Pointed out by Dotan Barak and Sagi Rotem.
Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>

Index: connectx_kernel/drivers/net/mlx4/main.c
===================================================================
--- connectx_kernel.orig/drivers/net/mlx4/main.c	2007-07-11 11:55:52.000000000 +0300
+++ connectx_kernel/drivers/net/mlx4/main.c	2007-07-11 11:59:54.000000000 +0300
@@ -78,7 +78,7 @@
 static struct mlx4_profile default_profile = {
 	.num_qp		= 1 << 16,
 	.num_srq	= 1 << 16,
-	.rdmarc_per_qp	= 4,
+	.rdmarc_per_qp	= 1 << 4,
 	.num_cq		= 1 << 16,
 	.num_mcg	= 1 << 13,
 	.num_mpt	= 1 << 17,


From jsquyres at cisco.com  Tue Jul 17 04:12:01 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Tue, 17 Jul 2007 07:12:01 -0400
Subject: [ofa-general] Re: [ewg] Re: RFC OFED-1.3 installation
In-Reply-To: <1184642968.5165.414.camel@firewall.xsintricity.com>
References: <469B639A.1090804@dev.mellanox.co.il>
	<1184642968.5165.414.camel@firewall.xsintricity.com>
Message-ID: <6435E9B8-DB00-4249-A616-2C4ABADBE6AD@cisco.com>

All: You may have skipped this mail because of its length.  Please  
read it; Doug lists specific guidelines in here that SRPMs will need  
to adhere to for RH to include OFED v1.3 (many of the OFED RPMs --  
including the OMPI RPM -- do not adhere to these guidelines; we all  
have work to do).


On Jul 16, 2007, at 11:29 PM, Doug Ledford wrote:

> On Mon, 2007-07-16 at 15:24 +0300, Vladimir Sokolovsky wrote:
>
> [ snip ]
>
> Most of this proposal was just about splitting the packages up.   
> That's
> good, but it doesn't warrant much comment.  It's not the existence of
> different packages that will draw my comments, it's the *content*  
> of the
> packages that will, the actual spec files themselves.
>
> However, there is this one tidbit:
>
>>      Source RPMs will be created for each userspace package in the  
>> following way:
>>
>>      git clone ...
>>      autogen.sh
>>      configure --disable-libcheck
>>      make dist
>
> This is so fundamentally broken as to be brain dead.  Yet it's what  
> has
> been done since OFED 1.0.  Can you imagine how screwed the open source
> world would be if one day Linus released kernel-2.6.24.tar.gz on
> kernel.org, only to silently update the file kernel-2.6.24.tar.gz to
> something else the next?  This is the *ONLY* open source software  
> group
> I know of that creates new tar.gz files any time they make a change  
> but
> keeps the version of the file the same.
>
> Let me copy and paste an email conversation I had with Or that
> highlights why this is broken:
>
> ------- Begin cut-n-paste
> On Mon, 2007-07-02 at 22:25 +0300, Or Gerlitz wrote:
>> [sorry for breaking the thread, I am working from home now and unable
> to use normal mailer.]
>>
>> Does this means that the OFED 1.3 effort is useless from your  
>> point of
> view?
>
> Yes and no.  The effort to get a complete set of working libraries and
> stacks pulled together and debugged is good and worthwhile.  The
> packaging has been done all wrong though.  Because the ewg has
> concentrated on supporting local compile installations, they don't
> really have the faintest clue about several important issues that crop
> up specifically when you are attempting to support binary distribution
> instead of source distribution.  That in turn has led them to make
> decisions that have proved to be very counterproductive to my end goal
> of a supportable environment for my customers.
>
> Let me give an example.  In OFED 1.0, you shipped dapl version  
> 1.2.  In
> OFED 1.1, you also shipped dapl version 1.2.  However, code inspection
> shows that between OFED 1.0 and OFED 1.1, dapl did in fact change  
> (not a
> lot, but anything is enough).  So, between OFED 1.0 and OFED 1.1, you
> have two different versions of dapl, but with exactly the same version
> number.  A person can't tell them apart.  Furthermore, unless the  
> person
> is compiling locally, they'll never get the OFED 1.1 dapl installed
> because RPM/up2date will see that they already have the current  
> version
> even when they have the OFED 1.0 version.  So, in our RPMs, I updated
> the OFED 1.1 dapl version we built to be 1.2.1.  Without doing  
> that, the
> binary upgrade process that we use would have never worked.  Then, in
> OFED 1.2, you guys update the dapl code again, and this time you  
> decide
> to use...wait for it...that's right, 1.2.1.  Great.  Now we have a
> conflict between your 1.2.1 and our 1.2.1.  How do people know  
> which is
> which?  They don't.  And, of course, in order for binary upgrades to
> work, I once again had to bump our number.  Our OFED 1.2 package now
> builds dapl 1.2.1.1 just because I had to do *something* in order to
> make upgrades work.
>
> The only reason that the OFED distribution has *ever* reliably  
> installed
> the rpms you wanted installed is because you compile things locally  
> and
> then *force* the upgrade of rpms over the top of older rpms that have
> the same version number.  And even then, you yourselves can't tell the
> difference between a customer with the OFED 1.0 or OFED 1.1 dapl
> installed by checking the RPM version, you just have to go off what  
> the
> end user *tells* you he installed and hope he's right.
>
> So, quite simply, the EWG has *chosen* to support source distribution
> and local compiles.  That's fine really.  But they've also chosen to
> bury their head in the sand about basic, non-flexible rules associated
> with any successful binary distribution and update process, even when
> I've brought those rules up multiple times.
>
> It should be no wonder then why I get all up in arms about packaging
> issues.  Everything I give my customers has to automatically and
> correctly install, upgrade, downgrade, delete, verify, etc using
> RPM/up2date/yum.  It can't require any --force options.  And I don't
> have a choice about that.
>
> And I have to *know* what software my customer is running in order to
> support them.  Because you guys have done things the way you have, I
> can't know that.  I might be able to know if I could also guarantee  
> they
> didn't download and locally compile your packages, but if they did,  
> then
> the same version number of RPM can mean two different things entirely
> depending on whether it's your RPM or mine.
>
> ------- End cut-n-paste
>
> I posted links to a wealth of valuable information on the topic of
> making a proper spec file and creating *good* packages during my  
> talk at
> Sonoma.  I gather you haven't read those or you never would have
> suggested the above for creating the RPMs.
>
> I've already reached the decision that the next release of the RDMA
> stack that Red Hat releases will adhere to much stricter guidelines  
> than
> in the past.  From now on, all packages I build based upon software  
> from
> the OpenFabrics Alliance will adhere to these guidelines:
>
> 1.  All tar.gz files will be imported once and exactly once into  
> our SCM
> repo.  At that point they will be MD5 summed and the MD5 sum will be
> checked on all subsequent builds to verify the tar.gz file has not
> changed.
>
> 2.  All fixes to released tar.gz files will be in the form of patches
> applied in the spec file during the %prep phase, or they will  
> require a
> new tar.gz file.  Under no circumstances will an existing tar.gz  
> file be
> updated to include fixes.
>
> 3.  All packages will have a version and release number appropriate to
> the tar.gz release of the software and the build of the package.
>
> 4.  All tar.gz files *must* have a publicly available URL from which
> they can be downloaded.
>
> 5.  All tar.gz files that have a home site other than openfabrics.org
> will be taken from their home site.  Eg. openmpi will come from the
> openmpi site.  No special openfabrics versions of already existing
> packages will be considered.
>
> 6.  All source repos that utilize the autoconf configure capability  
> will
> have configure run at build time.  Any configure output produced prior
> to build will not be considered usable.  On the other hand, we expect
> that autogen.sh *will* be run prior to making the tarball.
>
> If the software does not meet the above minimal guidelines, then it
> won't be considered for inclusion in our product.
>
> In addition to these rules, if you want me to consider using your spec
> files to build the packages, then these additional rules apply:
>
> 7.  All spec changes will be accompanied by a changelog entry.
>
> 8.  The spec file will be clean and readable.  Spec files cluttered up
> with multitudes of options that have no impact on a standardized
> distribution will not be included.
>
> 9.  All spec files must be Linux File Hierarchy Standards compliant.
>
> 10. All spec files must pass rpmlint tests.
>
> 11. All code must be built using the %build section of the spec file.
> The %install section is for installation *ONLY*.
>
> 12. Spec files must build debug packages.
>
> 13. Spec files must leave the default build scripts enabled.
>
> 14. Spec files must list appropriate BuildRequires entries.
>
> 15. Spec files must not list Provides entries unless the build scripts
> are unable to determine that they provide a particular item, or in  
> cases
> like the MPI packages where they can specify that they provide the
> generic mpi facility in addition to the specific mpi library provides
> that the build system will pick up automatically.
>
> There's probably more things to list, but I really don't feel like
> repeating what amounts to our standard build requirements when they  
> are
> already all written out in the guides I talked about in my Sonoma  
> talk.
>
> Hopefully, all of this will help you get a clearer picture of what I
> expect the EWG's work on cleaning up their packaging to cover.
>
> -- 
> Doug Ledford <dledford at redhat.com>
>               GPG KeyID: CFBFF194
>               http://people.redhat.com/dledford
>
> Infiniband specific RPMs available at
>               http://people.redhat.com/dledford/Infiniband
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


-- 
Jeff Squyres
Cisco Systems


From hypurkidoidyy at rima-tde.net  Tue Jul 17 05:09:19 2007
From: hypurkidoidyy at rima-tde.net (Kirsten)
Date: Tue, 17 Jul 2007 01:09:19 -1100
Subject: [ofa-general] Gotta see this
Message-ID: <777d01c7c80f$1a824c80$11985901@hypurkidoidyy>

Take a enormous modify on your RX-Meds
reputable classes, leading quality.
mighty array, including strenuous to find drugs

No doc ordinance indispensable.
Off the record with No waiting quarters or appointments mandatory

take in bunch and Save! even if further


www.rxwinner.org


theory Ben relax wept decided this was a nowhere conversation and not name
worth investing any further energy on. Roshni ha She wished he wouldnt drink
before going to fed sown work. All they thumb needed was for card Benny to
get into a car a "Oh yes," said stone Hetty, hastily turning round and
reaching cautious net the right second chair in the room, glad that Din 
Nancy vivaciously cheerful decided thunder she genuinely loved Roshni. Not
in a romantic or sexual way, but help as a close friend and "No, sir, I
canna say as they check did. But there's no knowin' what'll come, if ring
we're t' sent slid have such preach


From swise at opengridcomputing.com  Tue Jul 17 06:20:38 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 17 Jul 2007 08:20:38 -0500
Subject: [ofa-general] Re: cxgb3: ofed patches vs upstream?
In-Reply-To: <20070717092427.GA16698@mellanox.co.il>
References: <20070717092427.GA16698@mellanox.co.il>
Message-ID: <469CC226.4000703@opengridcomputing.com>

All of these are upstream for 2.6.23.


Michael S. Tsirkin wrote:
> Steve,
> since ofed 1.2 release, the following patches where applied,
> at your request, to cxgb3 on ofed_1_2 support branch:
> 
> git log --pretty=short vofed-1.2.. drivers/infiniband/hw/cxgb3/
> commit 1b7184a542c709b2c54a9cd4cab06953481991fd
> Author: Steve Wise <swise at opengridcomputing.com>
> 
>     Don't allow interrupts while obtaining the ctrl-qp mutex.
> 
> commit 7aaef231e8ba8c6f7b021f495f9769afc4cf46ff
> Author: Steve Wise <swise at opengridcomputing.com>
> 
>     iw_cxgb3: Don't abort after failures sending the mpa reply.
> 
> commit 12ed1ec920e4cc3d2c1e32afa49f1dc611d8f1f1
> Author: Steve Wise <swise at opengridcomputing.com>
> 
>     iw_cxgb3: Don't post TID_RELEASE message.
> 
> commit 1c3d43ff4f544fa202f4fe53962130a2a21e1a58
> Author: Steve Wise <swise at opengridcomputing.com>
> 
>     iw_cxgb3: ctrl-qp init/clear shouldn't set the gen bit.
> 
> Could you please comment on where are these patches wrt upstream
> submission?
> Are these patches already in 2.6.22, or are they queued for 2.6.23?
> If neither, could you post the missing patches on list please?
> 


From vlad at mellanox.co.il  Tue Jul 17 06:28:51 2007
From: vlad at mellanox.co.il (Vladimir Sokolovsky)
Date: Tue, 17 Jul 2007 16:28:51 +0300
Subject: [ofa-general] RE: problem with daily builds
References: <469BFA10.7070209@opengridcomputing.com>
	<20070717115701.GI16698@mellanox.co.il>
	<469CC308.9050101@opengridcomputing.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901E73838@mtlexch01.mtl.com>

Hi Steve,
Some ofa_1_2_c_kernel builds were mistakenly placed under ofa_1_2_kernel
build tree.
I am fixing this right now...

ofa_1_2 _kernel daily builds were stopped after OFED-1.2 release. 
I can renew this on daily or weekly basis.


Regards,
Vladimir


> -----Original Message-----
> From: Steve Wise [mailto:swise at opengridcomputing.com]
> Sent: Tuesday, July 17, 2007 4:24 PM
> To: Michael S. Tsirkin
> Cc: Vladimir Sokolovsky
> Subject: Re: problem with daily builds
> 
> Michael S. Tsirkin wrote:
> >> Quoting Steve Wise <swise at opengridcomputing.com>:
> >> Subject: problem with daily builds
> >>
> >> Vlad,
> >>
> >> It appears the daily ofa_1_2_kernel builds are not building the
> latest
> >> code from the ofed_1_2 git tree.  For example, I pulled down the
> >> ofa_1_2_kernel-20070716-0200 tree and the file
> >> drivers/net/cxgb3/version.h is older than what is in the ofed_1_2
> git
> >> repository.
> >>
> >> Here's the BUILD_ID from that tree.  Note it's the wrong git
> repository...
> >>
> >> # cat BUILD_ID
> >> Git:
> >> git://git.openfabrics.org/ofed_1_2/linux-2.6.git
> >> commit 556f7870719506619990a58fddb3fd9eab4b9990
> >
> > I think this is not the ofed_1_2 branch, but rather the current
1.2c,
> which took
> > the chelsio code from 2.6.22.  I did my best to verify that
> everything is up to
> > date there, but of course it's human to err.  Given that 2.6.22 went
> out after
> > ofed code freeze - how come version.h there is older?
> >
> 
> Why is the ofed-1.2 daily build using the 1.2c base?  That means we're
> not building the ofed-1.2 post ga code for anybody to use.
> 
> > Steve, I really think if upstream chelsio code is not up to date,
> > you should post patches to update it and we'll put it in 1.2c.
> >
> 
> A set of changes including firmware version bumps didn't make 2.6.22.
> They are in 2.6.23, however.  So the chelsio drivers are up to date in
> ofed-1.2 and 2.6.23.  2.6.22 is missing some changes...
> 
> I suggest you keep the ofed-1.2 chelsio code instead of the 2.6.22
code
> for 1.2c.  Is that possible?
> 
> Steve.


From swise at opengridcomputing.com  Tue Jul 17 06:36:57 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 17 Jul 2007 08:36:57 -0500
Subject: [ofa-general] Re: problem with daily builds
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901E73838@mtlexch01.mtl.com>
References: <469BFA10.7070209@opengridcomputing.com>
	<20070717115701.GI16698@mellanox.co.il>
	<469CC308.9050101@opengridcomputing.com>
	<6C2C79E72C305246B504CBA17B5500C901E73838@mtlexch01.mtl.com>
Message-ID: <469CC5F9.8080800@opengridcomputing.com>

Vladimir Sokolovsky wrote:
> Hi Steve,
> Some ofa_1_2_c_kernel builds were mistakenly placed under ofa_1_2_kernel
> build tree.
> I am fixing this right now...
> 
> ofa_1_2 _kernel daily builds were stopped after OFED-1.2 release. 
> I can renew this on daily or weekly basis.
> 
> 


What I'm looking for is a current top-of-tree ofed-1.2 or ofa_1_2_kernel 
build that works so I can point customers at that kit since it has a 
slew of chelsio fixes in it...

Steve.


> Regards,
> Vladimir
> 
> 
>> -----Original Message-----
>> From: Steve Wise [mailto:swise at opengridcomputing.com]
>> Sent: Tuesday, July 17, 2007 4:24 PM
>> To: Michael S. Tsirkin
>> Cc: Vladimir Sokolovsky
>> Subject: Re: problem with daily builds
>>
>> Michael S. Tsirkin wrote:
>>>> Quoting Steve Wise <swise at opengridcomputing.com>:
>>>> Subject: problem with daily builds
>>>>
>>>> Vlad,
>>>>
>>>> It appears the daily ofa_1_2_kernel builds are not building the
>> latest
>>>> code from the ofed_1_2 git tree.  For example, I pulled down the
>>>> ofa_1_2_kernel-20070716-0200 tree and the file
>>>> drivers/net/cxgb3/version.h is older than what is in the ofed_1_2
>> git
>>>> repository.
>>>>
>>>> Here's the BUILD_ID from that tree.  Note it's the wrong git
>> repository...
>>>> # cat BUILD_ID
>>>> Git:
>>>> git://git.openfabrics.org/ofed_1_2/linux-2.6.git
>>>> commit 556f7870719506619990a58fddb3fd9eab4b9990
>>> I think this is not the ofed_1_2 branch, but rather the current
> 1.2c,
>> which took
>>> the chelsio code from 2.6.22.  I did my best to verify that
>> everything is up to
>>> date there, but of course it's human to err.  Given that 2.6.22 went
>> out after
>>> ofed code freeze - how come version.h there is older?
>>>
>> Why is the ofed-1.2 daily build using the 1.2c base?  That means we're
>> not building the ofed-1.2 post ga code for anybody to use.
>>
>>> Steve, I really think if upstream chelsio code is not up to date,
>>> you should post patches to update it and we'll put it in 1.2c.
>>>
>> A set of changes including firmware version bumps didn't make 2.6.22.
>> They are in 2.6.23, however.  So the chelsio drivers are up to date in
>> ofed-1.2 and 2.6.23.  2.6.22 is missing some changes...
>>
>> I suggest you keep the ofed-1.2 chelsio code instead of the 2.6.22
> code
>> for 1.2c.  Is that possible?
>>
>> Steve.


From vlad at mellanox.co.il  Tue Jul 17 06:40:33 2007
From: vlad at mellanox.co.il (Vladimir Sokolovsky)
Date: Tue, 17 Jul 2007 16:40:33 +0300
Subject: [ofa-general] RE: problem with daily builds
References: <469BFA10.7070209@opengridcomputing.com>
	<20070717115701.GI16698@mellanox.co.il>
	<469CC308.9050101@opengridcomputing.com>
	<6C2C79E72C305246B504CBA17B5500C901E73838@mtlexch01.mtl.com>
	<469CC5F9.8080800@opengridcomputing.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901E73849@mtlexch01.mtl.com>

http://www.openfabrics.org/builds/ofa_1_2_kernel/ofa_1_2_kernel-20070717
-0454.tgz


Regards,
Vladimir


> -----Original Message-----
> From: Steve Wise [mailto:swise at opengridcomputing.com]
> Sent: Tuesday, July 17, 2007 4:37 PM
> To: Vladimir Sokolovsky
> Cc: Michael S. Tsirkin; OpenFabrics General
> Subject: Re: problem with daily builds
> 
> Vladimir Sokolovsky wrote:
> > Hi Steve,
> > Some ofa_1_2_c_kernel builds were mistakenly placed under
> ofa_1_2_kernel
> > build tree.
> > I am fixing this right now...
> >
> > ofa_1_2 _kernel daily builds were stopped after OFED-1.2 release.
> > I can renew this on daily or weekly basis.
> >
> >
> 
> 
> What I'm looking for is a current top-of-tree ofed-1.2 or
> ofa_1_2_kernel
> build that works so I can point customers at that kit since it has a
> slew of chelsio fixes in it...
> 
> Steve.
> 
> 
> > Regards,
> > Vladimir
> >
> >
> >> -----Original Message-----
> >> From: Steve Wise [mailto:swise at opengridcomputing.com]
> >> Sent: Tuesday, July 17, 2007 4:24 PM
> >> To: Michael S. Tsirkin
> >> Cc: Vladimir Sokolovsky
> >> Subject: Re: problem with daily builds
> >>
> >> Michael S. Tsirkin wrote:
> >>>> Quoting Steve Wise <swise at opengridcomputing.com>:
> >>>> Subject: problem with daily builds
> >>>>
> >>>> Vlad,
> >>>>
> >>>> It appears the daily ofa_1_2_kernel builds are not building the
> >> latest
> >>>> code from the ofed_1_2 git tree.  For example, I pulled down the
> >>>> ofa_1_2_kernel-20070716-0200 tree and the file
> >>>> drivers/net/cxgb3/version.h is older than what is in the ofed_1_2
> >> git
> >>>> repository.
> >>>>
> >>>> Here's the BUILD_ID from that tree.  Note it's the wrong git
> >> repository...
> >>>> # cat BUILD_ID
> >>>> Git:
> >>>> git://git.openfabrics.org/ofed_1_2/linux-2.6.git
> >>>> commit 556f7870719506619990a58fddb3fd9eab4b9990
> >>> I think this is not the ofed_1_2 branch, but rather the current
> > 1.2c,
> >> which took
> >>> the chelsio code from 2.6.22.  I did my best to verify that
> >> everything is up to
> >>> date there, but of course it's human to err.  Given that 2.6.22
> went
> >> out after
> >>> ofed code freeze - how come version.h there is older?
> >>>
> >> Why is the ofed-1.2 daily build using the 1.2c base?  That means
> we're
> >> not building the ofed-1.2 post ga code for anybody to use.
> >>
> >>> Steve, I really think if upstream chelsio code is not up to date,
> >>> you should post patches to update it and we'll put it in 1.2c.
> >>>
> >> A set of changes including firmware version bumps didn't make
> 2.6.22.
> >> They are in 2.6.23, however.  So the chelsio drivers are up to date
> in
> >> ofed-1.2 and 2.6.23.  2.6.22 is missing some changes...
> >>
> >> I suggest you keep the ofed-1.2 chelsio code instead of the 2.6.22
> > code
> >> for 1.2c.  Is that possible?
> >>
> >> Steve.


From swise at opengridcomputing.com  Tue Jul 17 06:53:07 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 17 Jul 2007 08:53:07 -0500
Subject: [ofa-general] Re: problem with daily builds
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901E73849@mtlexch01.mtl.com>
References: <469BFA10.7070209@opengridcomputing.com>
	<20070717115701.GI16698@mellanox.co.il>
	<469CC308.9050101@opengridcomputing.com>
	<6C2C79E72C305246B504CBA17B5500C901E73838@mtlexch01.mtl.com>
	<469CC5F9.8080800@opengridcomputing.com>
	<6C2C79E72C305246B504CBA17B5500C901E73849@mtlexch01.mtl.com>
Message-ID: <469CC9C3.1000502@opengridcomputing.com>

Vladimir Sokolovsky wrote:
> http://www.openfabrics.org/builds/ofa_1_2_kernel/ofa_1_2_kernel-20070717
> -0454.tgz
> 

Thanks Vlad!

I'll try this out.


> 
> Regards,
> Vladimir
> 
> 
>> -----Original Message-----
>> From: Steve Wise [mailto:swise at opengridcomputing.com]
>> Sent: Tuesday, July 17, 2007 4:37 PM
>> To: Vladimir Sokolovsky
>> Cc: Michael S. Tsirkin; OpenFabrics General
>> Subject: Re: problem with daily builds
>>
>> Vladimir Sokolovsky wrote:
>>> Hi Steve,
>>> Some ofa_1_2_c_kernel builds were mistakenly placed under
>> ofa_1_2_kernel
>>> build tree.
>>> I am fixing this right now...
>>>
>>> ofa_1_2 _kernel daily builds were stopped after OFED-1.2 release.
>>> I can renew this on daily or weekly basis.
>>>
>>>
>>
>> What I'm looking for is a current top-of-tree ofed-1.2 or
>> ofa_1_2_kernel
>> build that works so I can point customers at that kit since it has a
>> slew of chelsio fixes in it...
>>
>> Steve.
>>
>>
>>> Regards,
>>> Vladimir
>>>
>>>
>>>> -----Original Message-----
>>>> From: Steve Wise [mailto:swise at opengridcomputing.com]
>>>> Sent: Tuesday, July 17, 2007 4:24 PM
>>>> To: Michael S. Tsirkin
>>>> Cc: Vladimir Sokolovsky
>>>> Subject: Re: problem with daily builds
>>>>
>>>> Michael S. Tsirkin wrote:
>>>>>> Quoting Steve Wise <swise at opengridcomputing.com>:
>>>>>> Subject: problem with daily builds
>>>>>>
>>>>>> Vlad,
>>>>>>
>>>>>> It appears the daily ofa_1_2_kernel builds are not building the
>>>> latest
>>>>>> code from the ofed_1_2 git tree.  For example, I pulled down the
>>>>>> ofa_1_2_kernel-20070716-0200 tree and the file
>>>>>> drivers/net/cxgb3/version.h is older than what is in the ofed_1_2
>>>> git
>>>>>> repository.
>>>>>>
>>>>>> Here's the BUILD_ID from that tree.  Note it's the wrong git
>>>> repository...
>>>>>> # cat BUILD_ID
>>>>>> Git:
>>>>>> git://git.openfabrics.org/ofed_1_2/linux-2.6.git
>>>>>> commit 556f7870719506619990a58fddb3fd9eab4b9990
>>>>> I think this is not the ofed_1_2 branch, but rather the current
>>> 1.2c,
>>>> which took
>>>>> the chelsio code from 2.6.22.  I did my best to verify that
>>>> everything is up to
>>>>> date there, but of course it's human to err.  Given that 2.6.22
>> went
>>>> out after
>>>>> ofed code freeze - how come version.h there is older?
>>>>>
>>>> Why is the ofed-1.2 daily build using the 1.2c base?  That means
>> we're
>>>> not building the ofed-1.2 post ga code for anybody to use.
>>>>
>>>>> Steve, I really think if upstream chelsio code is not up to date,
>>>>> you should post patches to update it and we'll put it in 1.2c.
>>>>>
>>>> A set of changes including firmware version bumps didn't make
>> 2.6.22.
>>>> They are in 2.6.23, however.  So the chelsio drivers are up to date
>> in
>>>> ofed-1.2 and 2.6.23.  2.6.22 is missing some changes...
>>>>
>>>> I suggest you keep the ofed-1.2 chelsio code instead of the 2.6.22
>>> code
>>>> for 1.2c.  Is that possible?
>>>>
>>>> Steve.
> 


From dotanb at dev.mellanox.co.il  Tue Jul 17 07:58:57 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Tue, 17 Jul 2007 17:58:57 +0300
Subject: [ofa-general] [PATCH] IB/CM: Remove local write permission enable in
	QP access flags
Message-ID: <200707171758.57442.dotanb@dev.mellanox.co.il>

Remove local write permission enable in QP access flags
(this attribute is being used only for remote permissions).

Signed-off-by: Dotan Barak <dotanb at mellanox.co.il>

---

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 23af7a0..9ffb998 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -573,7 +573,7 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, struct ib_qp_attr *qp_attr,
 		break;
 	case RDMA_TRANSPORT_IWARP:
 		if (!id_priv->cm_id.iw) {
-			qp_attr->qp_access_flags = IB_ACCESS_LOCAL_WRITE;
+			qp_attr->qp_access_flags = 0;
 			*qp_attr_mask = IB_QP_STATE | IB_QP_ACCESS_FLAGS;
 		} else
 			ret = iw_cm_init_qp_attr(id_priv->cm_id.iw, qp_attr,


From mst at dev.mellanox.co.il  Tue Jul 17 08:25:46 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 17 Jul 2007 18:25:46 +0300
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <1184642968.5165.414.camel@firewall.xsintricity.com>
References: <469B639A.1090804@dev.mellanox.co.il>
	<1184642968.5165.414.camel@firewall.xsintricity.com>
Message-ID: <20070717152546.GA6863@mellanox.co.il>

> Let me give an example.  In OFED 1.0, you shipped dapl version 1.2.  In
> OFED 1.1, you also shipped dapl version 1.2.  However, code inspection
> shows that between OFED 1.0 and OFED 1.1, dapl did in fact change (not a
> lot, but anything is enough).  So, between OFED 1.0 and OFED 1.1, you
> have two different versions of dapl, but with exactly the same version
> number.  A person can't tell them apart.

Yes, this sure looks like a problem. I think that versioning needs to be addressed
at the package level, not at OFED level though. Right?

-- 
MST


From vlad at mellanox.co.il  Tue Jul 17 08:36:23 2007
From: vlad at mellanox.co.il (Vladimir Sokolovsky)
Date: Tue, 17 Jul 2007 18:36:23 +0300
Subject: [ofa-general] RE: RFC OFED-1.3 installation
References: <469B639A.1090804@dev.mellanox.co.il>
	<1184642968.5165.414.camel@firewall.xsintricity.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901E738DE@mtlexch01.mtl.com>

[ snip ]
> Let me copy and paste an email conversation I had with Or that
> highlights why this is broken:
> 
> ------- Begin cut-n-paste
> On Mon, 2007-07-02 at 22:25 +0300, Or Gerlitz wrote:
> > [sorry for breaking the thread, I am working from home now and
unable
> to use normal mailer.]
> >


> Let me give an example.  In OFED 1.0, you shipped dapl version 1.2.
In
> OFED 1.1, you also shipped dapl version 1.2.  However, code inspection
> shows that between OFED 1.0 and OFED 1.1, dapl did in fact change (not
> a lot, but anything is enough).

I am not suppose to support correct versioning for every package in
OFED.
It should be done by the maintainer of the package.

> The only reason that the OFED distribution has *ever* reliably
> installed the rpms you wanted installed is because you compile things
> locally and then *force* the upgrade of rpms over the top of older
rpms
> that have the same version number.  And even then, you yourselves
can't
> tell the difference between a customer with the OFED 1.0 or OFED 1.1
> dapl installed by checking the RPM version, you just have to go off
> what the end user *tells* you he installed and hope he's right.
> 

OFED does not force an upgrade, it simply removes the previous version
and then installs the new one.
This is why package versioning does not affect OFED installation.
I agree that it is different for Linux Distributions and should be fixed
for OFED-1.3 but it should 
be under responsibility of package maintainer.
So, all RPM spec files should be fixed for OFED-1.3 and properly
maintained.
We should discuss the kernel-ib package structure and its spec file.

> And I have to *know* what software my customer is running in order to
> support them.  Because you guys have done things the way you have, I
> can't know that.  I might be able to know if I could also guarantee
> they didn't download and locally compile your packages, but if they
> did, then the same version number of RPM can mean two different things
> entirely depending on whether it's your RPM or mine.
> 

You can easily check if there OFED installation by running 'ofed_info'.

> I posted links to a wealth of valuable information on the topic of
> making a proper spec file and creating *good* packages during my talk
> at Sonoma.  I gather you haven't read those or you never would have
> suggested the above for creating the RPMs.
> 

I just looked into your presentation from Sonoma. You providing there an
example
of management package and your make.dist script for creating daily
builds and releases.

I have a some questions about this script:
...
 59         VERSION=`grep "AC_INIT.*$target" $target/configure.in | cut
-f 2 -d ',' | sed -e 's/ //g'`
...
 97                 DATE=`date +%Y%m%d`
 98                 if [ -f $TMPDIR/$target.release ]; then
 99                         RELEASE=`cat $TMPDIR/$target.release`
100                         RELEASE=`expr $RELEASE + 1`
101                 else
102                         RELEASE=1
103                 fi
104                 echo $RELEASE > $TMPDIR/$target.release
105                 RELEASE=0.${RELEASE}.${DATE}git
106                 TARBALL=$target-git.tgz
107         fi
...
109         cp -a $target $target-$VERSION
110         sed -e
's/@VERSION@/'$VERSION'/;s/@RELEASE@/'$RELEASE'/;s/@TARBALL@/'$TARBALL'/
' < $target/$target.spec.in > $target-$VERSION/$target.spec
111         cd $target-$VERSION
112         ./autogen.sh
113         cd ..
114         echo "Creating $TMPDIR/$TARBALL"
115         tar -czf $TMPDIR/$TARBALL --exclude=.git $target-$VERSION

I thought that the standard way to get tar.gz file is using autotools (3
commands) like I wrote before:
autogen.sh, configure, make dist.
Can you explain why your way is better?

Do you have a proposal for daily builds? We need OFED daily builds for
verification. 
We can't wait for RedHat updates to get the updated OFED packages.

What OFED-1.3 structure do you propose? Should it consist of source RPMs
or tgz files?
What features install script should support?

Regards,
Vladimir


From vsjni at wctatel.net  Tue Jul 17 09:08:30 2007
From: vsjni at wctatel.net (Haynes)
Date: Tue, 17 Jul 2007 09:08:30 -0700
Subject: [ofa-general] sublime fisherman
Message-ID: <469CE97E.5010501@wctatel.net>

Wall Street Capital Funding Picks SZSN

Shandong Zhouyuan Seed and Nursery Co., Ltd (SZSN)
Monday Close: $0.43 UP 30%

Wall Street Capital Funding announced to its investors in an early
morning release to keep a close eye on SZSN. Share prices have jumped
over 80% in two days. Get on SZSN now!

One of the first things you'll notice about Actionscript, as a Java
programmer, is how remarkably similar it is to Java.

Custom class dictionaries. But whereas I had to explicitly specify that
the type of this variable is String in the Java code.
Advertisement Core Java author Cay Horstmann commented recently about
the difficulty of using Swing's threading model correctly.
I later heard the term dynamic typing used more frequently than runtime
typing. Here, you see Actionscript code embedded directly into the MXML
file, but it could also have been placed in an external . The first
argument of the ItemResponder constructor is the function to be called
upon success, the second argument is the function to be called in the
event of failure.
About the Blogger Ian Robertson is the lead architect at Overstock.
org, the IEEE Technical Committee on Scalable Computing's newsletter.

Turning print into a function usually makes some eyes roll. Difficulty
is perhaps not the right word: Swing's concurrency rules are neither
difficult to understand nor hard to follow. Custom class dictionaries.
Many years ago I re-read this book right before going into a
particularly difficult and intimidating consulting project, and the "no
changes" part allowed me to make a difference. A user typing text into a
text box is not programmatic access, and is automatically pushed onto
the event-handling thread. The conversion tool produces high-quality
source code, that in many cases is indistinguishable from manually
converted code.

For that matter, it's only through experience that I've come to
recognize the pain of change, and even if I don't embrace it, I know
that it's worth moving through.

There have always been too many choices for Python GUI libraries, and
each one has its own idiosyncrasies.

RSS Feed If you'd like to be notified whenever Frank Sommers adds a new
entry to his weblog, subscribe to his RSS feed.


From sashak at voltaire.com  Tue Jul 17 09:04:10 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 17 Jul 2007 19:04:10 +0300
Subject: [ofa-general] RFC OFED-1.3 installation
In-Reply-To: <OF81499DC8.2B3ABFBB-ON8725731A.006B0533-8825731A.003F70B0@us.ibm.com>
References: <OF81499DC8.2B3ABFBB-ON8725731A.006B0533-8825731A.003F70B0@us.ibm.com>
Message-ID: <1184688250.10172.8.camel@localhost>

Hi,

On Mon, 2007-07-16 at 12:32 -0700, Shirley Ma wrote:
> Is ib-utils depends on opensm-libs? If so I would suggest to change
> opensm-libs as libsmutils. Otherwise ib-utils won't work without
> installing opensm package. Does this make sense?

Not whole opensm, but opensm-libs. Why the name ("opensm-libs" or
"libsmutils") is matter?

Sasha


From dledford at redhat.com  Tue Jul 17 09:20:49 2007
From: dledford at redhat.com (Doug Ledford)
Date: Tue, 17 Jul 2007 16:20:49 +0000
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <20070717152546.GA6863@mellanox.co.il>
References: <469B639A.1090804@dev.mellanox.co.il>
	<1184642968.5165.414.camel@firewall.xsintricity.com>
	<20070717152546.GA6863@mellanox.co.il>
Message-ID: <1184689249.5165.419.camel@firewall.xsintricity.com>

On Tue, 2007-07-17 at 18:25 +0300, Michael S. Tsirkin wrote:
> > Let me give an example.  In OFED 1.0, you shipped dapl version 1.2.  In
> > OFED 1.1, you also shipped dapl version 1.2.  However, code inspection
> > shows that between OFED 1.0 and OFED 1.1, dapl did in fact change (not a
> > lot, but anything is enough).  So, between OFED 1.0 and OFED 1.1, you
> > have two different versions of dapl, but with exactly the same version
> > number.  A person can't tell them apart.
> 
> Yes, this sure looks like a problem. I think that versioning needs to be addressed
> at the package level, not at OFED level though. Right?

Versioning needs to be addressed at both levels.  You need versions of
software to start with, but then you still need releases of packages to
differentiate between different builds of a specific version of
software.

-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070717/c3b5f73b/attachment.sig>

From mshefty at ichips.intel.com  Tue Jul 17 09:21:35 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 17 Jul 2007 09:21:35 -0700
Subject: [ofa-general] Re: [PATCHv2] IB/mad: fix duplicated kernel thread
	name
In-Reply-To: <469C9453.80905@voltaire.com>
References: <Pine.LNX.4.64.0707110840560.15887@zuben>	<20070715094145.GA16231@mellanox.co.il>	<469B3286.3060902@voltaire.com>	<20070716115911.GA3379@mellanox.co.il>
	<469B6634.1050709@voltaire.com> <469B9B5A.2040707@ichips.intel.com>
	<469C9453.80905@voltaire.com>
Message-ID: <469CEC8F.4050106@ichips.intel.com>

> Can you explain why would not the IB CM use the thread context provided 
> by the mad layer?

You can end up with deadlock conditions when destroying cm_id's that 
have outstanding MADs.  It also increases MAD processing time, which can 
increase dropping MADs.

> Second, if the CM needs a different context why not use the system 
> threads? I understood from Michael's reply that the CM code relies on 
> some thread/queue flushing at the time of CM ID destruction, is it an 
> implementation issue that can change? if not, can't one dedicated thread 
> do the job?

The timing and use of the system threads is unknown.  When the ib_mad 
module was created, it was suggested that the system threads not be 
used.  (I think it was Roland who recommended this.)  We can change to 
system threads, but it does open the possibility of complicated deadlock 
conditions if other modules use the system threads as well.

The CM could change to using a single dedicated thread, but if there are 
multiple processors available, why restrict processing to only being 
able to use one of them?

- Sean


From mst at dev.mellanox.co.il  Tue Jul 17 09:27:31 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 17 Jul 2007 19:27:31 +0300
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <1184689249.5165.419.camel@firewall.xsintricity.com>
References: <469B639A.1090804@dev.mellanox.co.il>
	<1184642968.5165.414.camel@firewall.xsintricity.com>
	<20070717152546.GA6863@mellanox.co.il>
	<1184689249.5165.419.camel@firewall.xsintricity.com>
Message-ID: <20070717162731.GA7479@mellanox.co.il>

> Quoting Doug Ledford <dledford at redhat.com>:
> Subject: Re: RFC OFED-1.3 installation
> 
> On Tue, 2007-07-17 at 18:25 +0300, Michael S. Tsirkin wrote:
> > > Let me give an example.  In OFED 1.0, you shipped dapl version 1.2.  In
> > > OFED 1.1, you also shipped dapl version 1.2.  However, code inspection
> > > shows that between OFED 1.0 and OFED 1.1, dapl did in fact change (not a
> > > lot, but anything is enough).  So, between OFED 1.0 and OFED 1.1, you
> > > have two different versions of dapl, but with exactly the same version
> > > number.  A person can't tell them apart.
> > 
> > Yes, this sure looks like a problem. I think that versioning needs to be addressed
> > at the package level, not at OFED level though. Right?
> 
> Versioning needs to be addressed at both levels.  You need versions of
> software to start with, but then you still need releases of packages to
> differentiate between different builds of a specific version of
> software.

Why would we want to have different builds of a specific version of software
for a specific OS?  Could you give an example pls?

-- 
MST


From dledford at redhat.com  Tue Jul 17 09:39:40 2007
From: dledford at redhat.com (Doug Ledford)
Date: Tue, 17 Jul 2007 16:39:40 +0000
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <20070717162731.GA7479@mellanox.co.il>
References: <469B639A.1090804@dev.mellanox.co.il>
	<1184642968.5165.414.camel@firewall.xsintricity.com>
	<20070717152546.GA6863@mellanox.co.il>
	<1184689249.5165.419.camel@firewall.xsintricity.com>
	<20070717162731.GA7479@mellanox.co.il>
Message-ID: <1184690380.5165.430.camel@firewall.xsintricity.com>

On Tue, 2007-07-17 at 19:27 +0300, Michael S. Tsirkin wrote:
> > Quoting Doug Ledford <dledford at redhat.com>:
> > Subject: Re: RFC OFED-1.3 installation
> > 
> > On Tue, 2007-07-17 at 18:25 +0300, Michael S. Tsirkin wrote:
> > > > Let me give an example.  In OFED 1.0, you shipped dapl version 1.2.  In
> > > > OFED 1.1, you also shipped dapl version 1.2.  However, code inspection
> > > > shows that between OFED 1.0 and OFED 1.1, dapl did in fact change (not a
> > > > lot, but anything is enough).  So, between OFED 1.0 and OFED 1.1, you
> > > > have two different versions of dapl, but with exactly the same version
> > > > number.  A person can't tell them apart.
> > > 
> > > Yes, this sure looks like a problem. I think that versioning needs to be addressed
> > > at the package level, not at OFED level though. Right?
> > 
> > Versioning needs to be addressed at both levels.  You need versions of
> > software to start with, but then you still need releases of packages to
> > differentiate between different builds of a specific version of
> > software.
> 
> Why would we want to have different builds of a specific version of software
> for a specific OS?  Could you give an example pls?

It's how you integrate needed patches immediately while waiting on the
next release of the software.  For example, when mdadm-2.6.2.tar.gz was
released, I built an mdadm-2.6.2-1 package (the 1 being the release
number).  I then went to work on some mdadm bug reports I had, and I
wrote a number of patches that squashed about 10 bug reports.  During
that time, I had three intervening builds as I integrated those patches
into the spec file and applied them to the 2.6.2 base source code during
the build process.  Those builds were 2.6.2-{2,3,4}.  I also forwarded
those patches upstream, they've been integrated into the upstream code
base, but a 2.6.3 has not yet been released, the upstream maintainer is
waiting until everything he's putting into it settles down.  When 2.6.3
is released, then I'll integrate 2.6.3 into our source SCM system, drop
all of the patches that have been integrated into the base 2.6.3 source
code, and build mdadm-2.6.3-1.

The point of all this being that most software maintainers don't release
new versions of their software on a daily or even weekly basis, so when
you are busy fixing up bugs in the software between releases, the
patches go in the spec file and you bump the release number so that each
subsequent build has a unique number that can positively identify both
the base source code used and all patches applied to that source code.

You also bump the release number of the package any time you make
changes to the spec file and rebuild.  So, for instance, if the only
change I made to a package was to change the %doc macro in the %files
section, I would still bump the release number and rebuild so that the
new rpm name-version-release combination would uniquely identify the
change.

-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070717/2688105b/attachment.sig>

From mst at dev.mellanox.co.il  Tue Jul 17 09:45:00 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 17 Jul 2007 19:45:00 +0300
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <1184690380.5165.430.camel@firewall.xsintricity.com>
References: <469B639A.1090804@dev.mellanox.co.il>
	<1184642968.5165.414.camel@firewall.xsintricity.com>
	<20070717152546.GA6863@mellanox.co.il>
	<1184689249.5165.419.camel@firewall.xsintricity.com>
	<20070717162731.GA7479@mellanox.co.il>
	<1184690380.5165.430.camel@firewall.xsintricity.com>
Message-ID: <20070717164500.GB7479@mellanox.co.il>

> Quoting Doug Ledford <dledford at redhat.com>:
> Subject: Re: RFC OFED-1.3 installation
> 
> On Tue, 2007-07-17 at 19:27 +0300, Michael S. Tsirkin wrote:
> > > Quoting Doug Ledford <dledford at redhat.com>:
> > > Subject: Re: RFC OFED-1.3 installation
> > > 
> > > On Tue, 2007-07-17 at 18:25 +0300, Michael S. Tsirkin wrote:
> > > > > Let me give an example.  In OFED 1.0, you shipped dapl version 1.2.  In
> > > > > OFED 1.1, you also shipped dapl version 1.2.  However, code inspection
> > > > > shows that between OFED 1.0 and OFED 1.1, dapl did in fact change (not a
> > > > > lot, but anything is enough).  So, between OFED 1.0 and OFED 1.1, you
> > > > > have two different versions of dapl, but with exactly the same version
> > > > > number.  A person can't tell them apart.
> > > > 
> > > > Yes, this sure looks like a problem. I think that versioning needs to be addressed
> > > > at the package level, not at OFED level though. Right?
> > > 
> > > Versioning needs to be addressed at both levels.  You need versions of
> > > software to start with, but then you still need releases of packages to
> > > differentiate between different builds of a specific version of
> > > software.
> > 
> > Why would we want to have different builds of a specific version of software
> > for a specific OS?  Could you give an example pls?
> 
> It's how you integrate needed patches immediately while waiting on the
> next release of the software.

OK.

> ...
> You also bump the release number of the package any time you make
> changes to the spec file and rebuild.

Since we have spec files as part of package, this will be really
the same as the previous case, right?


-- 
MST


From bramesh at vt.edu  Tue Jul 17 09:55:53 2007
From: bramesh at vt.edu (Bharath Ramesh)
Date: Tue, 17 Jul 2007 12:55:53 -0400
Subject: [ofa-general] OpenIB development help
Message-ID: <20070717165553.GA10298@vt.edu>

I am trying to migrate my research work to InfiniBand. I was searching
for different resources which would help me in migrating to use
InfiniBand. I couldnt find any technical documentation on how to develop
applications using IB VAPI. The only documentation that closely
resembles an API description is the InfiniBand Architecture release's
Chapter 11 which talks about the software transport Verbs. I tried using
the infiniband/verbs.h and to get some kind of understanding on how to
develop code to use ibverbs.

There are many aspects that one still doesnt understand. I was just
wondering if the development community could help me in providing me
with some resources or pointers so that I can better understand on how
to use ibverbs. I am more interested in using the reliable datagram
transport provided by ibverbs. I am not subscribed to the mailing list,
I would really appreciate it if you could cc me in the reply. I really
appreciate anyone taking time out of their busy schedule in providing me
some help.

Thanks,

Bharath

---
Bharath Ramesh       <bramesh at vt.edu>       http://people.cs.vt.edu/~bramesh


From dledford at redhat.com  Tue Jul 17 10:06:02 2007
From: dledford at redhat.com (Doug Ledford)
Date: Tue, 17 Jul 2007 17:06:02 +0000
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <20070717164500.GB7479@mellanox.co.il>
References: <469B639A.1090804@dev.mellanox.co.il>
	<1184642968.5165.414.camel@firewall.xsintricity.com>
	<20070717152546.GA6863@mellanox.co.il>
	<1184689249.5165.419.camel@firewall.xsintricity.com>
	<20070717162731.GA7479@mellanox.co.il>
	<1184690380.5165.430.camel@firewall.xsintricity.com>
	<20070717164500.GB7479@mellanox.co.il>
Message-ID: <1184691962.5165.450.camel@firewall.xsintricity.com>

On Tue, 2007-07-17 at 19:45 +0300, Michael S. Tsirkin wrote:
> > Quoting Doug Ledford <dledford at redhat.com>:
> > Subject: Re: RFC OFED-1.3 installation
> > 
> > On Tue, 2007-07-17 at 19:27 +0300, Michael S. Tsirkin wrote:
> > > > Quoting Doug Ledford <dledford at redhat.com>:
> > > > Subject: Re: RFC OFED-1.3 installation
> > > > 
> > > > On Tue, 2007-07-17 at 18:25 +0300, Michael S. Tsirkin wrote:
> > > > > > Let me give an example.  In OFED 1.0, you shipped dapl version 1.2.  In
> > > > > > OFED 1.1, you also shipped dapl version 1.2.  However, code inspection
> > > > > > shows that between OFED 1.0 and OFED 1.1, dapl did in fact change (not a
> > > > > > lot, but anything is enough).  So, between OFED 1.0 and OFED 1.1, you
> > > > > > have two different versions of dapl, but with exactly the same version
> > > > > > number.  A person can't tell them apart.
> > > > > 
> > > > > Yes, this sure looks like a problem. I think that versioning needs to be addressed
> > > > > at the package level, not at OFED level though. Right?
> > > > 
> > > > Versioning needs to be addressed at both levels.  You need versions of
> > > > software to start with, but then you still need releases of packages to
> > > > differentiate between different builds of a specific version of
> > > > software.
> > > 
> > > Why would we want to have different builds of a specific version of software
> > > for a specific OS?  Could you give an example pls?
> > 
> > It's how you integrate needed patches immediately while waiting on the
> > next release of the software.
> 
> OK.
> 
> > ...
> > You also bump the release number of the package any time you make
> > changes to the spec file and rebuild.
> 
> Since we have spec files as part of package, this will be really
> the same as the previous case, right?

Depends.  Right now the spec file gets its version out of the configure
stuff.  That version only updates when you update the version of the
software itself.  It doesn't increment on each change to the source
repo, only on the major updates when you would release a new tarball
anyway.  Package versioning is, by necessity, finer grained than source
repo versioning.  You don't release a new dapl tarball just because you
updated some comments to remove a typo.  But you *do* update rpm
versions on every single change, at least if you are going to distribute
the rpm.

Look, rpms are just like versioned tarballs.  Once they go out in the
wild, that particular name-version-release combination is FROZEN.  It
NEVER changes.  Changing the code underlying that particular
name-version-release is just as bad as the whole Linus scenario I
described.  We couldn't stay in business if we let that happen, period.
That's why we have the guidelines that we do for package versioning.

If you need daily builds, there is a way to make that happen that
preserves the upgrade process and preserves unique name-version-release
combinations.  In that case, you would use the daily feature of that
script I wrote.  It spits out a tarball named package-git.tar.gz.  The
-git nomenclature clearly identifies that this is *not* a versioned
tarball and it is *not* required to stay the same.  You could put a date
or head tag on the name as well if you want to make it unique.  I didn't
do that because then the daily git tarballs take up *way* too much space
in our SCM repo.  Then, you name the package

name-version-0.release.git${DATE}

This way, each daily build has a unique name.  You increment the release
number with each daily build, and the date tag allows you to see at a
glance what date of pull the release goes with.  Once the software has
reached maturity, you simply pull the final name-version.tar.gz tarball
and update the spec to be name-version-1 and it automatically compares
as newer than the daily builds and upgrades.  Then subsequent rpm builds
from that official release version start incrementing the release number
like normal.

-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070717/2c489d8a/attachment.sig>

From jsquyres at cisco.com  Tue Jul 17 10:11:01 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Tue, 17 Jul 2007 13:11:01 -0400
Subject: [ofa-general] Re: [ewg] Re: RFC OFED-1.3 installation
In-Reply-To: <1184691962.5165.450.camel@firewall.xsintricity.com>
References: <469B639A.1090804@dev.mellanox.co.il>
	<1184642968.5165.414.camel@firewall.xsintricity.com>
	<20070717152546.GA6863@mellanox.co.il>
	<1184689249.5165.419.camel@firewall.xsintricity.com>
	<20070717162731.GA7479@mellanox.co.il>
	<1184690380.5165.430.camel@firewall.xsintricity.com>
	<20070717164500.GB7479@mellanox.co.il>
	<1184691962.5165.450.camel@firewall.xsintricity.com>
Message-ID: <8BD5CE10-FB60-4694-8DF4-2BBF21FA762A@cisco.com>

On Jul 17, 2007, at 1:06 PM, Doug Ledford wrote:

> Look, rpms are just like versioned tarballs.  Once they go out in the
> wild, that particular name-version-release combination is FROZEN.  It
> NEVER changes.

I think that these 3 statements sum up the whole argument.  I find it  
hard to disagree with them.  :-)

-- 
Jeff Squyres
Cisco Systems


From mst at dev.mellanox.co.il  Tue Jul 17 10:12:50 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 17 Jul 2007 20:12:50 +0300
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <1184691962.5165.450.camel@firewall.xsintricity.com>
References: <469B639A.1090804@dev.mellanox.co.il>
	<1184642968.5165.414.camel@firewall.xsintricity.com>
	<20070717152546.GA6863@mellanox.co.il>
	<1184689249.5165.419.camel@firewall.xsintricity.com>
	<20070717162731.GA7479@mellanox.co.il>
	<1184690380.5165.430.camel@firewall.xsintricity.com>
	<20070717164500.GB7479@mellanox.co.il>
	<1184691962.5165.450.camel@firewall.xsintricity.com>
Message-ID: <20070717171250.GD7479@mellanox.co.il>

> Quoting Doug Ledford <dledford at redhat.com>:
> Subject: Re: RFC OFED-1.3 installation
> 
> On Tue, 2007-07-17 at 19:45 +0300, Michael S. Tsirkin wrote:
> > > Quoting Doug Ledford <dledford at redhat.com>:
> > > Subject: Re: RFC OFED-1.3 installation
> > > 
> > > On Tue, 2007-07-17 at 19:27 +0300, Michael S. Tsirkin wrote:
> > > > > Quoting Doug Ledford <dledford at redhat.com>:
> > > > > Subject: Re: RFC OFED-1.3 installation
> > > > > 
> > > > > On Tue, 2007-07-17 at 18:25 +0300, Michael S. Tsirkin wrote:
> > > > > > > Let me give an example.  In OFED 1.0, you shipped dapl version 1.2.  In
> > > > > > > OFED 1.1, you also shipped dapl version 1.2.  However, code inspection
> > > > > > > shows that between OFED 1.0 and OFED 1.1, dapl did in fact change (not a
> > > > > > > lot, but anything is enough).  So, between OFED 1.0 and OFED 1.1, you
> > > > > > > have two different versions of dapl, but with exactly the same version
> > > > > > > number.  A person can't tell them apart.
> > > > > > 
> > > > > > Yes, this sure looks like a problem. I think that versioning needs to be addressed
> > > > > > at the package level, not at OFED level though. Right?
> > > > > 
> > > > > Versioning needs to be addressed at both levels.  You need versions of
> > > > > software to start with, but then you still need releases of packages to
> > > > > differentiate between different builds of a specific version of
> > > > > software.
> > > > 
> > > > Why would we want to have different builds of a specific version of software
> > > > for a specific OS?  Could you give an example pls?
> > > 
> > > It's how you integrate needed patches immediately while waiting on the
> > > next release of the software.
> > 
> > OK.
> > 
> > > ...
> > > You also bump the release number of the package any time you make
> > > changes to the spec file and rebuild.
> > 
> > Since we have spec files as part of package, this will be really
> > the same as the previous case, right?
> 
> Depends.  Right now the spec file gets its version out of the configure
> stuff.  That version only updates when you update the version of the
> software itself.  It doesn't increment on each change to the source
> repo, only on the major updates when you would release a new tarball
> anyway.  Package versioning is, by necessity, finer grained than source
> repo versioning.  You don't release a new dapl tarball just because you
> updated some comments to remove a typo.  But you *do* update rpm
> versions on every single change, at least if you are going to distribute
> the rpm.
> 
> Look, rpms are just like versioned tarballs.  Once they go out in the
> wild, that particular name-version-release combination is FROZEN.

It really looks like this is a work around for when you want to apply
a patch without going through maintainer.

The way OFED release process works, we really don't
do releases all that often, and when we do, we can coordinate with
the maintainer.

-- 
MST


From mshefty at ichips.intel.com  Tue Jul 17 10:22:03 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 17 Jul 2007 10:22:03 -0700
Subject: [ofa-general] Re: [PATCH] IB/mad: fix duplicated kernel thread
	name
In-Reply-To: <20070717062159.GA2177@mellanox.co.il>
References: <Pine.LNX.4.64.0707110840560.15887@zuben>	<adatzs4wasb.fsf@cisco.com>
	<20070717062159.GA2177@mellanox.co.il>
Message-ID: <469CFABB.8090502@ichips.intel.com>

> However, creating a thread per port does seem
> somewhat arbitrary, and would mean wasting (a small amount of) resources
> apparently for no gain if there are lots of HCA ports in a box.

At least in theory, it should be easy to change the CM threading model 
to 1 thread per processor or a single thread.  I don't know if systems 
are more likely to have more HCA ports or processors, but all of our 
systems here (a few hundred nodes total) have more processors.  And 
given current IB speeds, I suspect this may be the common configuration.

- Sean


From dledford at redhat.com  Tue Jul 17 10:36:40 2007
From: dledford at redhat.com (Doug Ledford)
Date: Tue, 17 Jul 2007 17:36:40 +0000
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <20070717171250.GD7479@mellanox.co.il>
References: <469B639A.1090804@dev.mellanox.co.il>
	<1184642968.5165.414.camel@firewall.xsintricity.com>
	<20070717152546.GA6863@mellanox.co.il>
	<1184689249.5165.419.camel@firewall.xsintricity.com>
	<20070717162731.GA7479@mellanox.co.il>
	<1184690380.5165.430.camel@firewall.xsintricity.com>
	<20070717164500.GB7479@mellanox.co.il>
	<1184691962.5165.450.camel@firewall.xsintricity.com>
	<20070717171250.GD7479@mellanox.co.il>
Message-ID: <1184693800.5165.480.camel@firewall.xsintricity.com>

On Tue, 2007-07-17 at 20:12 +0300, Michael S. Tsirkin wrote:

> > Look, rpms are just like versioned tarballs.  Once they go out in the
> > wild, that particular name-version-release combination is FROZEN.
> 
> It really looks like this is a work around for when you want to apply
> a patch without going through maintainer.

Not really.  When you have a customer with a sev 1 issue, you don't wait
for upstream to release a new version of gcc before you get them their
fix.

There are also those times when you have an older, long released product
that isn't up to date with upstream, for instance RHEL4 mdadm is 1.12.0
and will not be updated to the 2.6.2 version that's in Fedora.  If I
find a bug in that 1.12.0 version of mdadm, then I'll fix it using a
patch in the spec file.  If the bug also exists in upstream then it will
get sent upstream to be included in the latest upstream release.  But,
upstream won't care about version 1.12.0, and they won't release a new
version 1 mdadm just for our bugfix, so we carry those targeted fixes
around as long as we have that version 1 mdadm on systems.

There are other reasons to do this as well, for instance when you need
to make a change as part of package integration that simply isn't needed
or wanted upstream.  For example, many times upstream couldn't care less
about patches that implement our particular file system layout for a
package.

There are lots of things that we as a distributor have to care about
that upstream generally does not.  The spec file and patches are how we
solve our customer's problems.  They are what make a stable
distribution, as opposed to a "bleeding edge, must always update to
latest upstream version to fix any problem" system, a reality.  It's the
difference between RHEL and Fedora.

> The way OFED release process works, we really don't
> do releases all that often, and when we do, we can coordinate with
> the maintainer.

-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070717/0544c2fb/attachment.sig>

From rdreier at cisco.com  Tue Jul 17 10:41:49 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 10:41:49 -0700
Subject: [ofa-general] socket buffer accounting with UDP/ipoib
In-Reply-To: <1183643723.25031.262.camel@mtls03> (Eli Cohen's message of "Thu,
	05 Jul 2007 16:55:22 +0300")
References: <1183643723.25031.262.camel@mtls03>
Message-ID: <aday7hfrkpu.fsf@cisco.com>

I did a quick hack to enable copybreak for UD packets up to 256 bytes
(see below).  This is still missing copybreak for CM / RC mode.
However I just wanted to see how it affected performance.  And the
answer is that on my system (fast quad-core Xeon, 1-port Mellanox PCIe
HCA) is that it didn't make any difference in small-message latency or
throughput, at least none that I could measure with netpipe (NPtcp).

I'm not sure whether to pursue this or not.


diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 285c143..bf60bbb 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -59,6 +59,8 @@ enum {
 	IPOIB_PACKET_SIZE         = 2048,
 	IPOIB_BUF_SIZE 		  = IPOIB_PACKET_SIZE + IB_GRH_BYTES,
 
+	IPOIB_COPYBREAK		  = 256,
+
 	IPOIB_ENCAP_LEN 	  = 4,
 
 	IPOIB_CM_MTU              = 0x10000 - 0x10, /* padding to align header to 16 */
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 1094488..8d6d0d0 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -203,22 +203,48 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 	if (wc->slid == priv->local_lid && wc->src_qp == priv->qp->qp_num)
 		goto repost;
 
-	/*
-	 * If we can't allocate a new RX buffer, dump
-	 * this packet and reuse the old buffer.
-	 */
-	if (unlikely(ipoib_alloc_rx_skb(dev, wr_id))) {
-		++priv->stats.rx_dropped;
-		goto repost;
-	}
-
 	ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n",
 		       wc->byte_len, wc->slid);
 
-	ib_dma_unmap_single(priv->ca, addr, IPOIB_BUF_SIZE, DMA_FROM_DEVICE);
+	if (wc->byte_len < IPOIB_COPYBREAK + IB_GRH_BYTES) {
+		struct sk_buff *new_skb;
+
+		/*
+		 * Add 12 bytes to 4-byte IPoIB header to get IP
+		 * header at a multiple of 16.
+		 */
+		new_skb = dev_alloc_skb(wc->byte_len - IB_GRH_BYTES + 12);
+		if (unlikely(!new_skb)) {
+			++priv->stats.rx_dropped;
+			goto repost;
+		}
+
+		skb_reserve(new_skb, 12);
+		skb_put(new_skb, wc->byte_len - IB_GRH_BYTES);
 
-	skb_put(skb, wc->byte_len);
-	skb_pull(skb, IB_GRH_BYTES);
+		ib_dma_sync_single_for_cpu(priv->ca, addr, IPOIB_BUF_SIZE,
+					   DMA_FROM_DEVICE);
+		skb_copy_from_linear_data_offset(skb, IB_GRH_BYTES, new_skb->data,
+						 wc->byte_len - IB_GRH_BYTES);
+		ib_dma_sync_single_for_device(priv->ca, addr, IPOIB_BUF_SIZE,
+					      DMA_FROM_DEVICE);
+
+		skb = new_skb;
+	} else {
+		/*
+		 * If we can't allocate a new RX buffer, dump
+		 * this packet and reuse the old buffer.
+		 */
+		if (unlikely(ipoib_alloc_rx_skb(dev, wr_id))) {
+			++priv->stats.rx_dropped;
+			goto repost;
+		}
+
+		ib_dma_unmap_single(priv->ca, addr, IPOIB_BUF_SIZE, DMA_FROM_DEVICE);
+
+		skb_put(skb, wc->byte_len);
+		skb_pull(skb, IB_GRH_BYTES);
+	}
 
 	skb->protocol = ((struct ipoib_header *) skb->data)->proto;
 	skb_reset_mac_header(skb);


From rdreier at cisco.com  Tue Jul 17 10:45:00 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 10:45:00 -0700
Subject: [ofa-general] [PATCH] IB/CM: Remove local write permission enable
	in QP access flags
In-Reply-To: <200707171758.57442.dotanb@dev.mellanox.co.il> (Dotan Barak's
	message of "Tue, 17 Jul 2007 17:58:57 +0300")
References: <200707171758.57442.dotanb@dev.mellanox.co.il>
Message-ID: <adatzs2sz4z.fsf@cisco.com>

 >  	case RDMA_TRANSPORT_IWARP:
 >  		if (!id_priv->cm_id.iw) {
 > -			qp_attr->qp_access_flags = IB_ACCESS_LOCAL_WRITE;
 > +			qp_attr->qp_access_flags = 0;
 >  			*qp_attr_mask = IB_QP_STATE | IB_QP_ACCESS_FLAGS;

Looks sane to me... among iWARP drivers, cxgb3 ignores IB_ACCESS_LOCAL_WRITE
in qp_access_flags and amso1100 doesn't look at qp_access_flags at all (??).


From mst at dev.mellanox.co.il  Tue Jul 17 10:45:26 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 17 Jul 2007 20:45:26 +0300
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <1184693800.5165.480.camel@firewall.xsintricity.com>
References: <469B639A.1090804@dev.mellanox.co.il>
	<1184642968.5165.414.camel@firewall.xsintricity.com>
	<20070717152546.GA6863@mellanox.co.il>
	<1184689249.5165.419.camel@firewall.xsintricity.com>
	<20070717162731.GA7479@mellanox.co.il>
	<1184690380.5165.430.camel@firewall.xsintricity.com>
	<20070717164500.GB7479@mellanox.co.il>
	<1184691962.5165.450.camel@firewall.xsintricity.com>
	<20070717171250.GD7479@mellanox.co.il>
	<1184693800.5165.480.camel@firewall.xsintricity.com>
Message-ID: <20070717174526.GE7479@mellanox.co.il>

> There are lots of things that we as a distributor have to care about
> that upstream generally does not.  The spec file and patches are how we
> solve our customer's problems.  They are what make a stable
> distribution, as opposed to a "bleeding edge, must always update to
> latest upstream version to fix any problem" system, a reality.  It's the
> difference between RHEL and Fedora.

I think I am getting it - you want to release a patched version of some OFED
library without going through openfabrics? OK.
So I imagine that's when you would increment the rpm-specific version number.
But I can't see why would an OFED release want to play with these.

-- 
MST


From rdreier at cisco.com  Tue Jul 17 10:49:52 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 10:49:52 -0700
Subject: [ofa-general] Re: [PATCH 1 of 2]  mlx4: implement query-qp
In-Reply-To: <200707170955.10933.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Tue, 17 Jul 2007 09:55:10 +0300")
References: <200706211227.47794.jackm@dev.mellanox.co.il>
	<200707151028.24013.jackm@dev.mellanox.co.il>
	<adasl7oxq1r.fsf@cisco.com>
	<200707170955.10933.jackm@dev.mellanox.co.il>
Message-ID: <adaps2qsywv.fsf@cisco.com>

 > Thanks for applying it.  I sent it to you as a patch to a patch because
 > I thought the change would be much more obvious to you this way.

OK, but I basically have to apply it by hand then.  I guess the best I
could do would be to revert the original patch but save a copy, apply
your patch to the patch, and then apply that patch.  Anyway it makes
things much more laborious.

 > Would you rather next time that I just send you an updated version of the original patch,
 > or should I send the fix as a patch to the code after the original patch has been applied?

Either way is fine, but an incremental patch is probably better
(especially because it makes the changes easiest to see).  And
especially in this case, where the original buggy patch was already
upstream, an incremental patch is definitely best.


From mst at dev.mellanox.co.il  Tue Jul 17 10:51:54 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 17 Jul 2007 20:51:54 +0300
Subject: [ofa-general] Re: socket buffer accounting with UDP/ipoib
In-Reply-To: <aday7hfrkpu.fsf@cisco.com>
References: <1183643723.25031.262.camel@mtls03> <aday7hfrkpu.fsf@cisco.com>
Message-ID: <20070717175154.GF7479@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: socket buffer accounting with UDP/ipoib
> 
> I did a quick hack to enable copybreak for UD packets up to 256 bytes
> (see below).  This is still missing copybreak for CM / RC mode.
> However I just wanted to see how it affected performance.  And the
> answer is that on my system (fast quad-core Xeon, 1-port Mellanox PCIe
> HCA) is that it didn't make any difference in small-message latency or
> throughput, at least none that I could measure with netpipe (NPtcp).

Not any benchmark would show an improvement: what we save with copybreak
is actually memory, which only has performance impact if you start reaching
RCVBUF size. And the savings are only if message size is below
the threshold, so you better set NDELAY to see any effect.

Try running a UDP benchmark with small message size and
NDELAY, and looking at number of UDP errors with netstat.

-- 
MST


From rdreier at cisco.com  Tue Jul 17 10:52:55 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 10:52:55 -0700
Subject: [ofa-general] Re: [PATCH 01/10] IB/ehca: Support for multiple event
	queues
In-Reply-To: <20070717043740.GB8527@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 17 Jul 2007 07:37:40 +0300")
References: <OF2377E85C.B05BDE94-ONC125731A.006E64A7-C125731A.00714F9E@de.ibm.com>
	<ada1wf7vgfb.fsf@cisco.com> <20070717043740.GB8527@mellanox.co.il>
Message-ID: <adalkdesyrs.fsf@cisco.com>

 > Here's some anecdotal evidence :)
 > http://lists.openfabrics.org/pipermail/general/2007-May/035758.html

Right, but then we went on to say that we probably want to use
multiple vectors to separate out multiple HCA ports rather than
send/sreceive on the same port.  And the current IPoIB implementation
of having that second CQ seems suboptimal anyway, since it seems to
leave us susceptible to the interrupt overload that NAPI was supposed
to solve.

At a higher level, I'm left wondering why nobody talked about multiple
EQs during the last months of the 2.6.22 process and now all of a
sudden it becomes urgent in the last few days of the 2.6.23 merge
window.  That's not really how I like to merge features....

 - R.


From rdreier at cisco.com  Tue Jul 17 10:53:50 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 10:53:50 -0700
Subject: [ofa-general] Re: Further 2.6.23 merge plans...
In-Reply-To: <20070716200540.GA8527@mellanox.co.il> (Michael S. Tsirkin's
	message of "Mon, 16 Jul 2007 23:05:40 +0300")
References: <adalkdl43w0.fsf@cisco.com> <adahco943ip.fsf@cisco.com>
	<20070713054711.GA21709@mellanox.co.il> <adar6nc2mt4.fsf@cisco.com>
	<20070714175425.GA17597@mellanox.co.il> <adabqecxpte.fsf@cisco.com>
	<20070716200540.GA8527@mellanox.co.il>
Message-ID: <adahco2syq9.fsf@cisco.com>

 > Well, the only issue I recall is about the # of EQs we want to allocate.
 > Was there something else?

Yes, some ideas about how applications should pick which EQ to use.
And how to handle CPU affinity.  And whether we want to try to do
something NUMA-aware.

 - R.


From rdreier at cisco.com  Tue Jul 17 10:57:42 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 10:57:42 -0700
Subject: [ofa-general] Re: socket buffer accounting with UDP/ipoib
In-Reply-To: <20070717175154.GF7479@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 17 Jul 2007 20:51:54 +0300")
References: <1183643723.25031.262.camel@mtls03> <aday7hfrkpu.fsf@cisco.com>
	<20070717175154.GF7479@mellanox.co.il>
Message-ID: <adad4yqsyjt.fsf@cisco.com>

 > Try running a UDP benchmark with small message size and
 > NDELAY, and looking at number of UDP errors with netstat.

If you give me an exact command line I can try it but I don't think
I'll have time to figure out what to run by myself.


From rdreier at cisco.com  Tue Jul 17 11:05:33 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 11:05:33 -0700
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <20070717152546.GA6863@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 17 Jul 2007 18:25:46 +0300")
References: <469B639A.1090804@dev.mellanox.co.il>
	<1184642968.5165.414.camel@firewall.xsintricity.com>
	<20070717152546.GA6863@mellanox.co.il>
Message-ID: <ada8x9esy6q.fsf@cisco.com>

It seems to me that this is all stemming from the same old fundamental
confusion between a "release" and a "distribution."  I think everyone
would be better served by a process where individual maintainers were
responsible for releasing tarballs of their packages, with schedules
coordinated toward an overall "openfabrics release" (see
http://live.gnome.org/TwoPointNineteen for hints about a process that
might work), and then an OFED team handled spec files and kernel
module packaging for various distros.

In this world I would expect Doug could just take tarballs from the
openfabrics world and not be bothered by the OFED RPM spec files
(unless he wants to use them as a reference).

To summarize, there would be two separate "products":

 - openfabrics release:
     format: .tar.gz files
     customers: OFED, Red Hat/Novell/Debian/etc packagers

 - OFED release:
     format: .srpm and binary .rpm files
     customers: end users who need newer drivers than their distribution includes

 - R.


From rdreier at cisco.com  Tue Jul 17 11:06:39 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 11:06:39 -0700
Subject: [ofa-general] Re: Further 2.6.23 merge plans...
In-Reply-To: <OFCB3C6CBA.79B1F6B5-ON87257317.00675CD3-88257317.00675C91@us.ibm.com>
	(Shirley Ma's message of "Fri, 13 Jul 2007 11:50:54 -0700")
References: <OFCB3C6CBA.79B1F6B5-ON87257317.00675CD3-88257317.00675C91@us.ibm.com>
Message-ID: <ada4pk2sy4w.fsf@cisco.com>

 >         We are working on IPoIB to use multiple EQ for multiple 
 > links/connetions scalability. Does this mean this will wait for 2.6.24?

I think so -- I don't want to merge something that first appears in
the last few days of the merge window.  The idea is to get your stuff
queued up *before* the merge window opens.

 - R.


From rdreier at cisco.com  Tue Jul 17 11:07:59 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 11:07:59 -0700
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <adahco943ip.fsf@cisco.com> (Roland Dreier's message of "Thu,
	12 Jul 2007 16:15:58 -0700")
References: <adalkdl43w0.fsf@cisco.com> <adahco943ip.fsf@cisco.com>
Message-ID: <adazm1urji8.fsf@cisco.com>

 >  - Take a look at Sean's local SA caching patches.  I merged
 >    everything else from Sean's tree, but I'm still undecided about
 >    these.  I haven't read them carefully yet, but even aside from that
 >    I don't have a good feeling about whether there's consensus about
 >    this yet.  Any opinions about merging, for or against, would be
 >    appreciated here.

Does anyone other than Sean have an opinion here?  If you want this
feature, if you've tested it, if you don't think it's ready yet,
whatever, please speak up -- I don't feel comfortable making a
decision on my own here (although I will if I have to).


From mshefty at ichips.intel.com  Tue Jul 17 11:20:49 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 17 Jul 2007 11:20:49 -0700
Subject: [ofa-general] [PATCH] IB/CM: Remove local write permission enable
	in	QP access flags
In-Reply-To: <200707171758.57442.dotanb@dev.mellanox.co.il>
References: <200707171758.57442.dotanb@dev.mellanox.co.il>
Message-ID: <469D0881.6050409@ichips.intel.com>

Dotan Barak wrote:
> Remove local write permission enable in QP access flags
> (this attribute is being used only for remote permissions).
> 
> Signed-off-by: Dotan Barak <dotanb at mellanox.co.il>

Acked-by: Sean Hefty <sean.hefty at intel.com>

Steve, does this look okay to you?

> 
> ---
> 
> diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
> index 23af7a0..9ffb998 100644
> --- a/drivers/infiniband/core/cma.c
> +++ b/drivers/infiniband/core/cma.c
> @@ -573,7 +573,7 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, struct ib_qp_attr *qp_attr,
>  		break;
>  	case RDMA_TRANSPORT_IWARP:
>  		if (!id_priv->cm_id.iw) {
> -			qp_attr->qp_access_flags = IB_ACCESS_LOCAL_WRITE;
> +			qp_attr->qp_access_flags = 0;
>  			*qp_attr_mask = IB_QP_STATE | IB_QP_ACCESS_FLAGS;
>  		} else
>  			ret = iw_cm_init_qp_attr(id_priv->cm_id.iw, qp_attr,
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From swise at opengridcomputing.com  Tue Jul 17 11:29:20 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 17 Jul 2007 13:29:20 -0500
Subject: [ofa-general] [PATCH] IB/CM: Remove local write permission enable
	in	QP access flags
In-Reply-To: <469D0881.6050409@ichips.intel.com>
References: <200707171758.57442.dotanb@dev.mellanox.co.il>
	<469D0881.6050409@ichips.intel.com>
Message-ID: <469D0A80.9060906@opengridcomputing.com>

Why are you changing this?

I think we set it for a specific reason (but I don't remember why just 
now)...


Sean Hefty wrote:
> Dotan Barak wrote:
>> Remove local write permission enable in QP access flags
>> (this attribute is being used only for remote permissions).
>>
>> Signed-off-by: Dotan Barak <dotanb at mellanox.co.il>
> 
> Acked-by: Sean Hefty <sean.hefty at intel.com>
> 
> Steve, does this look okay to you?
> 
>>
>> ---
>>
>> diff --git a/drivers/infiniband/core/cma.c 
>> b/drivers/infiniband/core/cma.c
>> index 23af7a0..9ffb998 100644
>> --- a/drivers/infiniband/core/cma.c
>> +++ b/drivers/infiniband/core/cma.c
>> @@ -573,7 +573,7 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, 
>> struct ib_qp_attr *qp_attr,
>>          break;
>>      case RDMA_TRANSPORT_IWARP:
>>          if (!id_priv->cm_id.iw) {
>> -            qp_attr->qp_access_flags = IB_ACCESS_LOCAL_WRITE;
>> +            qp_attr->qp_access_flags = 0;
>>              *qp_attr_mask = IB_QP_STATE | IB_QP_ACCESS_FLAGS;
>>          } else
>>              ret = iw_cm_init_qp_attr(id_priv->cm_id.iw, qp_attr,
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit 
>> http://openib.org/mailman/listinfo/openib-general
>>


From dledford at redhat.com  Tue Jul 17 11:34:14 2007
From: dledford at redhat.com (Doug Ledford)
Date: Tue, 17 Jul 2007 18:34:14 +0000
Subject: [ofa-general] RE: RFC OFED-1.3 installation
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901E738DE@mtlexch01.mtl.com>
References: <469B639A.1090804@dev.mellanox.co.il>
	<1184642968.5165.414.camel@firewall.xsintricity.com>
	<6C2C79E72C305246B504CBA17B5500C901E738DE@mtlexch01.mtl.com>
Message-ID: <1184697254.5165.527.camel@firewall.xsintricity.com>

On Tue, 2007-07-17 at 18:36 +0300, Vladimir Sokolovsky wrote:
> [ snip ]
> > Let me copy and paste an email conversation I had with Or that
> > highlights why this is broken:
> > 
> > ------- Begin cut-n-paste
> > On Mon, 2007-07-02 at 22:25 +0300, Or Gerlitz wrote:
> > > [sorry for breaking the thread, I am working from home now and
> unable
> > to use normal mailer.]
> > >
> 
> 
> > Let me give an example.  In OFED 1.0, you shipped dapl version 1.2.
> In
> > OFED 1.1, you also shipped dapl version 1.2.  However, code inspection
> > shows that between OFED 1.0 and OFED 1.1, dapl did in fact change (not
> > a lot, but anything is enough).
> 
> I am not suppose to support correct versioning for every package in
> OFED.
> It should be done by the maintainer of the package.

This may be true going forward when you've split all the packages up,
but it definitely was *not* true when all the packages where thrown into
one huge tarball and built out of one spec file.  Since the versioning
information was lost when that tarball was recreated over and over
again, versioning responsibility necessarily fell back upon the spec
file.

> > The only reason that the OFED distribution has *ever* reliably
> > installed the rpms you wanted installed is because you compile things
> > locally and then *force* the upgrade of rpms over the top of older
> rpms
> > that have the same version number.  And even then, you yourselves
> can't
> > tell the difference between a customer with the OFED 1.0 or OFED 1.1
> > dapl installed by checking the RPM version, you just have to go off
> > what the end user *tells* you he installed and hope he's right.
> > 
> 
> OFED does not force an upgrade, it simply removes the previous version
> and then installs the new one.

From the viewpoint of proper upgrades, there is no difference.  Removing
and then installing is just a work around for broken upgrades.

> This is why package versioning does not affect OFED installation.

Right, you guys did things in a way that allowed you to not care about
something that any distributor *must* care about.

> I agree that it is different for Linux Distributions

Open Fabrics Enterprise Distribution

>  and should be fixed
> for OFED-1.3 but it should 
> be under responsibility of package maintainer.

The maintainer of any given software is responsible for their tarballs.
The maintainer of any given rpm is responsible for their spec file and
rpms.  If a person takes on both roles, like Roland does, then they
handle both roles and the roles mostly merge into one.  But whenever
someone other than the project maintainer decides to be the package
maintainer, they are different roles and each is responsible for their
own versioning requirements.

> So, all RPM spec files should be fixed for OFED-1.3 and properly
> maintained.
> We should discuss the kernel-ib package structure and its spec file.
> 
> > And I have to *know* what software my customer is running in order to
> > support them.  Because you guys have done things the way you have, I
> > can't know that.  I might be able to know if I could also guarantee
> > they didn't download and locally compile your packages, but if they
> > did, then the same version number of RPM can mean two different things
> > entirely depending on whether it's your RPM or mine.
> > 
> 
> You can easily check if there OFED installation by running 'ofed_info'.

No, you can't.  At least not on any system running our packages.  We
don't, and won't, include anything like ofed_info in our distribution.
We have one tool, and only one, that we use to tell what software a
system is running: rpm.  We will not include things like ofed_info just
to find out what rpm should, if used properly, already tell us.  That
would be unnecessary duplication and results in all sorts of support
problems when you start needing to be able to tell customers to use
multiple different tools to try and figure out what one tool should be
able to tell them.

> 
> > I posted links to a wealth of valuable information on the topic of
> > making a proper spec file and creating *good* packages during my talk
> > at Sonoma.  I gather you haven't read those or you never would have
> > suggested the above for creating the RPMs.
> > 
> 
> I just looked into your presentation from Sonoma. You providing there an
> example
> of management package and your make.dist script for creating daily
> builds and releases.
> 
> I have a some questions about this script:
> ...
>  59         VERSION=`grep "AC_INIT.*$target" $target/configure.in | cut
> -f 2 -d ',' | sed -e 's/ //g'`
> ...
>  97                 DATE=`date +%Y%m%d`
>  98                 if [ -f $TMPDIR/$target.release ]; then
>  99                         RELEASE=`cat $TMPDIR/$target.release`
> 100                         RELEASE=`expr $RELEASE + 1`
> 101                 else
> 102                         RELEASE=1
> 103                 fi
> 104                 echo $RELEASE > $TMPDIR/$target.release
> 105                 RELEASE=0.${RELEASE}.${DATE}git
> 106                 TARBALL=$target-git.tgz
> 107         fi
> ...
> 109         cp -a $target $target-$VERSION
> 110         sed -e
> 's/@VERSION@/'$VERSION'/;s/@RELEASE@/'$RELEASE'/;s/@TARBALL@/'$TARBALL'/
> ' < $target/$target.spec.in > $target-$VERSION/$target.spec
> 111         cd $target-$VERSION
> 112         ./autogen.sh
> 113         cd ..
> 114         echo "Creating $TMPDIR/$TARBALL"
> 115         tar -czf $TMPDIR/$TARBALL --exclude=.git $target-$VERSION
> 
> I thought that the standard way to get tar.gz file is using autotools (3
> commands) like I wrote before:
> autogen.sh, configure, make dist.
> Can you explain why your way is better?

autogen.sh yes (and it should have been in my script, my current one has
it, but that one didn't).  Since configure tries to figure out a bunch
of stuff about the build environment, it must be run in the software
development environment of the platform you are targeting the final
build for.  If you run it on your local RHEL4 machine, but our RHEL4
build environment for our next update has a different glibc that changes
some minor thing that configure actually checks, then it would be wrong.
So, even if you run configure, I can't trust the output from it.
Obviously, if you aren't running configure, then make dist is
irrelevant.  So, you can run configure if you want, but I will ignore
the output in anything I build.  And if the make dist operation removes
any files necessary for me to properly reconfigure the software using
configure, then it will be a totally broken tarball from my perspective.

> Do you have a proposal for daily builds? We need OFED daily builds for
> verification. 
> We can't wait for RedHat updates to get the updated OFED packages.

I have a newer version of that make.dist script that I wrote to
specifically work for the repos other than the management tree.  Using
that script, you could just do this:

for repo in *; do
    ./make.dist $repo daily
    rm $RPMDIR/${repo}*
    rpmbuild --rebuild dist/$repo-git.tar.gz
    rpm -Uvh $RPMDIR/${repo}*
done

That's really all you need for anything you are building for internal
use.  And if one of you wanted to be responsible for providing the rpms,
then a single person could actually maintain versioned rpms that way.
It would only break down when you try to run the make.dist script from
different systems since it creates a file that lets it know what the
next number in sequence is each time it builds that git.tar.gz file.
However, even that could be solved by putting the release file in some
sort of SCM if you wanted multiple people to be able to build properly
versioned rpms.

Really, the strictest guidelines apply to things you make publicly
available.  If you want to have a private, EWG only area on the ofa
server where you guys can share daily, unversioned builds, go right
ahead.  It's when they go out in the wild and you expect other people to
pick them up that you have to care.

> What OFED-1.3 structure do you propose? Should it consist of source RPMs
> or tgz files?
> What features install script should support?

From my standpoint, tgz files are really about all I care about.  For
instance, no matter what install script you write, I won't be using it
because we have our own install/update methods.  And it's hard for you
to make a spec file that's both relevant for Red Hat and SuSE and at the
same time clean enough to meet our requirements.

There is one suggestion I would make though that greatly helps with the
whole package versioning issue.  We have this trick we use in our kernel
RPMs back when we used to ship a kernel-source rpm (which was different
than the src.rpm, it was a pre-prepared, already prep'ed source tree
ready to be built from).  When we built our own kernel RPMs, we would go
into the top level Makefile in the kernel source tree and edit the
extraversion to be what matched the rpm.  When we made that source tree
that would become the kernel-source package, we edited extraversion to
-prep so that the final result if a customer used it to build a kernel
would be something like 2.6.9-prep in the kernel version.  You guys
could do something similar in all the src.rpms you ship.  Since you know
they will be compiled locally, you could easily put something
like .local at the end of you release string, so that say dapl would be
version: 1.2.1, release: 1.local or 1.ofa or something like that.  It
doesn't solve package version comparison issues (aka, telling which
package is newer by the number), but it does help to solve
identification issues.

-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070717/4aad09ac/attachment.sig>

From sean.hefty at intel.com  Tue Jul 17 11:34:40 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 17 Jul 2007 11:34:40 -0700
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <ada8x9esy6q.fsf@cisco.com>
Message-ID: <000101c7c8a1$2342d1e0$3c98070a@amr.corp.intel.com>

>I think everyone
>would be better served by a process where individual maintainers were
>responsible for releasing tarballs of their packages, with schedules
>coordinated toward an overall "openfabrics release"

For what it's worth, I agree with this approach.

- Sean


From rdreier at cisco.com  Tue Jul 17 11:35:30 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 11:35:30 -0700
Subject: [ofa-general] [PATCH] IB/CM: Remove local write permission enable
	in	QP access flags
In-Reply-To: <469D0A80.9060906@opengridcomputing.com> (Steve Wise's message of
	"Tue, 17 Jul 2007 13:29:20 -0500")
References: <200707171758.57442.dotanb@dev.mellanox.co.il>
	<469D0881.6050409@ichips.intel.com>
	<469D0A80.9060906@opengridcomputing.com>
Message-ID: <adaodiari8d.fsf@cisco.com>

 > Why are you changing this?

Because "local write" doesn't make sense as a QP permission -- the QP
access flags are about what the remote end of a connection is allowed
to do with RDMA.

 > I think we set it for a specific reason (but I don't remember why just
 > now)...

I can't see anything in the cxgb3 or amso11000 drivers that would pay
attention to this flag -- in fact amso1100 ignores qp_access_flags entirely.

 - R.


From rdreier at cisco.com  Tue Jul 17 11:37:39 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 11:37:39 -0700
Subject: [ofa-general] Re: [PATCH] mlx4: increase max outstanding rdma reads
	per qp
In-Reply-To: <200707171311.43680.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Tue, 17 Jul 2007 13:11:43 +0300")
References: <200707171311.43680.jackm@dev.mellanox.co.il>
Message-ID: <adak5syri4s.fsf@cisco.com>

 > Change max outstanding rdma reads per QP from 4 to 16.
 > This enables an improvement in latency for rdma-read applications.

This only affects performance if an app queues more than 4 RDMA READ
requests, right?  (Because the 5th request doesn't have to wait for
the 1st request to complete before it's sent)

Do we want to increase this for mthca too, or is mlx4 so much faster
than mthca that we need more requests in flight to keep the pipeline
full?

 - R.


From swise at opengridcomputing.com  Tue Jul 17 11:39:53 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 17 Jul 2007 13:39:53 -0500
Subject: [ofa-general] [PATCH] IB/CM: Remove local write permission enable
	in	QP access flags
In-Reply-To: <adaodiari8d.fsf@cisco.com>
References: <200707171758.57442.dotanb@dev.mellanox.co.il>	<469D0881.6050409@ichips.intel.com>	<469D0A80.9060906@opengridcomputing.com>
	<adaodiari8d.fsf@cisco.com>
Message-ID: <469D0CF9.1020302@opengridcomputing.com>

Roland Dreier wrote:
>  > Why are you changing this?
> 
> Because "local write" doesn't make sense as a QP permission -- the QP
> access flags are about what the remote end of a connection is allowed
> to do with RDMA.
> 
>  > I think we set it for a specific reason (but I don't remember why just
>  > now)...
> 
> I can't see anything in the cxgb3 or amso11000 drivers that would pay
> attention to this flag -- in fact amso1100 ignores qp_access_flags entirely.
> 
>  - R.

ok then...

Acked-by: Steve Wise <swise at opengridcomputing.com>


From dledford at redhat.com  Tue Jul 17 11:41:44 2007
From: dledford at redhat.com (Doug Ledford)
Date: Tue, 17 Jul 2007 14:41:44 -0400
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <000101c7c8a1$2342d1e0$3c98070a@amr.corp.intel.com>
References: <000101c7c8a1$2342d1e0$3c98070a@amr.corp.intel.com>
Message-ID: <1184697704.5165.534.camel@firewall.xsintricity.com>

On Tue, 2007-07-17 at 11:34 -0700, Sean Hefty wrote:
> >I think everyone
> >would be better served by a process where individual maintainers were
> >responsible for releasing tarballs of their packages, with schedules
> >coordinated toward an overall "openfabrics release"
> 
> For what it's worth, I agree with this approach.

Ditto.

-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070717/244465e2/attachment.sig>

From dledford at redhat.com  Tue Jul 17 11:43:18 2007
From: dledford at redhat.com (Doug Ledford)
Date: Tue, 17 Jul 2007 18:43:18 +0000
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <20070717174526.GE7479@mellanox.co.il>
References: <469B639A.1090804@dev.mellanox.co.il>
	<1184642968.5165.414.camel@firewall.xsintricity.com>
	<20070717152546.GA6863@mellanox.co.il>
	<1184689249.5165.419.camel@firewall.xsintricity.com>
	<20070717162731.GA7479@mellanox.co.il>
	<1184690380.5165.430.camel@firewall.xsintricity.com>
	<20070717164500.GB7479@mellanox.co.il>
	<1184691962.5165.450.camel@firewall.xsintricity.com>
	<20070717171250.GD7479@mellanox.co.il>
	<1184693800.5165.480.camel@firewall.xsintricity.com>
	<20070717174526.GE7479@mellanox.co.il>
Message-ID: <1184697799.5165.536.camel@firewall.xsintricity.com>

On Tue, 2007-07-17 at 20:45 +0300, Michael S. Tsirkin wrote:
> > There are lots of things that we as a distributor have to care about
> > that upstream generally does not.  The spec file and patches are how we
> > solve our customer's problems.  They are what make a stable
> > distribution, as opposed to a "bleeding edge, must always update to
> > latest upstream version to fix any problem" system, a reality.  It's the
> > difference between RHEL and Fedora.
> 
> I think I am getting it - you want to release a patched version of some OFED
> library without going through openfabrics? OK.
> So I imagine that's when you would increment the rpm-specific version number.
> But I can't see why would an OFED release want to play with these.

You don't want to, you *have* to.  It's because you are distributing
source software packages that build RPMs.  And you aren't waiting until
OFED is final, you release pre-releases too.  So you need to be able to
tell the difference between a customer running libibverbs-1.0.4 from
OFED-1.3-beta1 and libibverbs-1.0.4 from OFED-1.3 final.  In order to do
so, they need a different release number because the version number is
the same.  The only way this changes is if every component of OFED 1.3
releases their final tar.gz file in concert with OFED 1.3.  Otherwise,
at least *some* items in there will need a bumped release number.

Unless of course you are just relying on ofed_info, which as I pointed
out in my last email, is a workaround for not doing this.  We *won't*
use that workaround because having two means to tell the same thing
increases our support personnel training costs and makes things more
confusing for the customer.  We have one tool already, that's good
enough.

Additionally, once you step into the "create rpms" space, there are only
two ways things can go.  You can adhere to RPM packaging standards, and
your custom built RPMs will peacefully coexist on a system were there
are similar RPMs coming from the OS distributor, aka Red Hat.  Or, you
can do what you've been doing, where RPMs you build don't maintain
consistent numbering, and the customer can end up getting screwed when
your RPMs and our RPMs collide.

It would be careless and reckless to risk customer systems going belly
up because your RPM and mine collide in a way that renders the machine
dysfunctional.

So don't think of it as playing games with bumping release numbers,
think of it as finally making OFED RPMs standard compliant so you no
longer need the workaround of ofed_info.

-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070717/21edc5a8/attachment.sig>

From tziporet at dev.mellanox.co.il  Tue Jul 17 12:25:54 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 17 Jul 2007 22:25:54 +0300
Subject: [ofa-general] Re: [PATCH] mlx4: increase max outstanding rdma
	reads	per qp
In-Reply-To: <adak5syri4s.fsf@cisco.com>
References: <200707171311.43680.jackm@dev.mellanox.co.il>
	<adak5syri4s.fsf@cisco.com>
Message-ID: <469D17C2.3040403@mellanox.co.il>

Roland Dreier wrote:
>  > Change max outstanding rdma reads per QP from 4 to 16.
>  > This enables an improvement in latency for rdma-read applications.
>
> This only affects performance if an app queues more than 4 RDMA READ
> requests, right?  (Because the 5th request doesn't have to wait for
> the 1st request to complete before it's sent)
>
> Do we want to increase this for mthca too, or is mlx4 so much faster
> than mthca that we need more requests in flight to keep the pipeline
> full?
>
>   
I suggest we do this in mthca too.

tziporet


From tziporet at dev.mellanox.co.il  Tue Jul 17 12:54:36 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 17 Jul 2007 22:54:36 +0300
Subject: [ofa-general] OFED July 16 meeting summary
Message-ID: <469D1E7C.7040701@mellanox.co.il>

*OFED July 16 meeting summary*

*1. Merge the OFED 1.2.1 release with OFED 1.2.c release in August. *

There was a long discussion on pros & cons regarding merging the two 
releases.

Pros:
- Everybody will be focused on the same release
- All user space libs (except for the new libmlx4) are the same
- Reduce QA efforts

Cons:
- The kernel was changed to 2.6.22 based and this can cause instability.
- Harder to distinguish what are the differences between 1.2 to 1.2.c. 
(since its not only few patches)
- 1.2.c release was aimed for ConnectX support only. If we lump the two 
releases together it may slow the convergence of this release.

In addition there is a need to check with IBM and Chelsio, who actually 
asked for the 1.2.1 release, if this suites them.
Steve agreed to test 1.2.c to see if its OK with his fixes.
Need a respond from IBM too. (BTW - no patches from IBM were sent so far.)

Decision: No decision was taken.
I suggest we stay with two different branches for now.
After more people will test 1.2.c  and see if its stable enough we can 
decide not to do 1.2.1

*2. Agree on OFED 1.3 schedule:
*The suggested schedule:*
*        * Feature freeze - Sep 4
        * Alpha release - Sep 10
        * Beta release - Sep 25
        * RC1 - Oct 16
        * RC2 - Oct 30
        * RC3 - Nov 8 (assuming many of us are at SC07 on the week of 
Nov 11)
        * RC4 - Nov 20
        * GA release - Nov 30 (or first week of Dec)

Discussion:
- Due to the 1.2.c release the schedule seems very tight.
- Since 1.2.c progress only the kernel, many user level features that 
are already done are not exposed to customers in OFED release.

Decision: Revisit the schedule on September according to the "must have" 
features readiness.

*3. Review OFED 1.3 features list:
*

There was an agreement on the must have features, except QoS that should 
be defined after IBTA SPEC is published
We have not reviewed the list of features thoroughly. Each company 
should review the features and send comments to the list.

Must have general features:
====================

    * Kernel base on 2.6.23 (all new features that will be part of this
      kernel will be included in OFED 1.3)
    * Install:
          o Break the packages RPMs (work with Novell and Redhat) to
            minimize integration effort into OS distribution
    * Package:
          o Sources arrangement for the end user (for the labs)
    * New HCAs & RNICs:
          o ConnectX support
          o Neteffect support
    * QoS: OSM, CM, CMA, ULPs (IPoIB, SDP, SRP)

Other features (must have marked with *)
==============================

    * libibverbs: New verbs:
          o Scalable Reliable Connected Transport (with Mellanox ConnectX)*
          o Reliable Multicast?

ULPs:

    * IPoIB:
          o Performance improvements (those that will be stable on time)
          o NAPI - done
    * SDP:
          o * Keepalive
          o * AIO
    * uDAPL:
          o DAT 2.0 support with IB extensions for immediate data, atomics;
          o Add extensions for new verbs (SRCT,RM)
    * VNIC:
          o GA quality. Not a technology preview version anymore.
          o Added support for QLogic EVIC (10 Gbps
            Infiniband-to-Ethernet gateway) - in GA
    * RDS: RDMA API (using FMRs); GA quality with Oracle 11
    * NFSoRDMA integration - pending we have a maintainer
    * Management:
          o * Multiple partitions via libibumad
          o OpenSM
                + More routing performance improvements - done
                + Even more speedups - done
                + Better packaging/installation - done
                + "Native" daemon mode - done
                + * Performance management
                + * Quality of Service manager: Based on IBTA annex
                + Enhancements for fat tree routing (non pure tree
                  support) - done
                + More console commands and telnet access to console - done
          o More diagnostics
                + ibidsverify.pl: validate LIDs and GUIDs in subnet - done
                + Updated ibnetdiscover format with link width and
                  speed, and GUIDs - done
                + ibnetdiscover grouping support for new Voltaire
                  chassis - done
                + diag updates for IB router support - done
                + iblinkinfo.pl: Support peer port link width and speed
                  validation - done
                + ibdatacounters: Add script and man page for subnet
                  wide data counters saquery enhancements - done
    * iWARP:
          o * Chelsio: Get to GA level
          o NetEffect: Get the drivers into OFED


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070717/0578b084/attachment.html>

From coutinho at dcc.ufmg.br  Tue Jul 17 12:55:03 2007
From: coutinho at dcc.ufmg.br (Bruno Coutinho)
Date: Tue, 17 Jul 2007 16:55:03 -0300
Subject: [ofa-general] OpenIB development help
In-Reply-To: <20070717165553.GA10298@vt.edu>
References: <20070717165553.GA10298@vt.edu>
Message-ID: <a8d96dec0707171255w206e5731l68c5e9c0827656e@mail.gmail.com>

Perhaps you should use uDAPL/kDAPL ( http://www.datcollaborative.org/ ).
It's more documented and it's more portable. It works with Infiniband and iWARP
(http://en.wikipedia.org/wiki/IWARP).


2007/7/17, Bharath Ramesh <bramesh at vt.edu>:
>
> I am trying to migrate my research work to InfiniBand. I was searching
> for different resources which would help me in migrating to use
> InfiniBand. I couldnt find any technical documentation on how to develop
> applications using IB VAPI. The only documentation that closely
> resembles an API description is the InfiniBand Architecture release's
> Chapter 11 which talks about the software transport Verbs. I tried using
> the infiniband/verbs.h and to get some kind of understanding on how to
> develop code to use ibverbs.
>
> There are many aspects that one still doesnt understand. I was just
> wondering if the development community could help me in providing me
> with some resources or pointers so that I can better understand on how
> to use ibverbs. I am more interested in using the reliable datagram
> transport provided by ibverbs. I am not subscribed to the mailing list,
> I would really appreciate it if you could cc me in the reply. I really
> appreciate anyone taking time out of their busy schedule in providing me
> some help.
>
> Thanks,
>
> Bharath
>
> ---
> Bharath Ramesh       <bramesh at vt.edu>
> http://people.cs.vt.edu/~bramesh
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070717/e9cc3694/attachment.html>

From eli at mellanox.co.il  Tue Jul 17 12:59:08 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Tue, 17 Jul 2007 22:59:08 +0300
Subject: [ofa-general] socket buffer accounting with UDP/ipoib
References: <1183643723.25031.262.camel@mtls03> <aday7hfrkpu.fsf@cisco.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901E739A0@mtlexch01.mtl.com>

I just got from vacation  and started working on a version that does the
same for bot UD ans CM modes. I will send a distinct patch for CM later
this week.

-----Original Message-----
From: Roland Dreier [mailto:rdreier at cisco.com] 
Sent: Tuesday, July 17, 2007 8:42 PM
To: Eli Cohen
Cc: general at lists.openfabrics.org
Subject: Re: [ofa-general] socket buffer accounting with UDP/ipoib

I did a quick hack to enable copybreak for UD packets up to 256 bytes
(see below).  This is still missing copybreak for CM / RC mode.
However I just wanted to see how it affected performance.  And the
answer is that on my system (fast quad-core Xeon, 1-port Mellanox PCIe
HCA) is that it didn't make any difference in small-message latency or
throughput, at least none that I could measure with netpipe (NPtcp).


From rdreier at cisco.com  Tue Jul 17 13:11:30 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 13:11:30 -0700
Subject: [ofa-general] Re: [PATCH] mlx4: increase max outstanding rdma
	reads	per qp
In-Reply-To: <469D17C2.3040403@mellanox.co.il> (Tziporet Koren's message of
	"Tue, 17 Jul 2007 22:25:54 +0300")
References: <200707171311.43680.jackm@dev.mellanox.co.il>
	<adak5syri4s.fsf@cisco.com> <469D17C2.3040403@mellanox.co.il>
Message-ID: <adafy3mrdsd.fsf@cisco.com>

 > I suggest we do this in mthca too.

Have you tested this to know whether it matters?  Increasing the limit
uses more memory per QP...

Does the rdma read latency test in OFED queue up enough work requests
to measure this?

 - R.


From rdreier at cisco.com  Tue Jul 17 13:15:44 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 13:15:44 -0700
Subject: [ofa-general] OpenIB development help
In-Reply-To: <a8d96dec0707171255w206e5731l68c5e9c0827656e@mail.gmail.com>
	(Bruno Coutinho's message of "Tue, 17 Jul 2007 16:55:03 -0300")
References: <20070717165553.GA10298@vt.edu>
	<a8d96dec0707171255w206e5731l68c5e9c0827656e@mail.gmail.com>
Message-ID: <adabqeardlb.fsf@cisco.com>

 > Perhaps you should use uDAPL/kDAPL ( http://www.datcollaborative.org/ ).
 > It's more documented and it's more portable. It works with Infiniband and iWARP
 > (http://en.wikipedia.org/wiki/IWARP).

Oh no!!

First of all I don't know of any kDAPL implementation for any OS that
is still being developed.  Everyone completely gave up on kDAPL a long
time ago.

And if all you care about is being able to work on top of IB and
iWARP, then libibverbs + librdmacm works perfectly fine without having
to add another layer and all the complexity of DAPL.  And you don't
have to worry about code like (from dapl/common/dapl_cookie.c):

    new_head = (dapl_os_atomic_read (&buffer->head) + 1) % buffer->pool_size;

    if ( new_head == dapl_os_atomic_read (&buffer->tail) )
    {
        dat_status = DAT_INSUFFICIENT_RESOURCES;
	goto bail;
    }
    else
    {
        dapl_os_atomic_set (&buffer->head, new_head);

	*cookie_ptr = &buffer->pool[dapl_os_atomic_read (&buffer->head)];
	dat_status = DAT_SUCCESS;
    }


From rdreier at cisco.com  Tue Jul 17 13:18:21 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 13:18:21 -0700
Subject: [ofa-general] OpenIB development help
In-Reply-To: <20070717165553.GA10298@vt.edu> (Bharath Ramesh's message of "Tue,
	17 Jul 2007 12:55:53 -0400")
References: <20070717165553.GA10298@vt.edu>
Message-ID: <ada7ioyrdgy.fsf@cisco.com>

 > There are many aspects that one still doesnt understand. I was just
 > wondering if the development community could help me in providing me
 > with some resources or pointers so that I can better understand on how
 > to use ibverbs. I am more interested in using the reliable datagram
 > transport provided by ibverbs.

Unfortunately, no existing hardware supports RD (reliable datagram),
and even the API in libibverbs is not complete.

Anyway, the libibverbs source contains some example code in examples/,
and there are several other packages that have other examples, eg
librdmacm, the performance tests in OFED, etc.

If you have specific questions then please ask them on the mailing
list.  It's very hard to answer a general query like "please teach me
how to use IB" but specific issues are easy to address.


From mst at dev.mellanox.co.il  Tue Jul 17 13:27:30 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 17 Jul 2007 23:27:30 +0300
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <1184697799.5165.536.camel@firewall.xsintricity.com>
References: <20070717152546.GA6863@mellanox.co.il>
	<1184689249.5165.419.camel@firewall.xsintricity.com>
	<20070717162731.GA7479@mellanox.co.il>
	<1184690380.5165.430.camel@firewall.xsintricity.com>
	<20070717164500.GB7479@mellanox.co.il>
	<1184691962.5165.450.camel@firewall.xsintricity.com>
	<20070717171250.GD7479@mellanox.co.il>
	<1184693800.5165.480.camel@firewall.xsintricity.com>
	<20070717174526.GE7479@mellanox.co.il>
	<1184697799.5165.536.camel@firewall.xsintricity.com>
Message-ID: <20070717202730.GA15990@mellanox.co.il>

> So you need to be able to
> tell the difference between a customer running libibverbs-1.0.4 from
> OFED-1.3-beta1 and libibverbs-1.0.4 from OFED-1.3 final.

I don't really think we want customers to run beta code, or intend to support
such configurations.


-- 
MST


From sweitzen at cisco.com  Tue Jul 17 13:29:12 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Tue, 17 Jul 2007 13:29:12 -0700
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <20070717202730.GA15990@mellanox.co.il>
References: <20070717152546.GA6863@mellanox.co.il><1184689249.5165.419.camel@firewall.xsintricity.com><20070717162731.GA7479@mellanox.co.il><1184690380.5165.430.camel@firewall.xsintricity.com><20070717164500.GB7479@mellanox.co.il><1184691962.5165.450.camel@firewall.xsintricity.com><20070717171250.GD7479@mellanox.co.il><1184693800.5165.480.camel@firewall.xsintricity.com><20070717174526.GE7479@mellanox.co.il><1184697799.5165.536.camel@firewall.xsintricity.com>
	<20070717202730.GA15990@mellanox.co.il>
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA303DCFD1A@xmb-sjc-216.amer.cisco.com>

> > So you need to be able to
> > tell the difference between a customer running libibverbs-1.0.4 from
> > OFED-1.3-beta1 and libibverbs-1.0.4 from OFED-1.3 final.
> 
> I don't really think we want customers to run beta code, or 
> intend to support
> such configurations.

But we still need to tell the difference, so we can tell the customer
they are running beta code and should upgrade.

Scott


From bramesh at vt.edu  Tue Jul 17 13:34:28 2007
From: bramesh at vt.edu (Bharath Ramesh)
Date: Tue, 17 Jul 2007 16:34:28 -0400
Subject: [ofa-general] OpenIB development help
In-Reply-To: <ada7ioyrdgy.fsf@cisco.com>
References: <20070717165553.GA10298@vt.edu> <ada7ioyrdgy.fsf@cisco.com>
Message-ID: <20070717203428.GA12927@vt.edu>

* Roland Dreier (rdreier at cisco.com) wrote:
>  > There are many aspects that one still doesnt understand. I was just
>  > wondering if the development community could help me in providing me
>  > with some resources or pointers so that I can better understand on how
>  > to use ibverbs. I am more interested in using the reliable datagram
>  > transport provided by ibverbs.
> 
> Unfortunately, no existing hardware supports RD (reliable datagram),
> and even the API in libibverbs is not complete.
> 
> Anyway, the libibverbs source contains some example code in examples/,
> and there are several other packages that have other examples, eg
> librdmacm, the performance tests in OFED, etc.
> 
> If you have specific questions then please ask them on the mailing
> list.  It's very hard to answer a general query like "please teach me
> how to use IB" but specific issues are easy to address.
> 

Thanks for replying to mail. I have a some basic understanding of IB. I
have gone through some of the example code in the example directory and
OFED performance test. I noticed that every one of those examples used
TCP to exchange information regarding lid, psn and qpn. My question is
basically that is there any other way to exchange this information using
only IB. Since no hardware supports RD, I have to bite the bullet and
use RC.

Thanks,

Bharath

---
Bharath Ramesh       <bramesh at vt.edu>       http://people.cs.vt.edu/~bramesh


From swise at opengridcomputing.com  Tue Jul 17 13:40:24 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 17 Jul 2007 15:40:24 -0500
Subject: [ofa-general] OpenIB development help
In-Reply-To: <adabqeardlb.fsf@cisco.com>
References: <20070717165553.GA10298@vt.edu>	<a8d96dec0707171255w206e5731l68c5e9c0827656e@mail.gmail.com>
	<adabqeardlb.fsf@cisco.com>
Message-ID: <469D2938.20104@opengridcomputing.com>

Roland Dreier wrote:
>  > Perhaps you should use uDAPL/kDAPL ( http://www.datcollaborative.org/ ).
>  > It's more documented and it's more portable. It works with Infiniband and iWARP
>  > (http://en.wikipedia.org/wiki/IWARP).
> 
> Oh no!!
> 
> First of all I don't know of any kDAPL implementation for any OS that
> is still being developed.  Everyone completely gave up on kDAPL a long
> time ago.
> 
> And if all you care about is being able to work on top of IB and
> iWARP, then libibverbs + librdmacm works perfectly fine without having
> to add another layer and all the complexity of DAPL.  

And librdmacm integrates with the routing subsystem allowing the RDMA CM 
to choose the correct rdma device.  DAPL forces you to open each device 
and "hope" your remote destination is reachable...

Steve.


From leininger2 at llnl.gov  Tue Jul 17 13:43:07 2007
From: leininger2 at llnl.gov (Matt Leininger)
Date: Tue, 17 Jul 2007 13:43:07 -0700
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <adazm1urji8.fsf@cisco.com>
References: <adalkdl43w0.fsf@cisco.com> <adahco943ip.fsf@cisco.com>
	<adazm1urji8.fsf@cisco.com>
Message-ID: <1184704987.7702.106.camel@hyperion>

On Tue, 2007-07-17 at 11:07 -0700, Roland Dreier wrote:
> >  - Take a look at Sean's local SA caching patches.  I merged
>  >    everything else from Sean's tree, but I'm still undecided about
>  >    these.  I haven't read them carefully yet, but even aside from that
>  >    I don't have a good feeling about whether there's consensus about
>  >    this yet.  Any opinions about merging, for or against, would be
>  >    appreciated here.
> 
> Does anyone other than Sean have an opinion here?  If you want this
> feature, if you've tested it, if you don't think it's ready yet,
> whatever, please speak up -- I don't feel comfortable making a
> decision on my own here (although I will if I have to).
  
  Roland,
 
     I would like to see these features moved upstream.  DOE funded this
work as part of the items we see needing on our large scale IB
deployment (both present and future).  So from at least one big customer
perspective we see this as useful.  

    I'll let others comment on specific code/implementation issues.

  Thanks,

	- Matt

> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

-- 
Matt Leininger, Ph.D.
Lawrence Livermore National Laboratory
leininger2 at llnl.gov
V 925-422-4110


From rdreier at cisco.com  Tue Jul 17 13:43:42 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 13:43:42 -0700
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <20070717202730.GA15990@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 17 Jul 2007 23:27:30 +0300")
References: <20070717152546.GA6863@mellanox.co.il>
	<1184689249.5165.419.camel@firewall.xsintricity.com>
	<20070717162731.GA7479@mellanox.co.il>
	<1184690380.5165.430.camel@firewall.xsintricity.com>
	<20070717164500.GB7479@mellanox.co.il>
	<1184691962.5165.450.camel@firewall.xsintricity.com>
	<20070717171250.GD7479@mellanox.co.il>
	<1184693800.5165.480.camel@firewall.xsintricity.com>
	<20070717174526.GE7479@mellanox.co.il>
	<1184697799.5165.536.camel@firewall.xsintricity.com>
	<20070717202730.GA15990@mellanox.co.il>
Message-ID: <ada3azmrcap.fsf@cisco.com>

 > I don't really think we want customers to run beta code

What's the point of a beta then??

 - R.


From rdreier at cisco.com  Tue Jul 17 13:44:53 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 13:44:53 -0700
Subject: [ofa-general] OpenIB development help
In-Reply-To: <20070717203428.GA12927@vt.edu> (Bharath Ramesh's message of "Tue,
	17 Jul 2007 16:34:28 -0400")
References: <20070717165553.GA10298@vt.edu> <ada7ioyrdgy.fsf@cisco.com>
	<20070717203428.GA12927@vt.edu>
Message-ID: <aday7hepxoa.fsf@cisco.com>

 > Thanks for replying to mail. I have a some basic understanding of IB. I
 > have gone through some of the example code in the example directory and
 > OFED performance test. I noticed that every one of those examples used
 > TCP to exchange information regarding lid, psn and qpn. My question is
 > basically that is there any other way to exchange this information using
 > only IB. Since no hardware supports RD, I have to bite the bullet and
 > use RC.

Look at librdmacm (or libibcm).  They provide higher-level
abstractions for connection establishment.


From rdreier at cisco.com  Tue Jul 17 13:45:42 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 13:45:42 -0700
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <1184704987.7702.106.camel@hyperion> (Matt Leininger's message of
	"Tue, 17 Jul 2007 13:43:07 -0700")
References: <adalkdl43w0.fsf@cisco.com> <adahco943ip.fsf@cisco.com>
	<adazm1urji8.fsf@cisco.com> <1184704987.7702.106.camel@hyperion>
Message-ID: <adatzs2pxmx.fsf@cisco.com>

 >      I would like to see these features moved upstream.  DOE funded this
 > work as part of the items we see needing on our large scale IB
 > deployment (both present and future).  So from at least one big customer
 > perspective we see this as useful.  

Does your reference to "present deployment" mean you are running this
code now?

 - R.


From bramesh at vt.edu  Tue Jul 17 13:47:45 2007
From: bramesh at vt.edu (Bharath Ramesh)
Date: Tue, 17 Jul 2007 16:47:45 -0400
Subject: [ofa-general] OpenIB development help
In-Reply-To: <adabqeardlb.fsf@cisco.com>
References: <20070717165553.GA10298@vt.edu>
	<a8d96dec0707171255w206e5731l68c5e9c0827656e@mail.gmail.com>
	<adabqeardlb.fsf@cisco.com>
Message-ID: <20070717204745.GB12927@vt.edu>

* Roland Dreier (rdreier at cisco.com) wrote:
>  > Perhaps you should use uDAPL/kDAPL ( http://www.datcollaborative.org/ ).
>  > It's more documented and it's more portable. It works with Infiniband and iWARP
>  > (http://en.wikipedia.org/wiki/IWARP).
> 
> Oh no!!
> 
> First of all I don't know of any kDAPL implementation for any OS that
> is still being developed.  Everyone completely gave up on kDAPL a long
> time ago.
> 
> And if all you care about is being able to work on top of IB and
> iWARP, then libibverbs + librdmacm works perfectly fine without having
> to add another layer and all the complexity of DAPL.  And you don't
> have to worry about code like (from dapl/common/dapl_cookie.c):
> 
>     new_head = (dapl_os_atomic_read (&buffer->head) + 1) % buffer->pool_size;
> 
>     if ( new_head == dapl_os_atomic_read (&buffer->tail) )
>     {
>         dat_status = DAT_INSUFFICIENT_RESOURCES;
> 	goto bail;
>     }
>     else
>     {
>         dapl_os_atomic_set (&buffer->head, new_head);
> 
> 	*cookie_ptr = &buffer->pool[dapl_os_atomic_read (&buffer->head)];
> 	dat_status = DAT_SUCCESS;
>     }
> 

I care about only working over IB. I dont want to add anymore layers of
software because I want to minimize the number of software layers that I
need to traverse.

Thanks,

Bharath

---
Bharath Ramesh       <bramesh at vt.edu>       http://people.cs.vt.edu/~bramesh


From bramesh at vt.edu  Tue Jul 17 13:52:50 2007
From: bramesh at vt.edu (Bharath Ramesh)
Date: Tue, 17 Jul 2007 16:52:50 -0400
Subject: [ofa-general] OpenIB development help
In-Reply-To: <aday7hepxoa.fsf@cisco.com>
References: <20070717165553.GA10298@vt.edu> <ada7ioyrdgy.fsf@cisco.com>
	<20070717203428.GA12927@vt.edu> <aday7hepxoa.fsf@cisco.com>
Message-ID: <20070717205250.GA13127@vt.edu>

* Roland Dreier (rdreier at cisco.com) wrote:
>  > Thanks for replying to mail. I have a some basic understanding of IB. I
>  > have gone through some of the example code in the example directory and
>  > OFED performance test. I noticed that every one of those examples used
>  > TCP to exchange information regarding lid, psn and qpn. My question is
>  > basically that is there any other way to exchange this information using
>  > only IB. Since no hardware supports RD, I have to bite the bullet and
>  > use RC.
> 
> Look at librdmacm (or libibcm).  They provide higher-level
> abstractions for connection establishment.
> 

Thanks for pointing to them. Another question off-topic I would say. I
noticed that you are the maintainer for libibverbs in debian. Is there
any time line when you might get librdmacm or libibcm into debian
experimental/unstable?

Thanks,

Bharath

---
Bharath Ramesh       <bramesh at vt.edu>       http://people.cs.vt.edu/~bramesh


From rdreier at cisco.com  Tue Jul 17 13:58:37 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 13:58:37 -0700
Subject: [ofa-general] OpenIB development help
In-Reply-To: <20070717205250.GA13127@vt.edu> (Bharath Ramesh's message of "Tue,
	17 Jul 2007 16:52:50 -0400")
References: <20070717165553.GA10298@vt.edu> <ada7ioyrdgy.fsf@cisco.com>
	<20070717203428.GA12927@vt.edu> <aday7hepxoa.fsf@cisco.com>
	<20070717205250.GA13127@vt.edu>
Message-ID: <adalkdepx1e.fsf@cisco.com>

 > Thanks for pointing to them. Another question off-topic I would say. I
 > noticed that you are the maintainer for libibverbs in debian. Is there
 > any time line when you might get librdmacm or libibcm into debian
 > experimental/unstable?

I don't have any plans to do any more Debian packages (like librdmacm)
right now.  I am the upstream for libibverbs in addition to being the
Debian maintainer, which is why I package it for Debian (and Fedora).
And I am not the upstream for librdmacm.

Actually now that I think of it, I do plan to prepare libmlx4 packages
at some point, but again I am the upstream there.

 - R.


From bramesh at vt.edu  Tue Jul 17 14:02:01 2007
From: bramesh at vt.edu (Bharath Ramesh)
Date: Tue, 17 Jul 2007 17:02:01 -0400
Subject: [ofa-general] OpenIB development help
In-Reply-To: <adalkdepx1e.fsf@cisco.com>
References: <20070717165553.GA10298@vt.edu> <ada7ioyrdgy.fsf@cisco.com>
	<20070717203428.GA12927@vt.edu> <aday7hepxoa.fsf@cisco.com>
	<20070717205250.GA13127@vt.edu> <adalkdepx1e.fsf@cisco.com>
Message-ID: <20070717210201.GA13439@vt.edu>

* Roland Dreier (rdreier at cisco.com) wrote:
>  > Thanks for pointing to them. Another question off-topic I would say. I
>  > noticed that you are the maintainer for libibverbs in debian. Is there
>  > any time line when you might get librdmacm or libibcm into debian
>  > experimental/unstable?
> 
> I don't have any plans to do any more Debian packages (like librdmacm)
> right now.  I am the upstream for libibverbs in addition to being the
> Debian maintainer, which is why I package it for Debian (and Fedora).
> And I am not the upstream for librdmacm.
> 
> Actually now that I think of it, I do plan to prepare libmlx4 packages
> at some point, but again I am the upstream there.
> 
>  - R.
> 

Thanks, I guess if there are no plans to build librdmacm for debian for
now I guess I will build them myself for now.

Thanks,

Bharath

---
Bharath Ramesh       <bramesh at vt.edu>       http://people.cs.vt.edu/~bramesh


From mst at dev.mellanox.co.il  Tue Jul 17 14:09:35 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 18 Jul 2007 00:09:35 +0300
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <ada3azmrcap.fsf@cisco.com>
References: <20070717162731.GA7479@mellanox.co.il>
	<1184690380.5165.430.camel@firewall.xsintricity.com>
	<20070717164500.GB7479@mellanox.co.il>
	<1184691962.5165.450.camel@firewall.xsintricity.com>
	<20070717171250.GD7479@mellanox.co.il>
	<1184693800.5165.480.camel@firewall.xsintricity.com>
	<20070717174526.GE7479@mellanox.co.il>
	<1184697799.5165.536.camel@firewall.xsintricity.com>
	<20070717202730.GA15990@mellanox.co.il> <ada3azmrcap.fsf@cisco.com>
Message-ID: <20070717210935.GA17168@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [ofa-general] Re: RFC OFED-1.3 installation
> 
>  > I don't really think we want customers to run beta code
> 
> What's the point of a beta then??

Donnu.
In previous OFED releases, we had "release candidates" rather than "beta".
Openfabrics members were running RCs and reporting issues on the list and in
bugzilla. Do you really ask your customers to do this for you?

-- 
MST


From mst at dev.mellanox.co.il  Tue Jul 17 14:14:44 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 18 Jul 2007 00:14:44 +0300
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA303DCFD1A@xmb-sjc-216.amer.cisco.com>
References: <20070717202730.GA15990@mellanox.co.il>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA303DCFD1A@xmb-sjc-216.amer.cisco.com>
Message-ID: <20070717211444.GB17168@mellanox.co.il>

> Quoting Scott Weitzenkamp (sweitzen) <sweitzen at cisco.com>:
> Subject: RE: [ofa-general] Re: RFC OFED-1.3 installation
> 
> > > So you need to be able to
> > > tell the difference between a customer running libibverbs-1.0.4 from
> > > OFED-1.3-beta1 and libibverbs-1.0.4 from OFED-1.3 final.
> > 
> > I don't really think we want customers to run beta code, or 
> > intend to support
> > such configurations.
> 
> But we still need to tell the difference, so we can tell the customer
> they are running beta code and should upgrade.

Sure, this makes sense. Non-release code such as nightly builds must
be marked as suchas clearly as possible.
Installing such a version will always have an element of risk in it, though,
and I don't think we want to encourage such use in production environment.

-- 
MST


From sweitzen at cisco.com  Tue Jul 17 14:16:49 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Tue, 17 Jul 2007 14:16:49 -0700
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <20070717210935.GA17168@mellanox.co.il>
References: <20070717162731.GA7479@mellanox.co.il><1184690380.5165.430.camel@firewall.xsintricity.com><20070717164500.GB7479@mellanox.co.il><1184691962.5165.450.camel@firewall.xsintricity.com><20070717171250.GD7479@mellanox.co.il><1184693800.5165.480.camel@firewall.xsintricity.com><20070717174526.GE7479@mellanox.co.il><1184697799.5165.536.camel@firewall.xsintricity.com><20070717202730.GA15990@mellanox.co.il>
	<ada3azmrcap.fsf@cisco.com> <20070717210935.GA17168@mellanox.co.il>
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA303DCFD6A@xmb-sjc-216.amer.cisco.com>


> >  > I don't really think we want customers to run beta code
> > 
> > What's the point of a beta then??
> 
> Donnu.
> In previous OFED releases, we had "release candidates" rather 
> than "beta".
> Openfabrics members were running RCs and reporting issues on 
> the list and in
> bugzilla. Do you really ask your customers to do this for you?

You say toMAYto, I say toMAHto.

We had many customers running various OFED 1.2 pre-GA builds for
testing, sometimes we had to use a daily build because of certain bug
fixes.

Scott


From arthur.jones at qlogic.com  Tue Jul 17 14:19:18 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Tue, 17 Jul 2007 14:19:18 -0700
Subject: [ofa-general] Re: [PATCH] IB/ipath: Make a few functions static
In-Reply-To: <adawsx0utlz.fsf@cisco.com>
References: <ada1wf8w8gp.fsf@cisco.com> <adawsx0utlz.fsf@cisco.com>
Message-ID: <20070717211918.GD30170@bauxite.pathscale.com>

hi roland, this patch looks good, thanks!

arthur

On Mon, Jul 16, 2007 at 10:49:12AM -0700, Roland Dreier wrote:
> Make some functions that are only used in a single .c file static.  In
> addition to being a cleanup, this shrinks the generated code.  On x86_64:
> 
> add/remove: 1/3 grow/shrink: 2/1 up/down: 4777/-4956 (-179)
> function                                     old     new   delta
> handle_errors                                  -    3994   +3994
> __verbs_timer                                 42     710    +668
> ipath_do_ruc_send                           2131    2246    +115
> ipath_no_bufs_available                      136       -    -136
> ipath_disarm_senderrbufs                     639       -    -639
> ipath_ib_timer                               658       -    -658
> ipath_intr                                  5878    2355   -3523
> 
> Signed-off-by: Roland Dreier <rolandd at cisco.com>
> ---
> Does this look OK to merge for 2.6.23?
> 
> diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c
> index 9361f5a..09c5fd8 100644
> --- a/drivers/infiniband/hw/ipath/ipath_driver.c
> +++ b/drivers/infiniband/hw/ipath/ipath_driver.c
> @@ -1889,7 +1889,7 @@ void ipath_write_kreg_port(const struct ipath_devdata *dd, ipath_kreg regno,
>  /* Below is "non-zero" to force override, but both actual LEDs are off */
>  #define LED_OVER_BOTH_OFF (8)
>  
> -void ipath_run_led_override(unsigned long opaque)
> +static void ipath_run_led_override(unsigned long opaque)
>  {
>  	struct ipath_devdata *dd = (struct ipath_devdata *)opaque;
>  	int timeoff;
> diff --git a/drivers/infiniband/hw/ipath/ipath_eeprom.c b/drivers/infiniband/hw/ipath/ipath_eeprom.c
> index 6b91479..b4503e9 100644
> --- a/drivers/infiniband/hw/ipath/ipath_eeprom.c
> +++ b/drivers/infiniband/hw/ipath/ipath_eeprom.c
> @@ -426,8 +426,8 @@ bail:
>   * @buffer: data to write
>   * @len: number of bytes to write
>   */
> -int ipath_eeprom_internal_write(struct ipath_devdata *dd, u8 eeprom_offset,
> -				const void *buffer, int len)
> +static int ipath_eeprom_internal_write(struct ipath_devdata *dd, u8 eeprom_offset,
> +				       const void *buffer, int len)
>  {
>  	u8 single_byte;
>  	int sub_len;
> diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c
> index 47aa434..1fd91c5 100644
> --- a/drivers/infiniband/hw/ipath/ipath_intr.c
> +++ b/drivers/infiniband/hw/ipath/ipath_intr.c
> @@ -70,7 +70,7 @@ static void ipath_clrpiobuf(struct ipath_devdata *dd, u32 pnum)
>   * If rewrite is true, and bits are set in the sendbufferror registers,
>   * we'll write to the buffer, for error recovery on parity errors.
>   */
> -void ipath_disarm_senderrbufs(struct ipath_devdata *dd, int rewrite)
> +static void ipath_disarm_senderrbufs(struct ipath_devdata *dd, int rewrite)
>  {
>  	u32 piobcnt;
>  	unsigned long sbuf[4];
> diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h
> index 3105005..b6ccd04 100644
> --- a/drivers/infiniband/hw/ipath/ipath_kernel.h
> +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h
> @@ -776,7 +776,6 @@ void ipath_get_eeprom_info(struct ipath_devdata *);
>  int ipath_update_eeprom_log(struct ipath_devdata *dd);
>  void ipath_inc_eeprom_err(struct ipath_devdata *dd, u32 eidx, u32 incr);
>  u64 ipath_snap_cntr(struct ipath_devdata *, ipath_creg);
> -void ipath_disarm_senderrbufs(struct ipath_devdata *, int);
>  
>  /*
>   * Set LED override, only the two LSBs have "public" meaning, but
> diff --git a/drivers/infiniband/hw/ipath/ipath_ruc.c b/drivers/infiniband/hw/ipath/ipath_ruc.c
> index 8525674..c69c252 100644
> --- a/drivers/infiniband/hw/ipath/ipath_ruc.c
> +++ b/drivers/infiniband/hw/ipath/ipath_ruc.c
> @@ -507,7 +507,7 @@ static int want_buffer(struct ipath_devdata *dd)
>   *
>   * Called when we run out of PIO buffers.
>   */
> -void ipath_no_bufs_available(struct ipath_qp *qp, struct ipath_ibdev *dev)
> +static void ipath_no_bufs_available(struct ipath_qp *qp, struct ipath_ibdev *dev)
>  {
>  	unsigned long flags;
>  
> diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c
> index 65f7181..16aa61f 100644
> --- a/drivers/infiniband/hw/ipath/ipath_verbs.c
> +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c
> @@ -488,7 +488,7 @@ bail:;
>   * This is called from ipath_do_rcv_timer() at interrupt level to check for
>   * QPs which need retransmits and to collect performance numbers.
>   */
> -void ipath_ib_timer(struct ipath_ibdev *dev)
> +static void ipath_ib_timer(struct ipath_ibdev *dev)
>  {
>  	struct ipath_qp *resend = NULL;
>  	struct list_head *last;
> diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h
> index f3d1f2c..9bbe819 100644
> --- a/drivers/infiniband/hw/ipath/ipath_verbs.h
> +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h
> @@ -782,8 +782,6 @@ void ipath_update_mmap_info(struct ipath_ibdev *dev,
>  
>  int ipath_mmap(struct ib_ucontext *context, struct vm_area_struct *vma);
>  
> -void ipath_no_bufs_available(struct ipath_qp *qp, struct ipath_ibdev *dev);
> -
>  void ipath_insert_rnr_queue(struct ipath_qp *qp);
>  
>  int ipath_get_rwqe(struct ipath_qp *qp, int wr_id_only);
> @@ -807,8 +805,6 @@ void ipath_ib_rcv(struct ipath_ibdev *, void *, void *, u32);
>  
>  int ipath_ib_piobufavail(struct ipath_ibdev *);
>  
> -void ipath_ib_timer(struct ipath_ibdev *);
> -
>  unsigned ipath_get_npkeys(struct ipath_devdata *);
>  
>  u32 ipath_get_cr_errpkey(struct ipath_devdata *);


From arthur.jones at qlogic.com  Tue Jul 17 14:20:05 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Tue, 17 Jul 2007 14:20:05 -0700
Subject: [ofa-general] Re: is ipath_get_user_pages_nocopy() dead code?
In-Reply-To: <adasl7outkw.fsf@cisco.com>
References: <ada1wf8w8gp.fsf@cisco.com> <adasl7outkw.fsf@cisco.com>
Message-ID: <20070717212004.GE30170@bauxite.pathscale.com>

hi roland, ...

On Mon, Jul 16, 2007 at 10:49:51AM -0700, Roland Dreier wrote:
> I don't see any callers of ipath_get_user_pages_nocopy().  Should we
> just delete it?

yes, shall i queue it up and post it?

thanks...

arthur


From arthur.jones at qlogic.com  Tue Jul 17 14:20:59 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Tue, 17 Jul 2007 14:20:59 -0700
Subject: [ofa-general] is ipath_layer.c dead code?
In-Reply-To: <ada1wf8w8gp.fsf@cisco.com>
References: <ada1wf8w8gp.fsf@cisco.com>
Message-ID: <20070717212059.GF30170@bauxite.pathscale.com>

hi roland, i'm still testing this one,
i'll get back to you soon...

arthur

On Mon, Jul 16, 2007 at 10:43:02AM -0700, Roland Dreier wrote:
> My kernel seems to build and link fine with the patch below.  Is
> ipath_layer.c being used for anything, or can we just kill it?
> 
>  - R.
> 
> diff --git a/drivers/infiniband/hw/ipath/Makefile b/drivers/infiniband/hw/ipath/Makefile
> index ec2e603..fe67388 100644
> --- a/drivers/infiniband/hw/ipath/Makefile
> +++ b/drivers/infiniband/hw/ipath/Makefile
> @@ -14,7 +14,6 @@ ib_ipath-y := \
>  	ipath_init_chip.o \
>  	ipath_intr.o \
>  	ipath_keys.o \
> -	ipath_layer.o \
>  	ipath_mad.o \
>  	ipath_mmap.o \
>  	ipath_mr.o \
> diff --git a/drivers/infiniband/hw/ipath/ipath_layer.c b/drivers/infiniband/hw/ipath/ipath_layer.c
> deleted file mode 100644
> index 82616b7..0000000
> --- a/drivers/infiniband/hw/ipath/ipath_layer.c
> +++ /dev/null
> @@ -1,365 +0,0 @@
> -/*
> - * Copyright (c) 2006, 2007 QLogic Corporation. All rights reserved.
> - * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
> - *
> - * This software is available to you under a choice of one of two
> - * licenses.  You may choose to be licensed under the terms of the GNU
> - * General Public License (GPL) Version 2, available from the file
> - * COPYING in the main directory of this source tree, or the
> - * OpenIB.org BSD license below:
> - *
> - *     Redistribution and use in source and binary forms, with or
> - *     without modification, are permitted provided that the following
> - *     conditions are met:
> - *
> - *      - Redistributions of source code must retain the above
> - *        copyright notice, this list of conditions and the following
> - *        disclaimer.
> - *
> - *      - Redistributions in binary form must reproduce the above
> - *        copyright notice, this list of conditions and the following
> - *        disclaimer in the documentation and/or other materials
> - *        provided with the distribution.
> - *
> - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> - * SOFTWARE.
> - */
> -
> -/*
> - * These are the routines used by layered drivers, currently just the
> - * layered ethernet driver and verbs layer.
> - */
> -
> -#include <linux/io.h>
> -#include <asm/byteorder.h>
> -
> -#include "ipath_kernel.h"
> -#include "ipath_layer.h"
> -#include "ipath_verbs.h"
> -#include "ipath_common.h"
> -
> -/* Acquire before ipath_devs_lock. */
> -static DEFINE_MUTEX(ipath_layer_mutex);
> -
> -u16 ipath_layer_rcv_opcode;
> -
> -static int (*layer_intr)(void *, u32);
> -static int (*layer_rcv)(void *, void *, struct sk_buff *);
> -static int (*layer_rcv_lid)(void *, void *);
> -
> -static void *(*layer_add_one)(int, struct ipath_devdata *);
> -static void (*layer_remove_one)(void *);
> -
> -int __ipath_layer_intr(struct ipath_devdata *dd, u32 arg)
> -{
> -	int ret = -ENODEV;
> -
> -	if (dd->ipath_layer.l_arg && layer_intr)
> -		ret = layer_intr(dd->ipath_layer.l_arg, arg);
> -
> -	return ret;
> -}
> -
> -int ipath_layer_intr(struct ipath_devdata *dd, u32 arg)
> -{
> -	int ret;
> -
> -	mutex_lock(&ipath_layer_mutex);
> -
> -	ret = __ipath_layer_intr(dd, arg);
> -
> -	mutex_unlock(&ipath_layer_mutex);
> -
> -	return ret;
> -}
> -
> -int __ipath_layer_rcv(struct ipath_devdata *dd, void *hdr,
> -		      struct sk_buff *skb)
> -{
> -	int ret = -ENODEV;
> -
> -	if (dd->ipath_layer.l_arg && layer_rcv)
> -		ret = layer_rcv(dd->ipath_layer.l_arg, hdr, skb);
> -
> -	return ret;
> -}
> -
> -int __ipath_layer_rcv_lid(struct ipath_devdata *dd, void *hdr)
> -{
> -	int ret = -ENODEV;
> -
> -	if (dd->ipath_layer.l_arg && layer_rcv_lid)
> -		ret = layer_rcv_lid(dd->ipath_layer.l_arg, hdr);
> -
> -	return ret;
> -}
> -
> -void ipath_layer_lid_changed(struct ipath_devdata *dd)
> -{
> -	mutex_lock(&ipath_layer_mutex);
> -
> -	if (dd->ipath_layer.l_arg && layer_intr)
> -		layer_intr(dd->ipath_layer.l_arg, IPATH_LAYER_INT_LID);
> -
> -	mutex_unlock(&ipath_layer_mutex);
> -}
> -
> -void ipath_layer_add(struct ipath_devdata *dd)
> -{
> -	mutex_lock(&ipath_layer_mutex);
> -
> -	if (layer_add_one)
> -		dd->ipath_layer.l_arg =
> -			layer_add_one(dd->ipath_unit, dd);
> -
> -	mutex_unlock(&ipath_layer_mutex);
> -}
> -
> -void ipath_layer_remove(struct ipath_devdata *dd)
> -{
> -	mutex_lock(&ipath_layer_mutex);
> -
> -	if (dd->ipath_layer.l_arg && layer_remove_one) {
> -		layer_remove_one(dd->ipath_layer.l_arg);
> -		dd->ipath_layer.l_arg = NULL;
> -	}
> -
> -	mutex_unlock(&ipath_layer_mutex);
> -}
> -
> -int ipath_layer_register(void *(*l_add)(int, struct ipath_devdata *),
> -			 void (*l_remove)(void *),
> -			 int (*l_intr)(void *, u32),
> -			 int (*l_rcv)(void *, void *, struct sk_buff *),
> -			 u16 l_rcv_opcode,
> -			 int (*l_rcv_lid)(void *, void *))
> -{
> -	struct ipath_devdata *dd, *tmp;
> -	unsigned long flags;
> -
> -	mutex_lock(&ipath_layer_mutex);
> -
> -	layer_add_one = l_add;
> -	layer_remove_one = l_remove;
> -	layer_intr = l_intr;
> -	layer_rcv = l_rcv;
> -	layer_rcv_lid = l_rcv_lid;
> -	ipath_layer_rcv_opcode = l_rcv_opcode;
> -
> -	spin_lock_irqsave(&ipath_devs_lock, flags);
> -
> -	list_for_each_entry_safe(dd, tmp, &ipath_dev_list, ipath_list) {
> -		if (!(dd->ipath_flags & IPATH_INITTED))
> -			continue;
> -
> -		if (dd->ipath_layer.l_arg)
> -			continue;
> -
> -		spin_unlock_irqrestore(&ipath_devs_lock, flags);
> -		dd->ipath_layer.l_arg = l_add(dd->ipath_unit, dd);
> -		spin_lock_irqsave(&ipath_devs_lock, flags);
> -	}
> -
> -	spin_unlock_irqrestore(&ipath_devs_lock, flags);
> -	mutex_unlock(&ipath_layer_mutex);
> -
> -	return 0;
> -}
> -
> -EXPORT_SYMBOL_GPL(ipath_layer_register);
> -
> -void ipath_layer_unregister(void)
> -{
> -	struct ipath_devdata *dd, *tmp;
> -	unsigned long flags;
> -
> -	mutex_lock(&ipath_layer_mutex);
> -	spin_lock_irqsave(&ipath_devs_lock, flags);
> -
> -	list_for_each_entry_safe(dd, tmp, &ipath_dev_list, ipath_list) {
> -		if (dd->ipath_layer.l_arg && layer_remove_one) {
> -			spin_unlock_irqrestore(&ipath_devs_lock, flags);
> -			layer_remove_one(dd->ipath_layer.l_arg);
> -			spin_lock_irqsave(&ipath_devs_lock, flags);
> -			dd->ipath_layer.l_arg = NULL;
> -		}
> -	}
> -
> -	spin_unlock_irqrestore(&ipath_devs_lock, flags);
> -
> -	layer_add_one = NULL;
> -	layer_remove_one = NULL;
> -	layer_intr = NULL;
> -	layer_rcv = NULL;
> -	layer_rcv_lid = NULL;
> -
> -	mutex_unlock(&ipath_layer_mutex);
> -}
> -
> -EXPORT_SYMBOL_GPL(ipath_layer_unregister);
> -
> -int ipath_layer_open(struct ipath_devdata *dd, u32 * pktmax)
> -{
> -	int ret;
> -	u32 intval = 0;
> -
> -	mutex_lock(&ipath_layer_mutex);
> -
> -	if (!dd->ipath_layer.l_arg) {
> -		ret = -EINVAL;
> -		goto bail;
> -	}
> -
> -	ret = ipath_setrcvhdrsize(dd, IPATH_HEADER_QUEUE_WORDS);
> -
> -	if (ret < 0)
> -		goto bail;
> -
> -	*pktmax = dd->ipath_ibmaxlen;
> -
> -	if (*dd->ipath_statusp & IPATH_STATUS_IB_READY)
> -		intval |= IPATH_LAYER_INT_IF_UP;
> -	if (dd->ipath_lid)
> -		intval |= IPATH_LAYER_INT_LID;
> -	if (dd->ipath_mlid)
> -		intval |= IPATH_LAYER_INT_BCAST;
> -	/*
> -	 * do this on open, in case low level is already up and
> -	 * just layered driver was reloaded, etc.
> -	 */
> -	if (intval)
> -		layer_intr(dd->ipath_layer.l_arg, intval);
> -
> -	ret = 0;
> -bail:
> -	mutex_unlock(&ipath_layer_mutex);
> -
> -	return ret;
> -}
> -
> -EXPORT_SYMBOL_GPL(ipath_layer_open);
> -
> -u16 ipath_layer_get_lid(struct ipath_devdata *dd)
> -{
> -	return dd->ipath_lid;
> -}
> -
> -EXPORT_SYMBOL_GPL(ipath_layer_get_lid);
> -
> -/**
> - * ipath_layer_get_mac - get the MAC address
> - * @dd: the infinipath device
> - * @mac: the MAC is put here
> - *
> - * This is the EUID-64 OUI octets (top 3), then
> - * skip the next 2 (which should both be zero or 0xff).
> - * The returned MAC is in network order
> - * mac points to at least 6 bytes of buffer
> - * We assume that by the time the LID is set, that the GUID is as valid
> - * as it's ever going to be, rather than adding yet another status bit.
> - */
> -
> -int ipath_layer_get_mac(struct ipath_devdata *dd, u8 * mac)
> -{
> -	u8 *guid;
> -
> -	guid = (u8 *) &dd->ipath_guid;
> -
> -	mac[0] = guid[0];
> -	mac[1] = guid[1];
> -	mac[2] = guid[2];
> -	mac[3] = guid[5];
> -	mac[4] = guid[6];
> -	mac[5] = guid[7];
> -	if ((guid[3] || guid[4]) && !(guid[3] == 0xff && guid[4] == 0xff))
> -		ipath_dbg("Warning, guid bytes 3 and 4 not 0 or 0xffff: "
> -			  "%x %x\n", guid[3], guid[4]);
> -	return 0;
> -}
> -
> -EXPORT_SYMBOL_GPL(ipath_layer_get_mac);
> -
> -u16 ipath_layer_get_bcast(struct ipath_devdata *dd)
> -{
> -	return dd->ipath_mlid;
> -}
> -
> -EXPORT_SYMBOL_GPL(ipath_layer_get_bcast);
> -
> -int ipath_layer_send_hdr(struct ipath_devdata *dd, struct ether_header *hdr)
> -{
> -	int ret = 0;
> -	u32 __iomem *piobuf;
> -	u32 plen, *uhdr;
> -	size_t count;
> -	__be16 vlsllnh;
> -
> -	if (!(dd->ipath_flags & IPATH_RCVHDRSZ_SET)) {
> -		ipath_dbg("send while not open\n");
> -		ret = -EINVAL;
> -	} else
> -		if ((dd->ipath_flags & (IPATH_LINKUNK | IPATH_LINKDOWN)) ||
> -		    dd->ipath_lid == 0) {
> -			/*
> -			 * lid check is for when sma hasn't yet configured
> -			 */
> -			ret = -ENETDOWN;
> -			ipath_cdbg(VERBOSE, "send while not ready, "
> -				   "mylid=%u, flags=0x%x\n",
> -				   dd->ipath_lid, dd->ipath_flags);
> -		}
> -
> -	vlsllnh = *((__be16 *) hdr);
> -	if (vlsllnh != htons(IPATH_LRH_BTH)) {
> -		ipath_dbg("Warning: lrh[0] wrong (%x, not %x); "
> -			  "not sending\n", be16_to_cpu(vlsllnh),
> -			  IPATH_LRH_BTH);
> -		ret = -EINVAL;
> -	}
> -	if (ret)
> -		goto done;
> -
> -	/* Get a PIO buffer to use. */
> -	piobuf = ipath_getpiobuf(dd, NULL);
> -	if (piobuf == NULL) {
> -		ret = -EBUSY;
> -		goto done;
> -	}
> -
> -	plen = (sizeof(*hdr) >> 2); /* actual length */
> -	ipath_cdbg(EPKT, "0x%x+1w pio %p\n", plen, piobuf);
> -
> -	writeq(plen+1, piobuf); /* len (+1 for pad) to pbc, no flags */
> -	ipath_flush_wc();
> -	piobuf += 2;
> -	uhdr = (u32 *)hdr;
> -	count = plen-1; /* amount we can copy before trigger word */
> -	__iowrite32_copy(piobuf, uhdr, count);
> -	ipath_flush_wc();
> -	__raw_writel(uhdr[count], piobuf + count);
> -	ipath_flush_wc(); /* ensure it's sent, now */
> -
> -	ipath_stats.sps_ether_spkts++;	/* ether packet sent */
> -
> -done:
> -	return ret;
> -}
> -
> -EXPORT_SYMBOL_GPL(ipath_layer_send_hdr);
> -
> -int ipath_layer_set_piointbufavail_int(struct ipath_devdata *dd)
> -{
> -	set_bit(IPATH_S_PIOINTBUFAVAIL, &dd->ipath_sendctrl);
> -
> -	ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl,
> -			 dd->ipath_sendctrl);
> -	return 0;
> -}
> -
> -EXPORT_SYMBOL_GPL(ipath_layer_set_piointbufavail_int);
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From mst at dev.mellanox.co.il  Tue Jul 17 14:32:15 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 18 Jul 2007 00:32:15 +0300
Subject: [ofa-general] Re: [PATCH 01/10] IB/ehca: Support for multiple event
	queues
In-Reply-To: <adalkdesyrs.fsf@cisco.com>
References: <OF2377E85C.B05BDE94-ONC125731A.006E64A7-C125731A.00714F9E@de.ibm.com>
	<ada1wf7vgfb.fsf@cisco.com> <20070717043740.GB8527@mellanox.co.il>
	<adalkdesyrs.fsf@cisco.com>
Message-ID: <20070717213215.GC17168@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH 01/10] IB/ehca: Support for multiple event queues
> 
>  > Here's some anecdotal evidence :)
>  > http://lists.openfabrics.org/pipermail/general/2007-May/035758.html
> 
> Right, but then we went on to say that we probably want to use
> multiple vectors to separate out multiple HCA ports rather than
> send/sreceive on the same port.  And the current IPoIB implementation
> of having that second CQ seems suboptimal anyway, since it seems to
> leave us susceptible to the interrupt overload that NAPI was supposed
> to solve.

Sure, the ipoib patch is just a proof of concept anyway.
And I'm actually working on merging send/recv CQs now,
to address the livelocks.

> At a higher level, I'm left wondering why nobody talked about multiple
> EQs during the last months of the 2.6.22 process and now all of a
> sudden it becomes urgent in the last few days of the 2.6.23 merge
> window.

I don't see any emergency in merging the IPoIB hack either.

I just hoped that once we merge the core changes people will start
experimenting with multiple vectors. This did not seem to have happened.
Could this be because there's no low level driver support upstream yet?

So I wonder whether merging the mthca patch [that was patch 2 of the series]
in 2.6.23 will finally get the
ball rolling, get people to experiment with multiple vectors
in userspace, and that will hopefully teach us something.

> That's not really how I like to merge features....

If you look just at the mthca patch in isolation,
do you still see a problem?

-- 
MST


From rdreier at cisco.com  Tue Jul 17 14:38:13 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 14:38:13 -0700
Subject: [ofa-general] Re: [PATCH] IB/ipath: Make a few functions static
In-Reply-To: <20070717211918.GD30170@bauxite.pathscale.com> (Arthur Jones's
	message of "Tue, 17 Jul 2007 14:19:18 -0700")
References: <ada1wf8w8gp.fsf@cisco.com> <adawsx0utlz.fsf@cisco.com>
	<20070717211918.GD30170@bauxite.pathscale.com>
Message-ID: <adahco2pv7e.fsf@cisco.com>

OK, I queued it for my next merge.


From rdreier at cisco.com  Tue Jul 17 14:38:31 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 14:38:31 -0700
Subject: [ofa-general] Re: is ipath_get_user_pages_nocopy() dead code?
In-Reply-To: <20070717212004.GE30170@bauxite.pathscale.com> (Arthur Jones's
	message of "Tue, 17 Jul 2007 14:20:05 -0700")
References: <ada1wf8w8gp.fsf@cisco.com> <adasl7outkw.fsf@cisco.com>
	<20070717212004.GE30170@bauxite.pathscale.com>
Message-ID: <adad4yqpv6w.fsf@cisco.com>

 > yes, shall i queue it up and post it?

No need, I can do it locally just as easily.


From mst at dev.mellanox.co.il  Tue Jul 17 14:44:17 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 18 Jul 2007 00:44:17 +0300
Subject: [ofa-general] Re: Further 2.6.23 merge plans...
In-Reply-To: <adazm1urji8.fsf@cisco.com>
References: <adalkdl43w0.fsf@cisco.com> <adahco943ip.fsf@cisco.com>
	<adazm1urji8.fsf@cisco.com>
Message-ID: <20070717214417.GE17168@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: Further 2.6.23 merge plans...
> 
>  >  - Take a look at Sean's local SA caching patches.  I merged
>  >    everything else from Sean's tree, but I'm still undecided about
>  >    these.  I haven't read them carefully yet, but even aside from that
>  >    I don't have a good feeling about whether there's consensus about
>  >    this yet.  Any opinions about merging, for or against, would be
>  >    appreciated here.
> 
> Does anyone other than Sean have an opinion here?  If you want this
> feature, if you've tested it, if you don't think it's ready yet,
> whatever, please speak up -- I don't feel comfortable making a
> decision on my own here (although I will if I have to).

We have the patches applied in ofed 1.2.c with default module parameter set to
caching disabled (ofed 1.2 had a different version of the patches, but caching
is disabled by default there, too). At least in this configuration
(caching disabled), all issues I've seen seem to be fixed now, and tests seem to
be running smoothly.

So I think it's safe to merge it up if the module parameter
is set to cache disabled by default.
No idea what happens if it's enabled though :)

-- 
MST


From mst at dev.mellanox.co.il  Tue Jul 17 14:58:11 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 18 Jul 2007 00:58:11 +0300
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA303DCFD6A@xmb-sjc-216.amer.cisco.com>
References: <ada3azmrcap.fsf@cisco.com> <20070717210935.GA17168@mellanox.co.il>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA303DCFD6A@xmb-sjc-216.amer.cisco.com>
Message-ID: <20070717215811.GA19243@mellanox.co.il>

> Quoting Scott Weitzenkamp (sweitzen) <sweitzen at cisco.com>:
> Subject: RE: [ofa-general] Re: RFC OFED-1.3 installation
> 
> 
> > >  > I don't really think we want customers to run beta code
> > > 
> > > What's the point of a beta then??
> > 
> > Donnu.
> > In previous OFED releases, we had "release candidates" rather 
> > than "beta".
> > Openfabrics members were running RCs and reporting issues on 
> > the list and in
> > bugzilla. Do you really ask your customers to do this for you?
> 
> You say toMAYto, I say toMAHto.
> 
> We had many customers running various OFED 1.2 pre-GA builds for
> testing, sometimes we had to use a daily build because of certain bug
> fixes.

OK then, I guess we could try to make it easy to switch between RCs.
But daily ... we don't want to increment a revision on each change, do we?

Maybe a nonstandard way like ofedinfo is enough for these testing setups?

-- 
MST


From rdreier at cisco.com  Tue Jul 17 15:07:27 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 15:07:27 -0700
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <20070717215811.GA19243@mellanox.co.il> (Michael S. Tsirkin's
	message of "Wed, 18 Jul 2007 00:58:11 +0300")
References: <ada3azmrcap.fsf@cisco.com> <20070717210935.GA17168@mellanox.co.il>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA303DCFD6A@xmb-sjc-216.amer.cisco.com>
	<20070717215811.GA19243@mellanox.co.il>
Message-ID: <ada4pk2ptuo.fsf@cisco.com>

 > But daily ... we don't want to increment a revision on each change, do we?

I think it's easy enough to make the revision of the RPMS be something
like -0.1.2007-07-17.1 or something like that.


From mst at dev.mellanox.co.il  Tue Jul 17 15:12:06 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 18 Jul 2007 01:12:06 +0300
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <ada4pk2ptuo.fsf@cisco.com>
References: <ada3azmrcap.fsf@cisco.com> <20070717210935.GA17168@mellanox.co.il>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA303DCFD6A@xmb-sjc-216.amer.cisco.com>
	<20070717215811.GA19243@mellanox.co.il> <ada4pk2ptuo.fsf@cisco.com>
Message-ID: <20070717221206.GC19243@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [ofa-general] Re: RFC OFED-1.3 installation
> 
>  > But daily ... we don't want to increment a revision on each change, do we?
> 
> I think it's easy enough to make the revision of the RPMS be something
> like -0.1.2007-07-17.1 or something like that.

OK, so you say just ignore the content and stick a date in there?
Fine, that'll work, and we can cover the RCs this way too I think.

-- 
MST


From hal.rosenstock at gmail.com  Tue Jul 17 15:33:11 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 17 Jul 2007 15:33:11 -0700
Subject: [ofa-general] Re: [ewg] Agenda for OFED meeting today
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563963@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C901563963@mtlexch01.mtl.com>
Message-ID: <f0e08f230707171533p71a9a020ief253303facb51d8@mail.gmail.com>

Hi Tziporet,

On 7/16/07, Tziporet Koren <tziporet at mellanox.co.il> wrote:
>
>  Hi All,
>
> We have our OFED synch meeting today at 9am PST.
>
> Agenda:
> 1. Merge the OFED 1.2.1 release with OFED 1.2.c release in August.
> 2. Agree on OFED 1.3 schedule:
>         * Feature freeze - Sep 4
>         * Alpha release - Sep 10
>         * Beta release - Sep 25
>         * RC1 - Oct 16
>         * RC2 - Oct 30
>         * RC3 - Nov 8 (assuming many of us are at SC07 on the week of Nov
> 11)
>         * RC4 - Nov 20
>         * GA release - Nov 30 (or first week of Dec)
> 3. Review OFED 1.3 features list:
> In last meeting we decided that the schedule is one of the most important
> parameters in OFED 1.3.
> Thus I divided the features for two categories:
>
>    - "must have" features - features that must be ready for the release
>    (marked with *)
>    - "optional" features - features that can be included in the release
>    in case they are ready according to the schedule
>
> Must have general features:
> ====================
>
>    - Kernel base on 2.6.23 (all new features that will be part of this
>    kernel will be included in OFED 1.3)
>    - Install:
>       - Break the packages RPMs (work with Novell and Redhat) to
>       minimize integration effort into OS distribution
>    - Package:
>       - Sources arrangement for the end user (for the labs)
>    - New HCAs & RNICs:
>       - ConnectX support
>       - Any other new HW?
>    - QoS: OSM, CM, CMA, ULPs (IPoIB, SDP, SRP)
>
> Other features (must have marked with *)
> ==============================
>
>    - libibverbs: New verbs:
>       - Scalable Reliable Connected Transport (with Mellanox
>       ConnectX)*
>       - Reliable Multicast?
>
> ULPs:
>
>    - IPoIB:
>       - Performance improvements (those that will be stable on time)
>       - NAPI - done
>    - SDP:
>       - * Keepalive
>       - * AIO
>    - uDAPL:
>       - DAT 2.0 support with IB extensions for immediate data,
>       atomics;
>       - Add extensions for new verbs (SRCT,RM)
>    - VNIC:
>       - GA quality. Not a technology preview version anymore.
>       - Added support for QLogic EVIC (10 Gbps
>       Infiniband-to-Ethernet gateway) - in GA
>    - RDS: RDMA API (using FMRs); GA quality with Oracle 11
>    - NFSoRDMA integration - pending we have a maintainer
>    - Management:
>       - * Multiple partitions via libibumad
>       - OpenSM
>          - More routing performance improvements - done
>          - Even more speedups - done
>          - Better packaging/installation - done
>          - "Native" daemon mode - done
>          - * Performance management
>          - * Quality of Service manager: Based on IBTA annex
>          - Enhancements for fat tree routing (non pure tree
>          support) - done
>          - More console commands and telnet access to console -
>          done
>       - More diagnostics
>          - ibidsverify.pl: validate LIDs and GUIDs in subnet -
>          done
>          - Updated ibnetdiscover format with link width and
>          speed, and GUIDs - done
>          - ibnetdiscover grouping support for new Voltaire
>          chassis - done
>          - diag updates for IB router support - done
>          - iblinkinfo.pl: Support peer port link width and speed
>          validation - done
>          - ibdatacounters: Add script and man page for subnet
>          wide data counters saquery enhancements - done
>
>
What happened to ibsim  ?  I thought that was on the list I originally sent.

-- Hal


>    - iWARP:
>       - * Chelsio: Get to GA level
>       - NetEffect: Get the drivers into OFED
>
> Tziporet Koren
> Software Director
> Mellanox Technologies
> mailto: *tziporet at mellanox.co.il* <tziporet at mellanox.co.il>
> Tel +972-4-9097200, ext 380
>
>
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070717/aa1dacca/attachment.html>

From pradeeps at linux.vnet.ibm.com  Tue Jul 17 15:39:27 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Tue, 17 Jul 2007 15:39:27 -0700
Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V7] patch
Message-ID: <469D451F.6030706@linux.vnet.ibm.com>

Here is a seventh version of the IPOIB_CM_NOSRQ patch.

Changes from V6:
1. Minor changes incorporating Sean Hefty's comments (changed
spin lock to an atomic and additional cleanups)

This patch has been tested with linux-2.6.22 derived from Roland's
for-2.6.23 git tree on ppc64 machines

Signed-off-by: Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>
---

--- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-30 
14:56:25.000000000 -0400
+++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib.h	2007-07-10 
18:30:10.000000000 -0400
@@ -95,11 +95,16 @@ enum {
  	IPOIB_MCAST_FLAG_ATTACHED = 3,
  };

+#define CM_PACKET_SIZE (1ul << 16)
  #define	IPOIB_OP_RECV   (1ul << 31)
  #ifdef CONFIG_INFINIBAND_IPOIB_CM
-#define	IPOIB_CM_OP_SRQ (1ul << 30)
+#define	IPOIB_CM_OP_RECV (1ul << 30)
+
+#define NOSRQ_INDEX_TABLE_SIZE 128
+#define NOSRQ_INDEX_MASK      (NOSRQ_INDEX_TABLE_SIZE -1)
+
  #else
-#define	IPOIB_CM_OP_SRQ (0)
+#define	IPOIB_CM_OP_RECV (0)
  #endif

  /* structs */
@@ -166,11 +171,14 @@ enum ipoib_cm_state {
  };

  struct ipoib_cm_rx {
-	struct ib_cm_id     *id;
-	struct ib_qp        *qp;
-	struct list_head     list;
-	struct net_device   *dev;
-	unsigned long        jiffies;
+	struct ib_cm_id     	*id;
+	struct ib_qp        	*qp;
+	struct ipoib_cm_rx_buf  *rx_ring; /* Used by NOSRQ only */
+	struct list_head     	 list;
+	struct net_device   	*dev;
+	unsigned long        	 jiffies;
+	u32                      index; /* wr_ids are distinguished by index
+					 * to identify the QP -NOSRQ only */
  	enum ipoib_cm_state  state;
  };

@@ -215,6 +223,8 @@ struct ipoib_cm_dev_priv {
  	struct ib_wc            ibwc[IPOIB_NUM_WC];
  	struct ib_sge           rx_sge[IPOIB_CM_RX_SG];
  	struct ib_recv_wr       rx_wr;
+	struct ipoib_cm_rx	**rx_index_table; /* See ipoib_cm_dev_init()
+						   *for usage of this element */
  };

  /*
@@ -564,10 +574,9 @@ static inline void ipoib_cm_skb_too_long
  	dev_kfree_skb_any(skb);
  }

-static inline void ipoib_cm_handle_rx_wc(struct net_device *dev, struct 
ib_wc *wc)
+void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
  {
  }
-
  #endif

  #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
--- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-07-10 
17:02:33.000000000 -0400
+++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-07-17 
17:45:16.000000000 -0400
@@ -49,6 +49,17 @@ MODULE_PARM_DESC(cm_data_debug_level,

  #include "ipoib.h"

+int max_rc_qp = NOSRQ_INDEX_TABLE_SIZE;
+int max_recv_buf = 1024; /* Default is 1024 MB */
+
+module_param_named(nosrq_max_rc_qp, max_rc_qp, int, 0644);
+MODULE_PARM_DESC(nosrq_max_rc_qp, "Max number of NOSRQ RC QPs supported");
+
+module_param_named(max_receive_buffer, max_recv_buf, int, 0644);
+MODULE_PARM_DESC(max_receive_buffer, "Max Receive Buffer Size in MB");
+
+atomic_t current_rc_qp; /* Active number of RC QPs for NOSRQ */
+
  #define IPOIB_CM_IETF_ID 0x1000000000000000ULL

  #define IPOIB_CM_RX_UPDATE_TIME (256 * HZ)
@@ -81,20 +92,20 @@ static void ipoib_cm_dma_unmap_rx(struct
  		ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE, 
DMA_FROM_DEVICE);
  }

-static int ipoib_cm_post_receive(struct net_device *dev, int id)
+static int post_receive_srq(struct net_device *dev, u64 id)
  {
  	struct ipoib_dev_priv *priv = netdev_priv(dev);
  	struct ib_recv_wr *bad_wr;
  	int i, ret;

-	priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_SRQ;
+	priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV;

  	for (i = 0; i < IPOIB_CM_RX_SG; ++i)
  		priv->cm.rx_sge[i].addr = priv->cm.srq_ring[id].mapping[i];

  	ret = ib_post_srq_recv(priv->cm.srq, &priv->cm.rx_wr, &bad_wr);
  	if (unlikely(ret)) {
-		ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret);
+		ipoib_warn(priv, "post srq failed for buf %ld (%d)\n", id, ret);
  		ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1,
  				      priv->cm.srq_ring[id].mapping);
  		dev_kfree_skb_any(priv->cm.srq_ring[id].skb);
@@ -104,12 +115,47 @@ static int ipoib_cm_post_receive(struct
  	return ret;
  }

-static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, 
int id, int frags,
+static int post_receive_nosrq(struct net_device *dev, u64 id)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_recv_wr *bad_wr;
+	int i, ret;
+	u32 index;
+	u32 wr_id;
+	struct ipoib_cm_rx *rx_ptr;
+
+	index = id  & NOSRQ_INDEX_MASK ;
+	wr_id = id >> 32;
+
+	rx_ptr = priv->cm.rx_index_table[index];
+
+	priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV;
+
+	for (i = 0; i < IPOIB_CM_RX_SG; ++i)
+		priv->cm.rx_sge[i].addr = rx_ptr->rx_ring[wr_id].mapping[i];
+
+	ret = ib_post_recv(rx_ptr->qp, &priv->cm.rx_wr, &bad_wr);
+	if (unlikely(ret)) {
+		ipoib_warn(priv, "post recv failed for buf %d (%d)\n",
+		           wr_id, ret);
+		ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1,
+		                      rx_ptr->rx_ring[wr_id].mapping);
+		dev_kfree_skb_any(rx_ptr->rx_ring[wr_id].skb);
+		rx_ptr->rx_ring[wr_id].skb = NULL;
+	}
+
+	return ret;
+}
+
+static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, 
u64 id,
+					     int frags,
  					     u64 mapping[IPOIB_CM_RX_SG])
  {
  	struct ipoib_dev_priv *priv = netdev_priv(dev);
  	struct sk_buff *skb;
  	int i;
+	struct ipoib_cm_rx *rx_ptr;
+	u32 index, wr_id;

  	skb = dev_alloc_skb(IPOIB_CM_HEAD_SIZE + 12);
  	if (unlikely(!skb))
@@ -141,7 +187,14 @@ static struct sk_buff *ipoib_cm_alloc_rx
  			goto partial_error;
  	}

-	priv->cm.srq_ring[id].skb = skb;
+	if (priv->cm.srq)
+		priv->cm.srq_ring[id].skb = skb;
+	else {
+		index = id  & NOSRQ_INDEX_MASK ;
+		wr_id = id >> 32;
+		rx_ptr = priv->cm.rx_index_table[index];
+		rx_ptr->rx_ring[wr_id].skb = skb;
+	}
  	return skb;

  partial_error:
@@ -198,16 +251,21 @@ static struct ib_qp *ipoib_cm_create_rx_
  {
  	struct ipoib_dev_priv *priv = netdev_priv(dev);
  	struct ib_qp_init_attr attr = {
-		.event_handler = ipoib_cm_rx_event_handler,
  		.send_cq = priv->cq, /* For drain WR */
  		.recv_cq = priv->cq,
  		.srq = priv->cm.srq,
  		.cap.max_send_wr = 1, /* For drain WR */
+		.cap.max_recv_wr = ipoib_recvq_size + 1,
  		.cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */
  		.sq_sig_type = IB_SIGNAL_ALL_WR,
  		.qp_type = IB_QPT_RC,
  		.qp_context = p,
  	};
+	if (!priv->cm.srq) {
+		attr.cap.max_recv_sge = IPOIB_CM_RX_SG;	
+		attr.event_handler = NULL;
+	} else
+		attr.event_handler = ipoib_cm_rx_event_handler;
  	return ib_create_qp(priv->pd, &attr);
  }

@@ -282,12 +340,129 @@ static int ipoib_cm_send_rep(struct net_
  	rep.flow_control = 0;
  	rep.rnr_retry_count = req->rnr_retry_count;
  	rep.target_ack_delay = 20; /* FIXME */
-	rep.srq = 1;
  	rep.qp_num = qp->qp_num;
  	rep.starting_psn = psn;
+	rep.srq	= !!priv->cm.srq;
  	return ib_send_cm_rep(cm_id, &rep);
  }

+static void init_context_and_add_list(struct ib_cm_id *cm_id,
+				    struct ipoib_cm_rx *p,
+				    struct ipoib_dev_priv *priv)
+{
+	cm_id->context = p;
+	p->jiffies = jiffies;
+	spin_lock_irq(&priv->lock);
+	if (list_empty(&priv->cm.passive_ids))
+		queue_delayed_work(ipoib_workqueue,
+				   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
+	if (priv->cm.srq) {
+		/* Add this entry to passive ids list head, but do not re-add
+		 * it if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush
+		 * list.
+		 */
+		if (p->state == IPOIB_CM_RX_LIVE)
+			list_move(&p->list, &priv->cm.passive_ids);
+	}
+	spin_unlock_irq(&priv->lock);
+}
+
+static int allocate_and_post_rbuf_nosrq(struct ib_cm_id *cm_id,
+				        struct ipoib_cm_rx *p, unsigned psn)
+{
+	struct net_device *dev = cm_id->context;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int ret;
+	u32 qp_num, index;
+	u64 i, recv_mem_used;
+
+	qp_num = p->qp->qp_num;
+
+	/* In the SRQ case there is a common rx buffer called the srq_ring.
+	 * However, for the NOSRQ we create an rx_ring for every
+	 * struct ipoib_cm_rx.
+	 */
+	p->rx_ring = kzalloc(ipoib_recvq_size * sizeof *p->rx_ring, GFP_KERNEL);
+	if (!p->rx_ring) {
+		printk(KERN_WARNING "Failed to allocate rx_ring for 0x%x\n",
+		       qp_num);
+		return -ENOMEM;
+	}
+
+	spin_lock_irq(&priv->lock);
+	list_add(&p->list, &priv->cm.passive_ids);
+	spin_unlock_irq(&priv->lock);
+
+	init_context_and_add_list(cm_id, p, priv);
+	spin_lock_irq(&priv->lock);
+		
+	for (index = 0; index < max_rc_qp; index++)
+		if (priv->cm.rx_index_table[index] == NULL)
+			break;
+
+	recv_mem_used = (u64)ipoib_recvq_size *
+		        (u64)atomic_inc_return(&current_rc_qp)
+		         * CM_PACKET_SIZE; /* packets are 64K */
+	if ((index == max_rc_qp) ||
+	( recv_mem_used >= max_recv_buf * (1ul << 20))) {
+		spin_unlock_irq(&priv->lock);
+		ipoib_warn(priv, "NOSRQ has reached the configurable limit "
+		           "of either %d RC QPs or, max recv buf size of "
+			   "0x%x MB\n", max_rc_qp, max_recv_buf);
+
+		/* We send a REJ to the remote side indicating that we
+		 * have no more free RC QPs and leave it to the remote side
+		 * to take appropriate action. This should leave the
+		 * current set of QPs unaffected and any subsequent REQs
+		 * will be able to use RC QPs if they are available.
+		 */
+		ib_send_cm_rej(cm_id, IB_CM_REJ_NO_QP, NULL, 0, NULL, 0);
+		ret = -EINVAL;
+		goto err_alloc_and_post;
+	}
+
+	priv->cm.rx_index_table[index] = p;
+	spin_unlock_irq(&priv->lock);
+
+	/* We will subsequently use this stored pointer while freeing
+	 * resources in stale task
+	 */
+	p->index = index;
+
+	ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn);
+	if (ret) {
+		ipoib_warn(priv, "ipoib_cm_modify_rx_qp() failed %d\n", ret);
+		ipoib_cm_dev_cleanup(dev);
+		goto err_alloc_and_post;
+	}
+
+	for (i = 0; i < ipoib_recvq_size; ++i) {
+		if (!ipoib_cm_alloc_rx_skb(dev, i << 32 | index,
+					   IPOIB_CM_RX_SG - 1,
+					   p->rx_ring[i].mapping)) {
+			ipoib_warn(priv, "failed to allocate receive "
+			           "buffer %ld\n", i);
+			ipoib_cm_dev_cleanup(dev);
+			ret = -ENOMEM;
+			goto err_alloc_and_post;
+		}
+
+		if (post_receive_nosrq(dev, i << 32 | index)) {
+			ipoib_warn(priv, "post_receive_nosrq "
+			           "failed for  buf %ld\n", i);
+			ipoib_cm_dev_cleanup(dev);
+			ret = -EIO;
+			goto err_alloc_and_post;
+		}
+	}
+
+	return 0;
+
+err_alloc_and_post:
+	kfree(p->rx_ring);
+	return ret;
+}
+
  static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct 
ib_cm_event *event)
  {
  	struct net_device *dev = cm_id->context;
@@ -298,13 +473,13 @@ static int ipoib_cm_req_handler(struct i

  	ipoib_dbg(priv, "REQ arrived\n");
  	p = kzalloc(sizeof *p, GFP_KERNEL);
-	if (!p)
+	if (!p) {
+		printk(KERN_WARNING "Failed to allocate RX control block when "
+		       "REQ arrived\n");
  		return -ENOMEM;
+	}
  	p->dev = dev;
  	p->id = cm_id;
-	cm_id->context = p;
-	p->state = IPOIB_CM_RX_LIVE;
-	p->jiffies = jiffies;
  	INIT_LIST_HEAD(&p->list);

  	p->qp = ipoib_cm_create_rx_qp(dev, p);
@@ -314,19 +489,20 @@ static int ipoib_cm_req_handler(struct i
  	}

  	psn = random32() & 0xffffff;
-	ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn);
-	if (ret)
-		goto err_modify;
+	if (!priv->cm.srq) {
+		if ((ret = allocate_and_post_rbuf_nosrq(cm_id, p, psn)))
+			goto err_post_nosrq;
+	} else {
+		p->rx_ring = NULL;
+		ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn);
+		if (ret)
+			goto err_modify;
+	}

-	spin_lock_irq(&priv->lock);
-	queue_delayed_work(ipoib_workqueue,
-			   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
-	/* Add this entry to passive ids list head, but do not re-add it
-	 * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */
-	p->jiffies = jiffies;
-	if (p->state == IPOIB_CM_RX_LIVE)
-		list_move(&p->list, &priv->cm.passive_ids);
-	spin_unlock_irq(&priv->lock);
+	if (priv->cm.srq) {
+		p->state = IPOIB_CM_RX_LIVE;
+		init_context_and_add_list(cm_id, p, priv);
+	}

  	ret = ipoib_cm_send_rep(dev, cm_id, p->qp, &event->param.req_rcvd, psn);
  	if (ret) {
@@ -336,6 +512,9 @@ static int ipoib_cm_req_handler(struct i
  	}
  	return 0;

+err_post_nosrq:
+	list_del_init(&p->list);
+	atomic_dec(&current_rc_qp);
  err_modify:
  	ib_destroy_qp(p->qp);
  err_qp:
@@ -399,29 +578,60 @@ static void skb_put_frags(struct sk_buff
  	}
  }

-void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
+static void timer_check_srq(struct ipoib_dev_priv *priv, struct 
ipoib_cm_rx *p)
+{
+	unsigned long flags;
+
+	if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) {
+		spin_lock_irqsave(&priv->lock, flags);
+		p->jiffies = jiffies;
+		/* Move this entry to list head, but do
+		 * not re-add it if it has been removed.
+		 */
+		if (p->state == IPOIB_CM_RX_LIVE)
+			list_move(&p->list, &priv->cm.passive_ids);
+		spin_unlock_irqrestore(&priv->lock, flags);
+	}
+}
+
+static void timer_check_nosrq(struct ipoib_dev_priv *priv, struct 
ipoib_cm_rx *p)
+{
+	unsigned long flags;
+
+	if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) {
+		spin_lock_irqsave(&priv->lock, flags);
+		p->jiffies = jiffies;
+		/* Move this entry to list head, but do
+		 * not re-add it if it has been removed. */
+		if (!list_empty(&p->list))	
+			list_move(&p->list, &priv->cm.passive_ids);
+		spin_unlock_irqrestore(&priv->lock, flags);
+	}
+}
+
+void handle_rx_wc_srq(struct net_device *dev, struct ib_wc *wc)
  {
  	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	unsigned int wr_id = wc->wr_id & ~IPOIB_CM_OP_SRQ;
+	u64 wr_id = wc->wr_id & ~IPOIB_CM_OP_RECV;
  	struct sk_buff *skb, *newskb;
  	struct ipoib_cm_rx *p;
  	unsigned long flags;
  	u64 mapping[IPOIB_CM_RX_SG];
-	int frags;
+	int frags, ret;

  	ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n",
  		       wr_id, wc->status);

  	if (unlikely(wr_id >= ipoib_recvq_size)) {
-		if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~IPOIB_CM_OP_SRQ)) {
+		if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~IPOIB_CM_OP_RECV)) {
  			spin_lock_irqsave(&priv->lock, flags);
  			list_splice_init(&priv->cm.rx_drain_list, &priv->cm.rx_reap_list);
  			ipoib_cm_start_rx_drain(priv);
  			queue_work(ipoib_workqueue, &priv->cm.rx_reap_task);
  			spin_unlock_irqrestore(&priv->lock, flags);
  		} else
-			ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n",
-				   wr_id, ipoib_recvq_size);
+			ipoib_warn(priv, "cm recv completion event with wrid 0x%llx (> 0x%x)\n",
+				   (unsigned long long)wr_id, ipoib_recvq_size);
  		return;
  	}

@@ -429,23 +639,15 @@ void ipoib_cm_handle_rx_wc(struct net_de

  	if (unlikely(wc->status != IB_WC_SUCCESS)) {
  		ipoib_dbg(priv, "cm recv error "
-			   "(status=%d, wrid=%d vend_err %x)\n",
-			   wc->status, wr_id, wc->vendor_err);
+			   "(status=%d, wrid=0x%llx vend_err %x)\n",
+			   wc->status, (unsigned long long)wr_id, wc->vendor_err);
  		++priv->stats.rx_dropped;
-		goto repost;
+		goto repost_srq;
  	}

  	if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) {
  		p = wc->qp->qp_context;
-		if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) {
-			spin_lock_irqsave(&priv->lock, flags);
-			p->jiffies = jiffies;
-			/* Move this entry to list head, but do not re-add it
-			 * if it has been moved out of list. */
-			if (p->state == IPOIB_CM_RX_LIVE)
-				list_move(&p->list, &priv->cm.passive_ids);
-			spin_unlock_irqrestore(&priv->lock, flags);
-		}
+		timer_check_srq(priv, p);
  	}

  	frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len,
@@ -457,13 +659,112 @@ void ipoib_cm_handle_rx_wc(struct net_de
  		 * If we can't allocate a new RX buffer, dump
  		 * this packet and reuse the old buffer.
  		 */
-		ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id);
+		ipoib_dbg(priv, "failed to allocate receive buffer %ld\n", wr_id);
+                ++priv->stats.rx_dropped;
+                goto repost_srq;
+        }
+
+	ipoib_cm_dma_unmap_rx(priv, frags,
+	                      priv->cm.srq_ring[wr_id].mapping);
+	memcpy(priv->cm.srq_ring[wr_id].mapping, mapping,
+	       (frags + 1) * sizeof *mapping);
+	ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n",
+		       wc->byte_len, wc->slid);
+
+	skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb);
+
+	skb->protocol = ((struct ipoib_header *) skb->data)->proto;
+	skb_reset_mac_header(skb);	
+	skb_pull(skb, IPOIB_ENCAP_LEN);
+
+	dev->last_rx = jiffies;
+	++priv->stats.rx_packets;
+	priv->stats.rx_bytes += skb->len;
+
+	skb->dev = dev;
+	/* XXX get correct PACKET_ type here */
+	skb->pkt_type = PACKET_HOST;
+	netif_receive_skb(skb);
+
+repost_srq:
+	ret = post_receive_srq(dev, wr_id);
+
+	if (unlikely(ret))
+		ipoib_warn(priv, "post_receive_srq failed for buf %ld\n",
+		           wr_id);
+
+}
+
+static void handle_rx_wc_nosrq(struct net_device *dev, struct ib_wc *wc)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct sk_buff *skb, *newskb;
+	u64 mapping[IPOIB_CM_RX_SG], wr_id = wc->wr_id >> 32;
+	u32 index;
+	struct ipoib_cm_rx *rx_ptr;
+	int frags, ret;
+
+
+	ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n",
+		       wr_id, wc->status);
+
+	if (unlikely(wr_id >= ipoib_recvq_size)) {
+		ipoib_warn(priv, "cm recv completion event with wrid 0x%llx (> %d)\n",
+				   (unsigned long long)wr_id, ipoib_recvq_size);
+		return;
+	}
+
+	index = (wc->wr_id & ~IPOIB_CM_OP_RECV) & NOSRQ_INDEX_MASK ;
+
+	/* This is the only place where rx_ptr could be a NULL - could
+	 * have just received a packet from a connection that has become
+	 * stale and so is going away. We will simply drop the packet and
+	 * let the hardware (it s IB_QPT_RC) handle the dropped packet.
+	 * In the timer_check() function below, p->jiffies is updated and
+	 * hence the connection will not be stale after that.
+	 */
+	rx_ptr = priv->cm.rx_index_table[index];
+	if (unlikely(!rx_ptr)) {
+		ipoib_warn(priv, "Received packet from a connection "
+		           "that is going away. Hardware will handle it.\n");
+		return;
+	}
+
+	skb = rx_ptr->rx_ring[wr_id].skb;
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		ipoib_dbg(priv, "cm recv error "
+			   "(status=%d, wrid=%ld vend_err %x)\n",
+			   wc->status, wr_id, wc->vendor_err);
+		++priv->stats.rx_dropped;
+		goto repost_nosrq;
+	}
+
+	if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) {
+		/* There are no guarantees that wc->qp is not NULL for HCAs
+	 	* that do not support SRQ. */
+		timer_check_nosrq(priv, rx_ptr);
+	}
+
+	frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len,
+					      (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE;
+
+	newskb = ipoib_cm_alloc_rx_skb(dev, wr_id << 32 | index, frags,
+				       mapping);
+	if (unlikely(!newskb)) {
+		/*
+		 * If we can't allocate a new RX buffer, dump
+		 * this packet and reuse the old buffer.
+		 */
+		ipoib_dbg(priv, "failed to allocate receive buffer %ld\n", wr_id);
  		++priv->stats.rx_dropped;
-		goto repost;
+		goto repost_nosrq;
  	}

-	ipoib_cm_dma_unmap_rx(priv, frags, priv->cm.srq_ring[wr_id].mapping);
-	memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, (frags + 1) * sizeof 
*mapping);
+	ipoib_cm_dma_unmap_rx(priv, frags,
+	                      rx_ptr->rx_ring[wr_id].mapping);
+	memcpy(rx_ptr->rx_ring[wr_id].mapping, mapping,
+	       (frags + 1) * sizeof *mapping);

  	ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n",
  		       wc->byte_len, wc->slid);
@@ -483,10 +784,22 @@ void ipoib_cm_handle_rx_wc(struct net_de
  	skb->pkt_type = PACKET_HOST;
  	netif_receive_skb(skb);

-repost:
-	if (unlikely(ipoib_cm_post_receive(dev, wr_id)))
-		ipoib_warn(priv, "ipoib_cm_post_receive failed "
-			   "for buf %d\n", wr_id);
+repost_nosrq:
+	ret = post_receive_nosrq(dev, wr_id << 32 | index);
+
+	if (unlikely(ret))
+		ipoib_warn(priv, "post_receive_nosrq failed for buf %ld\n",
+		           wr_id);
+}
+
+void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	if (priv->cm.srq)
+		handle_rx_wc_srq(dev, wc);
+	else
+		handle_rx_wc_nosrq(dev, wc);
  }

  static inline int post_send(struct ipoib_dev_priv *priv,
@@ -678,6 +991,42 @@ err_cm:
  	return ret;
  }

+static void free_resources_nosrq(struct ipoib_dev_priv *priv, struct 
ipoib_cm_rx *p)
+{
+	int i;
+
+	for(i = 0; i < ipoib_recvq_size; ++i)
+		if(p->rx_ring[i].skb) {
+			ipoib_cm_dma_unmap_rx(priv,
+				         IPOIB_CM_RX_SG - 1,
+					 p->rx_ring[i].mapping);
+			dev_kfree_skb_any(p->rx_ring[i].skb);
+			p->rx_ring[i].skb = NULL;
+		}
+	kfree(p->rx_ring);
+}
+
+void dev_stop_nosrq(struct ipoib_dev_priv *priv)
+{
+	struct ipoib_cm_rx *p;
+
+	spin_lock_irq(&priv->lock);
+	while (!list_empty(&priv->cm.passive_ids)) {
+		p = list_entry(priv->cm.passive_ids.next, typeof(*p), list);
+		free_resources_nosrq(priv, p);
+		list_del(&p->list);
+		spin_unlock_irq(&priv->lock);
+		ib_destroy_cm_id(p->id);
+		ib_destroy_qp(p->qp);
+		atomic_dec(&current_rc_qp);
+		kfree(p);
+		spin_lock_irq(&priv->lock);
+	}
+	spin_unlock_irq(&priv->lock);
+
+	cancel_delayed_work(&priv->cm.stale_task);
+}
+
  void ipoib_cm_dev_stop(struct net_device *dev)
  {
  	struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -692,6 +1041,11 @@ void ipoib_cm_dev_stop(struct net_device
  	ib_destroy_cm_id(priv->cm.id);
  	priv->cm.id = NULL;

+	if (!priv->cm.srq) {
+		dev_stop_nosrq(priv);
+		return;
+	}
+
  	spin_lock_irq(&priv->lock);
  	while (!list_empty(&priv->cm.passive_ids)) {
  		p = list_entry(priv->cm.passive_ids.next, typeof(*p), list);
@@ -737,6 +1091,7 @@ void ipoib_cm_dev_stop(struct net_device
  		kfree(p);
  	}

+
  	cancel_delayed_work(&priv->cm.stale_task);
  }

@@ -815,7 +1170,9 @@ static struct ib_qp *ipoib_cm_create_tx_
  	attr.recv_cq = priv->cq;
  	attr.srq = priv->cm.srq;
  	attr.cap.max_send_wr = ipoib_sendq_size;
+	attr.cap.max_recv_wr = 1;
  	attr.cap.max_send_sge = 1;
+	attr.cap.max_recv_sge = 1;
  	attr.sq_sig_type = IB_SIGNAL_ALL_WR;
  	attr.qp_type = IB_QPT_RC;
  	attr.send_cq = cq;
@@ -855,7 +1212,7 @@ static int ipoib_cm_send_req(struct net_
  	req.retry_count 	      = 0; /* RFC draft warns against retries */
  	req.rnr_retry_count 	      = 0; /* RFC draft warns against retries */
  	req.max_cm_retries 	      = 15;
-	req.srq 	              = 1;
+	req.srq			      = !!priv->cm.srq;
  	return ib_send_cm_req(id, &req);
  }

@@ -1200,6 +1557,9 @@ static void ipoib_cm_rx_reap(struct work
  	list_for_each_entry_safe(p, n, &list, list) {
  		ib_destroy_cm_id(p->id);
  		ib_destroy_qp(p->qp);
+		if (!priv->cm.srq) {	
+			atomic_dec(&current_rc_qp);
+		}
  		kfree(p);
  	}
  }
@@ -1218,12 +1578,19 @@ static void ipoib_cm_stale_task(struct w
  		p = list_entry(priv->cm.passive_ids.prev, typeof(*p), list);
  		if (time_before_eq(jiffies, p->jiffies + IPOIB_CM_RX_TIMEOUT))
  			break;
-		list_move(&p->list, &priv->cm.rx_error_list);
-		p->state = IPOIB_CM_RX_ERROR;
-		spin_unlock_irq(&priv->lock);
-		ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE);
-		if (ret)
-			ipoib_warn(priv, "unable to move qp to error state: %d\n", ret);
+		if (!priv->cm.srq) {
+			free_resources_nosrq(priv, p);
+			list_del_init(&p->list);
+			priv->cm.rx_index_table[p->index] = NULL;
+			spin_unlock_irq(&priv->lock);
+		} else {
+			list_move(&p->list, &priv->cm.rx_error_list);
+			p->state = IPOIB_CM_RX_ERROR;
+			spin_unlock_irq(&priv->lock);
+			ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE);
+			if (ret)
+				ipoib_warn(priv, "unable to move qp to error state: %d\n", ret);
+		}
  		spin_lock_irq(&priv->lock);
  	}

@@ -1277,16 +1644,40 @@ int ipoib_cm_add_mode_attr(struct net_de
  	return device_create_file(&dev->dev, &dev_attr_mode);
  }

+static int create_srq(struct net_device *dev, struct ipoib_dev_priv *priv)
+{
+	struct ib_srq_init_attr srq_init_attr;
+	int ret;
+
+	srq_init_attr.attr.max_wr = ipoib_recvq_size;
+	srq_init_attr.attr.max_sge = IPOIB_CM_RX_SG;
+
+	priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr);
+	if (IS_ERR(priv->cm.srq)) {
+		ret = PTR_ERR(priv->cm.srq);
+		priv->cm.srq = NULL;
+		return ret;
+	}
+
+	priv->cm.srq_ring = kzalloc(ipoib_recvq_size *
+		                    sizeof *priv->cm.srq_ring,
+			            GFP_KERNEL);
+	if (!priv->cm.srq_ring) {
+		printk(KERN_WARNING "%s: failed to allocate CM ring "
+		       "(%d entries)\n",
+	       	       priv->ca->name, ipoib_recvq_size);
+		ipoib_cm_dev_cleanup(dev);
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
  int ipoib_cm_dev_init(struct net_device *dev)
  {
  	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	struct ib_srq_init_attr srq_init_attr = {
-		.attr = {
-			.max_wr  = ipoib_recvq_size,
-			.max_sge = IPOIB_CM_RX_SG
-		}
-	};
  	int ret, i;
+	struct ib_device_attr attr;

  	INIT_LIST_HEAD(&priv->cm.passive_ids);
  	INIT_LIST_HEAD(&priv->cm.reap_list);
@@ -1303,20 +1694,32 @@ int ipoib_cm_dev_init(struct net_device

  	skb_queue_head_init(&priv->cm.skb_queue);

-	priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr);
-	if (IS_ERR(priv->cm.srq)) {
-		ret = PTR_ERR(priv->cm.srq);
-		priv->cm.srq = NULL;
+	if ((ret = ib_query_device(priv->ca, &attr)))
  		return ret;
-	}

-	priv->cm.srq_ring = kzalloc(ipoib_recvq_size * sizeof *priv->cm.srq_ring,
-				    GFP_KERNEL);
-	if (!priv->cm.srq_ring) {
-		printk(KERN_WARNING "%s: failed to allocate CM ring (%d entries)\n",
-		       priv->ca->name, ipoib_recvq_size);
-		ipoib_cm_dev_cleanup(dev);
-		return -ENOMEM;
+	if (attr.max_srq) {
+		/* This device supports SRQ */
+		if ((ret = create_srq(dev, priv)))
+			return ret;
+		priv->cm.rx_index_table = NULL;
+	} else {
+		priv->cm.srq = NULL;
+		priv->cm.srq_ring = NULL;
+
+		/* Every new REQ that arrives creates a struct ipoib_cm_rx.
+		 * These structures form a link list starting with the
+		 * passive_ids. For quick and easy access we maintain a table
+		 * of pointers to struct ipoib_cm_rx called the rx_index_table
+		 */
+		priv->cm.rx_index_table = kzalloc(NOSRQ_INDEX_TABLE_SIZE *
+					 sizeof *priv->cm.rx_index_table,
+					 GFP_KERNEL);
+		if (!priv->cm.rx_index_table) {
+			printk(KERN_WARNING "Failed to allocate NOSRQ_INDEX_TABLE\n");
+			return -ENOMEM;
+		}
+
+		atomic_set(&current_rc_qp, 0);
  	}

  	for (i = 0; i < IPOIB_CM_RX_SG; ++i)
@@ -1329,17 +1732,24 @@ int ipoib_cm_dev_init(struct net_device
  	priv->cm.rx_wr.sg_list = priv->cm.rx_sge;
  	priv->cm.rx_wr.num_sge = IPOIB_CM_RX_SG;

-	for (i = 0; i < ipoib_recvq_size; ++i) {
-		if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1,
+	/* One can post receive buffers even before the RX QP is created
+	 * only in the SRQ case. Therefore for NOSRQ we skip the rest of init
+	 * and do that in ipoib_cm_req_handler()
+	 */
+
+	if (priv->cm.srq) {
+		for (i = 0; i < ipoib_recvq_size; ++i) {
+			if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1,
  					   priv->cm.srq_ring[i].mapping)) {
-			ipoib_warn(priv, "failed to allocate receive buffer %d\n", i);
-			ipoib_cm_dev_cleanup(dev);
-			return -ENOMEM;
-		}
-		if (ipoib_cm_post_receive(dev, i)) {
-			ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i);
-			ipoib_cm_dev_cleanup(dev);
-			return -EIO;
+				ipoib_warn(priv, "failed to allocate receive buffer %d\n", i);
+				ipoib_cm_dev_cleanup(dev);
+				return -ENOMEM;
+			}
+			if (post_receive_srq(dev, i)) {
+				ipoib_warn(priv, "post_receive_srq failed for buf %d\n", i);
+				ipoib_cm_dev_cleanup(dev);
+				return -EIO;
+			}
  		}
  	}

--- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-30 
14:56:25.000000000 -0400
+++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-07-10 
18:30:10.000000000 -0400
@@ -299,7 +299,7 @@ int ipoib_poll(struct net_device *dev, i
  		for (i = 0; i < n; ++i) {
  			struct ib_wc *wc = priv->ibwc + i;

-			if (wc->wr_id & IPOIB_CM_OP_SRQ) {
+			if (wc->wr_id & IPOIB_CM_OP_RECV) {
  				++done;
  				--max;
  				ipoib_cm_handle_rx_wc(dev, wc);
@@ -557,7 +557,7 @@ void ipoib_drain_cq(struct net_device *d
  	do {
  		n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc);
  		for (i = 0; i < n; ++i) {
-			if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ)
+			if (priv->ibwc[i].wr_id & IPOIB_CM_OP_RECV)
  				ipoib_cm_handle_rx_wc(dev, priv->ibwc + i);
  			else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV)
  				ipoib_ib_handle_rx_wc(dev, priv->ibwc + i);
--- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-30 
14:56:25.000000000 -0400
+++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-07-10 
18:30:10.000000000 -0400
@@ -175,6 +175,15 @@ int ipoib_transport_dev_init(struct net_
  	if (!ret)
  		size += ipoib_recvq_size + 1 /* 1 extra for rx_drain_qp */;

+ 	/* We increase the size of the CQ in the NOSRQ case to prevent CQ
+ 	 * overflow. Every new REQ creates a new RX QP and each QP has an
+ 	 * RX ring associated with it. Therefore we could have
+ 	 * NOSRQ_INDEX_TABLE_SIZE*ipoib_recvq_size + ipoib_sendq_size CQEs
+ 	 * in a CQ.
+ 	 */
+ 	if(!priv->cm.srq)
+ 		size += (NOSRQ_INDEX_TABLE_SIZE -1)* ipoib_recvq_size;
+
  	priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, 
size, 0);
  	if (IS_ERR(priv->cq)) {
  		printk(KERN_WARNING "%s: failed to create CQ\n", ca->name);


From rdreier at cisco.com  Tue Jul 17 15:41:17 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 15:41:17 -0700
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <20070717221206.GC19243@mellanox.co.il> (Michael S. Tsirkin's
	message of "Wed, 18 Jul 2007 01:12:06 +0300")
References: <ada3azmrcap.fsf@cisco.com> <20070717210935.GA17168@mellanox.co.il>
	<A15335FBE9BD2449AF2C9EF3D1EB8EA303DCFD6A@xmb-sjc-216.amer.cisco.com>
	<20070717215811.GA19243@mellanox.co.il> <ada4pk2ptuo.fsf@cisco.com>
	<20070717221206.GC19243@mellanox.co.il>
Message-ID: <adazm1uodpu.fsf@cisco.com>

 > > I think it's easy enough to make the revision of the RPMS be something
 > > like -0.1.2007-07-17.1 or something like that.
 > 
 > OK, so you say just ignore the content and stick a date in there?
 > Fine, that'll work, and we can cover the RCs this way too I think.

I just meant to add a revision that encodes the daily build if you
want to do daily builds.  So you could have libibverbs RPMs with
version 1.1.2-0.1.2007-07-17.1 or whatever, and then do 1.1.2-0.2.beta1
and 1.1.2-1 final.

 - R.


From pradeeps at linux.vnet.ibm.com  Tue Jul 17 15:43:37 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Tue, 17 Jul 2007 15:43:37 -0700
Subject: [ofa-general] IPOIB CM (NOSRQ) extension [PATCH V2] patch
Message-ID: <469D4619.2040904@linux.vnet.ibm.com>

This patch handles the corner case of running out of RC QPs. In that
case it switches to UD mode. This patch can be used both by NOSRQ and
SRQ code.

This is a resubmission of the previous patch against the 2.6.22 kernel.
No changes otherwise.

This patch has been tested with linux-2.6.22 derived from Roland's
for-2.6.23 git tree on ppc64 machines


Signed-off-by: Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>
---

--- c/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-07-17 
17:56:17.000000000 -0400
+++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-07-17 
17:59:16.000000000 -0400
@@ -1372,8 +1372,18 @@ static int ipoib_cm_tx_handler(struct ib
  			ib_send_cm_rej(cm_id, IB_CM_REJ_CONSUMER_DEFINED,
  				       NULL, 0, NULL, 0);
  		break;
-	case IB_CM_REQ_ERROR:
  	case IB_CM_REJ_RECEIVED:
+		ipoib_warn(priv, "REJ received\n");
+		spin_lock(&priv->lock);
+		neigh = tx->neigh;
+		spin_unlock(&priv->lock);
+		
+		if ((neigh) && (event->param.rej_rcvd.reason ==
+		   IB_CM_REJ_NO_QP)) {
+			clear_bit(IPOIB_FLAG_OPER_UP, &neigh->cm->flags);
+			break;
+		}
+	case IB_CM_REQ_ERROR:
  	case IB_CM_TIMEWAIT_EXIT:
  		ipoib_dbg(priv, "CM error %d.\n", event->event);
  		spin_lock_irq(&priv->tx_lock);
--- c/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-30 
14:56:25.000000000 -0400
+++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-07-17 
17:59:16.000000000 -0400
@@ -679,11 +679,10 @@ static int ipoib_start_xmit(struct sk_bu

  		neigh = *to_ipoib_neigh(skb->dst->neighbour);

-		if (ipoib_cm_get(neigh)) {
-			if (ipoib_cm_up(neigh)) {
+		if (ipoib_cm_get(neigh) &&  ipoib_cm_up(neigh) &&
+			test_bit(IPOIB_FLAG_OPER_UP, &neigh->cm->flags)) {
  				ipoib_cm_send(dev, skb, ipoib_cm_get(neigh));
  				goto out;
-			}
  		} else if (neigh->ah) {
  			if (unlikely(memcmp(&neigh->dgid.raw,
  					    skb->dst->neighbour->ha + 4,


From rdreier at cisco.com  Tue Jul 17 15:44:16 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 15:44:16 -0700
Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V7] patch
In-Reply-To: <469D451F.6030706@linux.vnet.ibm.com> (Pradeep Satyanarayana's
	message of "Tue, 17 Jul 2007 15:39:27 -0700")
References: <469D451F.6030706@linux.vnet.ibm.com>
Message-ID: <adaveciodkv.fsf@cisco.com>

I'll take a closer look later, but please try to find a way to post
patches so they don't get line wrapped (otherwise I won't be able to
apply it, even if we converge on something acceptable).

Also please do some basic quality control.  Running this patch through
scripts/checkpatch.pl shows many many small style problems -- you can
ignore the 80 character limit warnings for the most part, but there is
plenty of whitespace damage and other stuff to fix.

 - R.


From hal.rosenstock at gmail.com  Tue Jul 17 15:50:02 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 17 Jul 2007 15:50:02 -0700
Subject: [ofa-general] Re: [ewg] OFED July 16 meeting summary
In-Reply-To: <469D1E7C.7040701@mellanox.co.il>
References: <469D1E7C.7040701@mellanox.co.il>
Message-ID: <f0e08f230707171550j1230d86ct990051a70483c9d0@mail.gmail.com>

Hi Tziporet,

On 7/17/07, Tziporet Koren <tziporet at dev.mellanox.co.il> wrote:
>
>  *OFED July 16 meeting summary*
>
> *1. Merge the OFED 1.2.1 release with OFED 1.2.c release in August. *
>
> There was a long discussion on pros & cons regarding merging the two
> releases.
>
> Pros:
> - Everybody will be focused on the same release
> - All user space libs (except for the new libmlx4) are the same
> - Reduce QA efforts
>
> Cons:
> - The kernel was changed to 2.6.22 based and this can cause instability.
> - Harder to distinguish what are the differences between 1.2 to 1.2.c.
> (since its not only few patches)
> - 1.2.c release was aimed for ConnectX support only. If we lump the two
> releases together it may slow the convergence of this release.
>
> In addition there is a need to check with IBM and Chelsio, who actually
> asked for the 1.2.1 release, if this suites them.
> Steve agreed to test 1.2.c to see if its OK with his fixes.
> Need a respond from IBM too. (BTW - no patches from IBM were sent so far.)
>
> Decision: No decision was taken.
> I suggest we stay with two different branches for now.
> After more people will test 1.2.c  and see if its stable enough we can
> decide not to do 1.2.1
>
> *2. Agree on OFED 1.3 schedule:
> *The suggested schedule:*
> *        * Feature freeze - Sep 4
>         * Alpha release - Sep 10
>         * Beta release - Sep 25
>         * RC1 - Oct 16
>         * RC2 - Oct 30
>         * RC3 - Nov 8 (assuming many of us are at SC07 on the week of Nov
> 11)
>         * RC4 - Nov 20
>         * GA release - Nov 30 (or first week of Dec)
>
> Discussion:
> - Due to the 1.2.c release the schedule seems very tight.
> - Since 1.2.c progress only the kernel, many user level features that are
> already done are not exposed to customers in OFED release.
>
> Decision: Revisit the schedule on September according to the "must have"
> features readiness.
>
> *3. Review OFED 1.3 features list:
> *
>
> There was an agreement on the must have features, except QoS that should
> be defined after IBTA SPEC is published
> We have not reviewed the list of features thoroughly. Each company should
> review the features and send comments to the list.
>
> Must have general features:
> ====================
>
>    - Kernel base on 2.6.23 (all new features that will be part of this
>    kernel will be included in OFED 1.3)
>    - Install:
>       - Break the packages RPMs (work with Novell and Redhat) to
>       minimize integration effort into OS distribution
>     - Package:
>       - Sources arrangement for the end user (for the labs)
>     - New HCAs & RNICs:
>       - ConnectX support
>       - Neteffect support
>         - QoS: OSM, CM, CMA, ULPs (IPoIB, SDP, SRP)
>
> Other features (must have marked with *)
> ==============================
>
>    - libibverbs: New verbs:
>       - Scalable Reliable Connected Transport (with Mellanox
>       ConnectX)*
>       - Reliable Multicast?
>
> ULPs:
>
>    - IPoIB:
>       - Performance improvements (those that will be stable on time)
>       - NAPI - done
>     - SDP:
>       - * Keepalive
>       - * AIO
>     - uDAPL:
>       - DAT 2.0 support with IB extensions for immediate data,
>       atomics;
>       - Add extensions for new verbs (SRCT,RM)
>     - VNIC:
>       - GA quality. Not a technology preview version anymore.
>       - Added support for QLogic EVIC (10 Gbps
>       Infiniband-to-Ethernet gateway) - in GA
>     - RDS: RDMA API (using FMRs); GA quality with Oracle 11
>    - NFSoRDMA integration - pending we have a maintainer
>     - Management:
>       - * Multiple partitions via libibumad
>       - OpenSM
>          - More routing performance improvements - done
>          - Even more speedups - done
>          - Better packaging/installation - done
>          - "Native" daemon mode - done
>          - * Performance management
>          - * Quality of Service manager: Based on IBTA annex
>          - Enhancements for fat tree routing (non pure tree
>          support) - done
>          - More console commands and telnet access to console -
>          done
>        - More diagnostics
>          - ibidsverify.pl: validate LIDs and GUIDs in subnet -
>          done
>          - Updated ibnetdiscover format with link width and
>          speed, and GUIDs - done
>          - ibnetdiscover grouping support for new Voltaire
>          chassis - done
>          - diag updates for IB router support - done
>          - iblinkinfo.pl: Support peer port link width and speed
>          validation - done
>          - ibdatacounters: Add script and man page for subnet
>          wide data counters saquery enhancements - done
>
>
What happened to ibsim ? It was on the list I sent. Is there any reason it
can't be included ?

Thanks.

-- Hal

iWARP:
>
>
>    -
>       - * Chelsio: Get to GA level
>       - NetEffect: Get the drivers into OFED
>
>
>
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070717/c1b16331/attachment.html>

From dledford at redhat.com  Tue Jul 17 16:09:27 2007
From: dledford at redhat.com (Doug Ledford)
Date: Tue, 17 Jul 2007 23:09:27 +0000
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <20070717202730.GA15990@mellanox.co.il>
References: <20070717152546.GA6863@mellanox.co.il>
	<1184689249.5165.419.camel@firewall.xsintricity.com>
	<20070717162731.GA7479@mellanox.co.il>
	<1184690380.5165.430.camel@firewall.xsintricity.com>
	<20070717164500.GB7479@mellanox.co.il>
	<1184691962.5165.450.camel@firewall.xsintricity.com>
	<20070717171250.GD7479@mellanox.co.il>
	<1184693800.5165.480.camel@firewall.xsintricity.com>
	<20070717174526.GE7479@mellanox.co.il>
	<1184697799.5165.536.camel@firewall.xsintricity.com>
	<20070717202730.GA15990@mellanox.co.il>
Message-ID: <1184713767.5165.547.camel@firewall.xsintricity.com>

On Tue, 2007-07-17 at 23:27 +0300, Michael S. Tsirkin wrote:
> > So you need to be able to
> > tell the difference between a customer running libibverbs-1.0.4 from
> > OFED-1.3-beta1 and libibverbs-1.0.4 from OFED-1.3 final.
> 
> I don't really think we want customers to run beta code, or intend to support
> such configurations.

It's not so much whether you want them to or not.  If you make it
available, some of them *will* run it.  Not necessarily in production,
but still run it none the less.  You need to be able to tell the
difference.  And they need to be able to tell the difference.  What if
they installed it to test and provide feedback to you, but then because
the version numbers weren't distinct, and they forgot just which
machines they put it in, *they* no longer knew which was the beta/rc
code or the final release?

-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070717/29d5ed5f/attachment.sig>

From dledford at redhat.com  Tue Jul 17 16:11:47 2007
From: dledford at redhat.com (Doug Ledford)
Date: Tue, 17 Jul 2007 23:11:47 +0000
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <20070717210935.GA17168@mellanox.co.il>
References: <20070717162731.GA7479@mellanox.co.il>
	<1184690380.5165.430.camel@firewall.xsintricity.com>
	<20070717164500.GB7479@mellanox.co.il>
	<1184691962.5165.450.camel@firewall.xsintricity.com>
	<20070717171250.GD7479@mellanox.co.il>
	<1184693800.5165.480.camel@firewall.xsintricity.com>
	<20070717174526.GE7479@mellanox.co.il>
	<1184697799.5165.536.camel@firewall.xsintricity.com>
	<20070717202730.GA15990@mellanox.co.il> <ada3azmrcap.fsf@cisco.com>
	<20070717210935.GA17168@mellanox.co.il>
Message-ID: <1184713907.5165.549.camel@firewall.xsintricity.com>

On Wed, 2007-07-18 at 00:09 +0300, Michael S. Tsirkin wrote:
> > Quoting Roland Dreier <rdreier at cisco.com>:
> > Subject: Re: [ofa-general] Re: RFC OFED-1.3 installation
> > 
> >  > I don't really think we want customers to run beta code
> > 
> > What's the point of a beta then??
> 
> Donnu.
> In previous OFED releases, we had "release candidates" rather than "beta".
> Openfabrics members were running RCs and reporting issues on the list and in
> bugzilla. Do you really ask your customers to do this for you?

Sure, as much as possible.  I generally don't recommend using it in
production, but just as close as they can get to production is fine with
me.  The more issues they find while I'm still actually working on it
and making new revisions, the less issues they'll find after I stupidly
think I'm done.

-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070717/938ad9ad/attachment.sig>

From pradeeps at linux.vnet.ibm.com  Tue Jul 17 17:01:04 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Tue, 17 Jul 2007 17:01:04 -0700
Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V7] patch
In-Reply-To: <adaveciodkv.fsf@cisco.com>
References: <469D451F.6030706@linux.vnet.ibm.com> <adaveciodkv.fsf@cisco.com>
Message-ID: <469D5840.5040802@linux.vnet.ibm.com>

Roland Dreier wrote:
> I'll take a closer look later, but please try to find a way to post
> patches so they don't get line wrapped (otherwise I won't be able to
> apply it, even if we converge on something acceptable).
> 

As far as I know there should be no line wrap issues any more. Do you
still see it?

> Also please do some basic quality control.  Running this patch through
> scripts/checkpatch.pl shows many many small style problems -- you can
> ignore the 80 character limit warnings for the most part, but there is
> plenty of whitespace damage and other stuff to fix.
>

Will do.

Pradeep


From rdreier at cisco.com  Tue Jul 17 18:08:07 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 18:08:07 -0700
Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V7] patch
In-Reply-To: <469D5840.5040802@linux.vnet.ibm.com> (Pradeep Satyanarayana's
	message of "Tue, 17 Jul 2007 17:01:04 -0700")
References: <469D451F.6030706@linux.vnet.ibm.com> <adaveciodkv.fsf@cisco.com>
	<469D5840.5040802@linux.vnet.ibm.com>
Message-ID: <adar6n6o6x4.fsf@cisco.com>

 > As far as I know there should be no line wrap issues any more. Do you
 > still see it?

Yes, eg:

-static inline void ipoib_cm_handle_rx_wc(struct net_device *dev,
struct ib_wc *wc)
+void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)

I think the problem is the "flowed" in:

 > Content-Type: text/plain; charset=ISO-8859-1; format=flowed

I see

 > User-Agent: Thunderbird 2.0.0.4 (Windows/20070604)

and I think some people have managed to use thunderbird to send
non-mangled patches, so you should be able to find some documentation.


From mst at dev.mellanox.co.il  Tue Jul 17 19:18:54 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 18 Jul 2007 05:18:54 +0300
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <1184713907.5165.549.camel@firewall.xsintricity.com>
References: <20070717164500.GB7479@mellanox.co.il>
	<1184691962.5165.450.camel@firewall.xsintricity.com>
	<20070717171250.GD7479@mellanox.co.il>
	<1184693800.5165.480.camel@firewall.xsintricity.com>
	<20070717174526.GE7479@mellanox.co.il>
	<1184697799.5165.536.camel@firewall.xsintricity.com>
	<20070717202730.GA15990@mellanox.co.il> <ada3azmrcap.fsf@cisco.com>
	<20070717210935.GA17168@mellanox.co.il>
	<1184713907.5165.549.camel@firewall.xsintricity.com>
Message-ID: <20070718021854.GD19243@mellanox.co.il>

> Quoting Doug Ledford <dledford at redhat.com>:
> Subject: Re: [ofa-general] Re: RFC OFED-1.3 installation
> 
> On Wed, 2007-07-18 at 00:09 +0300, Michael S. Tsirkin wrote:
> > > Quoting Roland Dreier <rdreier at cisco.com>:
> > > Subject: Re: [ofa-general] Re: RFC OFED-1.3 installation
> > > 
> > >  > I don't really think we want customers to run beta code
> > > 
> > > What's the point of a beta then??
> > 
> > Donnu.
> > In previous OFED releases, we had "release candidates" rather than "beta".
> > Openfabrics members were running RCs and reporting issues on the list and in
> > bugzilla. Do you really ask your customers to do this for you?
> 
> Sure, as much as possible.  I generally don't recommend using it in
> production, but just as close as they can get to production is fine with
> me.  The more issues they find while I'm still actually working on it
> and making new revisions, the less issues they'll find after I stupidly
> think I'm done.

So,Roland's idea of sticking a date in RPM revision willwork, won't it?


-- 
MST


From dledford at redhat.com  Tue Jul 17 19:49:24 2007
From: dledford at redhat.com (Doug Ledford)
Date: Wed, 18 Jul 2007 02:49:24 +0000
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <20070718021854.GD19243@mellanox.co.il>
References: <20070717164500.GB7479@mellanox.co.il>
	<1184691962.5165.450.camel@firewall.xsintricity.com>
	<20070717171250.GD7479@mellanox.co.il>
	<1184693800.5165.480.camel@firewall.xsintricity.com>
	<20070717174526.GE7479@mellanox.co.il>
	<1184697799.5165.536.camel@firewall.xsintricity.com>
	<20070717202730.GA15990@mellanox.co.il> <ada3azmrcap.fsf@cisco.com>
	<20070717210935.GA17168@mellanox.co.il>
	<1184713907.5165.549.camel@firewall.xsintricity.com>
	<20070718021854.GD19243@mellanox.co.il>
Message-ID: <1184726964.5165.552.camel@firewall.xsintricity.com>

On Wed, 2007-07-18 at 05:18 +0300, Michael S. Tsirkin wrote:
> > Quoting Doug Ledford <dledford at redhat.com>:
> > Subject: Re: [ofa-general] Re: RFC OFED-1.3 installation
> > 
> > On Wed, 2007-07-18 at 00:09 +0300, Michael S. Tsirkin wrote:
> > > > Quoting Roland Dreier <rdreier at cisco.com>:
> > > > Subject: Re: [ofa-general] Re: RFC OFED-1.3 installation
> > > > 
> > > >  > I don't really think we want customers to run beta code
> > > > 
> > > > What's the point of a beta then??
> > > 
> > > Donnu.
> > > In previous OFED releases, we had "release candidates" rather than "beta".
> > > Openfabrics members were running RCs and reporting issues on the list and in
> > > bugzilla. Do you really ask your customers to do this for you?
> > 
> > Sure, as much as possible.  I generally don't recommend using it in
> > production, but just as close as they can get to production is fine with
> > me.  The more issues they find while I'm still actually working on it
> > and making new revisions, the less issues they'll find after I stupidly
> > think I'm done.
> 
> So,Roland's idea of sticking a date in RPM revision willwork, won't it?

As long as you don't do two package builds on the same day.  That's why
my script encodes both an increasing number and the date into the
revision.

For reference, I'll attach the updated script I made for spitting out a
buildable tarball.

-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: make.dist
Type: application/x-shellscript
Size: 5272 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070718/9b4a120a/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070718/9b4a120a/attachment.sig>

From dledford at redhat.com  Tue Jul 17 19:58:28 2007
From: dledford at redhat.com (Doug Ledford)
Date: Wed, 18 Jul 2007 02:58:28 +0000
Subject: [ofa-general] RFC OFED-1.3 installation
In-Reply-To: <1184688250.10172.8.camel@localhost>
References: <OF81499DC8.2B3ABFBB-ON8725731A.006B0533-8825731A.003F70B0@us.ibm.com>
	<1184688250.10172.8.camel@localhost>
Message-ID: <1184727508.5165.559.camel@firewall.xsintricity.com>

On Tue, 2007-07-17 at 19:04 +0300, Sasha Khapyorsky wrote:
> Hi,
> 
> On Mon, 2007-07-16 at 12:32 -0700, Shirley Ma wrote:
> > Is ib-utils depends on opensm-libs? If so I would suggest to change
> > opensm-libs as libsmutils. Otherwise ib-utils won't work without
> > installing opensm package. Does this make sense?
> 
> Not whole opensm, but opensm-libs. Why the name ("opensm-libs" or
> "libsmutils") is matter?

It doesn't.  In the case of opensm, opensm requires opensm-libs, so it's
perfectly acceptable to install opensm-libs without opensm as there is
no requirement on opensm from opensm-libs.

Generally, it's standard practice that when you have something that's
primarily an app, but happens to provide libs that *can* be utilized by
other apps, then the naming is <appname>, <appname>-libs,
<appname>-devel.  Only when you have a package that is primarily a
library and any apps in the package are demo/test/example apps that
don't serve a useful purpose outside of the scope of the library do you
name the packages lib<name>, lib<name>-devel, and put the apps in
lib<name>-utils.

In addition, it is generally frowned upon to ship any static libraries
that customers might link against, but if you find that's truly
necessary, then it is preferred that the static libs be in a separate
-static package.  This way customers must intentionally install the
package to be able to link statically, so it won't happen by accident.

-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070718/43e566df/attachment.sig>

From or.gerlitz at gmail.com  Tue Jul 17 20:20:15 2007
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Wed, 18 Jul 2007 06:20:15 +0300
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <4696D1F3.2040507@ichips.intel.com>
References: <adalkdl43w0.fsf@cisco.com> <adahco943ip.fsf@cisco.com>
	<4696D1F3.2040507@ichips.intel.com>
Message-ID: <15ddcffd0707172020j5b68fcb2v7d3ca77863998020@mail.gmail.com>

On 7/13/07, Sean Hefty <mshefty at ichips.intel.com> wrote:
>
> >  - Take a look at Sean's local SA caching patches.  I merged
> >    everything else from Sean's tree, but I'm still undecided about
> >    these.  I haven't read them carefully yet, but even aside from that
> >    I don't have a good feeling about whether there's consensus about
> >    this yet.  Any opinions about merging, for or against, would be
> >    appreciated here.
>
> But to be fair, it will be difficult to enable both QoS and local PR
> caching.  To me, this would be the strongest reason against using it.
> However, QoS places additional burden on the SA, which will make scaling
> even more challenging.


my understanding is that the local sa does a path-query where all the fields
except for the SGID are wildcard-ed. This means we expect the result to be a
table of all the paths from this port to every other port on the fabrics for
every pkey which this port is a member of etc, correct?

How do you plug here  the QoS concept of SID in the path query? are you
expecting the SA to realize what are all the services for which this port is
a "member"? does the proposed definision for QoS management at the SA
defines "services per gids" isn't it "what SL to user per Service"?

Or.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070718/a133cc0f/attachment.html>

From rdreier at cisco.com  Tue Jul 17 20:23:38 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 20:23:38 -0700
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <15ddcffd0707172020j5b68fcb2v7d3ca77863998020@mail.gmail.com> (Or
	Gerlitz's message of "Wed, 18 Jul 2007 06:20:15 +0300")
References: <adalkdl43w0.fsf@cisco.com> <adahco943ip.fsf@cisco.com>
	<4696D1F3.2040507@ichips.intel.com>
	<15ddcffd0707172020j5b68fcb2v7d3ca77863998020@mail.gmail.com>
Message-ID: <adaabtuo0n9.fsf@cisco.com>

 > > But to be fair, it will be difficult to enable both QoS and local PR
 > > caching.  To me, this would be the strongest reason against using it.
 > > However, QoS places additional burden on the SA, which will make scaling
 > > even more challenging.
 > 
 > my understanding is that the local sa does a path-query where all the fields
 > except for the SGID are wildcard-ed. This means we expect the result to be a
 > table of all the paths from this port to every other port on the fabrics for
 > every pkey which this port is a member of etc, correct?
 > 
 > How do you plug here  the QoS concept of SID in the path query? are you
 > expecting the SA to realize what are all the services for which this port is
 > a "member"? does the proposed definision for QoS management at the SA
 > defines "services per gids" isn't it "what SL to user per Service"?

Or, thanks for rescuing this post.

I think this is an important question.  If we merge the local SA
stuff, then are we creating a problem for dealing with QoS?  Are we
going to have to revert the local SA stuff once the QoS stuff is
available?  Or is there at least a sketch of a plan on how to handle
this?

 - R.


From dledford at redhat.com  Tue Jul 17 20:30:15 2007
From: dledford at redhat.com (Doug Ledford)
Date: Tue, 17 Jul 2007 23:30:15 -0400
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <20070718021854.GD19243@mellanox.co.il>
References: <20070717164500.GB7479@mellanox.co.il>
	<1184691962.5165.450.camel@firewall.xsintricity.com>
	<20070717171250.GD7479@mellanox.co.il>
	<1184693800.5165.480.camel@firewall.xsintricity.com>
	<20070717174526.GE7479@mellanox.co.il>
	<1184697799.5165.536.camel@firewall.xsintricity.com>
	<20070717202730.GA15990@mellanox.co.il> <ada3azmrcap.fsf@cisco.com>
	<20070717210935.GA17168@mellanox.co.il>
	<1184713907.5165.549.camel@firewall.xsintricity.com>
	<20070718021854.GD19243@mellanox.co.il>
Message-ID: <1184729415.5165.570.camel@firewall.xsintricity.com>

On Wed, 2007-07-18 at 05:18 +0300, Michael S. Tsirkin wrote:
> > Quoting Doug Ledford <dledford at redhat.com>:
> > Subject: Re: [ofa-general] Re: RFC OFED-1.3 installation
> > 
> > On Wed, 2007-07-18 at 00:09 +0300, Michael S. Tsirkin wrote:
> > > > Quoting Roland Dreier <rdreier at cisco.com>:
> > > > Subject: Re: [ofa-general] Re: RFC OFED-1.3 installation
> > > > 
> > > >  > I don't really think we want customers to run beta code
> > > > 
> > > > What's the point of a beta then??
> > > 
> > > Donnu.
> > > In previous OFED releases, we had "release candidates" rather than "beta".
> > > Openfabrics members were running RCs and reporting issues on the list and in
> > > bugzilla. Do you really ask your customers to do this for you?
> > 
> > Sure, as much as possible.  I generally don't recommend using it in
> > production, but just as close as they can get to production is fine with
> > me.  The more issues they find while I'm still actually working on it
> > and making new revisions, the less issues they'll find after I stupidly
> > think I'm done.
> 
> So,Roland's idea of sticking a date in RPM revision willwork, won't it?

As long as you don't do two package builds on the same day.  That's why
my script encodes both an increasing number and the date into the
revision.

For reference, I'll attach the updated script I made for spitting out a
buildable tarball.

Hehehe...resending because the ofa list server ate my message due to the
script attachment :-D  I'll inline it instead.

I guess I'll also mention that this script exists in my ~/repos/upstream
directory, and also in that directory are all the git repos that I have
cloned from ofa (as well as other places).  So, it's one level above all
the various git clones and spits everything out into dist/.  The easiest
way to use this script for any given package you want to create a daily
snapshot of is to run ./make.dist repodir daily; scp
dist/repodir-git.tgz dist/repodir-daily.HEAD ofaserver:downloads.  That
simple action would (assuming you create a reasonable reponame.spec.in
file in the repos that are missing one) spit out a tarball that can be
passed directly to rpmbuild --rebuild reponame-git.tgz and rpm will spit
out the packages, and the repodir-daily.HEAD file shows the HEAD of the
git repo so you know exactly what state the tarball represents and you
can always get to it in another more recent repo by just updating to
that commit as head of tree.

#!/bin/bash

usage() {
echo "$0 repo daily | release [ signed | <key-id> ]"
echo
echo "	You must specify the repo to make a distribution tarball in.  This"
echo "script will not work with complex repos like the management repo that"
echo "builds more than one package.  It expects a repo to be a single package"
echo "repo where the directory name and the package name are the same, and"
echo "where a properly formatted reponame.spec.in file exists."
echo
echo "	You must specify either release or daily in order for this script"
echo "to make tarballs.  If this is a daily release, the tarballs will"
echo "be named <component>-git.tgz and will overwrite existing tarballs."
echo "If this is a release build, then the tarball will be named"
echo "<component>-<version>.tgz and must be a new file.  In addition,"
echo "the script will add a new set of symbolic tags to the git repo"
echo "that correspond to the <component>-<version> of each tarball."
echo
echo "	If the script detects that the tag on any component already exists,"
echo "it will abort the release and prompt you to update the version on"
echo "the already tagged component.  This enforces the proper behavior of"
echo "treating any released tarball as set in stone so that in the future"
echo "you will always be able to get to any given release tarball by"
echo "checking out the git tag and know with certainty that it is the same"
echo "code as released before even if you no longer have the same tarball"
echo "around."
echo
echo "	As part of this process, the script will parse the <target>.spec.in"
echo "file and output a <target>.spec file.  Since this script isn't smart"
echo "enough to deal with other random changes that should have their own" 
echo "checkin the script will refuse to run if the current repo state is not"
echo "clean."
echo
echo "	NOTE: the script has no clue if you are tagging on the right branch,"
echo "it will however show you the git branch output so you can confirm it"
echo "is on the right branch before proceeding with the release."
echo
echo "	In addition to just tagging the git repo, whenever creating a release"
echo "there is an optional argument of either signed or a hex gpg key-id."
echo "If you do not pass an argument to release, then the tag will be a"
echo "simple git annotated tag.  If you pass signed as the argument, the"
echo "git tag operation will use your default signing key to sign the tag."
echo "Or you can pass an actual gpg key id in hex format and git will sign"
echo "the tag with that key."
echo 
}

if [ -z "$1" -o -z "$2" ]; then usage; exit 1; fi

if [ ! -d "$1" ]; then usage; exit 1; fi

TMPDIR=dist
if [ ! -d $TMPDIR ]; then mkdir $TMPDIR; fi

if [ "$2" = "daily" -o "$2" = "release" ]; then
	if [ ! -f $TMPDIR/$1-$2.HEAD ]; then
		touch $TMPDIR/$1-$2.HEAD
	fi
	NEWHEAD=`cat $TMPDIR/$1-$2.HEAD`
else
	usage
	exit 1
fi

cd "$1"
echo "Updating git repo..."
git pull
RESULT=$?
HEAD=`git log --pretty=oneline -1`

if [ "$RESULT" -ne 0 ]; then
	echo "Failed to update the git repo cleanly, manual intervention required"
	exit 1
fi

if [ "$HEAD" = "$NEWHEAD" ]; then
	echo "No new commits since last tarball creation, nothing to do."
	cd ..
	exit 0
fi

if [ "$2" = "release" ]; then
	# Is the repo clean?
	git status | grep modified > /dev/null 2>&1
	if [ $? = 0 ]; then
		echo "There are modified files in the repo.  Please check any"
		echo "changes in before proceeding."
		exit 4
	fi
	# Since we will be tagging things, make sure we are on the right
	# branch
	git branch
	echo -n "Is the active branch the right one to tag this release on [y/N]? "
	read answer
	if [ "$answer" = y -o "$answer" = Y ]; then
		echo "Proceeding..."
	else
		echo "Please check out the right branch and run make.dist again"
		exit 0
	fi
	# Check versions to make sure that we can proceed
	VERSION=`grep "AC_INIT.*$1" configure.in | cut -f 2 -d ',' | sed -e 's/ //g'`
	TARBALL=$1-$VERSION.tgz
	if [ -f ../$TMPDIR/$TARBALL ]; then
		echo "Target $TARBALL already exists, please update the version of"
		echo "$1"
		exit 2
	fi
	if [ ! -z "`git tag -l $1-$VERSION`" ]; then
		echo "A git tag already exists for $1-$VERSION.  Please change the version"
		echo "of $1 so a tag replacement won't occur."
		exit 3
	fi
# On a real release, this resets the daily release starting point, on the
# assumption that any new daily builds will have a version number that is
# incrementally higher than the last officially released tarball.
	RELEASE=1
	echo $RELEASE > ../$TMPDIR/$1.release
else
	DATE=`date +%Y%m%d`
	if [ -f ../$TMPDIR/$1.release ]; then
		RELEASE=`cat ../$TMPDIR/$1.release`
		RELEASE=`expr $RELEASE + 1`
	else
		RELEASE=1
	fi
	echo $RELEASE > ../$TMPDIR/$1.release
	RELEASE=0.${RELEASE}.${DATE}git
	TARBALL=$1-git.tgz
fi

cd ..
cp -a $1 $1-$VERSION
[ -f $1/$1.spec.in ] && sed -e 's/@VERSION@/'$VERSION'/;s/@RELEASE@/'$RELEASE'/;s/@TARBALL@/'$TARBALL'/' < $1/$1.spec.in > $1-$VERSION/$1.spec
if [ -f $1-$VERSION/autogen.sh ]; then
	cd $1-$VERSION
	./autogen.sh
	cd ..
fi
echo "Creating $TMPDIR/$TARBALL"
tar -czf $TMPDIR/$TARBALL --exclude=.git $1-$VERSION
rm -rf $1-$VERSION
echo "$HEAD" > $TMPDIR/$1-$2.HEAD

if [ $2 = release ]; then
	echo "Tagging release."
	cd $1
	if [ ! -z "$3" ]; then
		if [ $3 = "signed" ]; then
			git tag -s -m "Auto tag by make.dist on release tarball creation" $1-$VERSION
		else
			git tag -u "$3" -m "Auto tag by make.dist on release tarball creation" $1-$VERSION
		fi
	else
		git tag -a -m "Auto tag by make.dist on release tarball creation" $1-$VERSION
	fi
	cd ..
fi


-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070717/bd5502b5/attachment.sig>

From rdreier at cisco.com  Tue Jul 17 20:31:15 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 20:31:15 -0700
Subject: [ofa-general] [PATCH] IB/CM: Remove local write permission enable
	in QP access flags
In-Reply-To: <200707171758.57442.dotanb@dev.mellanox.co.il> (Dotan Barak's
	message of "Tue, 17 Jul 2007 17:58:57 +0300")
References: <200707171758.57442.dotanb@dev.mellanox.co.il>
Message-ID: <ada644io0ak.fsf@cisco.com>

thanks, applied


From rdreier at cisco.com  Tue Jul 17 20:41:02 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 20:41:02 -0700
Subject: [ofa-general] Re: [PATCH v2] mlx4: add device reset to error
	handling mechanism
In-Reply-To: <200707121750.45629.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Thu, 12 Jul 2007 17:50:45 +0300")
References: <200707121750.45629.jackm@dev.mellanox.co.il>
Message-ID: <ada1wf6nzu9.fsf@cisco.com>

thanks, applied as below with quite a few changes:
 - I was wrong to suggest round_jiffies_relative() -- we really just
   want round_jiffies()
 - I don't think the "stop" variable is needed at all --
   del_timer_sync() should be safe without it (and yes this cleanup
   applies to mthca as well)
 - Don't start polling if the ioremap fails, it will obviously cause
   an instant oops

 - R.


commit ee49bd9397cd2b8fe7a1962505d81c1d0a1366fc
Author: Jack Morgenstein <jackm at dev.mellanox.co.il>
Date:   Thu Jul 12 17:50:45 2007 +0300

    mlx4_core: Reset device when internal error is detected
    
    Reset the device when an internal error is detected.
    
    Also, detect errors by polling the error buffer rather than using
    interrupts.  This is more robust and doesn't depend on MSI-X.  Remove
    the old interrupt handler entirely, since we don't want to support two
    mechanisms for detecting internal errors.
    
    Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

diff --git a/drivers/net/mlx4/catas.c b/drivers/net/mlx4/catas.c
index 1bb088a..6b32ec9 100644
--- a/drivers/net/mlx4/catas.c
+++ b/drivers/net/mlx4/catas.c
@@ -30,41 +30,133 @@
  * SOFTWARE.
  */
 
+#include <linux/workqueue.h>
+
 #include "mlx4.h"
 
-void mlx4_handle_catas_err(struct mlx4_dev *dev)
+enum {
+	MLX4_CATAS_POLL_INTERVAL	= 5 * HZ,
+};
+
+static DEFINE_SPINLOCK(catas_lock);
+
+static LIST_HEAD(catas_list);
+static struct workqueue_struct *catas_wq;
+static struct work_struct catas_work;
+
+static int internal_err_reset = 1;
+module_param(internal_err_reset, int, 0644);
+MODULE_PARM_DESC(internal_err_reset,
+		 "Reset device on internal errors if non-zero (default 1)");
+
+static void dump_err_buf(struct mlx4_dev *dev)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
 
 	int i;
 
-	mlx4_err(dev, "Catastrophic error detected:\n");
+	mlx4_err(dev, "Internal error detected:\n");
 	for (i = 0; i < priv->fw.catas_size; ++i)
 		mlx4_err(dev, "  buf[%02x]: %08x\n",
 			 i, swab32(readl(priv->catas_err.map + i)));
+}
 
-	mlx4_dispatch_event(dev, MLX4_EVENT_TYPE_LOCAL_CATAS_ERROR, 0, 0);
+static void poll_catas(unsigned long dev_ptr)
+{
+	struct mlx4_dev *dev = (struct mlx4_dev *) dev_ptr;
+	struct mlx4_priv *priv = mlx4_priv(dev);
+
+	if (readl(priv->catas_err.map)) {
+		dump_err_buf(dev);
+
+		mlx4_dispatch_event(dev, MLX4_EVENT_TYPE_LOCAL_CATAS_ERROR, 0, 0);
+
+		if (internal_err_reset) {
+			spin_lock(&catas_lock);
+			list_add(&priv->catas_err.list, &catas_list);
+			spin_unlock(&catas_lock);
+
+			queue_work(catas_wq, &catas_work);
+		}
+	} else
+		mod_timer(&priv->catas_err.timer,
+			  round_jiffies(jiffies + MLX4_CATAS_POLL_INTERVAL));
 }
 
-void mlx4_map_catas_buf(struct mlx4_dev *dev)
+static void catas_reset(struct work_struct *work)
+{
+	struct mlx4_priv *priv, *tmppriv;
+	struct mlx4_dev *dev;
+
+	LIST_HEAD(tlist);
+	int ret;
+
+	spin_lock_irq(&catas_lock);
+	list_splice_init(&catas_list, &tlist);
+	spin_unlock_irq(&catas_lock);
+
+	list_for_each_entry_safe(priv, tmppriv, &tlist, catas_err.list) {
+		ret = mlx4_restart_one(priv->dev.pdev);
+		dev = &priv->dev;
+		if (ret)
+			mlx4_err(dev, "Reset failed (%d)\n", ret);
+		else
+			mlx4_dbg(dev, "Reset succeeded\n");
+	}
+}
+
+void mlx4_start_catas_poll(struct mlx4_dev *dev)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
 	unsigned long addr;
 
+	INIT_LIST_HEAD(&priv->catas_err.list);
+	init_timer(&priv->catas_err.timer);
+	priv->catas_err.map = NULL;
+
 	addr = pci_resource_start(dev->pdev, priv->fw.catas_bar) +
 		priv->fw.catas_offset;
 
 	priv->catas_err.map = ioremap(addr, priv->fw.catas_size * 4);
-	if (!priv->catas_err.map)
-		mlx4_warn(dev, "Failed to map catastrophic error buffer at 0x%lx\n",
+	if (!priv->catas_err.map) {
+		mlx4_warn(dev, "Failed to map internal error buffer at 0x%lx\n",
 			  addr);
+		return;
+	}
 
+	priv->catas_err.timer.data     = (unsigned long) dev;
+	priv->catas_err.timer.function = poll_catas;
+	priv->catas_err.timer.expires  =
+		round_jiffies(jiffies + MLX4_CATAS_POLL_INTERVAL);
+	add_timer(&priv->catas_err.timer);
 }
 
-void mlx4_unmap_catas_buf(struct mlx4_dev *dev)
+void mlx4_stop_catas_poll(struct mlx4_dev *dev)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
 
+	del_timer_sync(&priv->catas_err.timer);
+
 	if (priv->catas_err.map)
 		iounmap(priv->catas_err.map);
+
+	spin_lock_irq(&catas_lock);
+	list_del(&priv->catas_err.list);
+	spin_unlock_irq(&catas_lock);
+}
+
+int __init mlx4_catas_init(void)
+{
+	INIT_WORK(&catas_work, catas_reset);
+
+	catas_wq = create_singlethread_workqueue("mlx4_err");
+	if (!catas_wq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void mlx4_catas_cleanup(void)
+{
+	destroy_workqueue(catas_wq);
 }
diff --git a/drivers/net/mlx4/eq.c b/drivers/net/mlx4/eq.c
index 27a82ce..2095c84 100644
--- a/drivers/net/mlx4/eq.c
+++ b/drivers/net/mlx4/eq.c
@@ -89,14 +89,12 @@ struct mlx4_eq_context {
 			       (1ull << MLX4_EVENT_TYPE_PATH_MIG_FAILED)    | \
 			       (1ull << MLX4_EVENT_TYPE_WQ_INVAL_REQ_ERROR) | \
 			       (1ull << MLX4_EVENT_TYPE_WQ_ACCESS_ERROR)    | \
-			       (1ull << MLX4_EVENT_TYPE_LOCAL_CATAS_ERROR)  | \
 			       (1ull << MLX4_EVENT_TYPE_PORT_CHANGE)	    | \
 			       (1ull << MLX4_EVENT_TYPE_ECC_DETECT)	    | \
 			       (1ull << MLX4_EVENT_TYPE_SRQ_CATAS_ERROR)    | \
 			       (1ull << MLX4_EVENT_TYPE_SRQ_QP_LAST_WQE)    | \
 			       (1ull << MLX4_EVENT_TYPE_SRQ_LIMIT)	    | \
 			       (1ull << MLX4_EVENT_TYPE_CMD))
-#define MLX4_CATAS_EVENT_MASK  (1ull << MLX4_EVENT_TYPE_LOCAL_CATAS_ERROR)
 
 struct mlx4_eqe {
 	u8			reserved1;
@@ -264,7 +262,7 @@ static irqreturn_t mlx4_interrupt(int irq, void *dev_ptr)
 
 	writel(priv->eq_table.clr_mask, priv->eq_table.clr_int);
 
-	for (i = 0; i < MLX4_EQ_CATAS; ++i)
+	for (i = 0; i < MLX4_NUM_EQ; ++i)
 		work |= mlx4_eq_int(dev, &priv->eq_table.eq[i]);
 
 	return IRQ_RETVAL(work);
@@ -281,14 +279,6 @@ static irqreturn_t mlx4_msi_x_interrupt(int irq, void *eq_ptr)
 	return IRQ_HANDLED;
 }
 
-static irqreturn_t mlx4_catas_interrupt(int irq, void *dev_ptr)
-{
-	mlx4_handle_catas_err(dev_ptr);
-
-	/* MSI-X vectors always belong to us */
-	return IRQ_HANDLED;
-}
-
 static int mlx4_MAP_EQ(struct mlx4_dev *dev, u64 event_mask, int unmap,
 			int eq_num)
 {
@@ -490,11 +480,9 @@ static void mlx4_free_irqs(struct mlx4_dev *dev)
 
 	if (eq_table->have_irq)
 		free_irq(dev->pdev->irq, dev);
-	for (i = 0; i < MLX4_EQ_CATAS; ++i)
+	for (i = 0; i < MLX4_NUM_EQ; ++i)
 		if (eq_table->eq[i].have_irq)
 			free_irq(eq_table->eq[i].irq, eq_table->eq + i);
-	if (eq_table->eq[MLX4_EQ_CATAS].have_irq)
-		free_irq(eq_table->eq[MLX4_EQ_CATAS].irq, dev);
 }
 
 static int __devinit mlx4_map_clr_int(struct mlx4_dev *dev)
@@ -598,32 +586,19 @@ int __devinit mlx4_init_eq_table(struct mlx4_dev *dev)
 	if (dev->flags & MLX4_FLAG_MSI_X) {
 		static const char *eq_name[] = {
 			[MLX4_EQ_COMP]  = DRV_NAME " (comp)",
-			[MLX4_EQ_ASYNC] = DRV_NAME " (async)",
-			[MLX4_EQ_CATAS] = DRV_NAME " (catas)"
+			[MLX4_EQ_ASYNC] = DRV_NAME " (async)"
 		};
 
-		err = mlx4_create_eq(dev, 1, MLX4_EQ_CATAS,
-				     &priv->eq_table.eq[MLX4_EQ_CATAS]);
-		if (err)
-			goto err_out_async;
-
-		for (i = 0; i < MLX4_EQ_CATAS; ++i) {
+		for (i = 0; i < MLX4_NUM_EQ; ++i) {
 			err = request_irq(priv->eq_table.eq[i].irq,
 					  mlx4_msi_x_interrupt,
 					  0, eq_name[i], priv->eq_table.eq + i);
 			if (err)
-				goto err_out_catas;
+				goto err_out_async;
 
 			priv->eq_table.eq[i].have_irq = 1;
 		}
 
-		err = request_irq(priv->eq_table.eq[MLX4_EQ_CATAS].irq,
-				  mlx4_catas_interrupt, 0,
-				  eq_name[MLX4_EQ_CATAS], dev);
-		if (err)
-			goto err_out_catas;
-
-		priv->eq_table.eq[MLX4_EQ_CATAS].have_irq = 1;
 	} else {
 		err = request_irq(dev->pdev->irq, mlx4_interrupt,
 				  IRQF_SHARED, DRV_NAME, dev);
@@ -639,22 +614,11 @@ int __devinit mlx4_init_eq_table(struct mlx4_dev *dev)
 		mlx4_warn(dev, "MAP_EQ for async EQ %d failed (%d)\n",
 			   priv->eq_table.eq[MLX4_EQ_ASYNC].eqn, err);
 
-	for (i = 0; i < MLX4_EQ_CATAS; ++i)
+	for (i = 0; i < MLX4_NUM_EQ; ++i)
 		eq_set_ci(&priv->eq_table.eq[i], 1);
 
-	if (dev->flags & MLX4_FLAG_MSI_X) {
-		err = mlx4_MAP_EQ(dev, MLX4_CATAS_EVENT_MASK, 0,
-				  priv->eq_table.eq[MLX4_EQ_CATAS].eqn);
-		if (err)
-			mlx4_warn(dev, "MAP_EQ for catas EQ %d failed (%d)\n",
-				  priv->eq_table.eq[MLX4_EQ_CATAS].eqn, err);
-	}
-
 	return 0;
 
-err_out_catas:
-	mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_CATAS]);
-
 err_out_async:
 	mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_ASYNC]);
 
@@ -675,19 +639,13 @@ void mlx4_cleanup_eq_table(struct mlx4_dev *dev)
 	struct mlx4_priv *priv = mlx4_priv(dev);
 	int i;
 
-	if (dev->flags & MLX4_FLAG_MSI_X)
-		mlx4_MAP_EQ(dev, MLX4_CATAS_EVENT_MASK, 1,
-			    priv->eq_table.eq[MLX4_EQ_CATAS].eqn);
-
 	mlx4_MAP_EQ(dev, MLX4_ASYNC_EVENT_MASK, 1,
 		    priv->eq_table.eq[MLX4_EQ_ASYNC].eqn);
 
 	mlx4_free_irqs(dev);
 
-	for (i = 0; i < MLX4_EQ_CATAS; ++i)
+	for (i = 0; i < MLX4_NUM_EQ; ++i)
 		mlx4_free_eq(dev, &priv->eq_table.eq[i]);
-	if (dev->flags & MLX4_FLAG_MSI_X)
-		mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_CATAS]);
 
 	mlx4_unmap_clr_int(dev);
 
diff --git a/drivers/net/mlx4/intf.c b/drivers/net/mlx4/intf.c
index 9ae951b..be5d9e9 100644
--- a/drivers/net/mlx4/intf.c
+++ b/drivers/net/mlx4/intf.c
@@ -142,6 +142,7 @@ int mlx4_register_device(struct mlx4_dev *dev)
 		mlx4_add_device(intf, priv);
 
 	mutex_unlock(&intf_mutex);
+	mlx4_start_catas_poll(dev);
 
 	return 0;
 }
@@ -151,6 +152,7 @@ void mlx4_unregister_device(struct mlx4_dev *dev)
 	struct mlx4_priv *priv = mlx4_priv(dev);
 	struct mlx4_interface *intf;
 
+	mlx4_stop_catas_poll(dev);
 	mutex_lock(&intf_mutex);
 
 	list_for_each_entry(intf, &intf_list, list)
diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index a4f2e04..e8f45e6 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -583,13 +583,11 @@ static int __devinit mlx4_setup_hca(struct mlx4_dev *dev)
 		goto err_pd_table_free;
 	}
 
-	mlx4_map_catas_buf(dev);
-
 	err = mlx4_init_eq_table(dev);
 	if (err) {
 		mlx4_err(dev, "Failed to initialize "
 			 "event queue table, aborting.\n");
-		goto err_catas_buf;
+		goto err_mr_table_free;
 	}
 
 	err = mlx4_cmd_use_events(dev);
@@ -659,8 +657,7 @@ err_cmd_poll:
 err_eq_table_free:
 	mlx4_cleanup_eq_table(dev);
 
-err_catas_buf:
-	mlx4_unmap_catas_buf(dev);
+err_mr_table_free:
 	mlx4_cleanup_mr_table(dev);
 
 err_pd_table_free:
@@ -836,9 +833,6 @@ err_cleanup:
 	mlx4_cleanup_cq_table(dev);
 	mlx4_cmd_use_polling(dev);
 	mlx4_cleanup_eq_table(dev);
-
-	mlx4_unmap_catas_buf(dev);
-
 	mlx4_cleanup_mr_table(dev);
 	mlx4_cleanup_pd_table(dev);
 	mlx4_cleanup_uar_table(dev);
@@ -885,9 +879,6 @@ static void __devexit mlx4_remove_one(struct pci_dev *pdev)
 		mlx4_cleanup_cq_table(dev);
 		mlx4_cmd_use_polling(dev);
 		mlx4_cleanup_eq_table(dev);
-
-		mlx4_unmap_catas_buf(dev);
-
 		mlx4_cleanup_mr_table(dev);
 		mlx4_cleanup_pd_table(dev);
 
@@ -908,6 +899,12 @@ static void __devexit mlx4_remove_one(struct pci_dev *pdev)
 	}
 }
 
+int mlx4_restart_one(struct pci_dev *pdev)
+{
+	mlx4_remove_one(pdev);
+	return mlx4_init_one(pdev, NULL);
+}
+
 static struct pci_device_id mlx4_pci_table[] = {
 	{ PCI_VDEVICE(MELLANOX, 0x6340) }, /* MT25408 "Hermon" SDR */
 	{ PCI_VDEVICE(MELLANOX, 0x634a) }, /* MT25408 "Hermon" DDR */
@@ -930,6 +927,10 @@ static int __init mlx4_init(void)
 {
 	int ret;
 
+	ret = mlx4_catas_init();
+	if (ret)
+		return ret;
+
 	ret = pci_register_driver(&mlx4_driver);
 	return ret < 0 ? ret : 0;
 }
@@ -937,6 +938,7 @@ static int __init mlx4_init(void)
 static void __exit mlx4_cleanup(void)
 {
 	pci_unregister_driver(&mlx4_driver);
+	mlx4_catas_cleanup();
 }
 
 module_init(mlx4_init);
diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index d9c91a7..be304a7 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -39,6 +39,7 @@
 
 #include <linux/mutex.h>
 #include <linux/radix-tree.h>
+#include <linux/timer.h>
 
 #include <linux/mlx4/device.h>
 #include <linux/mlx4/doorbell.h>
@@ -67,7 +68,6 @@ enum {
 enum {
 	MLX4_EQ_ASYNC,
 	MLX4_EQ_COMP,
-	MLX4_EQ_CATAS,
 	MLX4_NUM_EQ
 };
 
@@ -248,7 +248,8 @@ struct mlx4_mcg_table {
 
 struct mlx4_catas_err {
 	u32 __iomem	       *map;
-	int			size;
+	struct timer_list	timer;
+	struct list_head	list;
 };
 
 struct mlx4_priv {
@@ -311,9 +312,11 @@ void mlx4_cleanup_qp_table(struct mlx4_dev *dev);
 void mlx4_cleanup_srq_table(struct mlx4_dev *dev);
 void mlx4_cleanup_mcg_table(struct mlx4_dev *dev);
 
-void mlx4_map_catas_buf(struct mlx4_dev *dev);
-void mlx4_unmap_catas_buf(struct mlx4_dev *dev);
-
+void mlx4_start_catas_poll(struct mlx4_dev *dev);
+void mlx4_stop_catas_poll(struct mlx4_dev *dev);
+int mlx4_catas_init(void);
+void mlx4_catas_cleanup(void);
+int mlx4_restart_one(struct pci_dev *pdev);
 int mlx4_register_device(struct mlx4_dev *dev);
 void mlx4_unregister_device(struct mlx4_dev *dev);
 void mlx4_dispatch_event(struct mlx4_dev *dev, enum mlx4_event type,


From rdreier at cisco.com  Tue Jul 17 20:51:38 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 20:51:38 -0700
Subject: [ofa-general] Re: [PATCH] mlx4: increase max outstanding rdma reads
	per qp
In-Reply-To: <200707171311.43680.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Tue, 17 Jul 2007 13:11:43 +0300")
References: <200707171311.43680.jackm@dev.mellanox.co.il>
Message-ID: <adaodiamks5.fsf@cisco.com>

thanks, applied


From rdreier at cisco.com  Tue Jul 17 20:59:37 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 20:59:37 -0700
Subject: [ofa-general] Re: [PATCH 2 of 2] libmlx4: implement query_qp
In-Reply-To: <200707151058.55805.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Sun, 15 Jul 2007 10:58:55 +0300")
References: <200706211229.08703.jackm@dev.mellanox.co.il>
	<adawsx54507.fsf@cisco.com>
	<200707151058.55805.jackm@dev.mellanox.co.il>
Message-ID: <adak5symkeu.fsf@cisco.com>

OK, I think I'll merge this, since it seems cleaner to me.

commit 7f5eb9bb8c7fb3bd411674b856872d7ab4a7b1a3
Author: Roland Dreier <rolandd at cisco.com>
Date:   Tue Jul 17 20:59:02 2007 -0700

    IB/mlx4: Return receive queue sizes for userspace QPs from query QP
    
    Return the receive queue sizes for both userspace QPs and kernel Qps
    (not just kernel QPs) from mlx4_ib_query_qp().  Also zero the send
    queue sizes for userspace QPs to avoid a possible information leak,
    and set the max_inline_data for kernel QPs to 0 since inline sends are
    not supported for kernel QPs.
    
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 0793059..8d09aa3 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -1581,17 +1581,25 @@ int mlx4_ib_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *qp_attr, int qp_attr
 
 done:
 	qp_attr->cur_qp_state	     = qp_attr->qp_state;
+	qp_attr->cap.max_recv_wr     = qp->rq.wqe_cnt;
+	qp_attr->cap.max_recv_sge    = qp->rq.max_gs;
+
 	if (!ibqp->uobject) {
-		qp_attr->cap.max_send_wr     = qp->sq.wqe_cnt;
-		qp_attr->cap.max_recv_wr     = qp->rq.wqe_cnt;
-		qp_attr->cap.max_send_sge    = qp->sq.max_gs;
-		qp_attr->cap.max_recv_sge    = qp->rq.max_gs;
-		qp_attr->cap.max_inline_data = (1 << qp->sq.wqe_shift) -
-			send_wqe_overhead(qp->ibqp.qp_type) -
-			sizeof (struct mlx4_wqe_inline_seg);
-		qp_init_attr->cap	     = qp_attr->cap;
+		qp_attr->cap.max_send_wr  = qp->sq.wqe_cnt;
+		qp_attr->cap.max_send_sge = qp->sq.max_gs;
+	} else {
+		qp_attr->cap.max_send_wr  = 0;
+		qp_attr->cap.max_send_sge = 0;
 	}
 
+	/*
+	 * We don't support inline sends for kernel QPs (yet), and we
+	 * don't know what userspace's value should be.
+	 */
+	qp_attr->cap.max_inline_data = 0;
+
+	qp_init_attr->cap	     = qp_attr->cap;
+
 	return 0;
 }
 

From rdreier at cisco.com  Tue Jul 17 21:07:51 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 21:07:51 -0700
Subject: [ofa-general] Re: [PATCH 2 of 2] libmlx4: implement query_qp
In-Reply-To: <200706211229.08703.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Thu, 21 Jun 2007 12:29:08 +0300")
References: <200706211229.08703.jackm@dev.mellanox.co.il>
Message-ID: <adafy3mmk14.fsf@cisco.com>

thanks, applied.


From mst at dev.mellanox.co.il  Tue Jul 17 21:31:59 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 18 Jul 2007 07:31:59 +0300
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <1184729415.5165.570.camel@firewall.xsintricity.com>
References: <20070717171250.GD7479@mellanox.co.il>
	<1184693800.5165.480.camel@firewall.xsintricity.com>
	<20070717174526.GE7479@mellanox.co.il>
	<1184697799.5165.536.camel@firewall.xsintricity.com>
	<20070717202730.GA15990@mellanox.co.il> <ada3azmrcap.fsf@cisco.com>
	<20070717210935.GA17168@mellanox.co.il>
	<1184713907.5165.549.camel@firewall.xsintricity.com>
	<20070718021854.GD19243@mellanox.co.il>
	<1184729415.5165.570.camel@firewall.xsintricity.com>
Message-ID: <20070718043159.GA28541@mellanox.co.il>

> Quoting Doug Ledford <dledford at redhat.com>:
> Subject: Re: [ofa-general] Re: RFC OFED-1.3 installation
> 
> On Wed, 2007-07-18 at 05:18 +0300, Michael S. Tsirkin wrote:
> > > Quoting Doug Ledford <dledford at redhat.com>:
> > > Subject: Re: [ofa-general] Re: RFC OFED-1.3 installation
> > > 
> > > On Wed, 2007-07-18 at 00:09 +0300, Michael S. Tsirkin wrote:
> > > > > Quoting Roland Dreier <rdreier at cisco.com>:
> > > > > Subject: Re: [ofa-general] Re: RFC OFED-1.3 installation
> > > > > 
> > > > >  > I don't really think we want customers to run beta code
> > > > > 
> > > > > What's the point of a beta then??
> > > > 
> > > > Donnu.
> > > > In previous OFED releases, we had "release candidates" rather than "beta".
> > > > Openfabrics members were running RCs and reporting issues on the list and in
> > > > bugzilla. Do you really ask your customers to do this for you?
> > > 
> > > Sure, as much as possible.  I generally don't recommend using it in
> > > production, but just as close as they can get to production is fine with
> > > me.  The more issues they find while I'm still actually working on it
> > > and making new revisions, the less issues they'll find after I stupidly
> > > think I'm done.
> > 
> > So,Roland's idea of sticking a date in RPM revision willwork, won't it?
> 
> As long as you don't do two package builds on the same day.  That's why
> my script encodes both an increasing number and the date into the
> revision.
> 
> For reference, I'll attach the updated script I made for spitting out a
> buildable tarball.
> 
> Hehehe...resending because the ofa list server ate my message due to the
> script attachment :-D  I'll inline it instead.
> 
> I guess I'll also mention that this script exists in my ~/repos/upstream
> directory, and also in that directory are all the git repos that I have
> cloned from ofa (as well as other places).  So, it's one level above all
> the various git clones and spits everything out into dist/.  The easiest
> way to use this script for any given package you want to create a daily
> snapshot of is to run ./make.dist repodir daily; scp
> dist/repodir-git.tgz dist/repodir-daily.HEAD ofaserver:downloads.  That
> simple action would (assuming you create a reasonable reponame.spec.in
> file in the repos that are missing one) spit out a tarball that can be
> passed directly to rpmbuild --rebuild reponame-git.tgz and rpm will spit
> out the packages, and the repodir-daily.HEAD file shows the HEAD of the
> git repo so you know exactly what state the tarball represents and you
> can always get to it in another more recent repo by just updating to
> that commit as head of tree.

Thanks for the script.
In OFED, since we control the upstream, I think we'll try to do
as much as possible at the package level, for example make sure
that each package has a reasonable spec file.
Some ideas on how we might want to do this below.

> #!/bin/bash
> 
> usage() {
> echo "$0 repo daily | release [ signed | <key-id> ]"
> echo
> echo "	You must specify the repo to make a distribution tarball in.  This"
> echo "script will not work with complex repos like the management repo that"
> echo "builds more than one package.  It expects a repo to be a single package"
> echo "repo where the directory name and the package name are the same, and"
> echo "where a properly formatted reponame.spec.in file exists."
> echo
> echo "	You must specify either release or daily in order for this script"
> echo "to make tarballs.  If this is a daily release, the tarballs will"
> echo "be named <component>-git.tgz and will overwrite existing tarballs."
> echo "If this is a release build, then the tarball will be named"
> echo "<component>-<version>.tgz and must be a new file.  In addition,"
> echo "the script will add a new set of symbolic tags to the git repo"
> echo "that correspond to the <component>-<version> of each tarball."
> echo
> echo "	If the script detects that the tag on any component already exists,"
> echo "it will abort the release and prompt you to update the version on"
> echo "the already tagged component.  This enforces the proper behavior of"
> echo "treating any released tarball as set in stone so that in the future"
> echo "you will always be able to get to any given release tarball by"
> echo "checking out the git tag and know with certainty that it is the same"
> echo "code as released before even if you no longer have the same tarball"
> echo "around."
> echo
> echo "	As part of this process, the script will parse the <target>.spec.in"
> echo "file and output a <target>.spec file.  Since this script isn't smart"
> echo "enough to deal with other random changes that should have their own" 
> echo "checkin the script will refuse to run if the current repo state is not"
> echo "clean."
> echo
> echo "	NOTE: the script has no clue if you are tagging on the right branch,"
> echo "it will however show you the git branch output so you can confirm it"
> echo "is on the right branch before proceeding with the release."
> echo
> echo "	In addition to just tagging the git repo, whenever creating a release"
> echo "there is an optional argument of either signed or a hex gpg key-id."
> echo "If you do not pass an argument to release, then the tag will be a"
> echo "simple git annotated tag.  If you pass signed as the argument, the"
> echo "git tag operation will use your default signing key to sign the tag."
> echo "Or you can pass an actual gpg key id in hex format and git will sign"
> echo "the tag with that key."
> echo 
> }
> 
> if [ -z "$1" -o -z "$2" ]; then usage; exit 1; fi
> 
> if [ ! -d "$1" ]; then usage; exit 1; fi
> 
> TMPDIR=dist
> if [ ! -d $TMPDIR ]; then mkdir $TMPDIR; fi
> 
> if [ "$2" = "daily" -o "$2" = "release" ]; then
> 	if [ ! -f $TMPDIR/$1-$2.HEAD ]; then
> 		touch $TMPDIR/$1-$2.HEAD
> 	fi
> 	NEWHEAD=`cat $TMPDIR/$1-$2.HEAD`
> else
> 	usage
> 	exit 1
> fi
> 
> cd "$1"
> echo "Updating git repo..."
> git pull
> RESULT=$?
> HEAD=`git log --pretty=oneline -1`
> 
> if [ "$RESULT" -ne 0 ]; then
> 	echo "Failed to update the git repo cleanly, manual intervention required"
> 	exit 1
> fi

pull really will merge your local modifications with upstream.
In OFED we really want just git clone, and use upstream code unmodified.

> if [ "$HEAD" = "$NEWHEAD" ]; then
> 	echo "No new commits since last tarball creation, nothing to do."
> 	cd ..
> 	exit 0
> fi
> 
> if [ "$2" = "release" ]; then
> 	# Is the repo clean?
> 	git status | grep modified > /dev/null 2>&1
> 	if [ $? = 0 ]; then
> 		echo "There are modified files in the repo.  Please check any"
> 		echo "changes in before proceeding."
> 		exit 4
> 	fi
> 	# Since we will be tagging things, make sure we are on the right
> 	# branch
> 	git branch
> 	echo -n "Is the active branch the right one to tag this release on [y/N]? "
> 	read answer
> 	if [ "$answer" = y -o "$answer" = Y ]; then
> 		echo "Proceeding..."
> 	else
> 		echo "Please check out the right branch and run make.dist again"
> 		exit 0
> 	fi

See below on what we should do in OFED IMO.

> 	# Check versions to make sure that we can proceed
> 	VERSION=`grep "AC_INIT.*$1" configure.in | cut -f 2 -d ',' | sed -e 's/ //g'`
> 	TARBALL=$1-$VERSION.tgz
> 	if [ -f ../$TMPDIR/$TARBALL ]; then
> 		echo "Target $TARBALL already exists, please update the version of"
> 		echo "$1"
> 		exit 2
> 	fi
> 	if [ ! -z "`git tag -l $1-$VERSION`" ]; then
> 		echo "A git tag already exists for $1-$VERSION.  Please change the version"
> 		echo "of $1 so a tag replacement won't occur."
> 		exit 3
> 	fi
> # On a real release, this resets the daily release starting point, on the
> # assumption that any new daily builds will have a version number that is
> # incrementally higher than the last officially released tarball.
> 	RELEASE=1
> 	echo $RELEASE > ../$TMPDIR/$1.release
> else
> 	DATE=`date +%Y%m%d`
> 	if [ -f ../$TMPDIR/$1.release ]; then
> 		RELEASE=`cat ../$TMPDIR/$1.release`
> 		RELEASE=`expr $RELEASE + 1`
> 	else
> 		RELEASE=1
> 	fi
> 	echo $RELEASE > ../$TMPDIR/$1.release
> 	RELEASE=0.${RELEASE}.${DATE}git
> 	TARBALL=$1-git.tgz
> fi
> 
> cd ..
> cp -a $1 $1-$VERSION
> [ -f $1/$1.spec.in ] && sed -e 's/@VERSION@/'$VERSION'/;s/@RELEASE@/'$RELEASE'/;s/@TARBALL@/'$TARBALL'/' < $1/$1.spec.in > $1-$VERSION/$1.spec

This, I think, is the bit that we definitely want to reuse.

> if [ -f $1-$VERSION/autogen.sh ]; then
> 	cd $1-$VERSION
> 	./autogen.sh
> 	cd ..

I think we will want to call make dist too.

> fi
> echo "Creating $TMPDIR/$TARBALL"
> tar -czf $TMPDIR/$TARBALL --exclude=.git $1-$VERSION
> rm -rf $1-$VERSION
> echo "$HEAD" > $TMPDIR/$1-$2.HEAD
> 
> if [ $2 = release ]; then
> 	echo "Tagging release."
> 	cd $1
> 	if [ ! -z "$3" ]; then
> 		if [ $3 = "signed" ]; then
> 			git tag -s -m "Auto tag by make.dist on release tarball creation" $1-$VERSION
> 		else
> 			git tag -u "$3" -m "Auto tag by make.dist on release tarball creation" $1-$VERSION
> 		fi
> 	else
> 		git tag -a -m "Auto tag by make.dist on release tarball creation" $1-$VERSION
> 	fi
> 	cd ..
> fi

This takes whatever's at git head and then tags that.
In OFED it is the other way around: maintainers tag
the appropriate bits, release script just packages that.

-- 
MST


From hal.rosenstock at gmail.com  Tue Jul 17 21:40:07 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 18 Jul 2007 00:40:07 -0400
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <adaabtuo0n9.fsf@cisco.com>
References: <adalkdl43w0.fsf@cisco.com> <adahco943ip.fsf@cisco.com>
	<4696D1F3.2040507@ichips.intel.com>
	<15ddcffd0707172020j5b68fcb2v7d3ca77863998020@mail.gmail.com>
	<adaabtuo0n9.fsf@cisco.com>
Message-ID: <f0e08f230707172140y36569958xadcba6faae1d65b3@mail.gmail.com>

On 7/17/07, Roland Dreier <rdreier at cisco.com> wrote:
>
> > > But to be fair, it will be difficult to enable both QoS and local PR
> > > caching.  To me, this would be the strongest reason against using it.
> > > However, QoS places additional burden on the SA, which will make
> scaling
> > > even more challenging.
> >
> > my understanding is that the local sa does a path-query where all the
> fields
> > except for the SGID are wildcard-ed. This means we expect the result to
> be a
> > table of all the paths from this port to every other port on the fabrics
> for
> > every pkey which this port is a member of etc, correct?
> >
> > How do you plug here  the QoS concept of SID in the path query? are you
> > expecting the SA to realize what are all the services for which this
> port is
> > a "member"? does the proposed definision for QoS management at the SA
> > defines "services per gids" isn't it "what SL to user per Service"?
>
> Or, thanks for rescuing this post.
>
> I think this is an important question.  If we merge the local SA
> stuff, then are we creating a problem for dealing with QoS?  Are we
> going to have to revert the local SA stuff once the QoS stuff is
> available?  Or is there at least a sketch of a plan on how to handle
> this?


Is the worst case that local SA cache and QoS on an end node are mutually
exclusive ? I think there is a way to shut off the local SA cache.

-- Hal

- R.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070718/c6b3f092/attachment.html>

From kliteyn at mellanox.co.il  Tue Jul 17 21:44:34 2007
From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il)
Date: 18 Jul 2007 07:44:34 +0300
Subject: [ofa-general] nightly osm_sim report 2007-07-18:normal completion
Message-ID: <MTLEXCH01PYzDabigPa0000180a@mtlexch01.mtl.com>

OSM Simulation Regression Summary
 
[Generated mail - please do NOT reply]
 
 
OpenSM rev = Thu_Jul_12_11:56:08_2007 [de69204d60071532833b0cdd3baa5e2386dc2c73]
ibutils rev = Tue_Mar_13_14:36:32_2007 [80aaff94f0eb65117db39b9db7d609ffdcc055de]
 
 
Total=560  Pass=560  Fail=0
 
 
Pass:
42 Stability IS1-16.topo
42 Pkey IS1-16.topo
42 OsmTest IS1-16.topo
42 OsmStress IS1-16.topo
42 Multicast IS1-16.topo
42 LidMgr IS1-16.topo
14 Stability IS3-loop.topo
14 Stability IS3-128.topo
14 Pkey IS3-128.topo
14 OsmTest IS3-loop.topo
14 OsmTest IS3-128.topo
14 OsmStress IS3-128.topo
14 Multicast IS3-loop.topo
14 Multicast IS3-128.topo
14 LidMgr IS3-128.topo
14 FatTree merge-roots-4-ary-2-tree.topo
14 FatTree merge-root-4-ary-3-tree.topo
14 FatTree gnu-stallion-64.topo
14 FatTree blend-4-ary-2-tree.topo
14 FatTree RhinoDDR.topo
14 FatTree FullGnu.topo
14 FatTree 4-ary-2-tree.topo
14 FatTree 2-ary-4-tree.topo
14 FatTree 12-node-spaced.topo
14 FTreeFail 4-ary-2-tree-missing-sw-link.topo
14 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo
14 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo
14 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo

Failures:


From dledford at redhat.com  Tue Jul 17 21:56:59 2007
From: dledford at redhat.com (Doug Ledford)
Date: Wed, 18 Jul 2007 04:56:59 +0000
Subject: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <20070718043159.GA28541@mellanox.co.il>
References: <20070717171250.GD7479@mellanox.co.il>
	<1184693800.5165.480.camel@firewall.xsintricity.com>
	<20070717174526.GE7479@mellanox.co.il>
	<1184697799.5165.536.camel@firewall.xsintricity.com>
	<20070717202730.GA15990@mellanox.co.il> <ada3azmrcap.fsf@cisco.com>
	<20070717210935.GA17168@mellanox.co.il>
	<1184713907.5165.549.camel@firewall.xsintricity.com>
	<20070718021854.GD19243@mellanox.co.il>
	<1184729415.5165.570.camel@firewall.xsintricity.com>
	<20070718043159.GA28541@mellanox.co.il>
Message-ID: <1184734619.5165.579.camel@firewall.xsintricity.com>

On Wed, 2007-07-18 at 07:31 +0300, Michael S. Tsirkin wrote:

> Thanks for the script.
> In OFED, since we control the upstream, I think we'll try to do
> as much as possible at the package level, for example make sure
> that each package has a reasonable spec file.
> Some ideas on how we might want to do this below.
> 
> > #!/bin/bash
> > 
> > usage() {
> > echo "$0 repo daily | release [ signed | <key-id> ]"
> > echo
> > echo "	You must specify the repo to make a distribution tarball in.  This"
> > echo "script will not work with complex repos like the management repo that"
> > echo "builds more than one package.  It expects a repo to be a single package"
> > echo "repo where the directory name and the package name are the same, and"
> > echo "where a properly formatted reponame.spec.in file exists."
> > echo
> > echo "	You must specify either release or daily in order for this script"
> > echo "to make tarballs.  If this is a daily release, the tarballs will"
> > echo "be named <component>-git.tgz and will overwrite existing tarballs."
> > echo "If this is a release build, then the tarball will be named"
> > echo "<component>-<version>.tgz and must be a new file.  In addition,"
> > echo "the script will add a new set of symbolic tags to the git repo"
> > echo "that correspond to the <component>-<version> of each tarball."
> > echo
> > echo "	If the script detects that the tag on any component already exists,"
> > echo "it will abort the release and prompt you to update the version on"
> > echo "the already tagged component.  This enforces the proper behavior of"
> > echo "treating any released tarball as set in stone so that in the future"
> > echo "you will always be able to get to any given release tarball by"
> > echo "checking out the git tag and know with certainty that it is the same"
> > echo "code as released before even if you no longer have the same tarball"
> > echo "around."
> > echo
> > echo "	As part of this process, the script will parse the <target>.spec.in"
> > echo "file and output a <target>.spec file.  Since this script isn't smart"
> > echo "enough to deal with other random changes that should have their own" 
> > echo "checkin the script will refuse to run if the current repo state is not"
> > echo "clean."
> > echo
> > echo "	NOTE: the script has no clue if you are tagging on the right branch,"
> > echo "it will however show you the git branch output so you can confirm it"
> > echo "is on the right branch before proceeding with the release."
> > echo
> > echo "	In addition to just tagging the git repo, whenever creating a release"
> > echo "there is an optional argument of either signed or a hex gpg key-id."
> > echo "If you do not pass an argument to release, then the tag will be a"
> > echo "simple git annotated tag.  If you pass signed as the argument, the"
> > echo "git tag operation will use your default signing key to sign the tag."
> > echo "Or you can pass an actual gpg key id in hex format and git will sign"
> > echo "the tag with that key."
> > echo 
> > }
> > 
> > if [ -z "$1" -o -z "$2" ]; then usage; exit 1; fi
> > 
> > if [ ! -d "$1" ]; then usage; exit 1; fi
> > 
> > TMPDIR=dist
> > if [ ! -d $TMPDIR ]; then mkdir $TMPDIR; fi
> > 
> > if [ "$2" = "daily" -o "$2" = "release" ]; then
> > 	if [ ! -f $TMPDIR/$1-$2.HEAD ]; then
> > 		touch $TMPDIR/$1-$2.HEAD
> > 	fi
> > 	NEWHEAD=`cat $TMPDIR/$1-$2.HEAD`
> > else
> > 	usage
> > 	exit 1
> > fi
> > 
> > cd "$1"
> > echo "Updating git repo..."
> > git pull
> > RESULT=$?
> > HEAD=`git log --pretty=oneline -1`
> > 
> > if [ "$RESULT" -ne 0 ]; then
> > 	echo "Failed to update the git repo cleanly, manual intervention required"
> > 	exit 1
> > fi
> 
> pull really will merge your local modifications with upstream.
> In OFED we really want just git clone, and use upstream code unmodified.

That depends on how you have your repos set up.  I keep separate repos
for tracking upstream and for doing local work.  You can run this on
either repo, and it will either give you a clean upstream copy or your
local copy merged up to date (assuming you are on a branch that merges
from upstream, otherwise if you are on a local branch, the pull will
update the repo, but not your checked out files).

This script is a little schitzophrenic at the moment because it acts
like it's both a customer of the git repo and a master of the git repo.
In truth, the release part of the script was only ever intended to be
used by a maintainer in a clean master repo.  The daily part of the
script can be used by anyone who wants to spit out quick daily builds.
But, if you are a consumer of the repo instead of the maintainer, then
for the daily builds you need to update the repo.

So, in short, the daily part is usable by anyone tracking development of
any repo and will pull from the upstream repo to keep up to date.  The
release functionality should only be used by maintainers, and then only
in their master repo.  Make more sense that way?

> > if [ "$HEAD" = "$NEWHEAD" ]; then
> > 	echo "No new commits since last tarball creation, nothing to do."
> > 	cd ..
> > 	exit 0
> > fi
> > 
> > if [ "$2" = "release" ]; then
> > 	# Is the repo clean?
> > 	git status | grep modified > /dev/null 2>&1
> > 	if [ $? = 0 ]; then
> > 		echo "There are modified files in the repo.  Please check any"
> > 		echo "changes in before proceeding."
> > 		exit 4
> > 	fi
> > 	# Since we will be tagging things, make sure we are on the right
> > 	# branch
> > 	git branch
> > 	echo -n "Is the active branch the right one to tag this release on [y/N]? "
> > 	read answer
> > 	if [ "$answer" = y -o "$answer" = Y ]; then
> > 		echo "Proceeding..."
> > 	else
> > 		echo "Please check out the right branch and run make.dist again"
> > 		exit 0
> > 	fi
> 
> See below on what we should do in OFED IMO.
> 
> > 	# Check versions to make sure that we can proceed
> > 	VERSION=`grep "AC_INIT.*$1" configure.in | cut -f 2 -d ',' | sed -e 's/ //g'`
> > 	TARBALL=$1-$VERSION.tgz
> > 	if [ -f ../$TMPDIR/$TARBALL ]; then
> > 		echo "Target $TARBALL already exists, please update the version of"
> > 		echo "$1"
> > 		exit 2
> > 	fi
> > 	if [ ! -z "`git tag -l $1-$VERSION`" ]; then
> > 		echo "A git tag already exists for $1-$VERSION.  Please change the version"
> > 		echo "of $1 so a tag replacement won't occur."
> > 		exit 3
> > 	fi
> > # On a real release, this resets the daily release starting point, on the
> > # assumption that any new daily builds will have a version number that is
> > # incrementally higher than the last officially released tarball.
> > 	RELEASE=1
> > 	echo $RELEASE > ../$TMPDIR/$1.release
> > else
> > 	DATE=`date +%Y%m%d`
> > 	if [ -f ../$TMPDIR/$1.release ]; then
> > 		RELEASE=`cat ../$TMPDIR/$1.release`
> > 		RELEASE=`expr $RELEASE + 1`
> > 	else
> > 		RELEASE=1
> > 	fi
> > 	echo $RELEASE > ../$TMPDIR/$1.release
> > 	RELEASE=0.${RELEASE}.${DATE}git
> > 	TARBALL=$1-git.tgz
> > fi
> > 
> > cd ..
> > cp -a $1 $1-$VERSION
> > [ -f $1/$1.spec.in ] && sed -e 's/@VERSION@/'$VERSION'/;s/@RELEASE@/'$RELEASE'/;s/@TARBALL@/'$TARBALL'/' < $1/$1.spec.in > $1-$VERSION/$1.spec
> 
> This, I think, is the bit that we definitely want to reuse.
> 
> > if [ -f $1-$VERSION/autogen.sh ]; then
> > 	cd $1-$VERSION
> > 	./autogen.sh
> > 	cd ..
> 
> I think we will want to call make dist too.

As long as make dist doesn't remove vital files, then sure.  Gonna have
to run configure to run make dist so you'll have to add both calls.

> > fi
> > echo "Creating $TMPDIR/$TARBALL"
> > tar -czf $TMPDIR/$TARBALL --exclude=.git $1-$VERSION
> > rm -rf $1-$VERSION
> > echo "$HEAD" > $TMPDIR/$1-$2.HEAD
> > 
> > if [ $2 = release ]; then
> > 	echo "Tagging release."
> > 	cd $1
> > 	if [ ! -z "$3" ]; then
> > 		if [ $3 = "signed" ]; then
> > 			git tag -s -m "Auto tag by make.dist on release tarball creation" $1-$VERSION
> > 		else
> > 			git tag -u "$3" -m "Auto tag by make.dist on release tarball creation" $1-$VERSION
> > 		fi
> > 	else
> > 		git tag -a -m "Auto tag by make.dist on release tarball creation" $1-$VERSION
> > 	fi
> > 	cd ..
> > fi
> 
> This takes whatever's at git head and then tags that.
> In OFED it is the other way around: maintainers tag
> the appropriate bits, release script just packages that.

Like I said earlier, the release operation is intended to only be used
by a maintainer, and in this case it just automates the process of
tagging and spitting out a tarball into the same action.  And for
clarity, it tags the head of whatever branch you are on.  So, if you've
been working in the ofed_1_2 branch, and made some changes, and are
ready for a release, it spits out the tarball and tags the ofed_1_2
branch head as the symbolic tag reponame-version.  If you want to work
on master and do a release there, then it works similarly.


-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070718/b4d26f3b/attachment.sig>

From jgunthorpe at obsidianresearch.com  Tue Jul 17 22:09:28 2007
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Tue, 17 Jul 2007 23:09:28 -0600
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <f0e08f230707172140y36569958xadcba6faae1d65b3@mail.gmail.com>
References: <adalkdl43w0.fsf@cisco.com> <adahco943ip.fsf@cisco.com>
	<4696D1F3.2040507@ichips.intel.com>
	<15ddcffd0707172020j5b68fcb2v7d3ca77863998020@mail.gmail.com>
	<adaabtuo0n9.fsf@cisco.com>
	<f0e08f230707172140y36569958xadcba6faae1d65b3@mail.gmail.com>
Message-ID: <20070718050928.GA3103@obsidianresearch.com>

On Wed, Jul 18, 2007 at 12:40:07AM -0400, Hal Rosenstock wrote:

>>      I think this is an important question.  If we merge the local SA
>>      stuff, then are we creating a problem for dealing with QoS?  Are we
>>      going to have to revert the local SA stuff once the QoS stuff is
>>      available?  Or is there at least a sketch of a plan on how to handle
>>      this?
>  
> Is the worst case that local SA cache and QoS on an end node are mutually
> exclusive ? I think there is a way to shut off the local SA cache.

IMHO, I still think that without some kind of SM/SA sourced
invalidation mechanism all client side caching (including the ipoib
stuff we have now) is a bad idea. There just isn't any way to maintain
coherence. I think QoS is just a specific case of why.. Routers are
also likely to cause similar kinds of headaches. There are even a
bunch of other corner cases even with out those two.

It seems to me this would be alot better as a patch set to let a user
space daemon have first dibs at responding to a PR lookup. Then the
labs could have a special daemon that worked with the SA in a
vendor-specific way to do replication and get some big speed
ups. This should be pretty easy if you use a shared filesystem to
distribute a routing database produced by opensm.

But I'm not working on this stuff ;)

Jason


From sean.hefty at intel.com  Tue Jul 17 22:26:35 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 17 Jul 2007 22:26:35 -0700
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <adaabtuo0n9.fsf@cisco.com>
Message-ID: <000001c7c8fc$360eec90$5dcc180a@amr.corp.intel.com>

>I think this is an important question.  If we merge the local SA
>stuff, then are we creating a problem for dealing with QoS?

Yes - I do believe that merging PR caching and QoS together will be difficult.
I don't think the problems are insurmountable, but I can't say that I have a
definite solution for how to deal with this. 

My current thoughts are that the purpose of the cache is to increase SA
scalability on large clusters.  We've seen issues running MPI, trying to
establish all-to-all connections, on our 256 node cluster.  (With 4 processes
per node, this results in about 500,000+ PR queries hitting the SA.)  The SA was
swamped with work, and it wasn't trying to enforce QoS requirements across the
cluster.

I just don't see how an SA that is already having trouble scaling to this number
of nodes will be able to perform the additional task of providing QoS across the
cluster.  It may be that, at least initially, an administrator may need to
select between enabling PR caching or QoS.

>Are we going to have to revert the local SA stuff once the QoS stuff is
>available?

In the best case, the local SA will need enhancements added to the base support.
In the worst case, a user would have to choose between QoS or PR caching.  If
all users choose QoS, then it would make sense to remove the local SA. 

>Or is there at least a sketch of a plan on how to handle this?

This is only a rough idea, and it depends on how the QoS is implemented.  The
idea is to create a local QoS module on each node.  The local QoS modules would
be programmed with basic QoS information.  For example, which types of queries
to handle locally, versus which ones to forward to the SA.  Locally handled
queries would return PRs based on some QoS mapping table.  (I haven't looked
into any details of this.)

Ideally, local QoS modules would be programmed by a QoS master.  This would
require a new vendor-specific protocol, but would allow for a simple distributed
QoS manager.

We will have a better idea of the issues and possible solutions once the QoS
spec is released, and we can hold discussions on it.  I will be working more
details on QoS enhancements starting in the next couple of weeks. 

- Sean


From rdreier at cisco.com  Tue Jul 17 22:39:11 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 17 Jul 2007 22:39:11 -0700
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <20070718050928.GA3103@obsidianresearch.com> (Jason Gunthorpe's
	message of "Tue, 17 Jul 2007 23:09:28 -0600")
References: <adalkdl43w0.fsf@cisco.com> <adahco943ip.fsf@cisco.com>
	<4696D1F3.2040507@ichips.intel.com>
	<15ddcffd0707172020j5b68fcb2v7d3ca77863998020@mail.gmail.com>
	<adaabtuo0n9.fsf@cisco.com>
	<f0e08f230707172140y36569958xadcba6faae1d65b3@mail.gmail.com>
	<20070718050928.GA3103@obsidianresearch.com>
Message-ID: <ada7ioymfsw.fsf@cisco.com>

 > IMHO, I still think that without some kind of SM/SA sourced
 > invalidation mechanism all client side caching (including the ipoib
 > stuff we have now) is a bad idea.

But for IPoIB at least doing a path lookup for every packet is
obviously not feasible.  And ARP table aging gives a way to recover
from stale cached data, eventually at least.

In fact this may be a good argument in favor of local SA caching -- by
analogy with IPoIB it makes sense to avoid going to the SA too often.


From sean.hefty at intel.com  Tue Jul 17 23:04:54 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 17 Jul 2007 23:04:54 -0700
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <20070718050928.GA3103@obsidianresearch.com>
Message-ID: <000101c7c901$900cb290$5dcc180a@amr.corp.intel.com>

>IMHO, I still think that without some kind of SM/SA sourced
>invalidation mechanism all client side caching (including the ipoib
>stuff we have now) is a bad idea.

These are not full proof mechanisms, but the SA does have client re-registration
and GID in/out of service events that the local SA responds to.  Anything beyond
that becomes vendor specific.  The local SA exposes the ability for a user space
application to force an update of the cache, and leaves the refresh policy up to
the user.  In our use model, we force a refresh immediately before starting a
large MPI job.

Nothing precludes a user space daemon from updating the cache at timed
intervals, or from communicating with an SA in some vendor defined way to
maintain coherency.  I'm only trying to provide the kernel framework.  (We can
debate whether another framework would have been better, and I've held this
discussion on the list before...)  I do envision someone creating user space
applications to control refreshes and, with local SA extensions, allow
pre-loading of the cache, updates to specific paths, etc.

We can gain additional benefits by integrating the local SA tighter with the
stack.  For example, the CM could update the local SA on path migration events
or CM message timeouts.

For now, I want to start with a fairly simple framework that's useful and
extensible.  And, IMO, I don't believe that the cache coherency issues are
reason enough alone to prevent merging this patch.

- Sean


From ogerlitz at voltaire.com  Tue Jul 17 23:36:17 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 18 Jul 2007 09:36:17 +0300
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <adatzs2pxmx.fsf@cisco.com>
References: <adalkdl43w0.fsf@cisco.com>
	<adahco943ip.fsf@cisco.com>	<adazm1urji8.fsf@cisco.com>
	<1184704987.7702.106.camel@hyperion> <adatzs2pxmx.fsf@cisco.com>
Message-ID: <469DB4E1.6080200@voltaire.com>

Roland Dreier wrote:
>  >      I would like to see these features moved upstream.  DOE funded this
>  > work as part of the items we see needing on our large scale IB
>  > deployment (both present and future).  So from at least one big customer
>  > perspective we see this as useful.  
> 
> Does your reference to "present deployment" mean you are running this
> code now?

Indeed, my understanding is that the DOE uses an Open MPI device (I 
think its called PTE) which is implemented directly over libibverbs and 
hence no path queries are issued at all, if this is indeed the case, for 
them its more of a "for-the-future" thing.

Or.


From tziporet at dev.mellanox.co.il  Tue Jul 17 23:41:02 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Wed, 18 Jul 2007 09:41:02 +0300
Subject: [ofa-general] OpenIB development help
In-Reply-To: <aday7hepxoa.fsf@cisco.com>
References: <20070717165553.GA10298@vt.edu>
	<ada7ioyrdgy.fsf@cisco.com>	<20070717203428.GA12927@vt.edu>
	<aday7hepxoa.fsf@cisco.com>
Message-ID: <469DB5FE.1010206@mellanox.co.il>

Roland Dreier wrote:
>  > Thanks for replying to mail. I have a some basic understanding of IB. I
>  > have gone through some of the example code in the example directory and
>  > OFED performance test. I noticed that every one of those examples used
>  > TCP to exchange information regarding lid, psn and qpn. My question is
>  > basically that is there any other way to exchange this information using
>  > only IB. Since no hardware supports RD, I have to bite the bullet and
>  > use RC.
>   
Also in the test rdma_lat (under  ~mst/perftest.git) there is an option 
-c that opens the connection using rdmacm
> Look at librdmacm (or libibcm).  They provide higher-level
> abstractions for connection establishment.
There is an example there too - rping that open RC connection with librdmacm

Tziporet


From ogerlitz at voltaire.com  Tue Jul 17 23:43:52 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 18 Jul 2007 09:43:52 +0300
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <ada7ioymfsw.fsf@cisco.com>
References: <adalkdl43w0.fsf@cisco.com>
	<adahco943ip.fsf@cisco.com>	<4696D1F3.2040507@ichips.intel.com>	<15ddcffd0707172020j5b68fcb2v7d3ca77863998020@mail.gmail.com>	<adaabtuo0n9.fsf@cisco.com>	<f0e08f230707172140y36569958xadcba6faae1d65b3@mail.gmail.com>	<20070718050928.GA3103@obsidianresearch.com>
	<ada7ioymfsw.fsf@cisco.com>
Message-ID: <469DB6A8.1050107@voltaire.com>

Roland Dreier wrote:
>  > IMHO, I still think that without some kind of SM/SA sourced
>  > invalidation mechanism all client side caching (including the ipoib
>  > stuff we have now) is a bad idea.
> 
> But for IPoIB at least doing a path lookup for every packet is
> obviously not feasible.  And ARP table aging gives a way to recover
> from stale cached data, eventually at least.

> In fact this may be a good argument in favor of local SA caching -- by
> analogy with IPoIB it makes sense to avoid going to the SA too often.

for each neighbour IPoIB-UD (*) keeps an IB UD Address Handle (AH), so 
the neighbouring subsystem GC mechanism which does unicast ARP probes 
etc actually --verifies-- that the cached AH is valid.

With the local SA, even though the network stack has invalidated the AH 
(neighbour), a new path query would not be initiated. If this is the 
case also with the current IPoIB code, it seems to me as a bug. Actually 
I never managed to under --why-- there's a need to keep the path (except 
for debugfs reasons) record in ipoib and not only the ah ?!

Or.

(*) for IPoIB-CM its the same idea, the neighbour points to IB 
connection and the probe is sent over the connection


From ogerlitz at voltaire.com  Tue Jul 17 23:53:51 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 18 Jul 2007 09:53:51 +0300
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <000101c7c901$900cb290$5dcc180a@amr.corp.intel.com>
References: <000101c7c901$900cb290$5dcc180a@amr.corp.intel.com>
Message-ID: <469DB8FF.20604@voltaire.com>

Sean Hefty wrote:

> The local SA exposes the ability for a user space
> application to force an update of the cache, and leaves the refresh policy up to
> the user.  In our use model, we force a refresh immediately before starting a
> large MPI job.

The last statement left me confused... if you refresh the cache before 
you use it (spawn large MPI job) what does it buys you at all?!

Also how is the forced update mechanism being implemented?

Or.


From ogerlitz at voltaire.com  Wed Jul 18 00:08:29 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 18 Jul 2007 10:08:29 +0300
Subject: [ofa-general] [PATCH] IB/mad: fix duplicated kernel thread name
In-Reply-To: <adafy3nt79x.fsf@cisco.com>
References: <Pine.LNX.4.64.0707110840560.15887@zuben>	<adatzs4wasb.fsf@cisco.com>
	<469C923C.5000307@voltaire.com> <adafy3nt79x.fsf@cisco.com>
Message-ID: <469DBC6D.5030604@voltaire.com>

Roland Dreier wrote:
>  > the patch for itself only fixes a possible confusion created for the
>  > user as of two processes with the same name, however the discussion
>  > evolved to the question of how many threads should be used by the MAD
>  > and CM layers.
> 
> Is there any practical impact of two kernel threads with the same
> name, though?  I have tons of processes that all are "/bin/bash" on my
> box and it doesn't hurt too much.

yes.

When looking on the system for debug etc purposes since I know the mad 
layer uses thread per device/port, when there is a problem (eg 
starvation, crash, deadlock, you named it), its beneficial to know to 
which traffic flow its related, so the duplicate name creates confusion, 
thats all.

> The simplest way to make sure all the threads have unique names would
> seem to be just a private counter in the mad module that counts up,
> rather than trying to do device or port number.  Sticking in the last
> character of the device name is obviously too ugly.

OK, this seems quite simple to implement.

However, Michael have sent you a patch that changes the mad layer to use 
only one thread and I have raised the question that with the current 
code mad layer uses thread per device/port and the cm uses thread per 
cpu, is this really needed? what the correct path here?

Some discussion on that is going over this thread.

Or.


From mst at dev.mellanox.co.il  Wed Jul 18 00:28:41 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 18 Jul 2007 10:28:41 +0300
Subject: [ofa-general] Re: Further 2.6.23 merge plans...
In-Reply-To: <ada7ioymfsw.fsf@cisco.com>
References: <adalkdl43w0.fsf@cisco.com> <adahco943ip.fsf@cisco.com>
	<4696D1F3.2040507@ichips.intel.com>
	<15ddcffd0707172020j5b68fcb2v7d3ca77863998020@mail.gmail.com>
	<adaabtuo0n9.fsf@cisco.com>
	<f0e08f230707172140y36569958xadcba6faae1d65b3@mail.gmail.com>
	<20070718050928.GA3103@obsidianresearch.com>
	<ada7ioymfsw.fsf@cisco.com>
Message-ID: <20070718072841.GC1115@mellanox.co.il>

> And ARP table aging gives a way to recover
> from stale cached data, eventually at least.

Does it?

$ grep path_list drivers/infiniband/ulp/ipoib/*c
drivers/infiniband/ulp/ipoib/ipoib_main.c:      list_add_tail(&path->list, &priv->path_list);
drivers/infiniband/ulp/ipoib/ipoib_main.c:      list_splice(&priv->path_list, &remove_list);
drivers/infiniband/ulp/ipoib/ipoib_main.c:      INIT_LIST_HEAD(&priv->path_list);
drivers/infiniband/ulp/ipoib/ipoib_main.c:      INIT_LIST_HEAD(&priv->path_list);

In other words we add paths to ipoib specific cache, but we never seem
to *remove* individual paths from cache - we only know how to do
full cache invalidates on events such as port state change.

Right?

-- 
MST


From tziporet at dev.mellanox.co.il  Wed Jul 18 00:34:52 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Wed, 18 Jul 2007 10:34:52 +0300
Subject: [ofa-general] Re: Further 2.6.23 merge plans...
In-Reply-To: <20070717214417.GE17168@mellanox.co.il>
References: <adalkdl43w0.fsf@cisco.com>
	<adahco943ip.fsf@cisco.com>	<adazm1urji8.fsf@cisco.com>
	<20070717214417.GE17168@mellanox.co.il>
Message-ID: <469DC29C.3070205@mellanox.co.il>

Michael S. Tsirkin wrote:
> We have the patches applied in ofed 1.2.c with default module parameter set to
> caching disabled (ofed 1.2 had a different version of the patches, but caching
> is disabled by default there, too). At least in this configuration
> (caching disabled), all issues I've seen seem to be fixed now, and tests seem to
> be running smoothly.
>   
As far as I know Intel run with SA cache enabled on large clusters with 
Intel MPI

Tziporet


From mst at dev.mellanox.co.il  Wed Jul 18 00:38:31 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 18 Jul 2007 10:38:31 +0300
Subject: [ofa-general] Re: Further 2.6.23 merge plans...
In-Reply-To: <469DC29C.3070205@mellanox.co.il>
References: <adalkdl43w0.fsf@cisco.com> <adahco943ip.fsf@cisco.com>
	<adazm1urji8.fsf@cisco.com> <20070717214417.GE17168@mellanox.co.il>
	<469DC29C.3070205@mellanox.co.il>
Message-ID: <20070718073831.GE1115@mellanox.co.il>

> Quoting Tziporet Koren <tziporet at dev.mellanox.co.il>:
> Subject: Re: [ofa-general] Re: Further 2.6.23 merge plans...
> 
> Michael S. Tsirkin wrote:
> >We have the patches applied in ofed 1.2.c with default module parameter set
> >to caching disabled (ofed 1.2 had a different version of the patches, but
> >caching is disabled by default there, too). At least
> >in this configuration (caching disabled), all issues I've seen seem to be
> >fixed now, and tests seem to be running smoothly.
>
> As far as I know Intel run with SA cache enabled on large clusters with 
> Intel MPI

With OFED 1.2 version of the code, right?

-- 
MST


From mst at dev.mellanox.co.il  Wed Jul 18 00:46:32 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 18 Jul 2007 10:46:32 +0300
Subject: [ofa-general] Re: socket buffer accounting with UDP/ipoib
In-Reply-To: <aday7hfrkpu.fsf@cisco.com>
References: <1183643723.25031.262.camel@mtls03> <aday7hfrkpu.fsf@cisco.com>
Message-ID: <20070718074632.GF1115@mellanox.co.il>

> +		ib_dma_sync_single_for_cpu(priv->ca, addr, IPOIB_BUF_SIZE,
> +					   DMA_FROM_DEVICE);
> +		skb_copy_from_linear_data_offset(skb, IB_GRH_BYTES, new_skb->data,
> +						 wc->byte_len - IB_GRH_BYTES);
> +		ib_dma_sync_single_for_device(priv->ca, addr, IPOIB_BUF_SIZE,
> +					      DMA_FROM_DEVICE);

BTW, why is ib_dma_sync_single_for_device necessary here?

-- 
MST


From ughvx at frontiernet.net  Wed Jul 18 01:11:34 2007
From: ughvx at frontiernet.net (GreetingCards.Com)
Date: Wed, 18 Jul 2007 15:11:34 +0700
Subject: [ofa-general] You've received a greeting ecard from a School friend!
Message-ID: <001c01c7c913$4193b7f0$d2956f9f@xfue.ihul>

Hi. School friend has sent you a greeting ecard.
See your card as often as you wish during the next 15 days.

SEEING YOUR CARD

If your email software creates links to Web pages, click on your 
card's direct www address below while you are connected to the Internet:

http://24.4.181.191/?3e4dd7ae5b23933165b19d3383b4c00

Or copy and paste it into your browser's "Location" box (where Internet 
addresses go).

We hope you enjoy your awesome card.

Wishing you the best,
Administrator,
GreetingCards.Com


From tziporet at dev.mellanox.co.il  Wed Jul 18 01:48:58 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Wed, 18 Jul 2007 11:48:58 +0300
Subject: [ofa-general] Re: Further 2.6.23 merge plans...
In-Reply-To: <20070718073831.GE1115@mellanox.co.il>
References: <adalkdl43w0.fsf@cisco.com>
	<adahco943ip.fsf@cisco.com>	<adazm1urji8.fsf@cisco.com>
	<20070717214417.GE17168@mellanox.co.il>	<469DC29C.3070205@mellanox.co.il>
	<20070718073831.GE1115@mellanox.co.il>
Message-ID: <469DD3FA.305@mellanox.co.il>

Michael S. Tsirkin wrote:
>> As far as I know Intel run with SA cache enabled on large clusters with
>> Intel MPI
>>     
>
> With OFED 1.2 version of the code, right?
>
>   
Yes.
But maybe they also used the new module - Sean?


From ogerlitz at voltaire.com  Wed Jul 18 01:50:27 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 18 Jul 2007 11:50:27 +0300
Subject: [ofa-general] Re: [PATCHv2] IB/mad: fix duplicated kernel thread
	name
In-Reply-To: <469CEC8F.4050106@ichips.intel.com>
References: <Pine.LNX.4.64.0707110840560.15887@zuben>	<20070715094145.GA16231@mellanox.co.il>	<469B3286.3060902@voltaire.com>	<20070716115911.GA3379@mellanox.co.il>
	<469B6634.1050709@voltaire.com> <469B9B5A.2040707@ichips.intel.com>
	<469C9453.80905@voltaire.com> <469CEC8F.4050106@ichips.intel.com>
Message-ID: <469DD453.8010700@voltaire.com>

Sean Hefty wrote:
>> Can you explain why would not the IB CM use the thread context 
>> provided by the mad layer?

> You can end up with deadlock conditions when destroying cm_id's that 
> have outstanding MADs.  It also increases MAD processing time, which can 
> increase dropping MADs.

OK, thanks for the clarification.

  >> Second, if the CM needs a different context why not use the system
>> threads? I understood from Michael's reply that the CM code relies on 
>> some thread/queue flushing at the time of CM ID destruction, is it an 
>> implementation issue that can change? if not, can't one dedicated 
>> thread do the job?

> The timing and use of the system threads is unknown.  When the ib_mad 
> module was created, it was suggested that the system threads not be 
> used.  (I think it was Roland who recommended this.)  We can change to 
> system threads, but it does open the possibility of complicated deadlock 
> conditions if other modules use the system threads as well.

I know that from reasons such as timing and use which you mention, 
people tend to not to use the system threads for their fast path tasks. 
As for the possibility of deadlock b/c of system threads usage, this is 
an argument which I like less (...), eg the network stack does well on 
its deadlocak avoidance code without spawning dedicated threads.

Is it all about that the net stack uses softirqs where the ib stack 
needs threads for its control path (as of the usage of commands for IB 
resource create/modify/destroy). If this is the case, do you think it 
justifies spawning thread per CPU?

Or.


From vlad at lists.openfabrics.org  Wed Jul 18 01:51:45 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Wed, 18 Jul 2007 01:51:45 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070718-0100 daily build status
Message-ID: <20070718085145.4D6DDE60B76@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.16
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.20
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.12
Passed on ppc64 with linux-2.6.19
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ia64 with linux-2.6.18
Passed on powerpc with linux-2.6.19
Passed on powerpc with linux-2.6.14
Passed on ia64 with linux-2.6.15
Passed on x86_64 with linux-2.6.12
Passed on powerpc with linux-2.6.13
Passed on ia64 with linux-2.6.16
Passed on ppc64 with linux-2.6.12
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.17
Passed on ia64 with linux-2.6.19
Passed on ppc64 with linux-2.6.15
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on powerpc with linux-2.6.16
Passed on powerpc with linux-2.6.17
Passed on ppc64 with linux-2.6.14
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.15
Passed on powerpc with linux-2.6.15
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.14
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.16
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-34.ELsmp

Failed:
Build failed on i686 with linux-2.6.22-rc7


From ogerlitz at voltaire.com  Wed Jul 18 02:04:59 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 18 Jul 2007 12:04:59 +0300
Subject: [ofa-general] Re: Further 2.6.23 merge plans...
In-Reply-To: <20070718072841.GC1115@mellanox.co.il>
References: <adalkdl43w0.fsf@cisco.com>
	<adahco943ip.fsf@cisco.com>	<4696D1F3.2040507@ichips.intel.com>	<15ddcffd0707172020j5b68fcb2v7d3ca77863998020@mail.gmail.com>	<adaabtuo0n9.fsf@cisco.com>	<f0e08f230707172140y36569958xadcba6faae1d65b3@mail.gmail.com>	<20070718050928.GA3103@obsidianresearch.com>	<ada7ioymfsw.fsf@cisco.com>
	<20070718072841.GC1115@mellanox.co.il>
Message-ID: <469DD7BB.6060009@voltaire.com>

Michael S. Tsirkin wrote:
>> And ARP table aging gives a way to recover
>> from stale cached data, eventually at least.
> 
> Does it?
> 
> $ grep path_list drivers/infiniband/ulp/ipoib/*c
> drivers/infiniband/ulp/ipoib/ipoib_main.c:      list_add_tail(&path->list, &priv->path_list);
> drivers/infiniband/ulp/ipoib/ipoib_main.c:      list_splice(&priv->path_list, &remove_list);
> drivers/infiniband/ulp/ipoib/ipoib_main.c:      INIT_LIST_HEAD(&priv->path_list);
> drivers/infiniband/ulp/ipoib/ipoib_main.c:      INIT_LIST_HEAD(&priv->path_list);
> 
> In other words we add paths to ipoib specific cache, but we never seem
> to *remove* individual paths from cache - we only know how to do
> full cache invalidates on events such as port state change.
> 
> Right?

this seems like a bug, if the stack decided to delete OR change a 
neighbour, the path associated with it must not be re-used to create the 
address handle or to establish the connection, same for multicast 
neighbours.

Or.


From vlad at lists.openfabrics.org  Wed Jul 18 02:45:39 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Wed, 18 Jul 2007 02:45:39 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_c_kernel 20070718-0200 daily build status
Message-ID: <20070718094539.DF301E60825@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2_c

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.22-rc7
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.15
Passed on x86_64 with linux-2.6.20
Passed on ppc64 with linux-2.6.19
Passed on powerpc with linux-2.6.18
Passed on ia64 with linux-2.6.18
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.21.1
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.13
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.19
Passed on x86_64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on x86_64 with linux-2.6.14
Passed on ppc64 with linux-2.6.15
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.15
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on powerpc with linux-2.6.16
Passed on powerpc with linux-2.6.12
Passed on ia64 with linux-2.6.14
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on powerpc with linux-2.6.13
Passed on powerpc with linux-2.6.15
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.14
Passed on ia64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ia64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-34.ELsmp

Failed:


From mst at dev.mellanox.co.il  Wed Jul 18 02:55:31 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 18 Jul 2007 12:55:31 +0300
Subject: [ofa-general] Re: [PATCHv2] IB/mad: fix duplicated kernel thread
	name
In-Reply-To: <469DD453.8010700@voltaire.com>
References: <Pine.LNX.4.64.0707110840560.15887@zuben>
	<20070715094145.GA16231@mellanox.co.il>
	<469B3286.3060902@voltaire.com>
	<20070716115911.GA3379@mellanox.co.il>
	<469B6634.1050709@voltaire.com> <469B9B5A.2040707@ichips.intel.com>
	<469C9453.80905@voltaire.com> <469CEC8F.4050106@ichips.intel.com>
	<469DD453.8010700@voltaire.com>
Message-ID: <20070718095531.GH1115@mellanox.co.il>

> Is it all about that the net stack uses softirqs where the ib stack 
> needs threads for its control path (as of the usage of commands for IB 
> resource create/modify/destroy).

Yes.

> If this is the case, do you think it 
> justifies spawning thread per CPU?

Thread per CPU is really the default. It's the single threaded WQs
than need justification :)


-- 
MST


From sashak at voltaire.com  Wed Jul 18 03:31:11 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 18 Jul 2007 13:31:11 +0300
Subject: [ofa-general] svn converted repos moved away
Message-ID: <20070718103111.GN31073@sashak.voltaire.com>

Hi,

I moved to private place the repos where original svn to git conversion
was done. I guess that nobody needs it anymore. Please let me know if
I'm worng.

Sasha


From dotanb at dev.mellanox.co.il  Wed Jul 18 04:21:04 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Wed, 18 Jul 2007 14:21:04 +0300
Subject: [ofa-general] [PATCH] core/iwcm: Remove local write permission
	enable in QP access flags
Message-ID: <200707181421.04336.dotanb@dev.mellanox.co.il>

Remove local write permission enable in QP access flags
(this attribute is being used only for remote connections).

Signed-off-by: Dotan Barak <dotanb at mellanox.co.il>

---

diff --git a/drivers/infiniband/core/iwcm.c b/drivers/infiniband/core/iwcm.c
index 223b1aa..44b0a3d 100644
--- a/drivers/infiniband/core/iwcm.c
+++ b/drivers/infiniband/core/iwcm.c
@@ -941,8 +941,7 @@ static int iwcm_init_qp_init_attr(struct iwcm_id_private *cm_id_priv,
 	case IW_CM_STATE_CONN_RECV:
 	case IW_CM_STATE_ESTABLISHED:
 		*qp_attr_mask = IB_QP_STATE | IB_QP_ACCESS_FLAGS;
-		qp_attr->qp_access_flags = IB_ACCESS_LOCAL_WRITE |
-					   IB_ACCESS_REMOTE_WRITE|
+		qp_attr->qp_access_flags = IB_ACCESS_REMOTE_WRITE |
 					   IB_ACCESS_REMOTE_READ;
 		ret = 0;
 		break;


From tziporet at dev.mellanox.co.il  Wed Jul 18 04:50:27 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Wed, 18 Jul 2007 14:50:27 +0300
Subject: [ofa-general] Re: [ewg] Agenda for OFED meeting today
In-Reply-To: <f0e08f230707171533p71a9a020ief253303facb51d8@mail.gmail.com>
References: <6C2C79E72C305246B504CBA17B5500C901563963@mtlexch01.mtl.com>
	<f0e08f230707171533p71a9a020ief253303facb51d8@mail.gmail.com>
Message-ID: <469DFE83.4030507@mellanox.co.il>

Hal Rosenstock wrote:
> Hi Tziporet,
>
>
> What happened to ibsim  ?  I thought that was on the list I originally 
> sent.
It was but Sasha told me its not actually part of OFED. If it is no 
problem to add it again

Tziporet


From eli at mellanox.co.il  Wed Jul 18 05:25:52 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 18 Jul 2007 15:25:52 +0300
Subject: [ofa-general] socket buffer accounting with UDP/ipoib
In-Reply-To: <aday7hfrkpu.fsf@cisco.com>
References: <1183643723.25031.262.camel@mtls03>  <aday7hfrkpu.fsf@cisco.com>
Message-ID: <1184761552.3520.9.camel@mtls03>

I made some experiments with iperf running on CM mode and TCP sockets. I
can see that there is no bad affect on BW (excel file attached). We did
see a slight improvement in packet loss in UDP mode with an application
supplied by a customer.


Copy small received patckets to newly allocated SKBs just
big enough to contain the packet. This will relief accounting
done on the socket so that a smaller size is used.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>

---

Index: connectx_kernel/drivers/infiniband/ulp/ipoib/ipoib.h
===================================================================
--- connectx_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib.h	2007-07-17 15:41:29.000000000 +0300
+++ connectx_kernel/drivers/infiniband/ulp/ipoib/ipoib.h	2007-07-18 09:34:49.000000000 +0300
@@ -651,4 +651,7 @@
 
 #define IPOIB_QPN(ha) (be32_to_cpup((__be32 *) ha) & 0xffffff)
 
+#define SKB_LEN_THOLD 256
+#define CM_SKB_LEN_THOLD min(SKB_LEN_THOLD, IPOIB_CM_HEAD_SIZE)
+
 #endif /* _IPOIB_H */
Index: connectx_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c
===================================================================
--- connectx_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-07-17 15:41:29.000000000 +0300
+++ connectx_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-07-18 10:46:54.000000000 +0300
@@ -452,26 +452,40 @@
 
 	frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len,
 					      (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE;
-
-	newskb = ipoib_cm_alloc_rx_skb(dev, wr_id, frags, mapping);
-	if (unlikely(!newskb)) {
-		/*
-		 * If we can't allocate a new RX buffer, dump
-		 * this packet and reuse the old buffer.
-		 */
-		ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id);
-		++priv->stats.rx_dropped;
-		goto repost;
+	if (wc->byte_len < CM_SKB_LEN_THOLD) {
+		newskb = dev_alloc_skb(wc->byte_len);
+		if (!newskb)
+			ipoib_warn(priv, "failed to allocate skb\n");
+
+		ib_dma_sync_single_for_cpu(priv->ca, priv->cm.srq_ring[wr_id].mapping[0],
+					   IPOIB_CM_HEAD_SIZE, DMA_FROM_DEVICE);
+		skb_copy_from_linear_data_offset(skb, IB_GRH_BYTES, new_skb->data,
+						 wc->byte_len - IB_GRH_BYTES);
+		ib_dma_sync_single_for_device(priv->ca, priv->cm.srq_ring[wr_id].mapping[0],
+					      IPOIB_CM_HEAD_SIZE, DMA_FROM_DEVICE);
+
+		skb_put(newskb, wc->byte_len);
+		skb = newskb;
+	}
+	else {
+		newskb = ipoib_cm_alloc_rx_skb(dev, wr_id, frags, mapping);
+		if (unlikely(!newskb)) {
+			/*
+			 * If we can't allocate a new RX buffer, dump
+			 * this packet and reuse the old buffer.
+			 */
+			ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id);
+			++priv->stats.rx_dropped;
+			goto repost;
+		}
+		ipoib_cm_dma_unmap_rx(priv, frags, priv->cm.srq_ring[wr_id].mapping);
+		memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, (frags + 1) * sizeof *mapping);
+		skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb);
 	}
 
-	ipoib_cm_dma_unmap_rx(priv, frags, priv->cm.srq_ring[wr_id].mapping);
-	memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, (frags + 1) * sizeof *mapping);
-
 	ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n",
 		       wc->byte_len, wc->slid);
 
-	skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb);
-
 	skb->protocol = ((struct ipoib_header *) skb->data)->proto;
 	skb_reset_mac_header(skb);
 	skb_pull(skb, IPOIB_ENCAP_LEN);


From eli at mellanox.co.il  Wed Jul 18 05:27:12 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Wed, 18 Jul 2007 15:27:12 +0300
Subject: [ofa-general] socket buffer accounting with UDP/ipoib
In-Reply-To: <1184761552.3520.9.camel@mtls03>
References: <1183643723.25031.262.camel@mtls03>  <aday7hfrkpu.fsf@cisco.com>
	<1184761552.3520.9.camel@mtls03>
Message-ID: <1184761632.3520.11.camel@mtls03>

Attaching the file

On Wed, 2007-07-18 at 15:25 +0300, Eli Cohen wrote:
> I made some experiments with iperf running on CM mode and TCP sockets. I
> can see that there is no bad affect on BW (excel file attached). We did
> see a slight improvement in packet loss in UDP mode with an application
> supplied by a customer.
> 
> 
> 
> Copy small received patckets to newly allocated SKBs just
> big enough to contain the packet. This will relief accounting
> done on the socket so that a smaller size is used.
> 
> Signed-off-by: Eli Cohen <eli at mellanox.co.il>
> 
> ---
> 
> Index: connectx_kernel/drivers/infiniband/ulp/ipoib/ipoib.h
> ===================================================================
> --- connectx_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib.h	2007-07-17 15:41:29.000000000 +0300
> +++ connectx_kernel/drivers/infiniband/ulp/ipoib/ipoib.h	2007-07-18 09:34:49.000000000 +0300
> @@ -651,4 +651,7 @@
>  
>  #define IPOIB_QPN(ha) (be32_to_cpup((__be32 *) ha) & 0xffffff)
>  
> +#define SKB_LEN_THOLD 256
> +#define CM_SKB_LEN_THOLD min(SKB_LEN_THOLD, IPOIB_CM_HEAD_SIZE)
> +
>  #endif /* _IPOIB_H */
> Index: connectx_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> ===================================================================
> --- connectx_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-07-17 15:41:29.000000000 +0300
> +++ connectx_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-07-18 10:46:54.000000000 +0300
> @@ -452,26 +452,40 @@
>  
>  	frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len,
>  					      (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE;
> -
> -	newskb = ipoib_cm_alloc_rx_skb(dev, wr_id, frags, mapping);
> -	if (unlikely(!newskb)) {
> -		/*
> -		 * If we can't allocate a new RX buffer, dump
> -		 * this packet and reuse the old buffer.
> -		 */
> -		ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id);
> -		++priv->stats.rx_dropped;
> -		goto repost;
> +	if (wc->byte_len < CM_SKB_LEN_THOLD) {
> +		newskb = dev_alloc_skb(wc->byte_len);
> +		if (!newskb)
> +			ipoib_warn(priv, "failed to allocate skb\n");
> +
> +		ib_dma_sync_single_for_cpu(priv->ca, priv->cm.srq_ring[wr_id].mapping[0],
> +					   IPOIB_CM_HEAD_SIZE, DMA_FROM_DEVICE);
> +		skb_copy_from_linear_data_offset(skb, IB_GRH_BYTES, new_skb->data,
> +						 wc->byte_len - IB_GRH_BYTES);
> +		ib_dma_sync_single_for_device(priv->ca, priv->cm.srq_ring[wr_id].mapping[0],
> +					      IPOIB_CM_HEAD_SIZE, DMA_FROM_DEVICE);
> +
> +		skb_put(newskb, wc->byte_len);
> +		skb = newskb;
> +	}
> +	else {
> +		newskb = ipoib_cm_alloc_rx_skb(dev, wr_id, frags, mapping);
> +		if (unlikely(!newskb)) {
> +			/*
> +			 * If we can't allocate a new RX buffer, dump
> +			 * this packet and reuse the old buffer.
> +			 */
> +			ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id);
> +			++priv->stats.rx_dropped;
> +			goto repost;
> +		}
> +		ipoib_cm_dma_unmap_rx(priv, frags, priv->cm.srq_ring[wr_id].mapping);
> +		memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, (frags + 1) * sizeof *mapping);
> +		skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb);
>  	}
>  
> -	ipoib_cm_dma_unmap_rx(priv, frags, priv->cm.srq_ring[wr_id].mapping);
> -	memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, (frags + 1) * sizeof *mapping);
> -
>  	ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n",
>  		       wc->byte_len, wc->slid);
>  
> -	skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb);
> -
>  	skb->protocol = ((struct ipoib_header *) skb->data)->proto;
>  	skb_reset_mac_header(skb);
>  	skb_pull(skb, IPOIB_ENCAP_LEN);
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: skb_vs_noskb_patch.xls
Type: application/vnd.ms-excel
Size: 17408 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070718/5f24e572/attachment.xls>

From hal.rosenstock at gmail.com  Wed Jul 18 05:43:57 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 18 Jul 2007 08:43:57 -0400
Subject: [ofa-general] Re: [ewg] Agenda for OFED meeting today
In-Reply-To: <469DFE83.4030507@mellanox.co.il>
References: <6C2C79E72C305246B504CBA17B5500C901563963@mtlexch01.mtl.com>
	<f0e08f230707171533p71a9a020ief253303facb51d8@mail.gmail.com>
	<469DFE83.4030507@mellanox.co.il>
Message-ID: <f0e08f230707180543x31bf3748pba4304f98e68097f@mail.gmail.com>

On 7/18/07, Tziporet Koren <tziporet at dev.mellanox.co.il> wrote:
>
> Hal Rosenstock wrote:
> > Hi Tziporet,
> >
> >
> > What happened to ibsim  ?  I thought that was on the list I originally
> > sent.
> It was but Sasha told me its not actually part of OFED.


That was the case for OFED 1.2 but it was proposed to add it for OFED 1.3.

 If it is no
> problem to add it again


Could this please be done ? Thanks.

-- Hal

Tziporet
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070718/0ef1d9b4/attachment.html>

From mst at dev.mellanox.co.il  Wed Jul 18 06:36:31 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 18 Jul 2007 16:36:31 +0300
Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus
In-Reply-To: <adabqek9i2g.fsf@cisco.com>
References: <adalkdpxopo.fsf@cisco.com> <20070709213913.GB20052@mellanox.co.il>
	<adamyy4vjo9.fsf@cisco.com> <20070710071547.GA3814@mellanox.co.il>
	<adabqekuvde.fsf@cisco.com> <20070710171142.GC11320@mellanox.co.il>
	<ada3azwb076.fsf@cisco.com> <20070710183006.GE11320@mellanox.co.il>
	<adabqek9i2g.fsf@cisco.com>
Message-ID: <20070718133630.GF17765@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [ofa-general] Re: mthca use of dma_sync_single is bogus
> 
>  > Hmm. This means there's no way to sync a range within
>  > mapping created with map_sg?
> 
> It doesn't seem that there is one right now at least.
> 
>  > > It actually doesn't look too bad to replace our use of pci_map_sg()
>  > > with dma_map_single(), at least at first glance.  I'll try to write a
>  > > patch later.
>  > 
>  > Well, the reason map_sg is there is presumably because on some
>  > architectures it's worth it to try and make the region contigious in DMA space.
>  > But I agree this seems the lesser evil at this point ...
> 
> Given that we're already trying to allocate big chunks of physically
> contiguous memory, I think that any virtual merging we get is likely
> to be of very small benefit.
> 
> It is kind of a shame to give this up though.

Did we reach any conclusion? Are you switching to map_single?

-- 
MST


From bramesh at vt.edu  Wed Jul 18 07:08:03 2007
From: bramesh at vt.edu (Bharath Ramesh)
Date: Wed, 18 Jul 2007 10:08:03 -0400
Subject: [ofa-general] OpenIB development help
In-Reply-To: <469DB5FE.1010206@mellanox.co.il>
References: <20070717165553.GA10298@vt.edu> <ada7ioyrdgy.fsf@cisco.com>
	<20070717203428.GA12927@vt.edu> <aday7hepxoa.fsf@cisco.com>
	<469DB5FE.1010206@mellanox.co.il>
Message-ID: <20070718140803.GA31599@vt.edu>

* Tziporet Koren (tziporet at dev.mellanox.co.il) wrote:
> Roland Dreier wrote:
>>  > Thanks for replying to mail. I have a some basic understanding of IB. I
>>  > have gone through some of the example code in the example directory and
>>  > OFED performance test. I noticed that every one of those examples used
>>  > TCP to exchange information regarding lid, psn and qpn. My question is
>>  > basically that is there any other way to exchange this information 
>> using
>>  > only IB. Since no hardware supports RD, I have to bite the bullet and
>>  > use RC.
>>   
> Also in the test rdma_lat (under  ~mst/perftest.git) there is an option -c 
> that opens the connection using rdmacm
>> Look at librdmacm (or libibcm).  They provide higher-level
>> abstractions for connection establishment.
> There is an example there too - rping that open RC connection with 
> librdmacm
>
> Tziporet
>

Thanks for pointing out the various examples. Really appreciate it.

Thanks,

Bharath

---
Bharath Ramesh       <bramesh at vt.edu>       http://people.cs.vt.edu/~bramesh


From rdreier at cisco.com  Wed Jul 18 08:12:18 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 18 Jul 2007 08:12:18 -0700
Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus
In-Reply-To: <20070718133630.GF17765@mellanox.co.il> (Michael S. Tsirkin's
	message of "Wed, 18 Jul 2007 16:36:31 +0300")
References: <adalkdpxopo.fsf@cisco.com>
	<20070709213913.GB20052@mellanox.co.il> <adamyy4vjo9.fsf@cisco.com>
	<20070710071547.GA3814@mellanox.co.il> <adabqekuvde.fsf@cisco.com>
	<20070710171142.GC11320@mellanox.co.il> <ada3azwb076.fsf@cisco.com>
	<20070710183006.GE11320@mellanox.co.il> <adabqek9i2g.fsf@cisco.com>
	<20070718133630.GF17765@mellanox.co.il>
Message-ID: <aday7hdlp9p.fsf@cisco.com>

 > Did we reach any conclusion? Are you switching to map_single?

haven't had a chance to work on it yet, but I don't see a better alternative.


From rdreier at cisco.com  Wed Jul 18 08:10:47 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 18 Jul 2007 08:10:47 -0700
Subject: [ofa-general] Re: socket buffer accounting with UDP/ipoib
In-Reply-To: <20070718074632.GF1115@mellanox.co.il> (Michael S. Tsirkin's
	message of "Wed, 18 Jul 2007 10:46:32 +0300")
References: <1183643723.25031.262.camel@mtls03> <aday7hfrkpu.fsf@cisco.com>
	<20070718074632.GF1115@mellanox.co.il>
Message-ID: <ada3azln3wo.fsf@cisco.com>

 > > +		ib_dma_sync_single_for_cpu(priv->ca, addr, IPOIB_BUF_SIZE,
 > > +					   DMA_FROM_DEVICE);
 > > +		skb_copy_from_linear_data_offset(skb, IB_GRH_BYTES, new_skb->data,
 > > +						 wc->byte_len - IB_GRH_BYTES);
 > > +		ib_dma_sync_single_for_device(priv->ca, addr, IPOIB_BUF_SIZE,
 > > +					      DMA_FROM_DEVICE);
 > 
 > BTW, why is ib_dma_sync_single_for_device necessary here?

Not sure what you're asking exactly.  The sync for device is needed to
match the previous sync for the cpu obviously.  We need both syncs for
the same reason we need the unmap when we don't copy -- we're copying
data out of the skb we gave to the device earlier, so we need to make
sure the cpu sees the right data.


From sean.hefty at intel.com  Wed Jul 18 09:16:52 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 18 Jul 2007 09:16:52 -0700
Subject: [ofa-general] Re: Further 2.6.23 merge plans...
In-Reply-To: <469DD3FA.305@mellanox.co.il>
Message-ID: <000101c7c957$0e4e51e0$69cc180a@amr.corp.intel.com>

>> With OFED 1.2 version of the code, right?
>>
>>
>Yes.
>But maybe they also used the new module - Sean?

We actually use the OFED 1.2 version.  So, this feature is in use, but not this
specific implementation.

- Sean


From rdreier at cisco.com  Wed Jul 18 09:20:21 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 18 Jul 2007 09:20:21 -0700
Subject: [ofa-general] Re: Further 2.6.23 merge plans...
In-Reply-To: <000101c7c957$0e4e51e0$69cc180a@amr.corp.intel.com> (Sean Hefty's
	message of "Wed, 18 Jul 2007 09:16:52 -0700")
References: <000101c7c957$0e4e51e0$69cc180a@amr.corp.intel.com>
Message-ID: <adasl7llm4a.fsf@cisco.com>

 > We actually use the OFED 1.2 version.  So, this feature is in use, but not this
 > specific implementation.

Hmm... how much testing has the implementation being proposed for
merging actually had?

It might still be OK if the answer is that it hasn't been tested at
scale but that the basic code works and should behave the same as the
code that was tested because the underlying design is the same... is
at least that much true?


From eitan at mellanox.co.il  Wed Jul 18 09:28:04 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 18 Jul 2007 19:28:04 +0300
Subject: [ofa-general] [PATCH] opensm: Bug in coding of VL Arbitration tables
Message-ID: <86ir8h3cdn.fsf@sw053.lab.mtl.com>

Hi Sasha

Discovered a bug in coding of the VL Arbitration table "index".
According to spec should be:
 1 for low part of low table 
 2 for high part of low table
 3 for low part of high table
 4 for high part of high table 

the patch below fixes it:

Eitan

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c
index bbb1608..413e200 100644
--- a/opensm/opensm/osm_qos.c
+++ b/opensm/opensm/osm_qos.c
@@ -116,14 +116,14 @@ static ib_api_status_t vlarb_update(osm_req_t * p_req,
 		    p_pi->vl_arb_low_cap : IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK;
 		if ((status = vlarb_update_table_block(p_req, p, port_num,
 						       &qcfg->vlarb_low[0],
-						       len, 0)) != IB_SUCCESS)
+						       len, 1)) != IB_SUCCESS)
 			return status;
 	}
 	if (p_pi->vl_arb_low_cap > IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK) {
 		len = p_pi->vl_arb_low_cap % IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK;
 		if ((status = vlarb_update_table_block(p_req, p, port_num,
 						       &qcfg->vlarb_low[1],
-						       len, 1)) != IB_SUCCESS)
+						       len, 2)) != IB_SUCCESS)
 			return status;
 	}
 	if (p_pi->vl_arb_high_cap > 0) {
@@ -131,14 +131,14 @@ static ib_api_status_t vlarb_update(osm_req_t * p_req,
 		    p_pi->vl_arb_high_cap : IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK;
 		if ((status = vlarb_update_table_block(p_req, p, port_num,
 						       &qcfg->vlarb_high[0],
-						       len, 2)) != IB_SUCCESS)
+						       len, 3)) != IB_SUCCESS)
 			return status;
 	}
 	if (p_pi->vl_arb_high_cap > IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK) {
 		len = p_pi->vl_arb_high_cap % IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK;
 		if ((status = vlarb_update_table_block(p_req, p, port_num,
 						       &qcfg->vlarb_high[1],
-						       len, 3)) != IB_SUCCESS)
+						       len, 4)) != IB_SUCCESS)
 			return status;
 	}
 

From eitan at mellanox.co.il  Wed Jul 18 09:31:49 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 18 Jul 2007 19:31:49 +0300
Subject: [ofa-general] [PATCH] opensm: Bug in coding trying to set
	vl_arb_high_limit
Message-ID: <86hco13c7e.fsf@sw053.lab.mtl.com>

Hi Sasha

When QoS setup is done the code was trying to send updates of
vl_arb_high_limit by req_set of PORT_INFO with the new data.
However, at that stage the SM still did not assign LIDs to the ports.
So the sent PortInfo.base_lid was still zero. The specification does not
allow for such LIDs (they are considered ilegal). 

the patch below fixes this by storing the calculated value and later 
using it in link and lid managers.

Eitan

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

diff --git a/opensm/include/opensm/osm_port.h b/opensm/include/opensm/osm_port.h
index 54ebcfc..5032b1b 100644
--- a/opensm/include/opensm/osm_port.h
+++ b/opensm/include/opensm/osm_port.h
@@ -117,6 +117,7 @@ typedef struct _osm_physp
   struct _osm_node		*p_node;
   struct _osm_physp		*p_remote_physp;
   boolean_t                      healthy;
+  uint8_t                vl_high_limit;
   osm_dr_path_t			 dr_path;
   osm_pkey_tbl_t                 pkeys;
   ib_vl_arb_table_t              vl_arb[4];
diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c
index bc3f8b3..ed76382 100644
--- a/opensm/opensm/osm_lid_mgr.c
+++ b/opensm/opensm/osm_lid_mgr.c
@@ -1182,6 +1182,14 @@ __osm_lid_mgr_set_physp_pi(
            ib_port_info_get_port_state(p_old_pi) )
         send_set = TRUE;
     }
+
+	 /* provide the vl_high_limit from the qos mgr */
+	 if (p_mgr->p_subn->opt.no_qos == FALSE) 
+		 if (p_physp->vl_high_limit != p_old_pi->vl_high_limit)
+		 {
+			 send_set = TRUE;
+			 p_pi->vl_high_limit = p_physp->vl_high_limit;
+		 }
   }
   else
   {
diff --git a/opensm/opensm/osm_link_mgr.c b/opensm/opensm/osm_link_mgr.c
index 25f0fc3..3781fd2 100644
--- a/opensm/opensm/osm_link_mgr.c
+++ b/opensm/opensm/osm_link_mgr.c
@@ -354,6 +354,15 @@ __osm_link_mgr_set_physp_pi(
       context.pi_context.active_transition = FALSE;
   }
 
+  /* provide the vl_high_limit from the qos mgr */
+  if (p_mgr->p_subn->opt.no_qos == FALSE) 
+	  if (p_physp->vl_high_limit != p_old_pi->vl_high_limit)
+	  {
+		  send_set = TRUE;
+		  p_pi->vl_high_limit = p_physp->vl_high_limit;
+	  }	  
+  
+  
   context.pi_context.node_guid = osm_node_get_node_guid( p_node );
   context.pi_context.port_guid = osm_physp_get_port_guid( p_physp );
   context.pi_context.set_method = TRUE;
diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c
index bbb1608..413e200 100644
--- a/opensm/opensm/osm_qos.c
+++ b/opensm/opensm/osm_qos.c
@@ -216,42 +216,6 @@ static ib_api_status_t sl2vl_update(osm_req_t * p_req, osm_port_t * p_port,
 	return IB_SUCCESS;
 }
 
-static ib_api_status_t vl_high_limit_update(osm_req_t * p_req,
-					    osm_physp_t * p,
-					    const struct qos_config *qcfg)
-{
-	uint8_t payload[IB_SMP_DATA_SIZE];
-	osm_madw_context_t context;
-	ib_port_info_t *p_pi;
-
-	p_pi = &p->port_info;
-
-	if (p_pi->vl_high_limit == qcfg->vl_high_limit)
-		return IB_SUCCESS;
-
-	memset(payload, 0, IB_SMP_DATA_SIZE);
-	memcpy(payload, p_pi, sizeof(ib_port_info_t));
-
-	p_pi = (ib_port_info_t *) payload;
-	ib_port_info_set_state_no_change(p_pi);
-
-	p_pi->vl_high_limit = qcfg->vl_high_limit;
-
-	context.pi_context.node_guid =
-	    osm_node_get_node_guid(osm_physp_get_node_ptr(p));
-	context.pi_context.port_guid = osm_physp_get_port_guid(p);
-	context.pi_context.set_method = TRUE;
-	context.pi_context.update_master_sm_base_lid = FALSE;
-	context.pi_context.ignore_errors = FALSE;
-	context.pi_context.light_sweep = FALSE;
-	context.pi_context.active_transition = FALSE;
-
-	return osm_req_set(p_req, osm_physp_get_dr_path_ptr(p),
-			   payload, sizeof(payload), IB_MAD_ATTR_PORT_INFO,
-			   cl_hton32(osm_physp_get_port_num(p)),
-			   CL_DISP_MSGID_NONE, &context);
-}
-
 static ib_api_status_t qos_physp_setup(osm_log_t * p_log, osm_req_t * p_req,
 				       osm_port_t * p_port, osm_physp_t * p,
 				       uint8_t port_num,
@@ -261,16 +225,8 @@ static ib_api_status_t qos_physp_setup(osm_log_t * p_log, osm_req_t * p_req,
 
 	/* OpVLs should be ok at this moment - just use it */
 
-	/* setup VL high limit */
-	status = vl_high_limit_update(p_req, p, qcfg);
-	if (status != IB_SUCCESS) {
-		osm_log(p_log, OSM_LOG_ERROR,
-			"qos_physp_setup: ERR 6201 : "
-			"failed to update VLHighLimit "
-			"for port %" PRIx64 " #%d\n",
-			cl_ntoh64(p->port_guid), port_num);
-		return status;
-	}
+	/* setup VL high limit on the physp later to be updated by lid/link mgrs */
+	p->vl_high_limit = qcfg->vl_high_limit;
 
 	/* setup VLArbitration */
 	status = vlarb_update(p_req, p, port_num, qcfg);


From hal.rosenstock at gmail.com  Wed Jul 18 09:35:49 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 18 Jul 2007 09:35:49 -0700
Subject: [ofa-general] Re: [PATCH] opensm: Bug in coding of VL Arbitration
	tables
In-Reply-To: <86ir8h3cdn.fsf@sw053.lab.mtl.com>
References: <86ir8h3cdn.fsf@sw053.lab.mtl.com>
Message-ID: <f0e08f230707180935m2634b12fud8ca77fc964ee432@mail.gmail.com>

Hi Eitan,

On 7/18/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:
>
> Hi Sasha
>
> Discovered a bug in coding of the VL Arbitration table "index".
> According to spec should be:
> 1 for low part of low table
> 2 for high part of low table
> 3 for low part of high table
> 4 for high part of high table
>
> the patch below fixes it:
>
> Eitan
>
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
>
> diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c
> index bbb1608..413e200 100644
> --- a/opensm/opensm/osm_qos.c
> +++ b/opensm/opensm/osm_qos.c
> @@ -116,14 +116,14 @@ static ib_api_status_t vlarb_update(osm_req_t *
> p_req,
>                     p_pi->vl_arb_low_cap :
> IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK;
>                 if ((status = vlarb_update_table_block(p_req, p, port_num,
>
> &qcfg->vlarb_low[0],
> -                                                      len, 0)) !=
> IB_SUCCESS)
> +                                                      len, 1)) !=
> IB_SUCCESS)
>                         return status;
>         }
>         if (p_pi->vl_arb_low_cap > IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK) {
>                 len = p_pi->vl_arb_low_cap %
> IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK;
>                 if ((status = vlarb_update_table_block(p_req, p, port_num,
>
> &qcfg->vlarb_low[1],
> -                                                      len, 1)) !=
> IB_SUCCESS)
> +                                                      len, 2)) !=
> IB_SUCCESS)
>                         return status;
>         }
>         if (p_pi->vl_arb_high_cap > 0) {
> @@ -131,14 +131,14 @@ static ib_api_status_t vlarb_update(osm_req_t *
> p_req,
>                     p_pi->vl_arb_high_cap :
> IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK;
>                 if ((status = vlarb_update_table_block(p_req, p, port_num,
>
> &qcfg->vlarb_high[0],
> -                                                      len, 2)) !=
> IB_SUCCESS)
> +                                                      len, 3)) !=
> IB_SUCCESS)
>                         return status;
>         }
>         if (p_pi->vl_arb_high_cap > IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK) {
>                 len = p_pi->vl_arb_high_cap %
> IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK;
>                 if ((status = vlarb_update_table_block(p_req, p, port_num,
>
> &qcfg->vlarb_high[1],
> -                                                      len, 3)) !=
> IB_SUCCESS)
> +                                                      len, 4)) !=
> IB_SUCCESS)
>                         return status;
>         }


Are you sure ? It looks to me like this is already handled in
> vlarb_update_table_block as follows:
>

        if (!memcmp(&p->vl_arb[block_num], &block,
                     block_length * sizeof(block.vl_entry[0])))
                return IB_SUCCESS;

but

        attr_mod = ((block_num + 1) << 16) | port_num;

        return osm_req_set(p_req, osm_physp_get_dr_path_ptr(p),
                           (uint8_t *) & block, sizeof(block),
                           IB_MAD_ATTR_VL_ARBITRATION,
                           cl_hton32(attr_mod), CL_DISP_MSGID_NONE,
&context);

-- Hal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070718/343e3180/attachment.html>

From eitan at mellanox.co.il  Wed Jul 18 09:37:11 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 18 Jul 2007 19:37:11 +0300
Subject: [ofa-general] RE: [PATCH] opensm: Bug in coding of VL Arbitration
	tables 
References: <86ir8h3cdn.fsf@sw053.lab.mtl.com>
	<f0e08f230707180935m2634b12fud8ca77fc964ee432@mail.gmail.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901E73E68@mtlexch01.mtl.com>

Thanks Hal. Good catch. Should have seen this.
Sorry
 
Eitan


________________________________

	From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] 
	Sent: Wednesday, July 18, 2007 7:36 PM
	To: Eitan Zahavi
	Cc: OPENIB; sashak at voltaire.com; Yevgeny Kliteynik
	Subject: Re: [PATCH] opensm: Bug in coding of VL Arbitration
tables
	
	
	Hi Eitan,
	
	
	On 7/18/07, Eitan Zahavi <eitan at mellanox.co.il> wrote: 

		Hi Sasha
		
		Discovered a bug in coding of the VL Arbitration table
"index".
		According to spec should be:
		1 for low part of low table
		2 for high part of low table
		3 for low part of high table
		4 for high part of high table
		
		the patch below fixes it:
		
		Eitan
		
		Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
		
		diff --git a/opensm/opensm/osm_qos.c
b/opensm/opensm/osm_qos.c 
		index bbb1608..413e200 100644
		--- a/opensm/opensm/osm_qos.c
		+++ b/opensm/opensm/osm_qos.c
		@@ -116,14 +116,14 @@ static ib_api_status_t
vlarb_update(osm_req_t * p_req,
		                    p_pi->vl_arb_low_cap :
IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK;
		                if ((status =
vlarb_update_table_block(p_req, p, port_num,
	
&qcfg->vlarb_low[0],
		-
len, 0)) != IB_SUCCESS)
		+
len, 1)) != IB_SUCCESS)
		                        return status;
		        }
		        if (p_pi->vl_arb_low_cap >
IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK) {
		                len = p_pi->vl_arb_low_cap %
IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK;
		                if ((status =
vlarb_update_table_block(p_req, p, port_num,
	
&qcfg->vlarb_low[1],
		-
len, 1)) != IB_SUCCESS)
		+
len, 2)) != IB_SUCCESS)
		                        return status;
		        }
		        if (p_pi->vl_arb_high_cap > 0) {
		@@ -131,14 +131,14 @@ static ib_api_status_t
vlarb_update(osm_req_t * p_req,
		                    p_pi->vl_arb_high_cap :
IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK;
		                if ((status =
vlarb_update_table_block(p_req, p, port_num,
	
&qcfg->vlarb_high[0],
		-
len, 2)) != IB_SUCCESS)
		+
len, 3)) != IB_SUCCESS)
		                        return status;
		        }
		        if (p_pi->vl_arb_high_cap >
IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK) {
		                len = p_pi->vl_arb_high_cap %
IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK;
		                if ((status =
vlarb_update_table_block(p_req, p, port_num,
	
&qcfg->vlarb_high[1],
		-
len, 3)) != IB_SUCCESS)
		+
len, 4)) != IB_SUCCESS)
		                        return status;
		        }


		Are you sure ? It looks to me like this is already
handled in vlarb_update_table_block as follows: 
		

	        if (!memcmp(&p->vl_arb[block_num], &block,
	                     block_length * sizeof(block.vl_entry[0])))
	                return IB_SUCCESS;
	
	but
	
	        attr_mod = ((block_num + 1) << 16) | port_num;
	
	        return osm_req_set(p_req, osm_physp_get_dr_path_ptr(p),
	                           (uint8_t *) & block, sizeof(block),
	                           IB_MAD_ATTR_VL_ARBITRATION,
	                           cl_hton32(attr_mod),
CL_DISP_MSGID_NONE, &context);
	
	-- Hal 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070718/fa82c1f8/attachment.html>

From hal.rosenstock at gmail.com  Wed Jul 18 09:55:53 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 18 Jul 2007 09:55:53 -0700
Subject: [ofa-general] Re: [PATCH] opensm: Bug in coding trying to set
	vl_arb_high_limit
In-Reply-To: <86hco13c7e.fsf@sw053.lab.mtl.com>
References: <86hco13c7e.fsf@sw053.lab.mtl.com>
Message-ID: <f0e08f230707180955l15053f45s77620860ad098ee1@mail.gmail.com>

Hi again Eitan,

On 7/18/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:
>
> Hi Sasha
>
> When QoS setup is done the code was trying to send updates of
> vl_arb_high_limit by req_set of PORT_INFO with the new data.
> However, at that stage the SM still did not assign LIDs to the ports.
> So the sent PortInfo.base_lid was still zero. The specification does not
> allow for such LIDs (they are considered ilegal).


 Doesn't that really depend on the PortState ? The LID (and SMLID) needs to
be set by ARMED/ACTIVE.

the patch below fixes this by storing the calculated value and later
> using it in link and lid managers.


It's probably better to defer the setting as this patch appears to do.

-- Hal

Eitan
>
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
>
> diff --git a/opensm/include/opensm/osm_port.h
> b/opensm/include/opensm/osm_port.h
> index 54ebcfc..5032b1b 100644
> --- a/opensm/include/opensm/osm_port.h
> +++ b/opensm/include/opensm/osm_port.h
> @@ -117,6 +117,7 @@ typedef struct _osm_physp
>    struct _osm_node             *p_node;
>    struct _osm_physp            *p_remote_physp;
>    boolean_t                      healthy;
> +  uint8_t                vl_high_limit;
>    osm_dr_path_t                         dr_path;
>    osm_pkey_tbl_t                 pkeys;
>    ib_vl_arb_table_t              vl_arb[4];
> diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c
> index bc3f8b3..ed76382 100644
> --- a/opensm/opensm/osm_lid_mgr.c
> +++ b/opensm/opensm/osm_lid_mgr.c
> @@ -1182,6 +1182,14 @@ __osm_lid_mgr_set_physp_pi(
>             ib_port_info_get_port_state(p_old_pi) )
>          send_set = TRUE;
>      }
> +
> +        /* provide the vl_high_limit from the qos mgr */
> +        if (p_mgr->p_subn->opt.no_qos == FALSE)
> +                if (p_physp->vl_high_limit != p_old_pi->vl_high_limit)
> +                {
> +                        send_set = TRUE;
> +                        p_pi->vl_high_limit = p_physp->vl_high_limit;
> +                }
>    }
>    else
>    {
> diff --git a/opensm/opensm/osm_link_mgr.c b/opensm/opensm/osm_link_mgr.c
> index 25f0fc3..3781fd2 100644
> --- a/opensm/opensm/osm_link_mgr.c
> +++ b/opensm/opensm/osm_link_mgr.c
> @@ -354,6 +354,15 @@ __osm_link_mgr_set_physp_pi(
>        context.pi_context.active_transition = FALSE;
>    }
>
> +  /* provide the vl_high_limit from the qos mgr */
> +  if (p_mgr->p_subn->opt.no_qos == FALSE)
> +         if (p_physp->vl_high_limit != p_old_pi->vl_high_limit)
> +         {
> +                 send_set = TRUE;
> +                 p_pi->vl_high_limit = p_physp->vl_high_limit;
> +         }
> +
> +
>    context.pi_context.node_guid = osm_node_get_node_guid( p_node );
>    context.pi_context.port_guid = osm_physp_get_port_guid( p_physp );
>    context.pi_context.set_method = TRUE;
> diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c
> index bbb1608..413e200 100644
> --- a/opensm/opensm/osm_qos.c
> +++ b/opensm/opensm/osm_qos.c
> @@ -216,42 +216,6 @@ static ib_api_status_t sl2vl_update(osm_req_t *
> p_req, osm_port_t * p_port,
>         return IB_SUCCESS;
> }
>
> -static ib_api_status_t vl_high_limit_update(osm_req_t * p_req,
> -                                           osm_physp_t * p,
> -                                           const struct qos_config *qcfg)
> -{
> -       uint8_t payload[IB_SMP_DATA_SIZE];
> -       osm_madw_context_t context;
> -       ib_port_info_t *p_pi;
> -
> -       p_pi = &p->port_info;
> -
> -       if (p_pi->vl_high_limit == qcfg->vl_high_limit)
> -               return IB_SUCCESS;
> -
> -       memset(payload, 0, IB_SMP_DATA_SIZE);
> -       memcpy(payload, p_pi, sizeof(ib_port_info_t));
> -
> -       p_pi = (ib_port_info_t *) payload;
> -       ib_port_info_set_state_no_change(p_pi);
> -
> -       p_pi->vl_high_limit = qcfg->vl_high_limit;
> -
> -       context.pi_context.node_guid =
> -           osm_node_get_node_guid(osm_physp_get_node_ptr(p));
> -       context.pi_context.port_guid = osm_physp_get_port_guid(p);
> -       context.pi_context.set_method = TRUE;
> -       context.pi_context.update_master_sm_base_lid = FALSE;
> -       context.pi_context.ignore_errors = FALSE;
> -       context.pi_context.light_sweep = FALSE;
> -       context.pi_context.active_transition = FALSE;
> -
> -       return osm_req_set(p_req, osm_physp_get_dr_path_ptr(p),
> -                          payload, sizeof(payload),
> IB_MAD_ATTR_PORT_INFO,
> -                          cl_hton32(osm_physp_get_port_num(p)),
> -                          CL_DISP_MSGID_NONE, &context);
> -}
> -
> static ib_api_status_t qos_physp_setup(osm_log_t * p_log, osm_req_t *
> p_req,
>                                        osm_port_t * p_port, osm_physp_t *
> p,
>                                        uint8_t port_num,
> @@ -261,16 +225,8 @@ static ib_api_status_t qos_physp_setup(osm_log_t *
> p_log, osm_req_t * p_req,
>
>         /* OpVLs should be ok at this moment - just use it */
>
> -       /* setup VL high limit */
> -       status = vl_high_limit_update(p_req, p, qcfg);
> -       if (status != IB_SUCCESS) {
> -               osm_log(p_log, OSM_LOG_ERROR,
> -                       "qos_physp_setup: ERR 6201 : "
> -                       "failed to update VLHighLimit "
> -                       "for port %" PRIx64 " #%d\n",
> -                       cl_ntoh64(p->port_guid), port_num);
> -               return status;
> -       }
> +       /* setup VL high limit on the physp later to be updated by
> lid/link mgrs */
> +       p->vl_high_limit = qcfg->vl_high_limit;
>
>         /* setup VLArbitration */
>         status = vlarb_update(p_req, p, port_num, qcfg);
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070718/2aed6cba/attachment.html>

From mshefty at ichips.intel.com  Wed Jul 18 10:05:58 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 18 Jul 2007 10:05:58 -0700
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <000001c7c8fc$360eec90$5dcc180a@amr.corp.intel.com>
References: <000001c7c8fc$360eec90$5dcc180a@amr.corp.intel.com>
Message-ID: <469E4876.7020805@ichips.intel.com>

> We will have a better idea of the issues and possible solutions once the QoS
> spec is released, and we can hold discussions on it.  I will be working more
> details on QoS enhancements starting in the next couple of weeks. 

Based on discussions so far, maybe the best path forward from here is to
delay until 2.6.24.  This will let us add this version to OFED 1.3 for
more widespread testing, plus give us the time that we need to come up
with a plan to integrate QoS with the local SA.  I don't think we'll
have a final implementation for QoS support by that time, but at least
we'll have a better idea of the problems.

These patches are based on the same design used with OFED 1.2, but a
fair number of lines of code still changed, plus it added InformInfo 
registration.  I don't believe anyone other than me has tested these 
patches with the local SA enabled.  It's typically running on my 
systems, but because it automatically fails over to standard SA queries, 
it would be easy for me to miss problems.

- Sean


From pradeeps at linux.vnet.ibm.com  Wed Jul 18 10:23:46 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Wed, 18 Jul 2007 10:23:46 -0700
Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V7] patch resubmit
Message-ID: <469E4CA2.2040708@linux.vnet.ibm.com>

Resubmitting the 7th version of the patch. Changed the settings
in my mail client, so I expect there should be no line wraps. Also 
white space mangling rectified.

Signed-off-by: Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>
---

--- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-30 14:56:25.000000000 -0400
+++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib.h	2007-07-17 19:21:46.000000000 -0400
@@ -95,11 +95,16 @@ enum {
 	IPOIB_MCAST_FLAG_ATTACHED = 3,
 };
 
+#define CM_PACKET_SIZE (1ul << 16)
 #define	IPOIB_OP_RECV   (1ul << 31)
 #ifdef CONFIG_INFINIBAND_IPOIB_CM
-#define	IPOIB_CM_OP_SRQ (1ul << 30)
+#define	IPOIB_CM_OP_RECV (1ul << 30)
+
+#define NOSRQ_INDEX_TABLE_SIZE 128
+#define NOSRQ_INDEX_MASK      (NOSRQ_INDEX_TABLE_SIZE -1)
+
 #else
-#define	IPOIB_CM_OP_SRQ (0)
+#define	IPOIB_CM_OP_RECV (0)
 #endif
 
 /* structs */
@@ -166,11 +171,14 @@ enum ipoib_cm_state {
 };
 
 struct ipoib_cm_rx {
-	struct ib_cm_id     *id;
-	struct ib_qp        *qp;
-	struct list_head     list;
-	struct net_device   *dev;
-	unsigned long        jiffies;
+	struct ib_cm_id     	*id;
+	struct ib_qp        	*qp;
+	struct ipoib_cm_rx_buf  *rx_ring; /* Used by NOSRQ only */
+	struct list_head     	 list;
+	struct net_device   	*dev;
+	unsigned long        	 jiffies;
+	u32                      index; /* wr_ids are distinguished by index
+					 * to identify the QP -NOSRQ only */
 	enum ipoib_cm_state  state;
 };
 
@@ -215,6 +223,8 @@ struct ipoib_cm_dev_priv {
 	struct ib_wc            ibwc[IPOIB_NUM_WC];
 	struct ib_sge           rx_sge[IPOIB_CM_RX_SG];
 	struct ib_recv_wr       rx_wr;
+	struct ipoib_cm_rx	**rx_index_table; /* See ipoib_cm_dev_init()
+						   *for usage of this element */
 };
 
 /*
@@ -564,10 +574,9 @@ static inline void ipoib_cm_skb_too_long
 	dev_kfree_skb_any(skb);
 }
 
-static inline void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
+void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 {
 }
-
 #endif
 
 #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
--- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-07-10 17:02:33.000000000 -0400
+++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-07-17 20:53:19.000000000 -0400
@@ -49,6 +49,17 @@ MODULE_PARM_DESC(cm_data_debug_level,
 
 #include "ipoib.h"
 
+int max_rc_qp = NOSRQ_INDEX_TABLE_SIZE;
+int max_recv_buf = 1024; /* Default is 1024 MB */
+
+module_param_named(nosrq_max_rc_qp, max_rc_qp, int, 0644);
+MODULE_PARM_DESC(nosrq_max_rc_qp, "Max number of NOSRQ RC QPs supported");
+
+module_param_named(max_receive_buffer, max_recv_buf, int, 0644);
+MODULE_PARM_DESC(max_receive_buffer, "Max Receive Buffer Size in MB");
+
+atomic_t current_rc_qp; /* Active number of RC QPs for NOSRQ */
+
 #define IPOIB_CM_IETF_ID 0x1000000000000000ULL
 
 #define IPOIB_CM_RX_UPDATE_TIME (256 * HZ)
@@ -81,20 +92,20 @@ static void ipoib_cm_dma_unmap_rx(struct
 		ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE);
 }
 
-static int ipoib_cm_post_receive(struct net_device *dev, int id)
+static int post_receive_srq(struct net_device *dev, u64 id)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_recv_wr *bad_wr;
 	int i, ret;
 
-	priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_SRQ;
+	priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV;
 
 	for (i = 0; i < IPOIB_CM_RX_SG; ++i)
 		priv->cm.rx_sge[i].addr = priv->cm.srq_ring[id].mapping[i];
 
 	ret = ib_post_srq_recv(priv->cm.srq, &priv->cm.rx_wr, &bad_wr);
 	if (unlikely(ret)) {
-		ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret);
+		ipoib_warn(priv, "post srq failed for buf %ld (%d)\n", id, ret);
 		ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1,
 				      priv->cm.srq_ring[id].mapping);
 		dev_kfree_skb_any(priv->cm.srq_ring[id].skb);
@@ -104,12 +115,47 @@ static int ipoib_cm_post_receive(struct 
 	return ret;
 }
 
-static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, int frags,
+static int post_receive_nosrq(struct net_device *dev, u64 id)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_recv_wr *bad_wr;
+	int i, ret;
+	u32 index;
+	u32 wr_id;
+	struct ipoib_cm_rx *rx_ptr;
+
+	index = id  & NOSRQ_INDEX_MASK ;
+	wr_id = id >> 32;
+
+	rx_ptr = priv->cm.rx_index_table[index];
+
+	priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV;
+
+	for (i = 0; i < IPOIB_CM_RX_SG; ++i)
+		priv->cm.rx_sge[i].addr = rx_ptr->rx_ring[wr_id].mapping[i];
+
+	ret = ib_post_recv(rx_ptr->qp, &priv->cm.rx_wr, &bad_wr);
+	if (unlikely(ret)) {
+		ipoib_warn(priv, "post recv failed for buf %d (%d)\n",
+			   wr_id, ret);
+		ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1,
+				      rx_ptr->rx_ring[wr_id].mapping);
+		dev_kfree_skb_any(rx_ptr->rx_ring[wr_id].skb);
+		rx_ptr->rx_ring[wr_id].skb = NULL;
+	}
+
+	return ret;
+}
+
+static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, u64 id,
+					     int frags,
 					     u64 mapping[IPOIB_CM_RX_SG])
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct sk_buff *skb;
 	int i;
+	struct ipoib_cm_rx *rx_ptr;
+	u32 index, wr_id;
 
 	skb = dev_alloc_skb(IPOIB_CM_HEAD_SIZE + 12);
 	if (unlikely(!skb))
@@ -141,7 +187,14 @@ static struct sk_buff *ipoib_cm_alloc_rx
 			goto partial_error;
 	}
 
-	priv->cm.srq_ring[id].skb = skb;
+	if (priv->cm.srq)
+		priv->cm.srq_ring[id].skb = skb;
+	else {
+		index = id  & NOSRQ_INDEX_MASK ;
+		wr_id = id >> 32;
+		rx_ptr = priv->cm.rx_index_table[index];
+		rx_ptr->rx_ring[wr_id].skb = skb;
+	}
 	return skb;
 
 partial_error:
@@ -198,16 +251,21 @@ static struct ib_qp *ipoib_cm_create_rx_
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_init_attr attr = {
-		.event_handler = ipoib_cm_rx_event_handler,
 		.send_cq = priv->cq, /* For drain WR */
 		.recv_cq = priv->cq,
 		.srq = priv->cm.srq,
 		.cap.max_send_wr = 1, /* For drain WR */
+		.cap.max_recv_wr = ipoib_recvq_size + 1,
 		.cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */
 		.sq_sig_type = IB_SIGNAL_ALL_WR,
 		.qp_type = IB_QPT_RC,
 		.qp_context = p,
 	};
+	if (!priv->cm.srq) {
+		attr.cap.max_recv_sge = IPOIB_CM_RX_SG;
+		attr.event_handler = NULL;
+	} else
+		attr.event_handler = ipoib_cm_rx_event_handler;
 	return ib_create_qp(priv->pd, &attr);
 }
 
@@ -282,12 +340,129 @@ static int ipoib_cm_send_rep(struct net_
 	rep.flow_control = 0;
 	rep.rnr_retry_count = req->rnr_retry_count;
 	rep.target_ack_delay = 20; /* FIXME */
-	rep.srq = 1;
 	rep.qp_num = qp->qp_num;
 	rep.starting_psn = psn;
+	rep.srq	= !!priv->cm.srq;
 	return ib_send_cm_rep(cm_id, &rep);
 }
 
+static void init_context_and_add_list(struct ib_cm_id *cm_id,
+				    struct ipoib_cm_rx *p,
+				    struct ipoib_dev_priv *priv)
+{
+	cm_id->context = p;
+	p->jiffies = jiffies;
+	spin_lock_irq(&priv->lock);
+	if (list_empty(&priv->cm.passive_ids))
+		queue_delayed_work(ipoib_workqueue,
+				   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
+	if (priv->cm.srq) {
+		/* Add this entry to passive ids list head, but do not re-add
+		 * it if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush
+		 * list.
+		 */
+		if (p->state == IPOIB_CM_RX_LIVE)
+			list_move(&p->list, &priv->cm.passive_ids);
+	}
+	spin_unlock_irq(&priv->lock);
+}
+
+static int allocate_and_post_rbuf_nosrq(struct ib_cm_id *cm_id,
+					struct ipoib_cm_rx *p, unsigned psn)
+{
+	struct net_device *dev = cm_id->context;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int ret;
+	u32 qp_num, index;
+	u64 i, recv_mem_used;
+
+	qp_num = p->qp->qp_num;
+
+	/* In the SRQ case there is a common rx buffer called the srq_ring.
+	 * However, for the NOSRQ we create an rx_ring for every
+	 * struct ipoib_cm_rx.
+	 */
+	p->rx_ring = kzalloc(ipoib_recvq_size * sizeof *p->rx_ring, GFP_KERNEL);
+	if (!p->rx_ring) {
+		printk(KERN_WARNING "Failed to allocate rx_ring for 0x%x\n",
+		       qp_num);
+		return -ENOMEM;
+	}
+
+	spin_lock_irq(&priv->lock);
+	list_add(&p->list, &priv->cm.passive_ids);
+	spin_unlock_irq(&priv->lock);
+
+	init_context_and_add_list(cm_id, p, priv);
+	spin_lock_irq(&priv->lock);
+
+	for (index = 0; index < max_rc_qp; index++)
+		if (priv->cm.rx_index_table[index] == NULL)
+			break;
+
+	recv_mem_used = (u64)ipoib_recvq_size *
+			(u64)atomic_inc_return(&current_rc_qp)
+			* CM_PACKET_SIZE; /* packets are 64K */
+	if ((index == max_rc_qp) ||
+	( recv_mem_used >= max_recv_buf * (1ul << 20))) {
+		spin_unlock_irq(&priv->lock);
+		ipoib_warn(priv, "NOSRQ has reached the configurable limit "
+			   "of either %d RC QPs or, max recv buf size of "
+			   "0x%x MB\n", max_rc_qp, max_recv_buf);
+
+		/* We send a REJ to the remote side indicating that we
+		 * have no more free RC QPs and leave it to the remote side
+		 * to take appropriate action. This should leave the
+		 * current set of QPs unaffected and any subsequent REQs
+		 * will be able to use RC QPs if they are available.
+		 */
+		ib_send_cm_rej(cm_id, IB_CM_REJ_NO_QP, NULL, 0, NULL, 0);
+		ret = -EINVAL;
+		goto err_alloc_and_post;
+	}
+
+	priv->cm.rx_index_table[index] = p;
+	spin_unlock_irq(&priv->lock);
+
+	/* We will subsequently use this stored pointer while freeing
+	 * resources in stale task
+	 */
+	p->index = index;
+
+	ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn);
+	if (ret) {
+		ipoib_warn(priv, "ipoib_cm_modify_rx_qp() failed %d\n", ret);
+		ipoib_cm_dev_cleanup(dev);
+		goto err_alloc_and_post;
+	}
+
+	for (i = 0; i < ipoib_recvq_size; ++i) {
+		if (!ipoib_cm_alloc_rx_skb(dev, i << 32 | index,
+					   IPOIB_CM_RX_SG - 1,
+					   p->rx_ring[i].mapping)) {
+			ipoib_warn(priv, "failed to allocate receive "
+				   "buffer %ld\n", i);
+			ipoib_cm_dev_cleanup(dev);
+			ret = -ENOMEM;
+			goto err_alloc_and_post;
+		}
+
+		if (post_receive_nosrq(dev, i << 32 | index)) {
+			ipoib_warn(priv, "post_receive_nosrq "
+				   "failed for  buf %ld\n", i);
+			ipoib_cm_dev_cleanup(dev);
+			ret = -EIO;
+			goto err_alloc_and_post;
+		}
+	}
+
+	return 0;
+
+err_alloc_and_post:
+	kfree(p->rx_ring);
+	return ret;
+}
+
 static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event)
 {
 	struct net_device *dev = cm_id->context;
@@ -298,13 +473,13 @@ static int ipoib_cm_req_handler(struct i
 
 	ipoib_dbg(priv, "REQ arrived\n");
 	p = kzalloc(sizeof *p, GFP_KERNEL);
-	if (!p)
+	if (!p) {
+		printk(KERN_WARNING "Failed to allocate RX control block when "
+		       "REQ arrived\n");
 		return -ENOMEM;
+	}
 	p->dev = dev;
 	p->id = cm_id;
-	cm_id->context = p;
-	p->state = IPOIB_CM_RX_LIVE;
-	p->jiffies = jiffies;
 	INIT_LIST_HEAD(&p->list);
 
 	p->qp = ipoib_cm_create_rx_qp(dev, p);
@@ -314,19 +489,21 @@ static int ipoib_cm_req_handler(struct i
 	}
 
 	psn = random32() & 0xffffff;
-	ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn);
-	if (ret)
-		goto err_modify;
+	if (!priv->cm.srq) {
+		ret = allocate_and_post_rbuf_nosrq(cm_id, p, psn);
+		if (ret)
+			goto err_post_nosrq;
+	} else {
+		p->rx_ring = NULL;
+		ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn);
+		if (ret)
+			goto err_modify;
+	}
 
-	spin_lock_irq(&priv->lock);
-	queue_delayed_work(ipoib_workqueue,
-			   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
-	/* Add this entry to passive ids list head, but do not re-add it
-	 * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */
-	p->jiffies = jiffies;
-	if (p->state == IPOIB_CM_RX_LIVE)
-		list_move(&p->list, &priv->cm.passive_ids);
-	spin_unlock_irq(&priv->lock);
+	if (priv->cm.srq) {
+		p->state = IPOIB_CM_RX_LIVE;
+		init_context_and_add_list(cm_id, p, priv);
+	}
 
 	ret = ipoib_cm_send_rep(dev, cm_id, p->qp, &event->param.req_rcvd, psn);
 	if (ret) {
@@ -336,6 +513,9 @@ static int ipoib_cm_req_handler(struct i
 	}
 	return 0;
 
+err_post_nosrq:
+	list_del_init(&p->list);
+	atomic_dec(&current_rc_qp);
 err_modify:
 	ib_destroy_qp(p->qp);
 err_qp:
@@ -399,29 +579,60 @@ static void skb_put_frags(struct sk_buff
 	}
 }
 
-void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
+static void timer_check_srq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p)
+{
+	unsigned long flags;
+
+	if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) {
+		spin_lock_irqsave(&priv->lock, flags);
+		p->jiffies = jiffies;
+		/* Move this entry to list head, but do
+		 * not re-add it if it has been removed.
+		 */
+		if (p->state == IPOIB_CM_RX_LIVE)
+			list_move(&p->list, &priv->cm.passive_ids);
+		spin_unlock_irqrestore(&priv->lock, flags);
+	}
+}
+
+static void timer_check_nosrq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p)
+{
+	unsigned long flags;
+
+	if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) {
+		spin_lock_irqsave(&priv->lock, flags);
+		p->jiffies = jiffies;
+		/* Move this entry to list head, but do
+		 * not re-add it if it has been removed. */
+		if (!list_empty(&p->list))
+			list_move(&p->list, &priv->cm.passive_ids);
+		spin_unlock_irqrestore(&priv->lock, flags);
+	}
+}
+
+void handle_rx_wc_srq(struct net_device *dev, struct ib_wc *wc)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	unsigned int wr_id = wc->wr_id & ~IPOIB_CM_OP_SRQ;
+	u64 wr_id = wc->wr_id & ~IPOIB_CM_OP_RECV;
 	struct sk_buff *skb, *newskb;
 	struct ipoib_cm_rx *p;
 	unsigned long flags;
 	u64 mapping[IPOIB_CM_RX_SG];
-	int frags;
+	int frags, ret;
 
 	ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n",
 		       wr_id, wc->status);
 
 	if (unlikely(wr_id >= ipoib_recvq_size)) {
-		if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~IPOIB_CM_OP_SRQ)) {
+		if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~IPOIB_CM_OP_RECV)) {
 			spin_lock_irqsave(&priv->lock, flags);
 			list_splice_init(&priv->cm.rx_drain_list, &priv->cm.rx_reap_list);
 			ipoib_cm_start_rx_drain(priv);
 			queue_work(ipoib_workqueue, &priv->cm.rx_reap_task);
 			spin_unlock_irqrestore(&priv->lock, flags);
 		} else
-			ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n",
-				   wr_id, ipoib_recvq_size);
+			ipoib_warn(priv, "cm recv completion event with wrid 0x%llx (> 0x%x)\n",
+				   (unsigned long long)wr_id, ipoib_recvq_size);
 		return;
 	}
 
@@ -429,23 +640,15 @@ void ipoib_cm_handle_rx_wc(struct net_de
 
 	if (unlikely(wc->status != IB_WC_SUCCESS)) {
 		ipoib_dbg(priv, "cm recv error "
-			   "(status=%d, wrid=%d vend_err %x)\n",
-			   wc->status, wr_id, wc->vendor_err);
+			   "(status=%d, wrid=0x%llx vend_err %x)\n",
+			   wc->status, (unsigned long long)wr_id, wc->vendor_err);
 		++priv->stats.rx_dropped;
-		goto repost;
+		goto repost_srq;
 	}
 
 	if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) {
 		p = wc->qp->qp_context;
-		if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) {
-			spin_lock_irqsave(&priv->lock, flags);
-			p->jiffies = jiffies;
-			/* Move this entry to list head, but do not re-add it
-			 * if it has been moved out of list. */
-			if (p->state == IPOIB_CM_RX_LIVE)
-				list_move(&p->list, &priv->cm.passive_ids);
-			spin_unlock_irqrestore(&priv->lock, flags);
-		}
+		timer_check_srq(priv, p);
 	}
 
 	frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len,
@@ -457,13 +660,111 @@ void ipoib_cm_handle_rx_wc(struct net_de
 		 * If we can't allocate a new RX buffer, dump
 		 * this packet and reuse the old buffer.
 		 */
-		ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id);
+		ipoib_dbg(priv, "failed to allocate receive buffer %ld\n", wr_id);
+		++priv->stats.rx_dropped;
+		goto repost_srq;
+	}
+
+	ipoib_cm_dma_unmap_rx(priv, frags,
+			      priv->cm.srq_ring[wr_id].mapping);
+	memcpy(priv->cm.srq_ring[wr_id].mapping, mapping,
+	       (frags + 1) * sizeof *mapping);
+	ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n",
+		       wc->byte_len, wc->slid);
+
+	skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb);
+
+	skb->protocol = ((struct ipoib_header *) skb->data)->proto;
+	skb_reset_mac_header(skb);
+	skb_pull(skb, IPOIB_ENCAP_LEN);
+
+	dev->last_rx = jiffies;
+	++priv->stats.rx_packets;
+	priv->stats.rx_bytes += skb->len;
+
+	skb->dev = dev;
+	/* XXX get correct PACKET_ type here */
+	skb->pkt_type = PACKET_HOST;
+	netif_receive_skb(skb);
+
+repost_srq:
+	ret = post_receive_srq(dev, wr_id);
+
+	if (unlikely(ret))
+		ipoib_warn(priv, "post_receive_srq failed for buf %ld\n",
+			   wr_id);
+
+}
+
+static void handle_rx_wc_nosrq(struct net_device *dev, struct ib_wc *wc)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct sk_buff *skb, *newskb;
+	u64 mapping[IPOIB_CM_RX_SG], wr_id = wc->wr_id >> 32;
+	u32 index;
+	struct ipoib_cm_rx *rx_ptr;
+	int frags, ret;
+
+
+	ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n",
+		       wr_id, wc->status);
+
+	if (unlikely(wr_id >= ipoib_recvq_size)) {
+		ipoib_warn(priv, "cm recv completion event with wrid 0x%llx (> %d)\n",
+				   (unsigned long long)wr_id, ipoib_recvq_size);
+		return;
+	}
+
+	index = (wc->wr_id & ~IPOIB_CM_OP_RECV) & NOSRQ_INDEX_MASK ;
+
+	/* This is the only place where rx_ptr could be a NULL - could
+	 * have just received a packet from a connection that has become
+	 * stale and so is going away. We will simply drop the packet and
+	 * let the hardware (it s IB_QPT_RC) handle the dropped packet.
+	 * In the timer_check() function below, p->jiffies is updated and
+	 * hence the connection will not be stale after that.
+	 */
+	rx_ptr = priv->cm.rx_index_table[index];
+	if (unlikely(!rx_ptr)) {
+		ipoib_warn(priv, "Received packet from a connection "
+			   "that is going away. Hardware will handle it.\n");
+		return;
+	}
+
+	skb = rx_ptr->rx_ring[wr_id].skb;
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		ipoib_dbg(priv, "cm recv error "
+			   "(status=%d, wrid=%ld vend_err %x)\n",
+			   wc->status, wr_id, wc->vendor_err);
+		++priv->stats.rx_dropped;
+		goto repost_nosrq;
+	}
+
+	if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) {
+		/* There are no guarantees that wc->qp is not NULL for HCAs
+		 * that do not support SRQ. */
+		timer_check_nosrq(priv, rx_ptr);
+	}
+
+	frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len,
+					      (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE;
+
+	newskb = ipoib_cm_alloc_rx_skb(dev, wr_id << 32 | index, frags,
+				       mapping);
+	if (unlikely(!newskb)) {
+		/*
+		 * If we can't allocate a new RX buffer, dump
+		 * this packet and reuse the old buffer.
+		 */
+		ipoib_dbg(priv, "failed to allocate receive buffer %ld\n", wr_id);
 		++priv->stats.rx_dropped;
-		goto repost;
+		goto repost_nosrq;
 	}
 
-	ipoib_cm_dma_unmap_rx(priv, frags, priv->cm.srq_ring[wr_id].mapping);
-	memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, (frags + 1) * sizeof *mapping);
+	ipoib_cm_dma_unmap_rx(priv, frags, rx_ptr->rx_ring[wr_id].mapping);
+	memcpy(rx_ptr->rx_ring[wr_id].mapping, mapping,
+	       (frags + 1) * sizeof *mapping);
 
 	ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n",
 		       wc->byte_len, wc->slid);
@@ -483,10 +784,22 @@ void ipoib_cm_handle_rx_wc(struct net_de
 	skb->pkt_type = PACKET_HOST;
 	netif_receive_skb(skb);
 
-repost:
-	if (unlikely(ipoib_cm_post_receive(dev, wr_id)))
-		ipoib_warn(priv, "ipoib_cm_post_receive failed "
-			   "for buf %d\n", wr_id);
+repost_nosrq:
+	ret = post_receive_nosrq(dev, wr_id << 32 | index);
+
+	if (unlikely(ret))
+		ipoib_warn(priv, "post_receive_nosrq failed for buf %ld\n",
+			   wr_id);
+}
+
+void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	if (priv->cm.srq)
+		handle_rx_wc_srq(dev, wc);
+	else
+		handle_rx_wc_nosrq(dev, wc);
 }
 
 static inline int post_send(struct ipoib_dev_priv *priv,
@@ -678,6 +991,42 @@ err_cm:
 	return ret;
 }
 
+static void free_resources_nosrq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p)
+{
+	int i;
+
+	for (i = 0; i < ipoib_recvq_size; ++i)
+		if (p->rx_ring[i].skb) {
+			ipoib_cm_dma_unmap_rx(priv,
+					 IPOIB_CM_RX_SG - 1,
+					 p->rx_ring[i].mapping);
+			dev_kfree_skb_any(p->rx_ring[i].skb);
+			p->rx_ring[i].skb = NULL;
+		}
+	kfree(p->rx_ring);
+}
+
+void dev_stop_nosrq(struct ipoib_dev_priv *priv)
+{
+	struct ipoib_cm_rx *p;
+
+	spin_lock_irq(&priv->lock);
+	while (!list_empty(&priv->cm.passive_ids)) {
+		p = list_entry(priv->cm.passive_ids.next, typeof(*p), list);
+		free_resources_nosrq(priv, p);
+		list_del(&p->list);
+		spin_unlock_irq(&priv->lock);
+		ib_destroy_cm_id(p->id);
+		ib_destroy_qp(p->qp);
+		atomic_dec(&current_rc_qp);
+		kfree(p);
+		spin_lock_irq(&priv->lock);
+	}
+	spin_unlock_irq(&priv->lock);
+
+	cancel_delayed_work(&priv->cm.stale_task);
+}
+
 void ipoib_cm_dev_stop(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -692,6 +1041,11 @@ void ipoib_cm_dev_stop(struct net_device
 	ib_destroy_cm_id(priv->cm.id);
 	priv->cm.id = NULL;
 
+	if (!priv->cm.srq) {
+		dev_stop_nosrq(priv);
+		return;
+	}
+
 	spin_lock_irq(&priv->lock);
 	while (!list_empty(&priv->cm.passive_ids)) {
 		p = list_entry(priv->cm.passive_ids.next, typeof(*p), list);
@@ -737,6 +1091,7 @@ void ipoib_cm_dev_stop(struct net_device
 		kfree(p);
 	}
 
+
 	cancel_delayed_work(&priv->cm.stale_task);
 }
 
@@ -815,7 +1170,9 @@ static struct ib_qp *ipoib_cm_create_tx_
 	attr.recv_cq = priv->cq;
 	attr.srq = priv->cm.srq;
 	attr.cap.max_send_wr = ipoib_sendq_size;
+	attr.cap.max_recv_wr = 1;
 	attr.cap.max_send_sge = 1;
+	attr.cap.max_recv_sge = 1;
 	attr.sq_sig_type = IB_SIGNAL_ALL_WR;
 	attr.qp_type = IB_QPT_RC;
 	attr.send_cq = cq;
@@ -855,7 +1212,7 @@ static int ipoib_cm_send_req(struct net_
 	req.retry_count 	      = 0; /* RFC draft warns against retries */
 	req.rnr_retry_count 	      = 0; /* RFC draft warns against retries */
 	req.max_cm_retries 	      = 15;
-	req.srq 	              = 1;
+	req.srq			      = !!priv->cm.srq;
 	return ib_send_cm_req(id, &req);
 }
 
@@ -1200,6 +1557,9 @@ static void ipoib_cm_rx_reap(struct work
 	list_for_each_entry_safe(p, n, &list, list) {
 		ib_destroy_cm_id(p->id);
 		ib_destroy_qp(p->qp);
+		if (!priv->cm.srq) {
+			atomic_dec(&current_rc_qp);
+		}
 		kfree(p);
 	}
 }
@@ -1218,12 +1578,19 @@ static void ipoib_cm_stale_task(struct w
 		p = list_entry(priv->cm.passive_ids.prev, typeof(*p), list);
 		if (time_before_eq(jiffies, p->jiffies + IPOIB_CM_RX_TIMEOUT))
 			break;
-		list_move(&p->list, &priv->cm.rx_error_list);
-		p->state = IPOIB_CM_RX_ERROR;
-		spin_unlock_irq(&priv->lock);
-		ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE);
-		if (ret)
-			ipoib_warn(priv, "unable to move qp to error state: %d\n", ret);
+		if (!priv->cm.srq) {
+			free_resources_nosrq(priv, p);
+			list_del_init(&p->list);
+			priv->cm.rx_index_table[p->index] = NULL;
+			spin_unlock_irq(&priv->lock);
+		} else {
+			list_move(&p->list, &priv->cm.rx_error_list);
+			p->state = IPOIB_CM_RX_ERROR;
+			spin_unlock_irq(&priv->lock);
+			ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE);
+			if (ret)
+				ipoib_warn(priv, "unable to move qp to error state: %d\n", ret);
+		}
 		spin_lock_irq(&priv->lock);
 	}
 
@@ -1277,16 +1644,40 @@ int ipoib_cm_add_mode_attr(struct net_de
 	return device_create_file(&dev->dev, &dev_attr_mode);
 }
 
+static int create_srq(struct net_device *dev, struct ipoib_dev_priv *priv)
+{
+	struct ib_srq_init_attr srq_init_attr;
+	int ret;
+
+	srq_init_attr.attr.max_wr = ipoib_recvq_size;
+	srq_init_attr.attr.max_sge = IPOIB_CM_RX_SG;
+
+	priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr);
+	if (IS_ERR(priv->cm.srq)) {
+		ret = PTR_ERR(priv->cm.srq);
+		priv->cm.srq = NULL;
+		return ret;
+	}
+
+	priv->cm.srq_ring = kzalloc(ipoib_recvq_size *
+				    sizeof *priv->cm.srq_ring,
+				    GFP_KERNEL);
+	if (!priv->cm.srq_ring) {
+		printk(KERN_WARNING "%s: failed to allocate CM ring "
+		       "(%d entries)\n",
+			priv->ca->name, ipoib_recvq_size);
+		ipoib_cm_dev_cleanup(dev);
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
 int ipoib_cm_dev_init(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	struct ib_srq_init_attr srq_init_attr = {
-		.attr = {
-			.max_wr  = ipoib_recvq_size,
-			.max_sge = IPOIB_CM_RX_SG
-		}
-	};
 	int ret, i;
+	struct ib_device_attr attr;
 
 	INIT_LIST_HEAD(&priv->cm.passive_ids);
 	INIT_LIST_HEAD(&priv->cm.reap_list);
@@ -1303,20 +1694,34 @@ int ipoib_cm_dev_init(struct net_device 
 
 	skb_queue_head_init(&priv->cm.skb_queue);
 
-	priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr);
-	if (IS_ERR(priv->cm.srq)) {
-		ret = PTR_ERR(priv->cm.srq);
-		priv->cm.srq = NULL;
+	ret = ib_query_device(priv->ca, &attr);
+	if (ret)
 		return ret;
-	}
 
-	priv->cm.srq_ring = kzalloc(ipoib_recvq_size * sizeof *priv->cm.srq_ring,
-				    GFP_KERNEL);
-	if (!priv->cm.srq_ring) {
-		printk(KERN_WARNING "%s: failed to allocate CM ring (%d entries)\n",
-		       priv->ca->name, ipoib_recvq_size);
-		ipoib_cm_dev_cleanup(dev);
-		return -ENOMEM;
+	if (attr.max_srq) {
+		/* This device supports SRQ */
+		ret = create_srq(dev, priv);
+		if (ret)
+			return ret;
+		priv->cm.rx_index_table = NULL;
+	} else {
+		priv->cm.srq = NULL;
+		priv->cm.srq_ring = NULL;
+
+		/* Every new REQ that arrives creates a struct ipoib_cm_rx.
+		 * These structures form a link list starting with the
+		 * passive_ids. For quick and easy access we maintain a table
+		 * of pointers to struct ipoib_cm_rx called the rx_index_table
+		 */
+		priv->cm.rx_index_table = kzalloc(NOSRQ_INDEX_TABLE_SIZE *
+					 sizeof *priv->cm.rx_index_table,
+					 GFP_KERNEL);
+		if (!priv->cm.rx_index_table) {
+			printk(KERN_WARNING "Failed to allocate NOSRQ_INDEX_TABLE\n");
+			return -ENOMEM;
+		}
+
+		atomic_set(&current_rc_qp, 0);
 	}
 
 	for (i = 0; i < IPOIB_CM_RX_SG; ++i)
@@ -1329,17 +1734,24 @@ int ipoib_cm_dev_init(struct net_device 
 	priv->cm.rx_wr.sg_list = priv->cm.rx_sge;
 	priv->cm.rx_wr.num_sge = IPOIB_CM_RX_SG;
 
-	for (i = 0; i < ipoib_recvq_size; ++i) {
-		if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1,
+	/* One can post receive buffers even before the RX QP is created
+	 * only in the SRQ case. Therefore for NOSRQ we skip the rest of init
+	 * and do that in ipoib_cm_req_handler()
+	 */
+
+	if (priv->cm.srq) {
+		for (i = 0; i < ipoib_recvq_size; ++i) {
+			if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1,
 					   priv->cm.srq_ring[i].mapping)) {
-			ipoib_warn(priv, "failed to allocate receive buffer %d\n", i);
-			ipoib_cm_dev_cleanup(dev);
-			return -ENOMEM;
-		}
-		if (ipoib_cm_post_receive(dev, i)) {
-			ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i);
-			ipoib_cm_dev_cleanup(dev);
-			return -EIO;
+				ipoib_warn(priv, "failed to allocate receive buffer %d\n", i);
+				ipoib_cm_dev_cleanup(dev);
+				return -ENOMEM;
+			}
+			if (post_receive_srq(dev, i)) {
+				ipoib_warn(priv, "post_receive_srq failed for buf %d\n", i);
+				ipoib_cm_dev_cleanup(dev);
+				return -EIO;
+			}
 		}
 	}
 
--- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-30 14:56:25.000000000 -0400
+++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-07-10 18:30:10.000000000 -0400
@@ -299,7 +299,7 @@ int ipoib_poll(struct net_device *dev, i
 		for (i = 0; i < n; ++i) {
 			struct ib_wc *wc = priv->ibwc + i;
 
-			if (wc->wr_id & IPOIB_CM_OP_SRQ) {
+			if (wc->wr_id & IPOIB_CM_OP_RECV) {
 				++done;
 				--max;
 				ipoib_cm_handle_rx_wc(dev, wc);
@@ -557,7 +557,7 @@ void ipoib_drain_cq(struct net_device *d
 	do {
 		n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc);
 		for (i = 0; i < n; ++i) {
-			if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ)
+			if (priv->ibwc[i].wr_id & IPOIB_CM_OP_RECV)
 				ipoib_cm_handle_rx_wc(dev, priv->ibwc + i);
 			else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV)
 				ipoib_ib_handle_rx_wc(dev, priv->ibwc + i);
--- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-30 14:56:25.000000000 -0400
+++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-07-17 20:09:25.000000000 -0400
@@ -175,6 +175,15 @@ int ipoib_transport_dev_init(struct net_
 	if (!ret)
 		size += ipoib_recvq_size + 1 /* 1 extra for rx_drain_qp */;
 
+	/* We increase the size of the CQ in the NOSRQ case to prevent CQ
+	 * overflow. Every new REQ creates a new RX QP and each QP has an
+	 * RX ring associated with it. Therefore we could have
+	 * NOSRQ_INDEX_TABLE_SIZE*ipoib_recvq_size + ipoib_sendq_size CQEs
+	 * in a CQ.
+	 */
+	if (!priv->cm.srq)
+		size += (NOSRQ_INDEX_TABLE_SIZE -1) * ipoib_recvq_size;
+
 	priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0);
 	if (IS_ERR(priv->cq)) {
 		printk(KERN_WARNING "%s: failed to create CQ\n", ca->name);


From pradeeps at linux.vnet.ibm.com  Wed Jul 18 10:26:12 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Wed, 18 Jul 2007 10:26:12 -0700
Subject: [ofa-general] IPOIB CM (NOSRQ) extension [PATCH V2] patch resubmit
Message-ID: <469E4D34.10903@linux.vnet.ibm.com>

Resubmitting the 2nd version of the patch. Changed the settings
in my mail client, so I expect there should be no line wraps. Also 
white space mangling rectified.

Signed-off-by: Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>
---

--- c/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-07-17 21:08:38.000000000 -0400
+++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-07-18 12:49:06.000000000 -0400
@@ -1372,8 +1372,18 @@ static int ipoib_cm_tx_handler(struct ib
 			ib_send_cm_rej(cm_id, IB_CM_REJ_CONSUMER_DEFINED,
 				       NULL, 0, NULL, 0);
 		break;
-	case IB_CM_REQ_ERROR:
 	case IB_CM_REJ_RECEIVED:
+		ipoib_warn(priv, "REJ received\n");
+		spin_lock(&priv->lock);
+		neigh = tx->neigh;
+		spin_unlock(&priv->lock);
+
+		if ((neigh) && (event->param.rej_rcvd.reason ==
+		   IB_CM_REJ_NO_QP)) {
+			clear_bit(IPOIB_FLAG_OPER_UP, &neigh->cm->flags);
+			break;
+		}
+	case IB_CM_REQ_ERROR:
 	case IB_CM_TIMEWAIT_EXIT:
 		ipoib_dbg(priv, "CM error %d.\n", event->event);
 		spin_lock_irq(&priv->tx_lock);
--- c/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-05-30 14:56:25.000000000 -0400
+++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-07-18 12:50:05.000000000 -0400
@@ -679,11 +679,10 @@ static int ipoib_start_xmit(struct sk_bu
 
 		neigh = *to_ipoib_neigh(skb->dst->neighbour);
 
-		if (ipoib_cm_get(neigh)) {
-			if (ipoib_cm_up(neigh)) {
+		if (ipoib_cm_get(neigh) &&  ipoib_cm_up(neigh) &&
+			test_bit(IPOIB_FLAG_OPER_UP, &neigh->cm->flags)) {
 				ipoib_cm_send(dev, skb, ipoib_cm_get(neigh));
 				goto out;
-			}
 		} else if (neigh->ah) {
 			if (unlikely(memcmp(&neigh->dgid.raw,
 					    skb->dst->neighbour->ha + 4,


From clark.tucker at gmail.com  Wed Jul 18 10:42:11 2007
From: clark.tucker at gmail.com (Clark Tucker)
Date: Wed, 18 Jul 2007 11:42:11 -0600
Subject: [ofa-general] rping / librdmacm deadlock question
Message-ID: <b8fd36840707181042t55cdf02rd57619d53057d4bb@mail.gmail.com>

Hello all,

First, the background: I am writing a linux device driver to provide IWarp
device support for our hardware.  I'm currently running kernel 2.6.20-rc4and
OFED-1.2-rc2.  I realize these are somewhat old, but I have examined newer
source, and haven't found any changes that seem immediately relevant.

I am experiencing the following behavior:

rping -s ....  (server starts fine, loads proper user-space library, etc)

rping -c ... (client starts fine, ... connects to server, and exchanges data
successfully)
So far so good.

If I interrupt the rping client with CTRL-C, then the client hangs hard.

I have, I believe, traced this to a deadlock between ib_destroy_qp() and
ucma_close(). It looks like librdmacm has a ((destructor)) function defined
that results in a call to ibv_device_close() and ultimately in
<device>::destroy_qp().   That seems reasonable, and it all happens as the
OS unloads the application.

However, it is (I believe) happening before the "rdma_cm" device file
descriptor is 'closed' by the OS as the application terminates.
[rdma_destroy_event_channel() would normally do this, but it doesn't get
called when the application is interrupted by SIGINT.]

Our driver (as do all drivers I've seen) performs an
atomic_dec(&qp->refcount) and wait_event(&qp->refcount) in 'destroy_qp()'.
Because the rdma_cm device hasn't been closed (i.e., ucma_close() hasn't yet
been called), a cm_id still has an active reference to the qp, and the
wait_event() will end up 'wait'ing.

So, the application cleanup process is blocked, essentially waiting for
kernel::ucma_close() to be called ... which won't happen because the
application unload code is blocked in destroy_qp()  ==> deadlock.

First, does my analysis make sense?

Perhaps my device driver should do additional work in ib_destroy_qp() that
will trigger the destruction of the cm_id... [but that doesn't seem
consistent with other drivers I've seen.]

Perhaps the application (i.e., librdmacm) should make sure the "rdma_cm"
device is closed before calling ibv_device_close()?

I'm just not sure if this is a driver issue, an application issue, or
something in between.
Also, I don't have access to any other IWarp hardware, so I can't test this
scenario in a different environment...

Any help/advice would be greatly appreciated!

Thanks for your time,
--Clark Tucker
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070718/4a615cfb/attachment.html>

From rdreier at cisco.com  Wed Jul 18 10:58:35 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 18 Jul 2007 10:58:35 -0700
Subject: [ofa-general] rping / librdmacm deadlock question
In-Reply-To: <b8fd36840707181042t55cdf02rd57619d53057d4bb@mail.gmail.com>
	(Clark Tucker's message of "Wed, 18 Jul 2007 11:42:11 -0600")
References: <b8fd36840707181042t55cdf02rd57619d53057d4bb@mail.gmail.com>
Message-ID: <adak5sxlhkk.fsf@cisco.com>

 > Our driver (as do all drivers I've seen) performs an
 > atomic_dec(&qp->refcount) and wait_event(&qp->refcount) in 'destroy_qp()'.
 > Because the rdma_cm device hasn't been closed (i.e., ucma_close() hasn't yet
 > been called), a cm_id still has an active reference to the qp, and the
 > wait_event() will end up 'wait'ing.

In the other drivers I know well (basically mthca and mlx4, since I
wrote them), the qp->refcount being waited for is an internal driver
refcount, and is used to make sure that the destroy QP operation waits
until any active interrupt handlers are done with the QP.  So I think
the problem is that you are letting a cm_id bump the QP's reference
count somehow.

 > Perhaps my device driver should do additional work in ib_destroy_qp() that
 > will trigger the destruction of the cm_id... [but that doesn't seem
 > consistent with other drivers I've seen.]

That doesn't make sense.  I think it's OK if upper layers are left
with a stale pointer to your QP -- let them worry about it.  Maybe
it's an iWARP thing that I don't really understand (I'm much more
familiar with the IB driver interface) but I don't think that the
cxgb3 driver runs into this issue.

 > Perhaps the application (i.e., librdmacm) should make sure the "rdma_cm"
 > device is closed before calling ibv_device_close()?

No, because then some other (possibly malicious) app could still cause
the deadlock and potentially create a bunch of unkillable processes.

 - R.


From clark.tucker at gmail.com  Wed Jul 18 11:18:10 2007
From: clark.tucker at gmail.com (Clark Tucker)
Date: Wed, 18 Jul 2007 12:18:10 -0600
Subject: [ofa-general] rping / librdmacm deadlock question
In-Reply-To: <adak5sxlhkk.fsf@cisco.com>
References: <b8fd36840707181042t55cdf02rd57619d53057d4bb@mail.gmail.com>
	<adak5sxlhkk.fsf@cisco.com>
Message-ID: <b8fd36840707181118q3f8916f3s1fb20875e912ab19@mail.gmail.com>

Thanks for the quick reply.  Comments below.

On 7/18/07, Roland Dreier <rdreier at cisco.com> wrote:
>
> > Our driver (as do all drivers I've seen) performs an
> > atomic_dec(&qp->refcount) and wait_event(&qp->refcount) in
> 'destroy_qp()'.
> > Because the rdma_cm device hasn't been closed (i.e., ucma_close() hasn't
> yet
> > been called), a cm_id still has an active reference to the qp, and the
> > wait_event() will end up 'wait'ing.
>
> In the other drivers I know well (basically mthca and mlx4, since I
> wrote them), the qp->refcount being waited for is an internal driver
> refcount, and is used to make sure that the destroy QP operation waits
> until any active interrupt handlers are done with the QP.  So I think
> the problem is that you are letting a cm_id bump the QP's reference
> count somehow.


I guess this really is relevant only for IWarp.  Other IWarp drivers I've
seen do an atomic_inc(&qp->refcount) in <device>::qp_add_ref().
Called via cm_id->device->iwcm->add_ref()?. [For example see:
iwcm.c::iw_cm_connect()].
This reference is removed by a call to cm_id->device->iwcm->rem_ref() [For
example see: iwcm::destroy_cm_id()].

And, to avoid a deadlock, I still believe that this must happen _before_
ib_uverbs_close() [ and ultimately ib_destroy_qp()] is called.

> Perhaps my device driver should do additional work in ib_destroy_qp() that
> > will trigger the destruction of the cm_id... [but that doesn't seem
> > consistent with other drivers I've seen.]
>
> That doesn't make sense.  I think it's OK if upper layers are left
> with a stale pointer to your QP -- let them worry about it.  Maybe
> it's an iWARP thing that I don't really understand (I'm much more
> familiar with the IB driver interface) but I don't think that the
> cxgb3 driver runs into this issue.
>
> > Perhaps the application (i.e., librdmacm) should make sure the "rdma_cm"
> > device is closed before calling ibv_device_close()?
>
> No, because then some other (possibly malicious) app could still cause
> the deadlock and potentially create a bunch of unkillable processes.


Very true...good point.

- R.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070718/7aaf89e6/attachment.html>

From mshefty at ichips.intel.com  Wed Jul 18 11:49:30 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 18 Jul 2007 11:49:30 -0700
Subject: [ofa-general] rping / librdmacm deadlock question
In-Reply-To: <b8fd36840707181042t55cdf02rd57619d53057d4bb@mail.gmail.com>
References: <b8fd36840707181042t55cdf02rd57619d53057d4bb@mail.gmail.com>
Message-ID: <469E60BA.3070703@ichips.intel.com>

> I have, I believe, traced this to a deadlock between ib_destroy_qp() and 
> ucma_close(). It looks like librdmacm has a ((destructor)) function 
> defined that results in a call to ibv_device_close() and ultimately in 
> <device>::destroy_qp().   That seems reasonable, and it all happens as 
> the OS unloads the application. 
> 
> However, it is (I believe) happening before the "rdma_cm" device file 
> descriptor is 'closed' by the OS as the application terminates.  
> [rdma_destroy_event_channel() would normally do this, but it doesn't get 
> called when the application is interrupted by SIGINT.]

This seems like an iWarp specific issue caused by the following code in 
iw_cm_connect():

	/* Get the ib_qp given the QPN */
	qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn);
	if (!qp) {
		spin_unlock_irqrestore(&cm_id_priv->lock, flags);
		return -EINVAL;
	}
	cm_id->device->iwcm->add_ref(qp);

I think the reference is normally removed in cm_close_handler:

	if (cm_id_priv->qp) {
		cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp);
		cm_id_priv->qp = NULL;
	}


The upstream iWarp drivers must already be able to handle this 
situation, or I'm sure we would have seen the problem before.  I'm just 
not familiar enough with the iWarp drivers to see what they do to handle 
  it.  I'll continue reading through the code, but maybe Steve can 
explain how to avoid the problem.

I wonder if it would be better if the iWarp CM acquired/released the QP 
reference on a per call basis, rather than holding a reference 
throughout the entire connection.

- Sean


From swise at opengridcomputing.com  Wed Jul 18 12:03:20 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 18 Jul 2007 14:03:20 -0500
Subject: [ofa-general] rping / librdmacm deadlock question
In-Reply-To: <469E60BA.3070703@ichips.intel.com>
References: <b8fd36840707181042t55cdf02rd57619d53057d4bb@mail.gmail.com>
	<469E60BA.3070703@ichips.intel.com>
Message-ID: <469E63F8.90209@opengridcomputing.com>

Sean Hefty wrote:
>> I have, I believe, traced this to a deadlock between ib_destroy_qp() 
>> and ucma_close(). It looks like librdmacm has a ((destructor)) 
>> function defined that results in a call to ibv_device_close() and 
>> ultimately in <device>::destroy_qp().   That seems reasonable, and it 
>> all happens as the OS unloads the application.
>> However, it is (I believe) happening before the "rdma_cm" device file 
>> descriptor is 'closed' by the OS as the application terminates.  
>> [rdma_destroy_event_channel() would normally do this, but it doesn't 
>> get called when the application is interrupted by SIGINT.]
> 
> This seems like an iWarp specific issue caused by the following code in 
> iw_cm_connect():
> 
>     /* Get the ib_qp given the QPN */
>     qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn);
>     if (!qp) {
>         spin_unlock_irqrestore(&cm_id_priv->lock, flags);
>         return -EINVAL;
>     }
>     cm_id->device->iwcm->add_ref(qp);
> 
> I think the reference is normally removed in cm_close_handler:
> 
>     if (cm_id_priv->qp) {
>         cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp);
>         cm_id_priv->qp = NULL;
>     }
> 
> 
> The upstream iWarp drivers must already be able to handle this 
> situation, or I'm sure we would have seen the problem before.  I'm just 
> not familiar enough with the iWarp drivers to see what they do to handle 
>  it.  I'll continue reading through the code, but maybe Steve can 
> explain how to avoid the problem.
> 
> I wonder if it would be better if the iWarp CM acquired/released the QP 
> reference on a per call basis, rather than holding a reference 
> throughout the entire connection.
> 

The design assume the iwcm can hold this reference and cache the qp ptr. 
    In the iwarp design, the cm_id (connection) and qp are tighly bound 
once the connection is transitioned into rdma mode.  This is different 
than infiniband.

I still don't see the deadlock?


Steve.


From tziporet at dev.mellanox.co.il  Wed Jul 18 12:06:55 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Wed, 18 Jul 2007 22:06:55 +0300
Subject: [ofa-general] Re: [PATCH] mlx4: increase max outstanding rdma
	reads	per qp
In-Reply-To: <adafy3mrdsd.fsf@cisco.com>
References: <200707171311.43680.jackm@dev.mellanox.co.il>	<adak5syri4s.fsf@cisco.com>
	<469D17C2.3040403@mellanox.co.il> <adafy3mrdsd.fsf@cisco.com>
Message-ID: <469E64CF.8000607@mellanox.co.il>

Roland Dreier wrote:
> Have you tested this to know whether it matters?  Increasing the limit
> uses more memory per QP...
>   
It give some benefit  but not as substantial as in ConnectX, so I guess 
we do not need this after all.
> Does the rdma read latency test in OFED queue up enough work requests
> to measure this?
>   
yes

Tziporet


From sashak at voltaire.com  Wed Jul 18 12:22:18 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 18 Jul 2007 22:22:18 +0300
Subject: [ofa-general] Re: [PATCH] opensm: Bug in coding trying to set
	vl_arb_high_limit
In-Reply-To: <86hco13c7e.fsf@sw053.lab.mtl.com>
References: <86hco13c7e.fsf@sw053.lab.mtl.com>
Message-ID: <20070718192217.GE27878@sashak.voltaire.com>

Hi Eitan,

On 19:31 Wed 18 Jul     , Eitan Zahavi wrote:
> Hi Sasha
> 
> When QoS setup is done the code was trying to send updates of
> vl_arb_high_limit by req_set of PORT_INFO with the new data.
> However, at that stage the SM still did not assign LIDs to the ports.
> So the sent PortInfo.base_lid was still zero. The specification does not
> allow for such LIDs (they are considered ilegal). 
> 
> the patch below fixes this by storing the calculated value and later 
> using it in link and lid managers.

Good, Thanks (and this also saves one PortInfo update MAD). One question below:


> 
> Eitan
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> 

[snip...]

> diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c
> index bc3f8b3..ed76382 100644
> --- a/opensm/opensm/osm_lid_mgr.c
> +++ b/opensm/opensm/osm_lid_mgr.c
> @@ -1182,6 +1182,14 @@ __osm_lid_mgr_set_physp_pi(
>             ib_port_info_get_port_state(p_old_pi) )
>          send_set = TRUE;
>      }
> +
> +	 /* provide the vl_high_limit from the qos mgr */
> +	 if (p_mgr->p_subn->opt.no_qos == FALSE) 
> +		 if (p_physp->vl_high_limit != p_old_pi->vl_high_limit)
> +		 {
> +			 send_set = TRUE;
> +			 p_pi->vl_high_limit = p_physp->vl_high_limit;
> +		 }

This part of code is for port_num != 0, so VLHighLimit setup will be
skipped for switch enhanced port 0. Is it something expected? If so why?

Sasha


From jgunthorpe at obsidianresearch.com  Wed Jul 18 12:27:45 2007
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Wed, 18 Jul 2007 13:27:45 -0600
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <ada7ioymfsw.fsf@cisco.com>
References: <adalkdl43w0.fsf@cisco.com> <adahco943ip.fsf@cisco.com>
	<4696D1F3.2040507@ichips.intel.com>
	<15ddcffd0707172020j5b68fcb2v7d3ca77863998020@mail.gmail.com>
	<adaabtuo0n9.fsf@cisco.com>
	<f0e08f230707172140y36569958xadcba6faae1d65b3@mail.gmail.com>
	<20070718050928.GA3103@obsidianresearch.com>
	<ada7ioymfsw.fsf@cisco.com>
Message-ID: <20070718192745.GY13618@obsidianresearch.com>

On Tue, Jul 17, 2007 at 10:39:11PM -0700, Roland Dreier wrote:
>  > IMHO, I still think that without some kind of SM/SA sourced
>  > invalidation mechanism all client side caching (including the ipoib
>  > stuff we have now) is a bad idea.
> 
> But for IPoIB at least doing a path lookup for every packet is
> obviously not feasible.  And ARP table aging gives a way to recover
> from stale cached data, eventually at least.

Well, aside from Michael's points about the current implementation,
even a perfect version relying only on ARP will still have annoying
failure modes. ARP in ethernet has a built in means to revoke a bad
mac, and IB also will be able to revoke a bad GID - but since the path
information in incoming APR LRH's isn't used it doesn't fix changes in
the network caused by the SM. ARP entry aging helps, but IIRC there
are cases where aging can be slowed if the right packets are Rx'd.

Also, I think I ment 'bad idea' ==> 'has annoying and subtle failure
modes' - UD ipoib definately needs to cache LRH data with ARP entries..

Jason


From jgunthorpe at obsidianresearch.com  Wed Jul 18 12:27:50 2007
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Wed, 18 Jul 2007 13:27:50 -0600
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <000101c7c901$900cb290$5dcc180a@amr.corp.intel.com>
References: <20070718050928.GA3103@obsidianresearch.com>
	<000101c7c901$900cb290$5dcc180a@amr.corp.intel.com>
Message-ID: <20070718192750.GA8931@obsidianresearch.com>

On Tue, Jul 17, 2007 at 11:04:54PM -0700, Sean Hefty wrote:
> >IMHO, I still think that without some kind of SM/SA sourced
> >invalidation mechanism all client side caching (including the ipoib
> >stuff we have now) is a bad idea.
 
> Nothing precludes a user space daemon from updating the cache at
> timed intervals, or from communicating with an SA in some vendor
> defined way to maintain coherency.  I'm only trying to provide the
> kernel framework.  (We can debate whether another framework would
> have been better, and I've held this discussion on the list
> before...)  I do envision someone creating user space applications
> to control refreshes and, with local SA extensions, allow
> pre-loading of the cache, updates to specific paths, etc.

So, my main concern is with the role of kernel caching and especially with
how control is exported to user space.

Clearly the kernel needs a fast lookup cache for things like ipoib and
others. I don't think a kernel module needs or wants a full on
distributed SA.

I personally think a simple in-kernel (small) fast lookup cache merged
with the ipoib cache that has a netlink interface to userspace to
add/delete/flush entries is a very good solution that will keep being
useful in future. netlink would also carry cache miss queries to
userspace. In absense of a daemon the kernel could query on its own
but cache very conservatively. A userspace version of the very
agressive cache you have now could also be created right away.

This is because I firmly do not belive in caching as a solution to the
scalability problems. It must be solved with some level of replication
and distribution of the SA data and algorithms. With that view
pre-loading a gaint kernel cache is exactly the wrong kind of
user<->kernel interface.

Maybe you could summarise how the user/kernel interface works?  The
last I saw was something based on MADs that looked very inefficient
compared with netlink.

Jason


From mshefty at ichips.intel.com  Wed Jul 18 12:28:02 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 18 Jul 2007 12:28:02 -0700
Subject: [ofa-general] rping / librdmacm deadlock question
In-Reply-To: <469E63F8.90209@opengridcomputing.com>
References: <b8fd36840707181042t55cdf02rd57619d53057d4bb@mail.gmail.com>
	<469E60BA.3070703@ichips.intel.com>
	<469E63F8.90209@opengridcomputing.com>
Message-ID: <469E69C2.6070404@ichips.intel.com>

>> I wonder if it would be better if the iWarp CM acquired/released the 
>> QP reference on a per call basis, rather than holding a reference 
>> throughout the entire connection.
>>
> 
> The design assume the iwcm can hold this reference and cache the qp ptr. 
>    In the iwarp design, the cm_id (connection) and qp are tighly bound 
> once the connection is transitioned into rdma mode.  This is different 
> than infiniband.

I don't know if this tight binding is necessary in the implementation. 
The cm_id could store the qpn, rather than a pointer to the structure. 
When necessary, the qp pointer could be acquired using the qpn, then 
released at the end of the function call.  I don't think we need to hold 
the reference on the qp structure for the entire connection.

I'm just tossing this out as an idea.  I'm not familiar enough with the 
details to claim that it's a better approach over what's currently done.

> I still don't see the deadlock?

What happens if a user calls destroy qp immediately after connecting it?

- Sean


From swise at opengridcomputing.com  Wed Jul 18 12:33:33 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 18 Jul 2007 14:33:33 -0500
Subject: [ofa-general] rping / librdmacm deadlock question
In-Reply-To: <b8fd36840707181042t55cdf02rd57619d53057d4bb@mail.gmail.com>
References: <b8fd36840707181042t55cdf02rd57619d53057d4bb@mail.gmail.com>
Message-ID: <469E6B0D.9080107@opengridcomputing.com>

Clark Tucker wrote:
> Hello all,
> 
> First, the background: I am writing a linux device driver to provide 
> IWarp device support for our hardware.  I'm currently running kernel 
> 2.6.20-rc4 and OFED-1.2-rc2.  I realize these are somewhat old, but I 
> have examined newer source, and haven't found any changes that seem 
> immediately relevant.
> 
> I am experiencing the following behavior:
> 
> rping -s ....  (server starts fine, loads proper user-space library, etc)
> 
> rping -c ... (client starts fine, ... connects to server, and exchanges 
> data successfully)
> So far so good.
> 
> If I interrupt the rping client with CTRL-C, then the client hangs hard.
> 
> I have, I believe, traced this to a deadlock between ib_destroy_qp() and 
> ucma_close(). It looks like librdmacm has a ((destructor)) function 
> defined that results in a call to ibv_device_close() and ultimately in 
> <device>::destroy_qp().   That seems reasonable, and it all happens as 
> the OS unloads the application. 
> 
> However, it is (I believe) happening before the "rdma_cm" device file 
> descriptor is 'closed' by the OS as the application terminates.  
> [rdma_destroy_event_channel() would normally do this, but it doesn't get 
> called when the application is interrupted by SIGINT.]
> 
> Our driver (as do all drivers I've seen) performs an 
> atomic_dec(&qp->refcount) and wait_event(&qp->refcount) in 
> 'destroy_qp()'.   Because the rdma_cm device hasn't been closed (i.e., 
> ucma_close() hasn't yet been called), a cm_id still has an active 
> reference to the qp, and the wait_event() will end up 'wait'ing.
> 

Your destroy_qp() method must destroy the active rdma connection which 
will force the iwcm to release the reference on the qp.  If you look at 
the chelsio driver, you'll see this is done before waiting on the refcnt 
to go to zero:

from iwch_destroy_qp():

>         attrs.next_state = IWCH_QP_STATE_ERROR;
>         iwch_modify_qp(rhp, qhp, IWCH_QP_ATTR_NEXT_STATE, &attrs, 0);
>         wait_event(qhp->wait, !qhp->ep);


Once the qhp->ep handle has been disassociated from the qp, the driver 
knows the iwcm has been given the CLOSE event and removed its reference 
on the qp.  Here is the iwcm close event handler.  Note it removes the
ref:

 From cm_close_handler():

>         if (cm_id_priv->qp) {
>                 cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp);
>                 cm_id_priv->qp = NULL;
>         }


It then can wait for any further references from interrupt handlers:

> 
>         atomic_dec(&qhp->refcnt);
>         wait_event(qhp->wait, !atomic_read(&qhp->refcnt));


> Perhaps my device driver should do additional work in ib_destroy_qp() 
> that will trigger the destruction of the cm_id... [but that doesn't seem 
> consistent with other drivers I've seen.]
>

Are you looking at the chelsio or ammaso iwarp drivers?  This code is 
all iwarp specific...


Hope this helps...


Steve


From swise at opengridcomputing.com  Wed Jul 18 12:34:17 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 18 Jul 2007 14:34:17 -0500
Subject: [ofa-general] rping / librdmacm deadlock question
In-Reply-To: <469E63F8.90209@opengridcomputing.com>
References: <b8fd36840707181042t55cdf02rd57619d53057d4bb@mail.gmail.com>
	<469E60BA.3070703@ichips.intel.com>
	<469E63F8.90209@opengridcomputing.com>
Message-ID: <469E6B39.1000206@opengridcomputing.com>

Steve Wise wrote:
> Sean Hefty wrote:
>>> I have, I believe, traced this to a deadlock between ib_destroy_qp() 
>>> and ucma_close(). It looks like librdmacm has a ((destructor)) 
>>> function defined that results in a call to ibv_device_close() and 
>>> ultimately in <device>::destroy_qp().   That seems reasonable, and it 
>>> all happens as the OS unloads the application.
>>> However, it is (I believe) happening before the "rdma_cm" device file 
>>> descriptor is 'closed' by the OS as the application terminates.  
>>> [rdma_destroy_event_channel() would normally do this, but it doesn't 
>>> get called when the application is interrupted by SIGINT.]
>>
>> This seems like an iWarp specific issue caused by the following code 
>> in iw_cm_connect():
>>
>>     /* Get the ib_qp given the QPN */
>>     qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn);
>>     if (!qp) {
>>         spin_unlock_irqrestore(&cm_id_priv->lock, flags);
>>         return -EINVAL;
>>     }
>>     cm_id->device->iwcm->add_ref(qp);
>>
>> I think the reference is normally removed in cm_close_handler:
>>
>>     if (cm_id_priv->qp) {
>>         cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp);
>>         cm_id_priv->qp = NULL;
>>     }
>>
>>
>> The upstream iWarp drivers must already be able to handle this 
>> situation, or I'm sure we would have seen the problem before.  I'm 
>> just not familiar enough with the iWarp drivers to see what they do to 
>> handle  it.  I'll continue reading through the code, but maybe Steve 
>> can explain how to avoid the problem.
>>
>> I wonder if it would be better if the iWarp CM acquired/released the 
>> QP reference on a per call basis, rather than holding a reference 
>> throughout the entire connection.
>>
> 
> The design assume the iwcm can hold this reference and cache the qp ptr. 
>    In the iwarp design, the cm_id (connection) and qp are tighly bound 
> once the connection is transitioned into rdma mode.  This is different 
> than infiniband.
> 
> I still don't see the deadlock?
> 

I've re-read this thread and I think I've posted the answers for Clark...

Steve.

> 
> Steve.
> 


From swise at opengridcomputing.com  Wed Jul 18 12:36:10 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 18 Jul 2007 14:36:10 -0500
Subject: [ofa-general] rping / librdmacm deadlock question
In-Reply-To: <469E69C2.6070404@ichips.intel.com>
References: <b8fd36840707181042t55cdf02rd57619d53057d4bb@mail.gmail.com>
	<469E60BA.3070703@ichips.intel.com>
	<469E63F8.90209@opengridcomputing.com>
	<469E69C2.6070404@ichips.intel.com>
Message-ID: <469E6BAA.8020204@opengridcomputing.com>

Sean Hefty wrote:
>>> I wonder if it would be better if the iWarp CM acquired/released the 
>>> QP reference on a per call basis, rather than holding a reference 
>>> throughout the entire connection.
>>>
>>
>> The design assume the iwcm can hold this reference and cache the qp 
>> ptr.    In the iwarp design, the cm_id (connection) and qp are tighly 
>> bound once the connection is transitioned into rdma mode.  This is 
>> different than infiniband.
> 
> I don't know if this tight binding is necessary in the implementation. 
> The cm_id could store the qpn, rather than a pointer to the structure. 
> When necessary, the qp pointer could be acquired using the qpn, then 
> released at the end of the function call.  I don't think we need to hold 
> the reference on the qp structure for the entire connection.
> 
> I'm just tossing this out as an idea.  I'm not familiar enough with the 
> details to claim that it's a better approach over what's currently done.

Maybe, but I'm not gonna change this code now.  It was too painful to 
get working... ;-)

> 
>> I still don't see the deadlock?
> 
> What happens if a user calls destroy qp immediately after connecting it?
> 
> 

See my reply to clark.  The iwarp provider _must_ disassociate the 
endpoint/cm_id from the qp in destroy_qp()...  This involves aborting or 
closing the connection and passing a CLOSE event to the iwcm which 
removes its reference.


Steve.


From clark.tucker at gmail.com  Wed Jul 18 12:57:54 2007
From: clark.tucker at gmail.com (Clark Tucker)
Date: Wed, 18 Jul 2007 13:57:54 -0600
Subject: [ofa-general] rping / librdmacm deadlock question
In-Reply-To: <469E6B0D.9080107@opengridcomputing.com>
References: <b8fd36840707181042t55cdf02rd57619d53057d4bb@mail.gmail.com>
	<469E6B0D.9080107@opengridcomputing.com>
Message-ID: <b8fd36840707181257x6496e34fu468d46ec15e5251f@mail.gmail.com>

Steve,

Thank you.  Looks like this was my problem.  I should have looked more
closely at the chelsio driver.
Sorry for the interruption, and thanks again for your help.

--clark

On 7/18/07, Steve Wise <swise at opengridcomputing.com> wrote:

>
> Your destroy_qp() method must destroy the active rdma connection which
> will force the iwcm to release the reference on the qp.  If you look at
> the chelsio driver, you'll see this is done before waiting on the refcnt
> to go to zero:


....

Hope this helps...
>
>
> Steve
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070718/dd12127f/attachment.html>

From Jonathan.Robertson at 3leafnetworks.com  Wed Jul 18 13:19:27 2007
From: Jonathan.Robertson at 3leafnetworks.com (Jonathan Robertson)
Date: Wed, 18 Jul 2007 13:19:27 -0700
Subject: [ofa-general] libsdp in OFED 1.1
Message-ID: <7C1D552561AF0544ACC7CF6F10E4966ECB5353@chronus.3leafnetworks.corp>

Hello,

 
I have been using libsdp, and preloading it with the application. I
would like to have it automatically preloaded, but am concerned about
some error messages that seem harmless. So I don't want to have our
client use the ld.so.preload if there are going to be messages.

 
I see the following when I run a simple 'ls'

 
# ls

Wed Jul 18 06:11:09 2007 ls[8105] libsdp Error close: no implementation
for close found

 .

..

#

 
Any suggestions?

 
I have the following in libsdp.conf

Log min-level 9 destination syslog

Use both server netserver *:*

Use both client netperf *:*

 
Our client is interested in having weblogic communicate with the oracle
DB using SDP, and the interface to oracle and weblogic being accessible
via tcp/ip over Ethernet as well.

 
Thanks!

Jonathan

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070718/824e05a9/attachment.html>

From mshefty at ichips.intel.com  Wed Jul 18 13:52:25 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 18 Jul 2007 13:52:25 -0700
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <20070718192750.GA8931@obsidianresearch.com>
References: <20070718050928.GA3103@obsidianresearch.com>	<000101c7c901$900cb290$5dcc180a@amr.corp.intel.com>
	<20070718192750.GA8931@obsidianresearch.com>
Message-ID: <469E7D89.7040809@ichips.intel.com>

> So, my main concern is with the role of kernel caching and especially with
> how control is exported to user space.

The only control currently exported by the local SA is a module 
parameter that allows a user to force a refresh of the entire cache.  I 
do not want to extend this until we can get at least some basic PR 
caching functionality merged.

I want something small that we can build on, and the local_sa patch is 
already 1300 lines of code, with another 1000 lines of code to support 
informinfo registration.

> Clearly the kernel needs a fast lookup cache for things like ipoib and
> others. I don't think a kernel module needs or wants a full on
> distributed SA.

We talking about PR caching only at this point, with possible extensions 
to support QoS.  Other SA information is not cached or needed.

For all to all connections, current code does something like the following:

1. Resolves IP addresses to DGIDs using ARP.  This results in IPoIB 
querying the SA and caching 1 PR per DGID.
2. Apps query the SA for PRs, with 1 PR query per DGID.  Eventually 
we'll get back the same set of PRs that IPoIB already had cached.
3. Establish the connections.  The IB CM stores the PR information with 
each connection in order to set the QP attributes properly.

We end up with redundant queries and the PR being cached in multiple 
places.  One optimization is to replace the N PR queries with a single, 
more efficient GetTable query.  A second optimization is to centralize 
the PR caching.  The local SA does the first, and starts us down the 
road of the second.

> I personally think a simple in-kernel (small) fast lookup cache merged
> with the ipoib cache that has a netlink interface to userspace to
> add/delete/flush entries is a very good solution that will keep being
> useful in future. netlink would also carry cache miss queries to
> userspace. In absense of a daemon the kernel could query on its own
> but cache very conservatively. A userspace version of the very
> agressive cache you have now could also be created right away.

I believe that the PR caching should be done outside of IPoIB.  Other 
paths may exist that IPoIB does not use.

> This is because I firmly do not belive in caching as a solution to the
> scalability problems. It must be solved with some level of replication
> and distribution of the SA data and algorithms.

PR caching *is* replication of the SA data.  The local SA works with all 
existing SAs.  It is not tied to one vendor, nor does it require changes 
  to the SAs.  Sure, we can define vendor specific protocols to assist 
with/optimize synchronization, but I don't believe it is necessary in an 
initial submission.  (In fact I think it's undesirable at this point, 
since it would require changes to the SA.)

> Maybe you could summarise how the user/kernel interface works?  The
> last I saw was something based on MADs that looked very inefficient
> compared with netlink.

I suggested a MAD interface to the local SA as being the most 
extensible.  It allows interacting with the cache from a local or remote 
node in a very IB fashion.  The local SA is located over QP1, and any 
new protocols can re-use the existing SA MAD format.

For example, the cache could be loaded using a 'SetTable PR' MAD.  It 
doesn't matter if the MAD is sent from a local user space daemon, some 
distributed SA agent, or the master SA.  Paths can be invalidated by 
sending 'Delete PR' MADs.

It may also be possible to extend such an interface for QoS purposes.

- Sean


From mshefty at ichips.intel.com  Wed Jul 18 13:54:35 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 18 Jul 2007 13:54:35 -0700
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <469E4876.7020805@ichips.intel.com>
References: <000001c7c8fc$360eec90$5dcc180a@amr.corp.intel.com>
	<469E4876.7020805@ichips.intel.com>
Message-ID: <469E7E0B.7040703@ichips.intel.com>

> Based on discussions so far, maybe the best path forward from here is to
> delay until 2.6.24.  This will let us add this version to OFED 1.3 for
> more widespread testing, plus give us the time that we need to come up
> with a plan to integrate QoS with the local SA.

I spoke with Matt on this, and he agreed with this plan.

- Sean


From sashak at voltaire.com  Wed Jul 18 14:00:46 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 19 Jul 2007 00:00:46 +0300
Subject: [ofa-general] management master git repository
Message-ID: <20070718210046.GH27878@sashak.voltaire.com>

Hi All,

Please note that due to maintainership transfer "master" of OFA
management userspace tree (OpenSM, infiniband-diags) is located at:

  git://git.openfabrics.org/~sashak/management

All OpenSM, Diags, libibumad and libibmad upstream changes will be
committed into this repo.

'ofed_1_2' branch exists too, but since there were no changes in last
days it is identical to one in ~halr/management.

I updated OFA wiki pages accordingly.

Sasha


From jgunthorpe at obsidianresearch.com  Wed Jul 18 14:32:43 2007
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Wed, 18 Jul 2007 15:32:43 -0600
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <469E7D89.7040809@ichips.intel.com>
References: <20070718050928.GA3103@obsidianresearch.com>
	<000101c7c901$900cb290$5dcc180a@amr.corp.intel.com>
	<20070718192750.GA8931@obsidianresearch.com>
	<469E7D89.7040809@ichips.intel.com>
Message-ID: <20070718213243.GZ13618@obsidianresearch.com>

On Wed, Jul 18, 2007 at 01:52:25PM -0700, Sean Hefty wrote:

> 1. Resolves IP addresses to DGIDs using ARP.  This results in IPoIB 
> querying the SA and caching 1 PR per DGID.
> 2. Apps query the SA for PRs, with 1 PR query per DGID.  Eventually 
> we'll get back the same set of PRs that IPoIB already had cached.
> 3. Establish the connections.  The IB CM stores the PR information with 
> each connection in order to set the QP attributes properly.

So, since you flush the cache for your MPI jobs the gain you see is
basically by re-using the data collected by ipoib?

If this is the case, do you get the same first-order benifit by
essentially using the ipoib cache for all PR queries?

> >I personally think a simple in-kernel (small) fast lookup cache merged
> >with the ipoib cache that has a netlink interface to userspace to
> >add/delete/flush entries is a very good solution that will keep being
> >useful in future. netlink would also carry cache miss queries to
> >userspace. In absense of a daemon the kernel could query on its own
> >but cache very conservatively. A userspace version of the very
> >agressive cache you have now could also be created right away.
> 
> I believe that the PR caching should be done outside of IPoIB.  Other 
> paths may exist that IPoIB does not use.

When I said merged, I was thinking eliminating the ipoib cache
component and using your new module. Doesn't seem much sense in
caching twice, especially since ipoib already lacks anything to keep
the cache coherent with the SA - and that is what the main work
is. One PR record cache in the kernel, and it would be in roughly the
same architectual spot as the your local sa module.

> >This is because I firmly do not belive in caching as a solution to the
> >scalability problems. It must be solved with some level of replication
> >and distribution of the SA data and algorithms.
> 
> PR caching *is* replication of the SA data.  The local SA works with all 
> existing SAs.  It is not tied to one vendor, nor does it require changes 
>  to the SAs.  Sure, we can define vendor specific protocols to assist 
> with/optimize synchronization, but I don't believe it is necessary in an 
> initial submission.  (In fact I think it's undesirable at this point, 
> since it would require changes to the SA.)

Ok, so I draw a distiction between caching _some_ final end products (ie
PRs) without any coherency to the original source data, and coherently
replicating enough data to compute _any_ query on demand.

Some vs All and Pull vs Push.

I'm trying to say, I think a simple kernel cache itself is fine, but
there should be only 1 cache (get rid of ipoib) and it should have a
really good interface to userspace so that the really hard problems
can be solved through user space code.

Not suggesting cache to SA vendor specific hooks or anything like
that, just a well defined kernel module that lets user space
co-opt path resolution, which needs to include a kernel cache
component due to ipoib.

Jason


From rdreier at cisco.com  Wed Jul 18 15:52:57 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 18 Jul 2007 15:52:57 -0700
Subject: [ofa-general] [GIT PULL] please pull infiniband.git
Message-ID: <adaodi9iat2.fsf@cisco.com>

Linus, please pull from

    master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This tree is also available from kernel.org mirrors at:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This will get another batch of changes for 2.6.23, including the
beginnings of cleaning up the work request posting code in mthca and
mlx4:

Dotan Barak (2):
      IB/mlx4: Take sizeof the correct pointer in call to memset()
      RDMA/cma: Remove local write permission from QP access flags

Hoang-Nam Nguyen (7):
      IB/ehca: Fix memory leak in error path of ehca_get_dma_mr()
      IB/ehca: Use common error code mapping instead of specific ones
      IB/ehca: Use #define for "pages per register_rpage" instead of hardcoded value
      IB/ehca: Use macro to calculate number of chunks in a mem block
      IB/ehca: MR/MW structure refactoring
      IB/ehca: Restructure ehca_set_pagebuf()
      IB/ehca: Fix warnings issued by checkpatch.pl

Jack Morgenstein (4):
      IB/mlx4: Fix flow label returned from query QP
      IB/mlx4: Fix port returned from query QP for QPs in INIT state
      mlx4_core: Reset device when internal error is detected
      IB/mlx4: Increase max outstanding RDMA reads as target

Joachim Fenkes (1):
      IB/ehca: Fix HW level autodetection

Roland Dreier (14):
      IB/mthca: Schedule MSI support for removal
      IB/mthca: Fix printk format used for firmware version in warning
      IB/iser: Make a couple of functions static
      IB/ipath: Make a few functions static
      IB/ipath: Remove ipath_get_user_pages_nocopy()
      IB/cm: Make internal function cm_get_ack_delay() static
      IB/mthca: Use uninitialized_var() for f0
      IB/mlx4: Return receive queue sizes for userspace QPs from query QP
      IB/mthca: Factor out setting WQE data segment entries
      IB/mlx4: Factor out setting WQE data segment entries
      IB/mlx4: Factor out setting other WQE segments
      IB/mthca: Factor out setting WQE remote address and atomic segment entries
      IB/mthca: Factor out setting WQE UD segment entries
      IB/mthca: Simplify use of size0 in work request posting

Steve Wise (1):
      RDMA/cxgb3: Remove cm_id reference on listen failures

 Documentation/feature-removal-schedule.txt        |   10 +
 drivers/infiniband/core/cm.c                      |    2 +-
 drivers/infiniband/core/cma.c                     |    2 +-
 drivers/infiniband/hw/cxgb3/iwch_cm.c             |    1 +
 drivers/infiniband/hw/ehca/ehca_av.c              |    2 +-
 drivers/infiniband/hw/ehca/ehca_classes.h         |   54 +-
 drivers/infiniband/hw/ehca/ehca_classes_pSeries.h |  156 ++--
 drivers/infiniband/hw/ehca/ehca_cq.c              |    2 +-
 drivers/infiniband/hw/ehca/ehca_eq.c              |    3 +-
 drivers/infiniband/hw/ehca/ehca_hca.c             |   28 +-
 drivers/infiniband/hw/ehca/ehca_irq.c             |   56 +-
 drivers/infiniband/hw/ehca/ehca_iverbs.h          |    7 +-
 drivers/infiniband/hw/ehca/ehca_main.c            |   50 +-
 drivers/infiniband/hw/ehca/ehca_mrmw.c            | 1087 ++++++++-------------
 drivers/infiniband/hw/ehca/ehca_mrmw.h            |   21 +-
 drivers/infiniband/hw/ehca/ehca_qes.h             |   22 +-
 drivers/infiniband/hw/ehca/ehca_qp.c              |   39 +-
 drivers/infiniband/hw/ehca/ehca_reqs.c            |   15 +-
 drivers/infiniband/hw/ehca/ehca_tools.h           |   31 +-
 drivers/infiniband/hw/ehca/ehca_uverbs.c          |   10 +-
 drivers/infiniband/hw/ehca/hcp_if.c               |    8 +-
 drivers/infiniband/hw/ehca/hcp_phyp.c             |    2 +-
 drivers/infiniband/hw/ehca/hipz_fns_core.h        |    4 +-
 drivers/infiniband/hw/ehca/hipz_hw.h              |   24 +-
 drivers/infiniband/hw/ehca/ipz_pt_fn.c            |    2 +-
 drivers/infiniband/hw/ehca/ipz_pt_fn.h            |    4 +-
 drivers/infiniband/hw/ipath/ipath_driver.c        |    2 +-
 drivers/infiniband/hw/ipath/ipath_eeprom.c        |    4 +-
 drivers/infiniband/hw/ipath/ipath_intr.c          |    2 +-
 drivers/infiniband/hw/ipath/ipath_kernel.h        |    2 -
 drivers/infiniband/hw/ipath/ipath_ruc.c           |    2 +-
 drivers/infiniband/hw/ipath/ipath_user_pages.c    |   26 -
 drivers/infiniband/hw/ipath/ipath_verbs.c         |    2 +-
 drivers/infiniband/hw/ipath/ipath_verbs.h         |    4 -
 drivers/infiniband/hw/mlx4/qp.c                   |  115 ++-
 drivers/infiniband/hw/mthca/mthca_main.c          |   22 +-
 drivers/infiniband/hw/mthca/mthca_qp.c            |  221 ++---
 drivers/infiniband/hw/mthca/mthca_srq.c           |   28 +-
 drivers/infiniband/hw/mthca/mthca_wqe.h           |   15 +
 drivers/infiniband/ulp/iser/iscsi_iser.h          |    5 -
 drivers/infiniband/ulp/iser/iser_memory.c         |    4 +-
 drivers/infiniband/ulp/iser/iser_verbs.c          |   47 +-
 drivers/net/mlx4/catas.c                          |  106 ++-
 drivers/net/mlx4/eq.c                             |   56 +-
 drivers/net/mlx4/intf.c                           |    2 +
 drivers/net/mlx4/main.c                           |   26 +-
 drivers/net/mlx4/mlx4.h                           |   13 +-
 47 files changed, 1055 insertions(+), 1291 deletions(-)


From sean.hefty at intel.com  Wed Jul 18 15:53:36 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 18 Jul 2007 15:53:36 -0700
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <20070718213243.GZ13618@obsidianresearch.com>
Message-ID: <000001c7c98e$7dbe7000$ff0da8c0@amr.corp.intel.com>

>So, since you flush the cache for your MPI jobs the gain you see is
>basically by re-using the data collected by ipoib?

I need to correct what I said before about our MPI jobs.  On our production
clusters, we're using the local SA in OFED 1.2, which updates automatically on a
timer.  This patch removes the timer updates and instead gives control of the
update policy to a user space app.  The local SA sits beneath the existing ib_sa
interface, and would have the PR data available when ipoib requests it.

>If this is the case, do you get the same first-order benifit by
>essentially using the ipoib cache for all PR queries?

There are a couple of benefits.  The number of PR queries is reduced from O(n^2)
to O(n).  The queries can also be done once up front, even started at different
times if needed, rather than all at once at job startup.  The jobs are also able
to make progress even if the SA dies or is unreachable.

>I'm trying to say, I think a simple kernel cache itself is fine, but
>there should be only 1 cache (get rid of ipoib) and it should have a
>really good interface to userspace so that the really hard problems
>can be solved through user space code.

I don't disagree, but (for now anyway) I believe that the natural interface for
communicating with an SA related agent is a MAD interface based on the SA
management class for the reasons I mentioned earlier.  But this is really
talking about extensions to the local SA patch, rather than addressing anything
fundamentally wrong with the current patch set.

- Sean


From rdreier at cisco.com  Wed Jul 18 16:11:15 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 18 Jul 2007 16:11:15 -0700
Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V7] patch resubmit
In-Reply-To: <469E4CA2.2040708@linux.vnet.ibm.com> (Pradeep Satyanarayana's
	message of "Wed, 18 Jul 2007 10:23:46 -0700")
References: <469E4CA2.2040708@linux.vnet.ibm.com>
Message-ID: <adahco1i9yk.fsf@cisco.com>

There's still some rather obvious problems with this patch.  It would
really help if you would read over your patch again I think... anyway:

 > +#define CM_PACKET_SIZE (1ul << 16)

This duplicates IPOIB_CM_MTU I think... certainly it needs to be kept
in sync with it somehow.

 > @@ -564,10 +574,9 @@ static inline void ipoib_cm_skb_too_long
 >  	dev_kfree_skb_any(skb);
 >  }
 >  
 > -static inline void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 > +void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)

Why is this change here?  (This is in the CONFIG_INFINIBAND_IPOIB_CM=n
part of ipoib.h)

 >  }
 > -
 >  #endif

Please try to avoid adding extraneous noise to your patch... it makes
it harder to focus on the real content.

 > +int max_rc_qp = NOSRQ_INDEX_TABLE_SIZE;
 > +int max_recv_buf = 1024; /* Default is 1024 MB */
 > +
 > +module_param_named(nosrq_max_rc_qp, max_rc_qp, int, 0644);
 > +MODULE_PARM_DESC(nosrq_max_rc_qp, "Max number of NOSRQ RC QPs supported");
 > +
 > +module_param_named(max_receive_buffer, max_recv_buf, int, 0644);
 > +MODULE_PARM_DESC(max_receive_buffer, "Max Receive Buffer Size in MB");
 > +
 > +atomic_t current_rc_qp; /* Active number of RC QPs for NOSRQ */

everything here can be static I think ("make namespacecheck" might be
worth running).  And you can use ATOMIC_INIT() instead of putting the
initialization into code.

 > -		ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret);
 > +		ipoib_warn(priv, "post srq failed for buf %ld (%d)\n", id, ret);

extra noise here (and still wrong -- id might be long long on some
architectures).

 > -		.event_handler = ipoib_cm_rx_event_handler,

why?  seems harmless to just leave this alone for all QPs even if an
SRQ isn't attached.

 > +	recv_mem_used = (u64)ipoib_recvq_size *
 > +			(u64)atomic_inc_return(&current_rc_qp)
 > +			* CM_PACKET_SIZE; /* packets are 64K */

packets might not always be 64K ... just let CM_PACKET_SIZE document
itself (or pick a better name if you think it needs to be clearer).

 > +	if ((index == max_rc_qp) ||
 > +	( recv_mem_used >= max_recv_buf * (1ul << 20))) {

formatting went awry here...

 > +		spin_unlock_irq(&priv->lock);
 > +		ipoib_warn(priv, "NOSRQ has reached the configurable limit "
 > +			   "of either %d RC QPs or, max recv buf size of "
 > +			   "0x%x MB\n", max_rc_qp, max_recv_buf);

 > +		ib_send_cm_rej(cm_id, IB_CM_REJ_NO_QP, NULL, 0, NULL, 0);
 > +		ret = -EINVAL;
 > +		goto err_alloc_and_post;

there's a bug here... you never undo the atomic_inc() of the number of
RC QPs even though you exit without creating a new connection.

 > -	if (!p)
 > +	if (!p) {
 > +		printk(KERN_WARNING "Failed to allocate RX control block when "
 > +		       "REQ arrived\n");
 >  		return -ENOMEM;
 > +	}

more unrelated changes... (feel free to send these as separate
patches)

 >  		kfree(p);
 >  	}
 >  
 > +
 >  	cancel_delayed_work(&priv->cm.stale_task);
 >  }

extra noise in the patch

 > +		if (!priv->cm.srq) {
 > +			atomic_dec(&current_rc_qp);
 > +		}

no need for { } here

 > +	/* We increase the size of the CQ in the NOSRQ case to prevent CQ
 > +	 * overflow. Every new REQ creates a new RX QP and each QP has an
 > +	 * RX ring associated with it. Therefore we could have
 > +	 * NOSRQ_INDEX_TABLE_SIZE*ipoib_recvq_size + ipoib_sendq_size CQEs
 > +	 * in a CQ.
 > +	 */
 > +	if (!priv->cm.srq)
 > +		size += (NOSRQ_INDEX_TABLE_SIZE -1) * ipoib_recvq_size;

only need to do this if CM is enabled

space after - here please too.

that's just from a quick skim of the patch...


From akepner at sgi.com  Wed Jul 18 16:22:32 2007
From: akepner at sgi.com (akepner at sgi.com)
Date: Wed, 18 Jul 2007 16:22:32 -0700
Subject: [ofa-general] Re: [RFC 1/1] libmthca: CQ/DMA race on Altix
In-Reply-To: <adair8kwajz.fsf@cisco.com>
References: <20070715212445.GG6921@sgi.com> <adair8kwajz.fsf@cisco.com>
Message-ID: <20070718232232.GQ16538@sgi.com>

On Mon, Jul 16, 2007 at 09:57:52AM -0700, Roland Dreier wrote:

> Looks reasonable but I would prefer to see explicit tests of the abi
> version so that we use the old register MR ABI for old kernels rather
> than unconditionally passing the extra parameter.

How about the following?

This is somewhat untidy, in that the abi_version is exposed to 
verbs.c, but it seemed the best way to go.

 mthca-abi.h |   11 ++++++++++-
 mthca.c     |   19 +++++++++++++------
 verbs.c     |   29 ++++++++++++++++++++---------
 3 files changed, 43 insertions(+), 16 deletions(-)
-- 

diff -rup ofa_1_2_user-20070623-0200.orig/src/userspace/libmthca/src/mthca-abi.h ofa_1_2_user-20070623-0200/src/userspace/libmthca/src/mthca-abi.h
--- ofa_1_2_user-20070623-0200.orig/src/userspace/libmthca/src/mthca-abi.h	2007-06-23 02:00:34.000000000 -0700
+++ ofa_1_2_user-20070623-0200/src/userspace/libmthca/src/mthca-abi.h	2007-07-18 10:58:07.903823741 -0700
@@ -36,7 +36,8 @@
 
 #include <infiniband/kern-abi.h>
 
-#define MTHCA_UVERBS_ABI_VERSION	1
+#define MTHCA_UVERBS_MIN_ABI_VERSION	1
+#define MTHCA_UVERBS_MAX_ABI_VERSION	2
 
 struct mthca_alloc_ucontext_resp {
 	struct ibv_get_context_resp	ibv_resp;
@@ -50,6 +51,14 @@ struct mthca_alloc_pd_resp {
 	__u32				reserved;
 };
 
+struct mthca_reg_mr_abi_ver_2 {
+	struct ibv_reg_mr		ibv_cmd;
+	__u32				mr_attrs;
+#define MTHCA_MR_DMAFLUSH		0x1 
+/* flush in-flight DMA on a write to memory region (IA64_SGI_SN2 only) */
+	__u32				reserved;
+};
+
 struct mthca_create_cq {
 	struct ibv_create_cq		ibv_cmd;
 	__u32				lkey;
diff -rup ofa_1_2_user-20070623-0200.orig/src/userspace/libmthca/src/mthca.c ofa_1_2_user-20070623-0200/src/userspace/libmthca/src/mthca.c
--- ofa_1_2_user-20070623-0200.orig/src/userspace/libmthca/src/mthca.c	2007-06-23 02:00:34.000000000 -0700
+++ ofa_1_2_user-20070623-0200/src/userspace/libmthca/src/mthca.c	2007-07-18 15:50:07.174842760 -0700
@@ -56,6 +56,8 @@
 #include "mthca.h"
 #include "mthca-abi.h"
 
+int abi_ver = 0;
+
 #ifndef PCI_VENDOR_ID_MELLANOX
 #define PCI_VENDOR_ID_MELLANOX			0x15b3
 #endif
@@ -282,11 +284,16 @@ static struct ibv_device *mthca_driver_i
 	return NULL;
 
 found:
-	if (abi_version > MTHCA_UVERBS_ABI_VERSION) {
-		fprintf(stderr, PFX "Fatal: ABI version %d of %s is too new (expected %d)\n",
-			abi_version, uverbs_sys_path, MTHCA_UVERBS_ABI_VERSION);
+	if (abi_version < MTHCA_UVERBS_MIN_ABI_VERSION ||
+	    abi_version > MTHCA_UVERBS_MAX_ABI_VERSION) {
+		fprintf(stderr, PFX "Fatal: ABI version %d of %s is not supported "
+			"(min supported %d, max supported %d)\n",
+			abi_version, uverbs_sys_path, 
+			MTHCA_UVERBS_MIN_ABI_VERSION, 
+			MTHCA_UVERBS_MAX_ABI_VERSION);
 		return NULL;
 	}
+	abi_ver = abi_version;
 
 	dev = malloc(sizeof *dev);
 	if (!dev) {
@@ -314,13 +321,13 @@ static __attribute__((constructor)) void
  */
 struct ibv_device *openib_driver_init(struct sysfs_class_device *sysdev)
 {
-	int abi_ver = 0;
+	int abi_version = 0;
 	char value[8];
 
 	if (ibv_read_sysfs_file(sysdev->path, "abi_version",
 				value, sizeof value) > 0)
-		abi_ver = strtol(value, NULL, 10);
+		abi_version = strtol(value, NULL, 10);
 
-	return mthca_driver_init(sysdev->path, abi_ver);
+	return mthca_driver_init(sysdev->path, abi_version);
 }
 #endif /* HAVE_IBV_REGISTER_DRIVER */
diff -rup ofa_1_2_user-20070623-0200.orig/src/userspace/libmthca/src/verbs.c ofa_1_2_user-20070623-0200/src/userspace/libmthca/src/verbs.c
--- ofa_1_2_user-20070623-0200.orig/src/userspace/libmthca/src/verbs.c	2007-06-23 02:00:34.000000000 -0700
+++ ofa_1_2_user-20070623-0200/src/userspace/libmthca/src/verbs.c	2007-07-18 15:43:13.230506881 -0700
@@ -45,6 +45,8 @@
 #include "mthca.h"
 #include "mthca-abi.h"
 
+extern int abi_ver;
+
 int mthca_query_device(struct ibv_context *context, struct ibv_device_attr *attr)
 {
 	struct ibv_query_device cmd;
@@ -117,26 +119,35 @@ int mthca_free_pd(struct ibv_pd *pd)
 
 static struct ibv_mr *__mthca_reg_mr(struct ibv_pd *pd, void *addr,
 				     size_t length, uint64_t hca_va,
-				     enum ibv_access_flags access)
+				     enum ibv_access_flags access, 
+				     int dmaflush)
 {
 	struct ibv_mr *mr;
-	struct ibv_reg_mr cmd;
+	struct mthca_reg_mr_abi_ver_2 cmd;
+	size_t cmd_size;
 	int ret;
 
 	mr = malloc(sizeof *mr);
 	if (!mr)
 		return NULL;
 
+	if (abi_ver > 1) {
+		cmd.mr_attrs |= (__u32) dmaflush ? MTHCA_MR_DMAFLUSH : 0;
+		cmd_size = sizeof(struct mthca_reg_mr_abi_ver_2);
+	} else 
+		cmd_size = sizeof(struct ibv_reg_mr);
+
 #ifdef IBV_CMD_REG_MR_HAS_RESP_PARAMS
 	{
 		struct ibv_reg_mr_resp resp;
 
 		ret = ibv_cmd_reg_mr(pd, addr, length, hca_va, access, mr,
-				     &cmd, sizeof cmd, &resp, sizeof resp);
+				     &cmd.ibv_cmd, cmd_size, &resp, 
+				     sizeof resp);
 	}
 #else
 	ret = ibv_cmd_reg_mr(pd, addr, length, hca_va, access, mr,
-			     &cmd, sizeof cmd);
+			     &cmd.ibv_cmd, cmd_size);
 #endif
 	if (ret) {
 		free(mr);
@@ -149,7 +160,7 @@ static struct ibv_mr *__mthca_reg_mr(str
 struct ibv_mr *mthca_reg_mr(struct ibv_pd *pd, void *addr,
 			    size_t length, enum ibv_access_flags access)
 {
-	return __mthca_reg_mr(pd, addr, length, (uintptr_t) addr, access);
+	return __mthca_reg_mr(pd, addr, length, (uintptr_t) addr, access, 0);
 }
 
 int mthca_dereg_mr(struct ibv_mr *mr)
@@ -202,7 +213,7 @@ struct ibv_cq *mthca_create_cq(struct ib
 
 	cq->mr = __mthca_reg_mr(to_mctx(context)->pd, cq->buf.buf,
 				cqe * MTHCA_CQ_ENTRY_SIZE,
-				0, IBV_ACCESS_LOCAL_WRITE);
+				0, IBV_ACCESS_LOCAL_WRITE, 1);
 	if (!cq->mr)
 		goto err_buf;
 
@@ -294,7 +305,7 @@ int mthca_resize_cq(struct ibv_cq *ibcq,
 
 	mr = __mthca_reg_mr(to_mctx(ibcq->context)->pd, buf.buf,
 			    cqe * MTHCA_CQ_ENTRY_SIZE,
-			    0, IBV_ACCESS_LOCAL_WRITE);
+			    0, IBV_ACCESS_LOCAL_WRITE, 1);
 	if (!mr) {
 		mthca_free_buf(&buf);
 		ret = ENOMEM;
@@ -402,7 +413,7 @@ struct ibv_srq *mthca_create_srq(struct 
 	if (mthca_alloc_srq_buf(pd, &attr->attr, srq))
 		goto err;
 
-	srq->mr = __mthca_reg_mr(pd, srq->buf.buf, srq->buf_size, 0, 0);
+	srq->mr = __mthca_reg_mr(pd, srq->buf.buf, srq->buf_size, 0, 0, 0);
 	if (!srq->mr)
 		goto err_free;
 
@@ -520,7 +531,7 @@ struct ibv_qp *mthca_create_qp(struct ib
 	    pthread_spin_init(&qp->rq.lock, PTHREAD_PROCESS_PRIVATE))
 		goto err_free;
 
-	qp->mr = __mthca_reg_mr(pd, qp->buf.buf, qp->buf_size, 0, 0);
+	qp->mr = __mthca_reg_mr(pd, qp->buf.buf, qp->buf_size, 0, 0, 0);
 	if (!qp->mr)
 		goto err_free;
 
-- 
Arthur


From sashak at voltaire.com  Wed Jul 18 16:29:47 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 19 Jul 2007 02:29:47 +0300
Subject: [ewg] Re: [ofa-general] Re: RFC OFED-1.3 installation
In-Reply-To: <1184729415.5165.570.camel@firewall.xsintricity.com>
References: <20070717171250.GD7479@mellanox.co.il>
	<1184693800.5165.480.camel@firewall.xsintricity.com>
	<20070717174526.GE7479@mellanox.co.il>
	<1184697799.5165.536.camel@firewall.xsintricity.com>
	<20070717202730.GA15990@mellanox.co.il> <ada3azmrcap.fsf@cisco.com>
	<20070717210935.GA17168@mellanox.co.il>
	<1184713907.5165.549.camel@firewall.xsintricity.com>
	<20070718021854.GD19243@mellanox.co.il>
	<1184729415.5165.570.camel@firewall.xsintricity.com>
Message-ID: <20070718232947.GM27878@sashak.voltaire.com>

Hi Doug,

On 23:30 Tue 17 Jul     , Doug Ledford wrote:
> 
> For reference, I'll attach the updated script I made for spitting out a
> buildable tarball.

Small comment about the script.

> 
> Hehehe...resending because the ofa list server ate my message due to the
> script attachment :-D  I'll inline it instead.
> 
> I guess I'll also mention that this script exists in my ~/repos/upstream
> directory, and also in that directory are all the git repos that I have
> cloned from ofa (as well as other places).  So, it's one level above all
> the various git clones and spits everything out into dist/.  The easiest
> way to use this script for any given package you want to create a daily
> snapshot of is to run ./make.dist repodir daily; scp
> dist/repodir-git.tgz dist/repodir-daily.HEAD ofaserver:downloads.  That
> simple action would (assuming you create a reasonable reponame.spec.in
> file in the repos that are missing one) spit out a tarball that can be
> passed directly to rpmbuild --rebuild reponame-git.tgz and rpm will spit
> out the packages, and the repodir-daily.HEAD file shows the HEAD of the
> git repo so you know exactly what state the tarball represents and you
> can always get to it in another more recent repo by just updating to
> that commit as head of tree.
> 
> #!/bin/bash
> 
> usage() {
> echo "$0 repo daily | release [ signed | <key-id> ]"
> echo
> echo "	You must specify the repo to make a distribution tarball in.  This"
> echo "script will not work with complex repos like the management repo that"
> echo "builds more than one package.  It expects a repo to be a single package"
> echo "repo where the directory name and the package name are the same, and"
> echo "where a properly formatted reponame.spec.in file exists."
> echo
> echo "	You must specify either release or daily in order for this script"
> echo "to make tarballs.  If this is a daily release, the tarballs will"
> echo "be named <component>-git.tgz and will overwrite existing tarballs."
> echo "If this is a release build, then the tarball will be named"
> echo "<component>-<version>.tgz and must be a new file.  In addition,"
> echo "the script will add a new set of symbolic tags to the git repo"
> echo "that correspond to the <component>-<version> of each tarball."
> echo
> echo "	If the script detects that the tag on any component already exists,"
> echo "it will abort the release and prompt you to update the version on"
> echo "the already tagged component.  This enforces the proper behavior of"
> echo "treating any released tarball as set in stone so that in the future"
> echo "you will always be able to get to any given release tarball by"
> echo "checking out the git tag and know with certainty that it is the same"
> echo "code as released before even if you no longer have the same tarball"
> echo "around."
> echo
> echo "	As part of this process, the script will parse the <target>.spec.in"
> echo "file and output a <target>.spec file.  Since this script isn't smart"
> echo "enough to deal with other random changes that should have their own" 
> echo "checkin the script will refuse to run if the current repo state is not"
> echo "clean."
> echo
> echo "	NOTE: the script has no clue if you are tagging on the right branch,"
> echo "it will however show you the git branch output so you can confirm it"
> echo "is on the right branch before proceeding with the release."
> echo
> echo "	In addition to just tagging the git repo, whenever creating a release"
> echo "there is an optional argument of either signed or a hex gpg key-id."
> echo "If you do not pass an argument to release, then the tag will be a"
> echo "simple git annotated tag.  If you pass signed as the argument, the"
> echo "git tag operation will use your default signing key to sign the tag."
> echo "Or you can pass an actual gpg key id in hex format and git will sign"
> echo "the tag with that key."
> echo 
> }
> 
> if [ -z "$1" -o -z "$2" ]; then usage; exit 1; fi
> 
> if [ ! -d "$1" ]; then usage; exit 1; fi
> 
> TMPDIR=dist
> if [ ! -d $TMPDIR ]; then mkdir $TMPDIR; fi
> 
> if [ "$2" = "daily" -o "$2" = "release" ]; then
> 	if [ ! -f $TMPDIR/$1-$2.HEAD ]; then
> 		touch $TMPDIR/$1-$2.HEAD
> 	fi
> 	NEWHEAD=`cat $TMPDIR/$1-$2.HEAD`
> else
> 	usage
> 	exit 1
> fi
> 
> cd "$1"
> echo "Updating git repo..."
> git pull
> RESULT=$?
> HEAD=`git log --pretty=oneline -1`
> 
> if [ "$RESULT" -ne 0 ]; then
> 	echo "Failed to update the git repo cleanly, manual intervention required"
> 	exit 1
> fi
> 
> if [ "$HEAD" = "$NEWHEAD" ]; then
> 	echo "No new commits since last tarball creation, nothing to do."
> 	cd ..
> 	exit 0
> fi
> 
> if [ "$2" = "release" ]; then
> 	# Is the repo clean?
> 	git status | grep modified > /dev/null 2>&1
> 	if [ $? = 0 ]; then
> 		echo "There are modified files in the repo.  Please check any"
> 		echo "changes in before proceeding."
> 		exit 4
> 	fi
> 	# Since we will be tagging things, make sure we are on the right
> 	# branch
> 	git branch
> 	echo -n "Is the active branch the right one to tag this release on [y/N]? "
> 	read answer
> 	if [ "$answer" = y -o "$answer" = Y ]; then
> 		echo "Proceeding..."
> 	else
> 		echo "Please check out the right branch and run make.dist again"
> 		exit 0
> 	fi
> 	# Check versions to make sure that we can proceed
> 	VERSION=`grep "AC_INIT.*$1" configure.in | cut -f 2 -d ',' | sed -e 's/ //g'`
> 	TARBALL=$1-$VERSION.tgz
> 	if [ -f ../$TMPDIR/$TARBALL ]; then
> 		echo "Target $TARBALL already exists, please update the version of"
> 		echo "$1"
> 		exit 2
> 	fi
> 	if [ ! -z "`git tag -l $1-$VERSION`" ]; then
> 		echo "A git tag already exists for $1-$VERSION.  Please change the version"
> 		echo "of $1 so a tag replacement won't occur."
> 		exit 3
> 	fi
> # On a real release, this resets the daily release starting point, on the
> # assumption that any new daily builds will have a version number that is
> # incrementally higher than the last officially released tarball.
> 	RELEASE=1
> 	echo $RELEASE > ../$TMPDIR/$1.release
> else
> 	DATE=`date +%Y%m%d`
> 	if [ -f ../$TMPDIR/$1.release ]; then
> 		RELEASE=`cat ../$TMPDIR/$1.release`
> 		RELEASE=`expr $RELEASE + 1`
> 	else
> 		RELEASE=1
> 	fi
> 	echo $RELEASE > ../$TMPDIR/$1.release
> 	RELEASE=0.${RELEASE}.${DATE}git
> 	TARBALL=$1-git.tgz
> fi
> 
> cd ..
> cp -a $1 $1-$VERSION

Instead of copying git-archive could be used. Something like this:

  GIT_DIR=$1 git-archive --format=tar --prefix=$1-$VERSION/ HEAD | tar xf -

The advantage is that tree should not be clean and files generated by
previous build will not be part of tarball (without using aggressive
git-clean modes). Source files local modifications will be ignored as
well.

I think this could be useful when tarball is generated by maintainer
from his/her working tree.

Sasha

> [ -f $1/$1.spec.in ] && sed -e 's/@VERSION@/'$VERSION'/;s/@RELEASE@/'$RELEASE'/;s/@TARBALL@/'$TARBALL'/' < $1/$1.spec.in > $1-$VERSION/$1.spec
> if [ -f $1-$VERSION/autogen.sh ]; then
> 	cd $1-$VERSION
> 	./autogen.sh
> 	cd ..
> fi
> echo "Creating $TMPDIR/$TARBALL"
> tar -czf $TMPDIR/$TARBALL --exclude=.git $1-$VERSION
> rm -rf $1-$VERSION
> echo "$HEAD" > $TMPDIR/$1-$2.HEAD
> 
> if [ $2 = release ]; then
> 	echo "Tagging release."
> 	cd $1
> 	if [ ! -z "$3" ]; then
> 		if [ $3 = "signed" ]; then
> 			git tag -s -m "Auto tag by make.dist on release tarball creation" $1-$VERSION
> 		else
> 			git tag -u "$3" -m "Auto tag by make.dist on release tarball creation" $1-$VERSION
> 		fi
> 	else
> 		git tag -a -m "Auto tag by make.dist on release tarball creation" $1-$VERSION
> 	fi
> 	cd ..
> fi
> 
> 
> 
> 
> 
> 
> 
> -- 
> Doug Ledford <dledford at redhat.com>
>               GPG KeyID: CFBFF194
>               http://people.redhat.com/dledford
> 
> Infiniband specific RPMs available at
>               http://people.redhat.com/dledford/Infiniband


> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


From sashak at voltaire.com  Wed Jul 18 16:42:03 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 19 Jul 2007 02:42:03 +0300
Subject: [ofa-general] management.git spec files and ./autogen.sh
Message-ID: <20070718234203.GN27878@sashak.voltaire.com>

Hi Doug,

For all management tarballs ./autogen.sh is called during generation (by
make.dist). Is there any reason to call ./autogen.sh again under %build
section of the spec file (it is common for *.spec.in)? And as result to
have autoconf and automake in BuildRequires: list?

Sasha


From jgunthorpe at obsidianresearch.com  Wed Jul 18 16:40:38 2007
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Wed, 18 Jul 2007 17:40:38 -0600
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <000001c7c98e$7dbe7000$ff0da8c0@amr.corp.intel.com>
References: <20070718213243.GZ13618@obsidianresearch.com>
	<000001c7c98e$7dbe7000$ff0da8c0@amr.corp.intel.com>
Message-ID: <20070718234038.GB13618@obsidianresearch.com>

On Wed, Jul 18, 2007 at 03:53:36PM -0700, Sean Hefty wrote:

> There are a couple of benefits.  The number of PR queries is reduced
> from O(n^2) to O(n).  The queries can also be done once up front,
> even started at different times if needed, rather than all at once
> at job startup.  The jobs are also able to make progress even if the
> SA dies or is unreachable.

Do you mean each node changes from O(local_cpus*nodes) -> O(nodes) ?
Globally, from cold cache start you should still be O(n^2)?

> >I'm trying to say, I think a simple kernel cache itself is fine, but
> >there should be only 1 cache (get rid of ipoib) and it should have a
> >really good interface to userspace so that the really hard problems
> >can be solved through user space code.
> 
> I don't disagree, but (for now anyway) I believe that the natural
> interface for communicating with an SA related agent is a MAD
> interface based on the SA management class for the reasons I
> mentioned earlier.  But this is really talking about extensions to
> the local SA patch, rather than addressing anything fundamentally
> wrong with the current patch set.

OK - thats fine then. When you get around to doing the user space side
I'll argue for netlink :) Having written both netlink user space code
and mad code, I can say netlink is way better!

Only other thing I'd see is to have the cache be on by default (ie
included by default in distro kernels) it really needs a default short
life time for cached entries as a work around for a coherence
protocol..

Jason


From sean.hefty at intel.com  Wed Jul 18 17:12:35 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 18 Jul 2007 17:12:35 -0700
Subject: [ofa-general] Further 2.6.23 merge plans...
In-Reply-To: <20070718234038.GB13618@obsidianresearch.com>
Message-ID: <000201c7c999$823243e0$ff0da8c0@amr.corp.intel.com>

>> There are a couple of benefits.  The number of PR queries is reduced
>> from O(n^2) to O(n).  The queries can also be done once up front,
>> even started at different times if needed, rather than all at once
>> at job startup.  The jobs are also able to make progress even if the
>> SA dies or is unreachable.
>
>Do you mean each node changes from O(local_cpus*nodes) -> O(nodes) ?
>Globally, from cold cache start you should still be O(n^2)?

Each node goes from O(processes * nodes) -> O(1).  The local SA does a single
GetTable query to obtain all PRs.  Whereas, applications do one PR query for
each connection.

>OK - thats fine then. When you get around to doing the user space side
>I'll argue for netlink :) Having written both netlink user space code
>and mad code, I can say netlink is way better!

We can thumb wrestle.  (I would never argue that the IB MAD interface is great.)
I'm suggesting that we want an interface that allows an application running on a
remote node to control local SA policy, and that the message format should be
similar to SA MADs.

My hope is that we can create an interface that will be usable for QoS purposes
as well.  I will start an open thread on this once the QoS is released, and I've
had time to think about more of the details.

- Sean 


From pradeeps at linux.vnet.ibm.com  Wed Jul 18 17:55:48 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Wed, 18 Jul 2007 17:55:48 -0700
Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V7] patch resubmit
In-Reply-To: <adahco1i9yk.fsf@cisco.com>
References: <469E4CA2.2040708@linux.vnet.ibm.com> <adahco1i9yk.fsf@cisco.com>
Message-ID: <469EB694.7040408@linux.vnet.ibm.com>

Roland Dreier wrote:
> There's still some rather obvious problems with this patch.  It would
> really help if you would read over your patch again I think... anyway:
> 
>  > +#define CM_PACKET_SIZE (1ul << 16)
> 
> This duplicates IPOIB_CM_MTU I think... certainly it needs to be kept
> in sync with it somehow.

They are not quite the same. How about:
#define CM_PACKET_SIZE (ALIGN(IPOIB_CM_MTU, PAGE_SIZE))

This should keep the two in sync.

> 
>  > @@ -564,10 +574,9 @@ static inline void ipoib_cm_skb_too_long
>  >  	dev_kfree_skb_any(skb);
>  >  }
>  >  
>  > -static inline void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
>  > +void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
> 
> Why is this change here?  (This is in the CONFIG_INFINIBAND_IPOIB_CM=n
> part of ipoib.h)
> 
>  >  }
>  > -
>  >  #endif

Will do

> 
>  > -		ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret);
>  > +		ipoib_warn(priv, "post srq failed for buf %ld (%d)\n", id, ret);
> 
> extra noise here (and still wrong -- id might be long long on some
> architectures).

Correct, it should have been %lld

> 
>  > -		.event_handler = ipoib_cm_rx_event_handler,
> 
> why?  seems harmless to just leave this alone for all QPs even if an
> SRQ isn't attached.
> 

If memory serves me right, I tried that and ran into some inexplicable problems.
Maybe it was hang or no traffic went through -don't exactly recollect what it was.
After this change the problem went away.


> 
>  > +		spin_unlock_irq(&priv->lock);
>  > +		ipoib_warn(priv, "NOSRQ has reached the configurable limit "
>  > +			   "of either %d RC QPs or, max recv buf size of "
>  > +			   "0x%x MB\n", max_rc_qp, max_recv_buf);
> 
>  > +		ib_send_cm_rej(cm_id, IB_CM_REJ_NO_QP, NULL, 0, NULL, 0);
>  > +		ret = -EINVAL;
>  > +		goto err_alloc_and_post;
> 
> there's a bug here... you never undo the atomic_inc() of the number of
> RC QPs even though you exit without creating a new connection.

The atomic_dec() does happen, but that is in ipoib_cm_req_handler(). There are
several places where allocate_and_post_rbuf_nosrq() could return an error after 
the atomic_inc(). So, there is an atomic_dec() in the calling routine. On the
other hand I could move that to allocate_and_post_rbuf_nosrq() itself.
 
> 
>  > -	if (!p)
>  > +	if (!p) {
>  > +		printk(KERN_WARNING "Failed to allocate RX control block when "
>  > +		       "REQ arrived\n");
>  >  		return -ENOMEM;
>  > +	}
> 
> more unrelated changes... (feel free to send these as separate
> patches)
> 

OK

>  >  		kfree(p);
>  >  	}
>  >  
>  > +
>  >  	cancel_delayed_work(&priv->cm.stale_task);
>  >  }
> 
> extra noise in the patch
> 
>  > +		if (!priv->cm.srq) {
>  > +			atomic_dec(&current_rc_qp);
>  > +		}
> 
> no need for { } here

OK

> 
>  > +	/* We increase the size of the CQ in the NOSRQ case to prevent CQ
>  > +	 * overflow. Every new REQ creates a new RX QP and each QP has an
>  > +	 * RX ring associated with it. Therefore we could have
>  > +	 * NOSRQ_INDEX_TABLE_SIZE*ipoib_recvq_size + ipoib_sendq_size CQEs
>  > +	 * in a CQ.
>  > +	 */
>  > +	if (!priv->cm.srq)
>  > +		size += (NOSRQ_INDEX_TABLE_SIZE -1) * ipoib_recvq_size;
> 
> only need to do this if CM is enabled
> 
> space after - here please too.
> 
> that's just from a quick skim of the patch...
>

OK


Pradeep


From kliteyn at mellanox.co.il  Wed Jul 18 21:45:53 2007
From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il)
Date: 19 Jul 2007 07:45:53 +0300
Subject: [ofa-general] nightly osm_sim report 2007-07-19:normal completion
Message-ID: <MTLEXCH01wSimD3JkUw00001a66@mtlexch01.mtl.com>

OSM Simulation Regression Summary
 
[Generated mail - please do NOT reply]
 
 
OpenSM rev = Thu_Jul_12_11:56:08_2007 [de69204d60071532833b0cdd3baa5e2386dc2c73]
ibutils rev = Tue_Mar_13_14:36:32_2007 [80aaff94f0eb65117db39b9db7d609ffdcc055de]
 
 
Total=560  Pass=560  Fail=0
 
 
Pass:
42 Stability IS1-16.topo
42 Pkey IS1-16.topo
42 OsmTest IS1-16.topo
42 OsmStress IS1-16.topo
42 Multicast IS1-16.topo
42 LidMgr IS1-16.topo
14 Stability IS3-loop.topo
14 Stability IS3-128.topo
14 Pkey IS3-128.topo
14 OsmTest IS3-loop.topo
14 OsmTest IS3-128.topo
14 OsmStress IS3-128.topo
14 Multicast IS3-loop.topo
14 Multicast IS3-128.topo
14 LidMgr IS3-128.topo
14 FatTree merge-roots-4-ary-2-tree.topo
14 FatTree merge-root-4-ary-3-tree.topo
14 FatTree gnu-stallion-64.topo
14 FatTree blend-4-ary-2-tree.topo
14 FatTree RhinoDDR.topo
14 FatTree FullGnu.topo
14 FatTree 4-ary-2-tree.topo
14 FatTree 2-ary-4-tree.topo
14 FatTree 12-node-spaced.topo
14 FTreeFail 4-ary-2-tree-missing-sw-link.topo
14 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo
14 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo
14 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo

Failures:


From eitan at mellanox.co.il  Wed Jul 18 21:51:30 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 19 Jul 2007 07:51:30 +0300
Subject: [ofa-general] RE: [PATCH] opensm: Bug in coding trying to set
	vl_arb_high_limit
References: <86hco13c7e.fsf@sw053.lab.mtl.com>
	<20070718192217.GE27878@sashak.voltaire.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901E73F06@mtlexch01.mtl.com>

Ohh your right. The Enh0 should get an update. 
I thought I got it right. Do you want me to provide an updated patch?

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL

 
> -----Original Message-----
> From: Sasha Khapyorsky [mailto:sashak at voltaire.com] 
> Sent: Wednesday, July 18, 2007 10:22 PM
> To: Eitan Zahavi
> Cc: OPENIB; hal.rosenstock at gmail.com; Yevgeny Kliteynik
> Subject: Re: [PATCH] opensm: Bug in coding trying to set 
> vl_arb_high_limit
> 
> Hi Eitan,
> 
> On 19:31 Wed 18 Jul     , Eitan Zahavi wrote:
> > Hi Sasha
> > 
> > When QoS setup is done the code was trying to send updates of 
> > vl_arb_high_limit by req_set of PORT_INFO with the new data.
> > However, at that stage the SM still did not assign LIDs to 
> the ports.
> > So the sent PortInfo.base_lid was still zero. The 
> specification does 
> > not allow for such LIDs (they are considered ilegal).
> > 
> > the patch below fixes this by storing the calculated value 
> and later 
> > using it in link and lid managers.
> 
> Good, Thanks (and this also saves one PortInfo update MAD). 
> One question below:
> 
> 
> > 
> > Eitan
> > 
> > Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> > 
> 
> [snip...]
> 
> > diff --git a/opensm/opensm/osm_lid_mgr.c 
> b/opensm/opensm/osm_lid_mgr.c 
> > index bc3f8b3..ed76382 100644
> > --- a/opensm/opensm/osm_lid_mgr.c
> > +++ b/opensm/opensm/osm_lid_mgr.c
> > @@ -1182,6 +1182,14 @@ __osm_lid_mgr_set_physp_pi(
> >             ib_port_info_get_port_state(p_old_pi) )
> >          send_set = TRUE;
> >      }
> > +
> > +	 /* provide the vl_high_limit from the qos mgr */
> > +	 if (p_mgr->p_subn->opt.no_qos == FALSE) 
> > +		 if (p_physp->vl_high_limit != p_old_pi->vl_high_limit)
> > +		 {
> > +			 send_set = TRUE;
> > +			 p_pi->vl_high_limit = p_physp->vl_high_limit;
> > +		 }
> 
> This part of code is for port_num != 0, so VLHighLimit setup 
> will be skipped for switch enhanced port 0. Is it something 
> expected? If so why?
> 
> Sasha
> 


From mst at dev.mellanox.co.il  Wed Jul 18 21:58:41 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 19 Jul 2007 07:58:41 +0300
Subject: [ofa-general] Re: [RFC 1/1] libmthca: CQ/DMA race on Altix
In-Reply-To: <20070718232232.GQ16538@sgi.com>
References: <20070715212445.GG6921@sgi.com> <adair8kwajz.fsf@cisco.com>
	<20070718232232.GQ16538@sgi.com>
Message-ID: <20070719045840.GB30983@mellanox.co.il>

> Quoting akepner at sgi.com <akepner at sgi.com>:
> Subject: Re: [RFC 1/1] libmthca: CQ/DMA race on Altix
>
> ...
>
> @@ -50,6 +51,14 @@ struct mthca_alloc_pd_resp {
>  	__u32				reserved;
>  };
>  
> +struct mthca_reg_mr_abi_ver_2 {
> +	struct ibv_reg_mr		ibv_cmd;
> +	__u32				mr_attrs;
> +#define MTHCA_MR_DMAFLUSH		0x1 
> +/* flush in-flight DMA on a write to memory region (IA64_SGI_SN2 only) */
> +	__u32				reserved;
> +};
> +
>  struct mthca_create_cq {
>  	struct ibv_create_cq		ibv_cmd;
>  	__u32				lkey;

Aren't there some unused bits in mr_attrs that we can use instead of
breaking the ABI?

-- 
MST


From pradeeps at linux.vnet.ibm.com  Wed Jul 18 22:15:58 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Wed, 18 Jul 2007 22:15:58 -0700
Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V7] patch resubmit
In-Reply-To: <469EB694.7040408@linux.vnet.ibm.com>
References: <469E4CA2.2040708@linux.vnet.ibm.com> <adahco1i9yk.fsf@cisco.com>
	<469EB694.7040408@linux.vnet.ibm.com>
Message-ID: <469EF38E.8000203@linux.vnet.ibm.com>


> 
>>  > +	/* We increase the size of the CQ in the NOSRQ case to prevent CQ
>>  > +	 * overflow. Every new REQ creates a new RX QP and each QP has an
>>  > +	 * RX ring associated with it. Therefore we could have
>>  > +	 * NOSRQ_INDEX_TABLE_SIZE*ipoib_recvq_size + ipoib_sendq_size CQEs
>>  > +	 * in a CQ.
>>  > +	 */
>>  > +	if (!priv->cm.srq)
>>  > +		size += (NOSRQ_INDEX_TABLE_SIZE -1) * ipoib_recvq_size;
>>
>> only need to do this if CM is enabled

This happens during init  in ipoib_transport_dev_init(). However, at this 
point IPOIB_FLAG_ADMIN_CM is not even set. So, it is not possible to do
this conditionally only if CM is enabled. Any suggestions?

Pradeep


From rdreier at cisco.com  Wed Jul 18 22:23:49 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 18 Jul 2007 22:23:49 -0700
Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V7] patch resubmit
In-Reply-To: <469EF38E.8000203@linux.vnet.ibm.com> (Pradeep Satyanarayana's
	message of "Wed, 18 Jul 2007 22:15:58 -0700")
References: <469E4CA2.2040708@linux.vnet.ibm.com> <adahco1i9yk.fsf@cisco.com>
	<469EB694.7040408@linux.vnet.ibm.com>
	<469EF38E.8000203@linux.vnet.ibm.com>
Message-ID: <aday7hdge56.fsf@cisco.com>

 > This happens during init  in ipoib_transport_dev_init(). However, at this 
 > point IPOIB_FLAG_ADMIN_CM is not even set. So, it is not possible to do
 > this conditionally only if CM is enabled. Any suggestions?

I meant only do it if CONFIG_INFINIBAND_IPOIB_CM is set.


From rdreier at cisco.com  Wed Jul 18 22:24:51 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 18 Jul 2007 22:24:51 -0700
Subject: [ofa-general] Re: [RFC 1/1] libmthca: CQ/DMA race on Altix
In-Reply-To: <20070719045840.GB30983@mellanox.co.il> (Michael S. Tsirkin's
	message of "Thu, 19 Jul 2007 07:58:41 +0300")
References: <20070715212445.GG6921@sgi.com> <adair8kwajz.fsf@cisco.com>
	<20070718232232.GQ16538@sgi.com>
	<20070719045840.GB30983@mellanox.co.il>
Message-ID: <adatzs1ge3g.fsf@cisco.com>

 > Aren't there some unused bits in mr_attrs that we can use instead of
 > breaking the ABI?

That seems pretty fragile to me.  Although maybe we could reserve a
block of bits for provider-private use or something...


From rdreier at cisco.com  Wed Jul 18 22:28:10 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 18 Jul 2007 22:28:10 -0700
Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V7] patch resubmit
In-Reply-To: <469EB694.7040408@linux.vnet.ibm.com> (Pradeep Satyanarayana's
	message of "Wed, 18 Jul 2007 17:55:48 -0700")
References: <469E4CA2.2040708@linux.vnet.ibm.com> <adahco1i9yk.fsf@cisco.com>
	<469EB694.7040408@linux.vnet.ibm.com>
Message-ID: <adaps2pgdxx.fsf@cisco.com>

 > They are not quite the same. How about:
 > #define CM_PACKET_SIZE (ALIGN(IPOIB_CM_MTU, PAGE_SIZE))

That makes sense.

 > >  > -		.event_handler = ipoib_cm_rx_event_handler,
 > > 
 > > why?  seems harmless to just leave this alone for all QPs even if an
 > > SRQ isn't attached.
 > 
 > If memory serves me right, I tried that and ran into some inexplicable problems.
 > Maybe it was hang or no traffic went through -don't exactly recollect what it was.
 > After this change the problem went away.

Umm... I would like to get to the root cause of that.  Because as far
as I can see there is no problem if the event handler is called for a
non-SRQ QP.  The event will never be "last WQE reached" (since only a
QP attached to an SRQ can generate that) and so the event handler will
just return immediately and do nothing.

 > The atomic_dec() does happen, but that is in ipoib_cm_req_handler(). There are
 > several places where allocate_and_post_rbuf_nosrq() could return an error after 
 > the atomic_inc(). So, there is an atomic_dec() in the calling routine. On the
 > other hand I could move that to allocate_and_post_rbuf_nosrq() itself.

Got it.  I guess that's OK although it does seem like it would be
clearer if allocate_and_post_rbuf_nosrq() unwound everything on error.

 - R.


From pradeeps at linux.vnet.ibm.com  Wed Jul 18 22:55:33 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Wed, 18 Jul 2007 22:55:33 -0700
Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V7] patch resubmit
In-Reply-To: <adaps2pgdxx.fsf@cisco.com>
References: <469E4CA2.2040708@linux.vnet.ibm.com> <adahco1i9yk.fsf@cisco.com>
	<469EB694.7040408@linux.vnet.ibm.com> <adaps2pgdxx.fsf@cisco.com>
Message-ID: <469EFCD5.5050800@linux.vnet.ibm.com>

Roland Dreier wrote:
>  > They are not quite the same. How about:
>  > #define CM_PACKET_SIZE (ALIGN(IPOIB_CM_MTU, PAGE_SIZE))
> 
> That makes sense.
> 
>  > >  > -		.event_handler = ipoib_cm_rx_event_handler,
>  > > 
>  > > why?  seems harmless to just leave this alone for all QPs even if an
>  > > SRQ isn't attached.
>  > 
>  > If memory serves me right, I tried that and ran into some inexplicable problems.
>  > Maybe it was hang or no traffic went through -don't exactly recollect what it was.
>  > After this change the problem went away.
> 
> Umm... I would like to get to the root cause of that.  Because as far
> as I can see there is no problem if the event handler is called for a
> non-SRQ QP.  The event will never be "last WQE reached" (since only a
> QP attached to an SRQ can generate that) and so the event handler will
> just return immediately and do nothing.

Since I do not recollect what the issue was it was it might require some investigation 
-especially since we have a short window for the merge. Would it be okay if I submit a 
patch without this for the merge? Subsequently I will submit a patch to address this issue.

Pradeep


From erezz at voltaire.com  Thu Jul 19 01:31:41 2007
From: erezz at voltaire.com (Erez Zilber)
Date: Thu, 19 Jul 2007 11:31:41 +0300
Subject: [ofa-general] Re: [PATCH 29/33] infiniband: sg chaining support
In-Reply-To: <adar6n8z7za.fsf@cisco.com>
References: <11845791213043-git-send-email-jens.axboe@oracle.com><1184579123437-git-send-email-jens.axboe@oracle.com>
	<adar6n8z7za.fsf@cisco.com>
Message-ID: <469F216D.3060306@voltaire.com>

Roland Dreier wrote:

> [adding infinipath at qlogic.com and general at lists.openfabrics.org -- Roland]
>

I would like to test that on iSER. Where can I download all 33 patches from?

Thanks,
Erez


From jens.axboe at oracle.com  Thu Jul 19 01:39:39 2007
From: jens.axboe at oracle.com (Jens Axboe)
Date: Thu, 19 Jul 2007 10:39:39 +0200
Subject: [ofa-general] Re: [PATCH 29/33] infiniband: sg chaining support
In-Reply-To: <469F216D.3060306@voltaire.com>
References: <adar6n8z7za.fsf@cisco.com> <469F216D.3060306@voltaire.com>
Message-ID: <20070719083939.GC11657@kernel.dk>

On Thu, Jul 19 2007, Erez Zilber wrote:
> Roland Dreier wrote:
> 
> > [adding infinipath at qlogic.com and general at lists.openfabrics.org -- Roland]
> >
> 
> I would like to test that on iSER. Where can I download all 33 patches from?

I can provide a rolled up patch for you, right now the patchset has been
split in a series of 3 (core -> drivers -> arch bits are seperate).
Here's one for current -git as-of this morning:

http://brick.kernel.dk/sglist-chain-all-2.6.22-git-20070719

-- 
Jens Axboe


From vlad at lists.openfabrics.org  Thu Jul 19 01:45:32 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Thu, 19 Jul 2007 01:45:32 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070719-0100 daily build status
Message-ID: <20070719084532.67CC6E60858@openfabrics.org>

This email was generated automatically, please do not reply


Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on x86_64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on powerpc with linux-2.6.18
Passed on powerpc with linux-2.6.19
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.13
Passed on x86_64 with linux-2.6.17
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.14
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.20
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.15
Passed on ia64 with linux-2.6.17
Passed on ppc64 with linux-2.6.14
Passed on x86_64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.12
Passed on ia64 with linux-2.6.16
Passed on x86_64 with linux-2.6.5-7.244-smp
Passed on x86_64 with linux-2.6.21.1
Passed on ppc64 with linux-2.6.15
Passed on x86_64 with linux-2.6.13
Passed on powerpc with linux-2.6.16
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.12
Passed on x86_64 with linux-2.6.14
Passed on ppc64 with linux-2.6.18
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.16
Passed on x86_64 with linux-2.6.15
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.9-34.ELsmp

Failed:
Build failed on i686 with linux-2.6.22-rc7


From mst at dev.mellanox.co.il  Thu Jul 19 01:47:51 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 19 Jul 2007 11:47:51 +0300
Subject: [ofa-general] oops on mlx4 modprobe
Message-ID: <20070719084751.GC24018@mellanox.co.il>

I got the following when loading mlx4_ib on git
589f1e81bde732dd0b1bc5d01b6bddd4bcb4527b


[ 1350.668590] Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP:
[ 1350.674068]  [<ffffffff8027b373>] __kmalloc+0x51/0xaf
[ 1350.682159] PGD 0
[ 1350.684378] Oops: 0000 [1] SMP
[ 1350.687735] CPU 3
[ 1350.689950] Modules linked in: ib_ipoib ib_cm ib_sa ib_uverbs ib_umad mlx4_ib mlx4_core ib_mthca ib_mad ib_core piix ata_piix
[ 1350.701777] Pid: 5391, comm: ipoib Not tainted 2.6.22-x86_64-git #119
[ 1350.708400] RIP: 0010:[<ffffffff8027b373>]  [<ffffffff8027b373>] __kmalloc+0x51/0xaf
[ 1350.716536] RSP: 0018:ffff81007c655ba0  EFLAGS: 00010046
[ 1350.722034] RAX: 0000000000000003 RBX: 0000000000000246 RCX: 0000000000000040
[ 1350.729352] RDX: ffff81007ed15000 RSI: 00000000000000d0 RDI: 0000000000000000
[ 1350.736669] RBP: ffff81007c655bc0 R08: 00000000fffffff0 R09: ffff810075779d80
[ 1350.743985] R10: 0000000000000001 R11: 0000000005b8d800 R12: 00000000000000d0
[ 1350.751302] R13: 0000000000000010 R14: ffff81007ed7cc78 R15: ffff81007dbad800
[ 1350.758620] FS:  0000000000000000(0000) GS:ffff81007ff2b340(0000) knlGS:0000000000000000
[ 1350.767089] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 1350.773021] CR2: 0000000000000028 CR3: 0000000075ca6000 CR4: 00000000000006e0
[ 1350.780338] Process ipoib (pid: 5391, threadinfo ffff81007c654000, task ffff81007c5d8040)
[ 1350.788895] Stack:  ffff81007ed7cc00 0000000000000000 ffff81007ed7cc00 ffff81007ed7cd20
[ 1350.797331]  ffff81007c655c40 ffffffff88063cb6 ffff81006ae20b80 000000006ae20c30
[ 1350.805151]  ffff81007c655df0 ffff81007e3ba380 00000000000000d0 ffff81007ffa7c80
[ 1350.812587] Call Trace:
[ 1350.815619]  [<ffffffff88063cb6>] :mlx4_ib:create_qp_common+0x558/0x736
[ 1350.822421]  [<ffffffff88064c2e>] :mlx4_ib:mlx4_ib_create_qp+0x62/0x11f
[ 1350.829223]  [<ffffffff880999d2>] :ib_ipoib:ipoib_cm_tx_completion+0x0/0x2bb
[ 1350.836461]  [<ffffffff8800eca9>] :ib_core:ib_create_qp+0x18/0x94
[ 1350.842743]  [<ffffffff8809a281>] :ib_ipoib:ipoib_cm_tx_start+0x216/0x651
[ 1350.849714]  [<ffffffff80244382>] queue_work+0x3f/0x4a
[ 1350.855043]  [<ffffffff88080e63>] :ib_sa:ib_sa_join_multicast+0x292/0x2df
[ 1350.862030]  [<ffffffff8809a06b>] :ib_ipoib:ipoib_cm_tx_start+0x0/0x651
[ 1350.868829]  [<ffffffff80243cd4>] run_workqueue+0x85/0x10f
[ 1350.874501]  [<ffffffff80244695>] worker_thread+0x0/0xe7
[ 1350.880000]  [<ffffffff80244771>] worker_thread+0xdc/0xe7
[ 1350.885585]  [<ffffffff80247747>] autoremove_wake_function+0x0/0x38
[ 1350.892036]  [<ffffffff80247622>] kthread+0x49/0x77
[ 1350.897102]  [<ffffffff8020caa8>] child_rip+0xa/0x12
[ 1350.902254]  [<ffffffff802475d9>] kthread+0x0/0x77
[ 1350.907231]  [<ffffffff8020ca9e>] child_rip+0x0/0x12
[ 1350.912384]
[ 1350.914068]
[ 1350.914068] Code: 49 8b 54 c5 00 83 3a 00 74 16 8b 02 c7 42 0c 01 00 00 00 ff
[ 1350.923599] RIP  [<ffffffff8027b373>] __kmalloc+0x51/0xaf
[ 1350.929195]  RSP <ffff81007c655ba0>
[ 1350.932873] CR2: 0000000000000028


-- 
MST


From mst at dev.mellanox.co.il  Thu Jul 19 01:49:27 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 19 Jul 2007 11:49:27 +0300
Subject: [ofa-general] Re: ofa_1_2_kernel 20070719-0100 daily build status
In-Reply-To: <20070719084532.67CC6E60858@openfabrics.org>
References: <20070719084532.67CC6E60858@openfabrics.org>
Message-ID: <20070719084927.GD24018@mellanox.co.il>

> 
> Failed:
> Build failed on i686 with linux-2.6.22-rc7

Why is it still failing?
And shouldn't we switch to 2.6.22?

-- 
MST


From mst at dev.mellanox.co.il  Thu Jul 19 02:40:39 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 19 Jul 2007 12:40:39 +0300
Subject: [ofa-general] [PATCH] IB/mlx4: fix oops in qp allocation for srq
	case
Message-ID: <20070719094039.GF24018@mellanox.co.il>

Don't pass 0 size to kmalloc if qp->rq.wqe_cnt == 0 (e.g. for SRQ).

Note: initializing sq.wrid and rq.wrid to NULL at top helps keep error handling
simple, and also fixes what seems like a bug in create_qp_common error handling:
if srq is set for userspace, code at err_wrid would call kfree on wrid
arrays even though these have not been initialized.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

This patch fixes the oops I reported earlier.

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index b5a24fb..79e50e5 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -315,6 +315,8 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 	qp->rq.tail	    = 0;
 	qp->sq.head	    = 0;
 	qp->sq.tail	    = 0;
+	qp->sq.wrid         = NULL;
+	qp->rq.wrid         = NULL;
 
 	err = set_rq_size(dev, &init_attr->cap, !!pd->uobject, !!init_attr->srq, qp);
 	if (err)
@@ -385,13 +387,18 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 		if (err)
 			goto err_mtt;
 
-		qp->sq.wrid  = kmalloc(qp->sq.wqe_cnt * sizeof (u64), GFP_KERNEL);
-		qp->rq.wrid  = kmalloc(qp->rq.wqe_cnt * sizeof (u64), GFP_KERNEL);
-
-		if (!qp->sq.wrid || !qp->rq.wrid) {
+		qp->sq.wrid = kmalloc(qp->sq.wqe_cnt * sizeof (u64), GFP_KERNEL);
+		if (!qp->sq.wrid) {
 			err = -ENOMEM;
 			goto err_wrid;
 		}
+		if (qp->rq.wqe_cnt) {
+			qp->rq.wrid = kmalloc(qp->rq.wqe_cnt * sizeof (u64), GFP_KERNEL);
+			if (!qp->rq.wrid) {
+				err = -ENOMEM;
+				goto err_wrid;
+			}
+		}
 	}
 
 	err = mlx4_qp_alloc(dev->dev, sqpn, &qp->mqp);

-- 
MST


From vlad at lists.openfabrics.org  Thu Jul 19 02:45:34 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Thu, 19 Jul 2007 02:45:34 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_c_kernel 20070719-0200 daily build status
Message-ID: <20070719094535.0DBC6E60870@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2_c

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22-rc7
Passed on powerpc with linux-2.6.19
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.12
Passed on ppc64 with linux-2.6.18
Passed on ia64 with linux-2.6.12
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.13
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.17
Passed on ia64 with linux-2.6.13
Passed on powerpc with linux-2.6.13
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.15
Passed on powerpc with linux-2.6.16
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.16
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.15
Passed on ia64 with linux-2.6.14
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.13
Passed on ia64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5

Failed:


From mst at dev.mellanox.co.il  Thu Jul 19 02:50:12 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 19 Jul 2007 12:50:12 +0300
Subject: [ofa-general] [PATCH] IB/mthca: enable MSI-X by default
Message-ID: <20070719095012.GH24018@mellanox.co.il>

Recover from MSI-X errors by automatically falling back on regular interrupt,
instead of asking the user to do this manually.  This makes it possible to
enable MSI-X by default, and will make it possible to get rid of msi_x module
option in the future.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c
index 76fed75..0c8b954 100644
--- a/drivers/infiniband/hw/mthca/mthca_main.c
+++ b/drivers/infiniband/hw/mthca/mthca_main.c
@@ -61,7 +61,7 @@ MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0");
 
 #ifdef CONFIG_PCI_MSI
 
-static int msi_x = 0;
+static int msi_x = 1;
 module_param(msi_x, int, 0444);
 MODULE_PARM_DESC(msi_x, "attempt to use MSI-X if nonzero");
 
@@ -837,10 +837,7 @@ static int mthca_setup_hca(struct mthca_dev *dev)
 			  dev->mthca_flags & MTHCA_FLAG_MSI_X ?
 			  dev->eq_table.eq[MTHCA_EQ_CMD].msi_x_vector :
 			  dev->pdev->irq);
-		if (dev->mthca_flags & (MTHCA_FLAG_MSI | MTHCA_FLAG_MSI_X))
-			mthca_err(dev, "Try again with MSI/MSI-X disabled.\n");
-		else
-			mthca_err(dev, "BIOS or ACPI interrupt routing problem?\n");
+		mthca_err(dev, "BIOS or ACPI interrupt routing problem?\n");
 
 		goto err_cmd_poll;
 	}
@@ -1115,24 +1112,6 @@ static int __mthca_init_one(struct pci_dev *pdev, int hca_type)
 		goto err_free_dev;
 	}
 
-	if (msi_x && !mthca_enable_msi_x(mdev))
-		mdev->mthca_flags |= MTHCA_FLAG_MSI_X;
-	else if (msi) {
-		static int warned;
-
-		if (!warned) {
-			printk(KERN_WARNING PFX "WARNING: MSI support will be "
-			       "removed from the ib_mthca driver in January 2008.\n");
-			printk(KERN_WARNING "    If you are using MSI and cannot "
-			       "switch to MSI-X, please tell "
-			       "<general at lists.openfabrics.org>.\n");
-			++warned;
-		}
-
-		if (!pci_enable_msi(pdev))
-			mdev->mthca_flags |= MTHCA_FLAG_MSI;
-	}
-
 	if (mthca_cmd_init(mdev)) {
 		mthca_err(mdev, "Failed to init command interface, aborting.\n");
 		goto err_free_dev;
@@ -1156,7 +1135,36 @@ static int __mthca_init_one(struct pci_dev *pdev, int hca_type)
 		mthca_warn(mdev, "If you have problems, try updating your HCA FW.\n");
 	}
 
+	if (msi_x && !mthca_enable_msi_x(mdev))
+		mdev->mthca_flags |= MTHCA_FLAG_MSI_X;
+	else if (msi) {
+		static int warned;
+
+		if (!warned) {
+			printk(KERN_WARNING PFX "WARNING: MSI support will be "
+			       "removed from the ib_mthca driver in January 2008.\n");
+			printk(KERN_WARNING "    If you are using MSI and cannot "
+			       "switch to MSI-X, please tell "
+			       "<general at lists.openfabrics.org>.\n");
+			++warned;
+		}
+
+		if (!pci_enable_msi(pdev))
+			mdev->mthca_flags |= MTHCA_FLAG_MSI;
+	}
+
 	err = mthca_setup_hca(mdev);
+	if (err == -EBUSY && (mdev->mthca_flags & (MTHCA_FLAG_MSI | MTHCA_FLAG_MSI_X))) {
+		mthca_warn(mdev, "Trying again with MSI/MSI-X disabled.\n");
+		if (mdev->mthca_flags & MTHCA_FLAG_MSI_X)
+			pci_disable_msix(pdev);
+		if (mdev->mthca_flags & MTHCA_FLAG_MSI)
+			pci_disable_msi(pdev);
+		mdev->mthca_flags &= ~(MTHCA_FLAG_MSI_X | MTHCA_FLAG_MSI);
+
+		err = mthca_setup_hca(mdev);
+	}
+
 	if (err)
 		goto err_close;
 
@@ -1192,17 +1200,17 @@ err_cleanup:
 	mthca_cleanup_uar_table(mdev);
 
 err_close:
+	if (mdev->mthca_flags & MTHCA_FLAG_MSI_X)
+		pci_disable_msix(pdev);
+	if (mdev->mthca_flags & MTHCA_FLAG_MSI)
+		pci_disable_msi(pdev);
+
 	mthca_close_hca(mdev);
 
 err_cmd:
 	mthca_cmd_cleanup(mdev);
 
 err_free_dev:
-	if (mdev->mthca_flags & MTHCA_FLAG_MSI_X)
-		pci_disable_msix(pdev);
-	if (mdev->mthca_flags & MTHCA_FLAG_MSI)
-		pci_disable_msi(pdev);
-
 	ib_dealloc_device(&mdev->ib_dev);
 
 err_free_res:

-- 
MST


From mst at dev.mellanox.co.il  Thu Jul 19 04:21:55 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 19 Jul 2007 14:21:55 +0300
Subject: [ofa-general] [PATCH] IB/mlx4: enable MSI-X by default
Message-ID: <20070719112155.GJ24018@mellanox.co.il>

Recover from MSI-X errors by automatically falling back on regular interrupt,
instead of asking the user to do this manually.  This makes it possible to
enable MSI-X by default, and will make it possible to get rid of msi_x module
option in the future.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index 4dc9dc1..fee53b2 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -61,7 +61,7 @@ MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0");
 
 #ifdef CONFIG_PCI_MSI
 
-static int msi_x;
+static int msi_x = 1;
 module_param(msi_x, int, 0444);
 MODULE_PARM_DESC(msi_x, "attempt to use MSI-X if nonzero");
 
@@ -602,10 +602,7 @@ static int __devinit mlx4_setup_hca(struct mlx4_dev *dev)
 		mlx4_err(dev, "NOP command failed to generate interrupt "
 			 "(IRQ %d), aborting.\n",
 			 priv->eq_table.eq[MLX4_EQ_ASYNC].irq);
-		if (dev->flags & MLX4_FLAG_MSI_X)
-			mlx4_err(dev, "Try again with MSI-X disabled.\n");
-		else
-			mlx4_err(dev, "BIOS or ACPI interrupt routing problem?\n");
+		mlx4_err(dev, "BIOS or ACPI interrupt routing problem?\n");
 
 		goto err_cmd_poll;
 	}
@@ -803,17 +800,26 @@ static int __devinit mlx4_init_one(struct pci_dev *pdev,
 		goto err_free_dev;
 	}
 
-	mlx4_enable_msi_x(dev);
-
 	if (mlx4_cmd_init(dev)) {
 		mlx4_err(dev, "Failed to init command interface, aborting.\n");
 		goto err_free_dev;
 	}
 
+	mlx4_enable_msi_x(dev);
+
 	err = mlx4_init_hca(dev);
+	if (err == -EBUSY && (dev->flags & MLX4_FLAG_MSI_X)) {
+		mlx4_warn(dev, "Trying again with MSI/MSI-X disabled.\n");
+		dev->flags &= ~MLX4_FLAG_MSI_X;
+		pci_disable_msix(pdev);
+		err = mlx4_init_hca(dev);
+	}
+
 	if (err)
 		goto err_cmd;
 
+	mlx4_enable_msi_x(dev);
+
 	err = mlx4_setup_hca(dev);
 	if (err)
 		goto err_close;
@@ -838,15 +844,15 @@ err_cleanup:
 	mlx4_cleanup_uar_table(dev);
 
 err_close:
+	if (dev->flags & MLX4_FLAG_MSI_X)
+		pci_disable_msix(pdev);
+
 	mlx4_close_hca(dev);
 
 err_cmd:
 	mlx4_cmd_cleanup(dev);
 
 err_free_dev:
-	if (dev->flags & MLX4_FLAG_MSI_X)
-		pci_disable_msix(pdev);
-
 	kfree(priv);
 
 err_release_bar2:
-- 
MST


From mst at dev.mellanox.co.il  Thu Jul 19 04:28:49 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 19 Jul 2007 14:28:49 +0300
Subject: [ofa-general] [PATCH] IB/mthca: change command token on timeout
Message-ID: <20070719112849.GK24018@mellanox.co.il>

Command token is currently only updated on command
event. This means that on command timeout, the same token
will be reused for new command, which results in a mess
if the timed out command *is* eventually completed.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

This patch is in OFED 1.2, so I think we want it for 2.6.23 too.

diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c b/drivers/infiniband/hw/mthca/mthca_cmd.c
index 7131446..26c42a1 100644
--- a/drivers/infiniband/hw/mthca/mthca_cmd.c
+++ b/drivers/infiniband/hw/mthca/mthca_cmd.c
@@ -355,9 +355,6 @@ void mthca_cmd_event(struct mthca_dev *dev,
 	context->result    = 0;
 	context->status    = status;
 	context->out_param = out_param;
-
-	context->token += dev->cmd.token_mask + 1;
-
 	complete(&context->done);
 }
 
@@ -379,6 +376,7 @@ static int mthca_cmd_wait(struct mthca_dev *dev,
 	spin_lock(&dev->cmd.context_lock);
 	BUG_ON(dev->cmd.free_head < 0);
 	context = &dev->cmd.context[dev->cmd.free_head];
+	context->token += dev->cmd.token_mask + 1;
 	dev->cmd.free_head = context->next;
 	spin_unlock(&dev->cmd.context_lock);
 

-- 
MST

-- 
MST


From sashak at voltaire.com  Thu Jul 19 05:13:37 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 19 Jul 2007 15:13:37 +0300
Subject: [ofa-general] RE: [PATCH] opensm: Bug in coding trying to set
	vl_arb_high_limit
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901E73F06@mtlexch01.mtl.com>
References: <86hco13c7e.fsf@sw053.lab.mtl.com>
	<20070718192217.GE27878@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901E73F06@mtlexch01.mtl.com>
Message-ID: <1184847217.21739.16.camel@localhost>

On Thu, 2007-07-19 at 07:51 +0300, Eitan Zahavi wrote:
> Ohh your right. The Enh0 should get an update. 
> I thought I got it right. Do you want me to provide an updated patch?

I can update on my side - I think we could remove VLHighLimit update
from osm_lid_mgr and have one only in osm_link_mgr.

Sasha

> 
> Eitan Zahavi
> Senior Engineering Director, Software Architect
> Mellanox Technologies LTD
> Tel:+972-4-9097208
> Fax:+972-4-9593245
> P.O. Box 586 Yokneam 20692 ISRAEL
> 
>  
> 
> > -----Original Message-----
> > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] 
> > Sent: Wednesday, July 18, 2007 10:22 PM
> > To: Eitan Zahavi
> > Cc: OPENIB; hal.rosenstock at gmail.com; Yevgeny Kliteynik
> > Subject: Re: [PATCH] opensm: Bug in coding trying to set 
> > vl_arb_high_limit
> > 
> > Hi Eitan,
> > 
> > On 19:31 Wed 18 Jul     , Eitan Zahavi wrote:
> > > Hi Sasha
> > > 
> > > When QoS setup is done the code was trying to send updates of 
> > > vl_arb_high_limit by req_set of PORT_INFO with the new data.
> > > However, at that stage the SM still did not assign LIDs to 
> > the ports.
> > > So the sent PortInfo.base_lid was still zero. The 
> > specification does 
> > > not allow for such LIDs (they are considered ilegal).
> > > 
> > > the patch below fixes this by storing the calculated value 
> > and later 
> > > using it in link and lid managers.
> > 
> > Good, Thanks (and this also saves one PortInfo update MAD). 
> > One question below:
> > 
> > 
> > > 
> > > Eitan
> > > 
> > > Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> > > 
> > 
> > [snip...]
> > 
> > > diff --git a/opensm/opensm/osm_lid_mgr.c 
> > b/opensm/opensm/osm_lid_mgr.c 
> > > index bc3f8b3..ed76382 100644
> > > --- a/opensm/opensm/osm_lid_mgr.c
> > > +++ b/opensm/opensm/osm_lid_mgr.c
> > > @@ -1182,6 +1182,14 @@ __osm_lid_mgr_set_physp_pi(
> > >             ib_port_info_get_port_state(p_old_pi) )
> > >          send_set = TRUE;
> > >      }
> > > +
> > > +	 /* provide the vl_high_limit from the qos mgr */
> > > +	 if (p_mgr->p_subn->opt.no_qos == FALSE) 
> > > +		 if (p_physp->vl_high_limit != p_old_pi->vl_high_limit)
> > > +		 {
> > > +			 send_set = TRUE;
> > > +			 p_pi->vl_high_limit = p_physp->vl_high_limit;
> > > +		 }
> > 
> > This part of code is for port_num != 0, so VLHighLimit setup 
> > will be skipped for switch enhanced port 0. Is it something 
> > expected? If so why?
> > 
> > Sasha
> > 


From eitan at mellanox.co.il  Thu Jul 19 05:24:13 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 19 Jul 2007 15:24:13 +0300
Subject: [ofa-general] RE: [PATCH] opensm: Bug in coding trying to set
	vl_arb_high_limit
References: <86hco13c7e.fsf@sw053.lab.mtl.com>
	<20070718192217.GE27878@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901E73F06@mtlexch01.mtl.com>
	<1184847217.21739.16.camel@localhost>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901ED5748@mtlexch01.mtl.com>

Hi Sasha,

I was not sure if there might be a case where the Link manager will not
touch the port.
So I placed it on both sides. Can't remember now if it is possible or
not.
Thanks for taking care of it.

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL

 
> -----Original Message-----
> From: Sasha Khapyorsky [mailto:sashak at voltaire.com] 
> Sent: Thursday, July 19, 2007 3:14 PM
> To: Eitan Zahavi
> Cc: OPENIB; hal.rosenstock at gmail.com; Yevgeny Kliteynik
> Subject: RE: [PATCH] opensm: Bug in coding trying to set 
> vl_arb_high_limit
> 
> On Thu, 2007-07-19 at 07:51 +0300, Eitan Zahavi wrote:
> > Ohh your right. The Enh0 should get an update. 
> > I thought I got it right. Do you want me to provide an 
> updated patch?
> 
> I can update on my side - I think we could remove VLHighLimit 
> update from osm_lid_mgr and have one only in osm_link_mgr.
> 
> Sasha
> 
> > 
> > Eitan Zahavi
> > Senior Engineering Director, Software Architect Mellanox 
> Technologies 
> > LTD
> > Tel:+972-4-9097208
> > Fax:+972-4-9593245
> > P.O. Box 586 Yokneam 20692 ISRAEL
> > 
> >  
> > 
> > > -----Original Message-----
> > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com]
> > > Sent: Wednesday, July 18, 2007 10:22 PM
> > > To: Eitan Zahavi
> > > Cc: OPENIB; hal.rosenstock at gmail.com; Yevgeny Kliteynik
> > > Subject: Re: [PATCH] opensm: Bug in coding trying to set 
> > > vl_arb_high_limit
> > > 
> > > Hi Eitan,
> > > 
> > > On 19:31 Wed 18 Jul     , Eitan Zahavi wrote:
> > > > Hi Sasha
> > > > 
> > > > When QoS setup is done the code was trying to send updates of 
> > > > vl_arb_high_limit by req_set of PORT_INFO with the new data.
> > > > However, at that stage the SM still did not assign LIDs to
> > > the ports.
> > > > So the sent PortInfo.base_lid was still zero. The
> > > specification does
> > > > not allow for such LIDs (they are considered ilegal).
> > > > 
> > > > the patch below fixes this by storing the calculated value
> > > and later
> > > > using it in link and lid managers.
> > > 
> > > Good, Thanks (and this also saves one PortInfo update MAD). 
> > > One question below:
> > > 
> > > 
> > > > 
> > > > Eitan
> > > > 
> > > > Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> > > > 
> > > 
> > > [snip...]
> > > 
> > > > diff --git a/opensm/opensm/osm_lid_mgr.c
> > > b/opensm/opensm/osm_lid_mgr.c
> > > > index bc3f8b3..ed76382 100644
> > > > --- a/opensm/opensm/osm_lid_mgr.c
> > > > +++ b/opensm/opensm/osm_lid_mgr.c
> > > > @@ -1182,6 +1182,14 @@ __osm_lid_mgr_set_physp_pi(
> > > >             ib_port_info_get_port_state(p_old_pi) )
> > > >          send_set = TRUE;
> > > >      }
> > > > +
> > > > +	 /* provide the vl_high_limit from the qos mgr */
> > > > +	 if (p_mgr->p_subn->opt.no_qos == FALSE) 
> > > > +		 if (p_physp->vl_high_limit != 
> p_old_pi->vl_high_limit)
> > > > +		 {
> > > > +			 send_set = TRUE;
> > > > +			 p_pi->vl_high_limit = 
> p_physp->vl_high_limit;
> > > > +		 }
> > > 
> > > This part of code is for port_num != 0, so VLHighLimit 
> setup will be 
> > > skipped for switch enhanced port 0. Is it something 
> expected? If so 
> > > why?
> > > 
> > > Sasha
> > > 
> 


From erezz at voltaire.com  Thu Jul 19 05:28:53 2007
From: erezz at voltaire.com (Erez Zilber)
Date: Thu, 19 Jul 2007 15:28:53 +0300
Subject: [ofa-general] Re: [PATCH 29/33] infiniband: sg chaining support
In-Reply-To: <20070719083939.GC11657@kernel.dk>
References: <adar6n8z7za.fsf@cisco.com> <469F216D.3060306@voltaire.com>
	<20070719083939.GC11657@kernel.dk>
Message-ID: <469F5905.5020303@voltaire.com>

Jens Axboe wrote:

> On Thu, Jul 19 2007, Erez Zilber wrote:
>   
>> Roland Dreier wrote:
>>
>>     
>>> [adding infinipath at qlogic.com and general at lists.openfabrics.org -- Roland]
>>>
>>>       
>> I would like to test that on iSER. Where can I download all 33 patches from?
>>     
>
> I can provide a rolled up patch for you, right now the patchset has been
> split in a series of 3 (core -> drivers -> arch bits are seperate).
> Here's one for current -git as-of this morning:
>
> http://brick.kernel.dk/sglist-chain-all-2.6.22-git-20070719
>
>   

Looks ok with iSER.

Erez


From sashak at voltaire.com  Thu Jul 19 06:00:56 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 19 Jul 2007 16:00:56 +0300
Subject: [ofa-general] RE: [PATCH] opensm: Bug in coding trying to set
	vl_arb_high_limit
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901ED5748@mtlexch01.mtl.com>
References: <86hco13c7e.fsf@sw053.lab.mtl.com>
	<20070718192217.GE27878@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901E73F06@mtlexch01.mtl.com>
	<1184847217.21739.16.camel@localhost>
	<6C2C79E72C305246B504CBA17B5500C901ED5748@mtlexch01.mtl.com>
Message-ID: <1184850056.21739.20.camel@localhost>

Hi Eitan,

On Thu, 2007-07-19 at 15:24 +0300, Eitan Zahavi wrote:
> Hi Sasha,
> 
> I was not sure if there might be a case where the Link manager will not
> touch the port.

It should, at least with IB_LINK_NO_CHANGE call. So I moved VLHighLimit
setup under this condition too (where most PortInfo fields are handled).
Will push soon. Thanks for the patch.

Sasha

> So I placed it on both sides. Can't remember now if it is possible or
> not.
> Thanks for taking care of it.
> 
> Eitan Zahavi
> Senior Engineering Director, Software Architect
> Mellanox Technologies LTD
> Tel:+972-4-9097208
> Fax:+972-4-9593245
> P.O. Box 586 Yokneam 20692 ISRAEL
> 
>  
> 
> > -----Original Message-----
> > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] 
> > Sent: Thursday, July 19, 2007 3:14 PM
> > To: Eitan Zahavi
> > Cc: OPENIB; hal.rosenstock at gmail.com; Yevgeny Kliteynik
> > Subject: RE: [PATCH] opensm: Bug in coding trying to set 
> > vl_arb_high_limit
> > 
> > On Thu, 2007-07-19 at 07:51 +0300, Eitan Zahavi wrote:
> > > Ohh your right. The Enh0 should get an update. 
> > > I thought I got it right. Do you want me to provide an 
> > updated patch?
> > 
> > I can update on my side - I think we could remove VLHighLimit 
> > update from osm_lid_mgr and have one only in osm_link_mgr.
> > 
> > Sasha
> > 
> > > 
> > > Eitan Zahavi
> > > Senior Engineering Director, Software Architect Mellanox 
> > Technologies 
> > > LTD
> > > Tel:+972-4-9097208
> > > Fax:+972-4-9593245
> > > P.O. Box 586 Yokneam 20692 ISRAEL
> > > 
> > >  
> > > 
> > > > -----Original Message-----
> > > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com]
> > > > Sent: Wednesday, July 18, 2007 10:22 PM
> > > > To: Eitan Zahavi
> > > > Cc: OPENIB; hal.rosenstock at gmail.com; Yevgeny Kliteynik
> > > > Subject: Re: [PATCH] opensm: Bug in coding trying to set 
> > > > vl_arb_high_limit
> > > > 
> > > > Hi Eitan,
> > > > 
> > > > On 19:31 Wed 18 Jul     , Eitan Zahavi wrote:
> > > > > Hi Sasha
> > > > > 
> > > > > When QoS setup is done the code was trying to send updates of 
> > > > > vl_arb_high_limit by req_set of PORT_INFO with the new data.
> > > > > However, at that stage the SM still did not assign LIDs to
> > > > the ports.
> > > > > So the sent PortInfo.base_lid was still zero. The
> > > > specification does
> > > > > not allow for such LIDs (they are considered ilegal).
> > > > > 
> > > > > the patch below fixes this by storing the calculated value
> > > > and later
> > > > > using it in link and lid managers.
> > > > 
> > > > Good, Thanks (and this also saves one PortInfo update MAD). 
> > > > One question below:
> > > > 
> > > > 
> > > > > 
> > > > > Eitan
> > > > > 
> > > > > Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> > > > > 
> > > > 
> > > > [snip...]
> > > > 
> > > > > diff --git a/opensm/opensm/osm_lid_mgr.c
> > > > b/opensm/opensm/osm_lid_mgr.c
> > > > > index bc3f8b3..ed76382 100644
> > > > > --- a/opensm/opensm/osm_lid_mgr.c
> > > > > +++ b/opensm/opensm/osm_lid_mgr.c
> > > > > @@ -1182,6 +1182,14 @@ __osm_lid_mgr_set_physp_pi(
> > > > >             ib_port_info_get_port_state(p_old_pi) )
> > > > >          send_set = TRUE;
> > > > >      }
> > > > > +
> > > > > +	 /* provide the vl_high_limit from the qos mgr */
> > > > > +	 if (p_mgr->p_subn->opt.no_qos == FALSE) 
> > > > > +		 if (p_physp->vl_high_limit != 
> > p_old_pi->vl_high_limit)
> > > > > +		 {
> > > > > +			 send_set = TRUE;
> > > > > +			 p_pi->vl_high_limit = 
> > p_physp->vl_high_limit;
> > > > > +		 }
> > > > 
> > > > This part of code is for port_num != 0, so VLHighLimit 
> > setup will be 
> > > > skipped for switch enhanced port 0. Is it something 
> > expected? If so 
> > > > why?
> > > > 
> > > > Sasha
> > > > 
> > 


From sashak at voltaire.com  Thu Jul 19 06:24:07 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 19 Jul 2007 16:24:07 +0300
Subject: [ofa-general] Re: [PATCH] opensm: Bug in coding trying to set
	vl_arb_high_limit
In-Reply-To: <1184850056.21739.20.camel@localhost>
References: <86hco13c7e.fsf@sw053.lab.mtl.com>
	<20070718192217.GE27878@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901E73F06@mtlexch01.mtl.com>
	<1184847217.21739.16.camel@localhost>
	<6C2C79E72C305246B504CBA17B5500C901ED5748@mtlexch01.mtl.com>
	<1184850056.21739.20.camel@localhost>
Message-ID: <20070719132407.GA16597@sashak.voltaire.com>

On 16:00 Thu 19 Jul     , Sasha Khapyorsky wrote:
> Hi Eitan,
> 
> On Thu, 2007-07-19 at 15:24 +0300, Eitan Zahavi wrote:
> > Hi Sasha,
> > 
> > I was not sure if there might be a case where the Link manager will not
> > touch the port.
> 
> It should, at least with IB_LINK_NO_CHANGE call. So I moved VLHighLimit
> setup under this condition too (where most PortInfo fields are handled).
> Will push soon. Thanks for the patch.

Actually this is what I meant:


commit 464a00b94e77d5f753a01569f19166e115eb90e5
Author: Sasha Khapyorsky <sashak at voltaire.com>
Date:   Thu Jul 19 16:03:55 2007 +0300

    opensm: VLHighLimit update during initial (in sweep) link_mgr call
    
    Update PortInfo:VLHighLimit during initial (in sweep) link_mgr call
    (which is with IB_LINK_NO_CHANGE).
    
    Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

diff --git a/opensm/opensm/osm_link_mgr.c b/opensm/opensm/osm_link_mgr.c
index b2b43ed..196942c 100644
--- a/opensm/opensm/osm_link_mgr.c
+++ b/opensm/opensm/osm_link_mgr.c
@@ -334,6 +334,14 @@ __osm_link_mgr_set_physp_pi(
          ib_port_info_get_op_vls(p_old_pi) )
       send_set = TRUE;
 
+    /* provide the vl_high_limit from the qos mgr */
+    if (p_mgr->p_subn->opt.no_qos == FALSE &&
+        p_physp->vl_high_limit != p_old_pi->vl_high_limit)
+    {
+      send_set = TRUE;
+      p_pi->vl_high_limit = p_physp->vl_high_limit;
+    }
+
     /* also the context can flag the need to check for errors. */
     context.pi_context.ignore_errors = FALSE;
   }
@@ -360,15 +368,6 @@ __osm_link_mgr_set_physp_pi(
       context.pi_context.active_transition = FALSE;
   }
 
-  /* provide the vl_high_limit from the qos mgr */
-  if (p_mgr->p_subn->opt.no_qos == FALSE)
-	  if (p_physp->vl_high_limit != p_old_pi->vl_high_limit)
-	  {
-		  send_set = TRUE;
-		  p_pi->vl_high_limit = p_physp->vl_high_limit;
-	  }
-
-
   context.pi_context.node_guid = osm_node_get_node_guid( p_node );
   context.pi_context.port_guid = osm_physp_get_port_guid( p_physp );
   context.pi_context.set_method = TRUE;


Sasha


From eitan at mellanox.co.il  Thu Jul 19 06:18:00 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 19 Jul 2007 16:18:00 +0300
Subject: [ofa-general] RE: [PATCH] opensm: Bug in coding trying to set
	vl_arb_high_limit
References: <86hco13c7e.fsf@sw053.lab.mtl.com>
	<20070718192217.GE27878@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901E73F06@mtlexch01.mtl.com>
	<1184847217.21739.16.camel@localhost>
	<6C2C79E72C305246B504CBA17B5500C901ED5748@mtlexch01.mtl.com>
	<1184850056.21739.20.camel@localhost>
	<20070719132407.GA16597@sashak.voltaire.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901ED57A2@mtlexch01.mtl.com>

Looks good.

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL

 
> -----Original Message-----
> From: Sasha Khapyorsky [mailto:sashak at voltaire.com] 
> Sent: Thursday, July 19, 2007 4:24 PM
> To: Eitan Zahavi
> Cc: OPENIB; hal.rosenstock at gmail.com; Yevgeny Kliteynik
> Subject: Re: [PATCH] opensm: Bug in coding trying to set 
> vl_arb_high_limit
> 
> On 16:00 Thu 19 Jul     , Sasha Khapyorsky wrote:
> > Hi Eitan,
> > 
> > On Thu, 2007-07-19 at 15:24 +0300, Eitan Zahavi wrote:
> > > Hi Sasha,
> > > 
> > > I was not sure if there might be a case where the Link 
> manager will 
> > > not touch the port.
> > 
> > It should, at least with IB_LINK_NO_CHANGE call. So I moved 
> > VLHighLimit setup under this condition too (where most 
> PortInfo fields are handled).
> > Will push soon. Thanks for the patch.
> 
> Actually this is what I meant:
> 
> 
> commit 464a00b94e77d5f753a01569f19166e115eb90e5
> Author: Sasha Khapyorsky <sashak at voltaire.com>
> Date:   Thu Jul 19 16:03:55 2007 +0300
> 
>     opensm: VLHighLimit update during initial (in sweep) link_mgr call
>     
>     Update PortInfo:VLHighLimit during initial (in sweep) 
> link_mgr call
>     (which is with IB_LINK_NO_CHANGE).
>     
>     Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> 
> diff --git a/opensm/opensm/osm_link_mgr.c 
> b/opensm/opensm/osm_link_mgr.c index b2b43ed..196942c 100644
> --- a/opensm/opensm/osm_link_mgr.c
> +++ b/opensm/opensm/osm_link_mgr.c
> @@ -334,6 +334,14 @@ __osm_link_mgr_set_physp_pi(
>           ib_port_info_get_op_vls(p_old_pi) )
>        send_set = TRUE;
>  
> +    /* provide the vl_high_limit from the qos mgr */
> +    if (p_mgr->p_subn->opt.no_qos == FALSE &&
> +        p_physp->vl_high_limit != p_old_pi->vl_high_limit)
> +    {
> +      send_set = TRUE;
> +      p_pi->vl_high_limit = p_physp->vl_high_limit;
> +    }
> +
>      /* also the context can flag the need to check for errors. */
>      context.pi_context.ignore_errors = FALSE;
>    }
> @@ -360,15 +368,6 @@ __osm_link_mgr_set_physp_pi(
>        context.pi_context.active_transition = FALSE;
>    }
>  
> -  /* provide the vl_high_limit from the qos mgr */
> -  if (p_mgr->p_subn->opt.no_qos == FALSE)
> -	  if (p_physp->vl_high_limit != p_old_pi->vl_high_limit)
> -	  {
> -		  send_set = TRUE;
> -		  p_pi->vl_high_limit = p_physp->vl_high_limit;
> -	  }
> -
> -
>    context.pi_context.node_guid = osm_node_get_node_guid( p_node );
>    context.pi_context.port_guid = osm_physp_get_port_guid( p_physp );
>    context.pi_context.set_method = TRUE;
> 
> 
> Sasha
> 


From sashak at voltaire.com  Thu Jul 19 06:45:33 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 19 Jul 2007 16:45:33 +0300
Subject: [ofa-general] Re: [PATCH] opensm: Bug in coding trying to set
	vl_arb_high_limit
In-Reply-To: <86hco13c7e.fsf@sw053.lab.mtl.com>
References: <86hco13c7e.fsf@sw053.lab.mtl.com>
Message-ID: <20070719134533.GD16597@sashak.voltaire.com>

On 19:31 Wed 18 Jul     , Eitan Zahavi wrote:
> Hi Sasha
> 
> When QoS setup is done the code was trying to send updates of
> vl_arb_high_limit by req_set of PORT_INFO with the new data.
> However, at that stage the SM still did not assign LIDs to the ports.
> So the sent PortInfo.base_lid was still zero. The specification does not
> allow for such LIDs (they are considered ilegal). 
> 
> the patch below fixes this by storing the calculated value and later 
> using it in link and lid managers.
> 
> Eitan
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Applied (with changes discussed in this thread). Thanks.

Sasha


From mst at dev.mellanox.co.il  Thu Jul 19 07:31:29 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 19 Jul 2007 17:31:29 +0300
Subject: [ofa-general] Re: The low level driver of mlx4 kmalloc 0 bytes in QP
	creation
In-Reply-To: <adaodj1rz9k.fsf@cisco.com>
References: <46821FDA.5030900@dev.mellanox.co.il> <adaodj1rz9k.fsf@cisco.com>
Message-ID: <20070719143129.GB28640@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: The low level driver of mlx4 kmalloc 0 bytes in QP creation
> 
>  > If one creates a QP with 0 WR in the RQ in the kernel level, the low
>  > level driver of the mlx4
>  > will kmalloc 0 bytes (for the WR IDs of the RQ).
>  > (for example, the IPoIB CM creates such a QP)
>  > 
>  > Is this is an error?
> 
> The consensus seems to be that kmalloc(0) is OK, although various
> 2.6.22-rc kernels printed big tracebacks when it happens.  I think
> getting rid of the kmalloc(0) in mlx4 would make the code more
> complicated for no real gain.

Hmm, seems to crash with recent git kernels.

-- 
MST


From rdreier at cisco.com  Thu Jul 19 07:36:46 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 19 Jul 2007 07:36:46 -0700
Subject: [ofa-general] Re: oops on mlx4 modprobe
References: <20070719084751.GC24018@mellanox.co.il>
Message-ID: <ada3azkh341.fsf@cisco.com>

Is this with CONFIG_SLAB or CONFIG_SLUB?


From rdreier at cisco.com  Thu Jul 19 07:46:47 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 19 Jul 2007 07:46:47 -0700
Subject: [ofa-general] Re: [PATCH] IB/mlx4: fix oops in qp allocation for srq
	case
References: <20070719094039.GF24018@mellanox.co.il>
Message-ID: <adar6n4fo2w.fsf@cisco.com>

kmalloc(0) is fine to do.  This must be a bug introduced recently into
one of the allocators -- which one are you using?


From rdreier at cisco.com  Thu Jul 19 07:46:46 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 19 Jul 2007 07:46:46 -0700
Subject: [ofa-general] Re: [PATCH] IB/mlx4: enable MSI-X by default
References: <20070719112155.GJ24018@mellanox.co.il>
Message-ID: <adawswwfo2x.fsf@cisco.com>

 > +	mlx4_enable_msi_x(dev);
 > +
 >  	err = mlx4_init_hca(dev);
 > +	if (err == -EBUSY && (dev->flags & MLX4_FLAG_MSI_X)) {
 > +		mlx4_warn(dev, "Trying again with MSI/MSI-X disabled.\n");
 > +		dev->flags &= ~MLX4_FLAG_MSI_X;
 > +		pci_disable_msix(pdev);
 > +		err = mlx4_init_hca(dev);
 > +	}
 > +
 >  	if (err)
 >  		goto err_cmd;
 >  
 > +	mlx4_enable_msi_x(dev);
 > +

Am I misreading the code or is that last mlx4_enable_msi_x() wrong?

 - R.


From HNGUYEN at de.ibm.com  Thu Jul 19 08:03:26 2007
From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen)
Date: Thu, 19 Jul 2007 17:03:26 +0200
Subject: [ofa-general] Re: [PATCH 01/10] IB/ehca: Support for multiple event
	queues
In-Reply-To: <adalkdesyrs.fsf@cisco.com>
Message-ID: <OF00D945EC.C16B2992-ONC125731D.0052335C-C125731D.0052B490@de.ibm.com>

Roland Dreier <rdreier at cisco.com> wrote on 17.07.2007 19:52:55:
> At a higher level, I'm left wondering why nobody talked about multiple
> EQs during the last months of the 2.6.22 process and now all of a
> sudden it becomes urgent in the last few days of the 2.6.23 merge
> window.  That's not really how I like to merge features....
OK, let keep multiple eqs for next release with a more stable
verbs def.
For the other patch to support MR with large pages we'll resend
it (without deps on multiple eqs patch) to you soon.
Regards
Nam


From mst at dev.mellanox.co.il  Thu Jul 19 09:12:01 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 19 Jul 2007 19:12:01 +0300
Subject: [ofa-general] Re: oops on mlx4 modprobe
In-Reply-To: <ada3azkh341.fsf@cisco.com>
References: <20070719084751.GC24018@mellanox.co.il> <ada3azkh341.fsf@cisco.com>
Message-ID: <20070719161201.GA31246@mellanox.co.il>

CONFIG_SLAB

Quoting Roland Dreier <rdreier at cisco.com>:
Subject: Re: oops on mlx4 modprobe

Is this with CONFIG_SLAB or CONFIG_SLUB?

-- 
MST


From mst at dev.mellanox.co.il  Thu Jul 19 09:15:44 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 19 Jul 2007 19:15:44 +0300
Subject: [ofa-general] Re: [PATCH] IB/mlx4: fix oops in qp allocation for srq
	case
In-Reply-To: <adar6n4fo2w.fsf@cisco.com>
References: <20070719094039.GF24018@mellanox.co.il> <adar6n4fo2w.fsf@cisco.com>
Message-ID: <20070719161543.GC31246@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH] IB/mlx4: fix oops in qp allocation for srq case
> 
> kmalloc(0) is fine to do.  This must be a bug introduced recently into
> one of the allocators -- which one are you using?

the bug in error handling is real though, isn't it?

-- 
MST


From mst at dev.mellanox.co.il  Thu Jul 19 09:18:20 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 19 Jul 2007 19:18:20 +0300
Subject: [ofa-general] Re: socket buffer accounting with UDP/ipoib
In-Reply-To: <ada3azln3wo.fsf@cisco.com>
References: <1183643723.25031.262.camel@mtls03> <aday7hfrkpu.fsf@cisco.com>
	<20070718074632.GF1115@mellanox.co.il> <ada3azln3wo.fsf@cisco.com>
Message-ID: <20070719161820.GD31246@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: socket buffer accounting with UDP/ipoib
> 
>  > > +		ib_dma_sync_single_for_cpu(priv->ca, addr, IPOIB_BUF_SIZE,
>  > > +					   DMA_FROM_DEVICE);
>  > > +		skb_copy_from_linear_data_offset(skb, IB_GRH_BYTES, new_skb->data,
>  > > +						 wc->byte_len - IB_GRH_BYTES);
>  > > +		ib_dma_sync_single_for_device(priv->ca, addr, IPOIB_BUF_SIZE,
>  > > +					      DMA_FROM_DEVICE);
>  > 
>  > BTW, why is ib_dma_sync_single_for_device necessary here?
> 
> Not sure what you're asking exactly.  The sync for device is needed to
> match the previous sync for the cpu obviously.

That's what I'm missing: must each sync_for_cpu be paired
with sync_for_device? Is there documentation for this somewhere?

> We need both syncs for
> the same reason we need the unmap when we don't copy -- we're copying
> data out of the skb we gave to the device earlier, so we need to make
> sure the cpu sees the right data.

Right, but device never reads the buffer, and CPU never modifies it.

-- 
MST


From mst at dev.mellanox.co.il  Thu Jul 19 09:18:52 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 19 Jul 2007 19:18:52 +0300
Subject: [ofa-general] Re: [PATCH] IB/mlx4: enable MSI-X by default
In-Reply-To: <adawswwfo2x.fsf@cisco.com>
References: <20070719112155.GJ24018@mellanox.co.il> <adawswwfo2x.fsf@cisco.com>
Message-ID: <20070719161852.GE31246@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH] IB/mlx4: enable MSI-X by default
> 
>  > +	mlx4_enable_msi_x(dev);
>  > +
>  >  	err = mlx4_init_hca(dev);
>  > +	if (err == -EBUSY && (dev->flags & MLX4_FLAG_MSI_X)) {
>  > +		mlx4_warn(dev, "Trying again with MSI/MSI-X disabled.\n");
>  > +		dev->flags &= ~MLX4_FLAG_MSI_X;
>  > +		pci_disable_msix(pdev);
>  > +		err = mlx4_init_hca(dev);
>  > +	}
>  > +
>  >  	if (err)
>  >  		goto err_cmd;
>  >  
>  > +	mlx4_enable_msi_x(dev);
>  > +
> 
> Am I misreading the code or is that last mlx4_enable_msi_x() wrong?

Hmm, looks like it is ..

-- 
MST


From Jonathan.Robertson at 3leafnetworks.com  Thu Jul 19 09:31:17 2007
From: Jonathan.Robertson at 3leafnetworks.com (Jonathan Robertson)
Date: Thu, 19 Jul 2007 09:31:17 -0700
Subject: FW: [ofa-general] libsdp in OFED 1.1
Message-ID: <7C1D552561AF0544ACC7CF6F10E4966ECB541A@chronus.3leafnetworks.corp>

Hi Jim,

We are actually using OFED 1.1. Hopefully we'll move to 1.2 in a few weeks. The systems using it are SLES 9 SP3.

Uname -a:
Linux oracle 2.6.5-7.244-smp #1 SMP Mon Dec 12 18:32:25 UTC 2005 x86_64 x86_64 x86_64 GNU/Linux

I have added alias net-pf-27 ib_sdp to modprobe.conf.local
I have modified /usr/local/ofed/etc/libsdp.conf to have the following lines:
log min-level 9 destination syslog
use both server "/usr/local/bin/netserver" *:*
use both client "/usr/local/bin/netperf" *:*
And I created /etc/ld.so.preload and have:
/usr/local/ofed/lib64/libsdp.so

Is there a close function in ofed 1.1? Perhaps I should try to add that to port.c for 1.1?

My reply to your email bounced...
5.1.0 - Unknown address error 550-'5.1.1 unknown or illegal alias: <removed>@austin.rr.com'

Thanks!
Jonathan

-----Original Message-----
From: Jim Mott [mailto:jimmmott at austin.rr.com] 
Sent: Wednesday, July 18, 2007 3:42 PM
To: Jonathan Robertson
Subject: RE: [ofa-general] libsdp in OFED 1.1

Hi,
  I have just taken over support for libsdp and am feeling my way here.
Probably I should have replied to the list, but this works too.

  I assume you are using OFED 1.2 version of the code.  There is a close()
function in that code (port.c), so there is something fishy here.  Could you
send a little more info please.  Stuff like distro, 32/64, and perhaps the
script/commands you use to automate the preload process.

Something like:

  # uname -a
  Linux sw106.lab.mtl.com 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT \
  2006 x86_64 x86_64 x86_64 GNU/Linux

  # export LD_LIBRARY_PATH=/usr/local/ofed/lib64:/usr/local/ofed/lib
  # export LD_PRELOAD=libsdp.so
  # export LIBSDP_CONFIG_FILE=/etc/infiniband/libsdp.conf 

  # ls 
  config_parser.c   config_scanner.c   libsdp.la  Makefile     match.c
port.lo
  config_parser.h   config_scanner.lo  log.c      Makefile.am  match.lo
sdp_inet.h
  config_parser.lo  libsdp.h           log.lo     Makefile.in  port.c
socket.c


Thanks,
Jim


=========================

From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Jonathan
Robertson
Sent: Wednesday, July 18, 2007 3:19 PM
To: general at lists.openfabrics.org
Subject: [ofa-general] libsdp in OFED 1.1

Hello,

I have been using libsdp, and preloading it with the application. I would
like to have it automatically preloaded, but am concerned about some error
messages that seem harmless. So I don't want to have our client use the
ld.so.preload if there are going to be messages.

I see the following when I run a simple 'ls'

# ls
Wed Jul 18 06:11:09 2007 ls[8105] libsdp Error close: no implementation for
close found
 .
..
#

Any suggestions?

I have the following in libsdp.conf
Log min-level 9 destination syslog
Use both server netserver *:*
Use both client netperf *:*

Our client is interested in having weblogic communicate with the oracle DB
using SDP, and the interface to oracle and weblogic being accessible via
tcp/ip over Ethernet as well.

Thanks!
Jonathan


From sclank at iuk.kg  Thu Jul 19 09:42:13 2007
From: sclank at iuk.kg (Myrtle Hooks)
Date: Thu, 19 Jul 2007 12:42:13 -0400
Subject: [ofa-general] Thanks,
	we are ready to lend you some cash regardless of Credit
Message-ID: <001b01c7ca02$a5b9a1a0$01c6c1a4@FAMILY>


Your credit score does not matter to us!
 
If you have your own business and want IMMEDIATE money to spend ANY way you like or wish Extra money to give the business a boost or  require A low interest loan - NO STRINGS ATTACHED, here is the deal we can offer you THIS EVENING (hurry, this offer will expire TODAY):
 
$69,000+ loan
 
Hurry, when our deal is gone, it is gone. Simply Call Us... 
 
Don't worry about approval, your your credit report will not disqualify you!
 
Call Us Free on 877-542-1880
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070719/7155b7b9/attachment.html>

From afriedle at open-mpi.org  Thu Jul 19 10:13:15 2007
From: afriedle at open-mpi.org (Andrew Friedley)
Date: Thu, 19 Jul 2007 10:13:15 -0700
Subject: [ofa-general] Limited number of multicasts groups that can be
	joined?
In-Reply-To: <468426B6.3060602@ichips.intel.com>
References: <46699A6D.4070300@open-mpi.org> <4683D7D6.50402@open-mpi.org>
	<468426B6.3060602@ichips.intel.com>
Message-ID: <469F9BAB.4080504@open-mpi.org>

Finally was able to have the SM switched over from Cisco on the switch 
to OpenSM on a node.  Responses inline below..

Sean Hefty wrote:
>> Now the more interesting part.  I'm now able to run on a 128 node 
>> machine using open SM running on a node (before, I was running on an 8 
>> node machine which I'm told is running the Cisco SM on a Topspin 
>> switch).  On this machine, if I run my benchmark with two processes 
>> per node (instead of one, i.e. mpirun -np 16 with 8 nodes), I'm able 
>> to join  > 750 groups simultaneously from one QP on each process.  To 
>> make this stranger, I can join only 4 groups running the same thing on 
>> the 8-node machine.
> 
> Are the switches and HCAs in the two setups the same?  If you run the 
> same SM on both clusters, do you see the same results?

The switches are different.  The 8 node machine uses a Topspin switch, 
the 128 node machine uses a Mellanox switch.  Looking at `ibstat` the 
HCAs appear to be the same (MT23108), though HCAs on the 128 node 
machine have firmware 3.2.0, where 3.5.0 is on the 8 node machine.  Does 
this matter?

Running OpenSM now, I still do not see the same results.  Behavior is 
now the same as the 128 node machine, except when running two processes 
per node (in which case I can join as many groups as I like on the 128 
node machine).  On the 8 node machine I am still limited to 4 groups in 
this case.  This makes me think the switch is involved, is this correct?

> 
>> While doing so I noticed that the time from calling 
>> rdma_join_multicast() to the event arrival stayed fairly constant (in 
>> the .001sec range), while the time from the join call to actually 
>> receiving messages on the group steadily increased from around .1 secs 
>> to around 2.7 secs with 750+ groups.  Furthermore, this time does not 
>> drop back to .1 secs if I stop the benchmark and run it (or any of my 
>> other multicast code) again.  This is understandable within a single 
>> program run, but the fact that behavior persists across runs concerns 
>> me -- feels like a bug, but I don't have much concrete here.
> 
> Even after all nodes leave all multicast groups, I don't believe that 
> there's a requirement for the SA to reprogram the switches immediately. 
>  So if the switches or the configuration of the swtiches are part of the 
> problem, I can imagine seeing issues between runs.
> 
> When rdma_join_multicast() reports the join event, it means either: the 
> SA has been notified of the join request, or, if the port has already 
> joined the group, that a reference count on the group has been 
> incremented.  The SA may still require time to program the switch 
> forwarding tables.

OK this makes sense, but I still don't see where all the time is going. 
  Should the fact that the switches haven't been reprogrammed since 
leaving the groups really effect how long it takes to do a subsequent 
join?  I'm not convinced.

Is this time being consumed by the switches when the are asked to 
reprogram their tables (I assume some sort of routing table is used 
internally)?  What could they be doing that takes so long to do that? 
Is it something that a firmware change on the switch could alleviate?

Andrew


From hal.rosenstock at gmail.com  Thu Jul 19 10:32:03 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 19 Jul 2007 10:32:03 -0700
Subject: [ofa-general] Limited number of multicasts groups that can be
	joined?
In-Reply-To: <469F9BAB.4080504@open-mpi.org>
References: <46699A6D.4070300@open-mpi.org> <4683D7D6.50402@open-mpi.org>
	<468426B6.3060602@ichips.intel.com> <469F9BAB.4080504@open-mpi.org>
Message-ID: <f0e08f230707191032h5134a643id01771438d7b9fb6@mail.gmail.com>

Andrew,

On 7/19/07, Andrew Friedley <afriedle at open-mpi.org> wrote:
>
> Finally was able to have the SM switched over from Cisco on the switch
> to OpenSM on a node.  Responses inline below..
>
> Sean Hefty wrote:
> >> Now the more interesting part.  I'm now able to run on a 128 node
> >> machine using open SM running on a node (before, I was running on an 8
> >> node machine which I'm told is running the Cisco SM on a Topspin
> >> switch).  On this machine, if I run my benchmark with two processes
> >> per node (instead of one, i.e. mpirun -np 16 with 8 nodes), I'm able
> >> to join  > 750 groups simultaneously from one QP on each process.  To
> >> make this stranger, I can join only 4 groups running the same thing on
> >> the 8-node machine.
> >
> > Are the switches and HCAs in the two setups the same?  If you run the
> > same SM on both clusters, do you see the same results?
>
> The switches are different.  The 8 node machine uses a Topspin switch,
> the 128 node machine uses a Mellanox switch.  Looking at `ibstat` the
> HCAs appear to be the same (MT23108), though HCAs on the 128 node
> machine have firmware 3.2.0, where 3.5.0 is on the 8 node machine.  Does
> this matter?
>
> Running OpenSM now, I still do not see the same results.  Behavior is
> now the same as the 128 node machine, except when running two processes
> per node (in which case I can join as many groups as I like on the 128
> node machine).  On the 8 node machine I am still limited to 4 groups in
> this case.


I'm not quite parsing what is the same with what is different in the results
(and I presume the only variable is SM).

This makes me think the switch is involved, is this correct?


I doubt it. It is either end station, SM, or a combination of the two.

>
> >> While doing so I noticed that the time from calling
> >> rdma_join_multicast() to the event arrival stayed fairly constant (in
> >> the .001sec range), while the time from the join call to actually
> >> receiving messages on the group steadily increased from around .1 secs
> >> to around 2.7 secs with 750+ groups.  Furthermore, this time does not
> >> drop back to .1 secs if I stop the benchmark and run it (or any of my
> >> other multicast code) again.  This is understandable within a single
> >> program run, but the fact that behavior persists across runs concerns
> >> me -- feels like a bug, but I don't have much concrete here.
> >
> > Even after all nodes leave all multicast groups, I don't believe that
> > there's a requirement for the SA to reprogram the switches immediately.
> >  So if the switches or the configuration of the swtiches are part of the
> > problem, I can imagine seeing issues between runs.
> >
> > When rdma_join_multicast() reports the join event, it means either: the
> > SA has been notified of the join request, or, if the port has already
> > joined the group, that a reference count on the group has been
> > incremented.  The SA may still require time to program the switch
> > forwarding tables.
>
> OK this makes sense, but I still don't see where all the time is going.
>   Should the fact that the switches haven't been reprogrammed since
> leaving the groups really effect how long it takes to do a subsequent
> join?  I'm not convinced.


It takes time for the SM to recalculate the multicast tree. While leaves can
be lazy, I forget whether joins are synchronous or not.

Is this time being consumed by the switches when the are asked to
> reprogram their tables (I assume some sort of routing table is used
> internally)?


This is relatively quick compared to the policy for the SM rerouting of
multicast based on joins/leaves/group creation/deletion.

-- Hal

  What could they be doing that takes so long to do that?
> Is it something that a firmware change on the switch could alleviate?
>
> Andrew
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070719/7ac9a061/attachment.html>

From afriedle at open-mpi.org  Thu Jul 19 10:58:42 2007
From: afriedle at open-mpi.org (Andrew Friedley)
Date: Thu, 19 Jul 2007 10:58:42 -0700
Subject: [ofa-general] Limited number of multicasts groups that can be
	joined?
In-Reply-To: <f0e08f230707191032h5134a643id01771438d7b9fb6@mail.gmail.com>
References: <46699A6D.4070300@open-mpi.org> <4683D7D6.50402@open-mpi.org>	
	<468426B6.3060602@ichips.intel.com> <469F9BAB.4080504@open-mpi.org>
	<f0e08f230707191032h5134a643id01771438d7b9fb6@mail.gmail.com>
Message-ID: <469FA652.4060909@open-mpi.org>

Hal Rosenstock wrote:
> I'm not quite parsing what is the same with what is different in the 
> results
> (and I presume the only variable is SM).

Yes; this is confusing, I'll try to summarize the various behaviors I'm 
getting.

First, there are two machines.  One has 8 nodes and runs a Topspin 
switch with the Cisco SM on it.  The other is 128 nodes and runs a 
Mellanox switch with Open SM on a compute node.  OFED v1.2 is used on 
both.  Below is how many groups I can join using my test program 
(described elsewhere in the thread)

On the 8 node machine:
8 procs (one per node) -- 14 groups.
16 procs (two per node) -- 4 groups.

On the 128 node machine:
8 procs (one per node, 8 nodes used) -- 14 groups.
16 procs (two per node, 8 nodes used) -- unlimited? I stopped past 750.

Some peculiarities complicate this.  On either machine, I've noticed 
that if I haven't been doing anything using IB multicast in say a day 
(haven't tried to figure out exactly how long), in any run scenario 
listed above, I can join 4 groups.  I do a couple runs where I hit 
errors after 4 groups, and then I consistently get the group counts 
above for the rest of the work day.

Second, in the cases in which I am able to join 14 groups, if I run my 
test program twice simultaneously on the same nodes, I am able to join a 
maximum of 14 groups total between the two running tests (as opposed to 
14 per test run).  Running the test twice simultaneously using a 
disjoint set of nodes is not an issue.

>> This makes me think the switch is involved, is this correct?
> 
> 
> I doubt it. It is either end station, SM, or a combination of the two.

OK.

>> OK this makes sense, but I still don't see where all the time is going.
>>   Should the fact that the switches haven't been reprogrammed since
>> leaving the groups really effect how long it takes to do a subsequent
>> join?  I'm not convinced.
> 
> 
> It takes time for the SM to recalculate the multicast tree. While leaves 
> can
> be lazy, I forget whether joins are synchronous or not.

Is the algorithm for recalculating the tree documented at all?  Or, 
where is the code for it (assuming I have access)?  I feel like I'm 
missing something here that explains why it's so costly.

Andrew

> 
> Is this time being consumed by the switches when the are asked to
>> reprogram their tables (I assume some sort of routing table is used
>> internally)?
> 
> 
> This is relatively quick compared to the policy for the SM rerouting of
> multicast based on joins/leaves/group creation/deletion.

OK.  Thanks for the insight.

Andrew


From afriedle at open-mpi.org  Thu Jul 19 11:14:00 2007
From: afriedle at open-mpi.org (Andrew Friedley)
Date: Thu, 19 Jul 2007 11:14:00 -0700
Subject: [ofa-general] Limited number of multicasts groups that can be
	joined?
In-Reply-To: <469FA652.4060909@open-mpi.org>
References: <46699A6D.4070300@open-mpi.org>
	<4683D7D6.50402@open-mpi.org>		<468426B6.3060602@ichips.intel.com>
	<469F9BAB.4080504@open-mpi.org>	<f0e08f230707191032h5134a643id01771438d7b9fb6@mail.gmail.com>
	<469FA652.4060909@open-mpi.org>
Message-ID: <469FA9E8.90609@open-mpi.org>


Andrew Friedley wrote:
> Hal Rosenstock wrote:
>> I'm not quite parsing what is the same with what is different in the 
>> results
>> (and I presume the only variable is SM).
> 
> Yes; this is confusing, I'll try to summarize the various behaviors I'm 
> getting.
> 
> First, there are two machines.  One has 8 nodes and runs a Topspin 
> switch with the Cisco SM on it.  The other is 128 nodes and runs a 
> Mellanox switch with Open SM on a compute node.  OFED v1.2 is used on 
> both.  Below is how many groups I can join using my test program 
> (described elsewhere in the thread)
> 
> On the 8 node machine:
> 8 procs (one per node) -- 14 groups.
> 16 procs (two per node) -- 4 groups.
> 
> On the 128 node machine:
> 8 procs (one per node, 8 nodes used) -- 14 groups.
> 16 procs (two per node, 8 nodes used) -- unlimited? I stopped past 750.
> 
> Some peculiarities complicate this.  On either machine, I've noticed 
> that if I haven't been doing anything using IB multicast in say a day 
> (haven't tried to figure out exactly how long), in any run scenario 
> listed above, I can join 4 groups.  I do a couple runs where I hit 
> errors after 4 groups, and then I consistently get the group counts 
> above for the rest of the work day.
> 
> Second, in the cases in which I am able to join 14 groups, if I run my 
> test program twice simultaneously on the same nodes, I am able to join a 
> maximum of 14 groups total between the two running tests (as opposed to 
> 14 per test run).  Running the test twice simultaneously using a 
> disjoint set of nodes is not an issue.

So I sent that last email before I meant to :)  Need to eat..  I've 
managed to confuse my self a little here too -- it looks like changing 
from the Cisco SM to the OpenSM did not change behavior on the 8 node 
machine.  At least, I'm still getting the same results above now that 
it's back on the Cisco SM.

Also some newer results.  I had a long run going on the 128 node machine 
to see how many groups I really could join, and it just errored out 
after joining 892 groups successfully.  Specifically, I got an 
RDMA_CM_EVENT_MULTICAST_ERROR event containing status -22 ('Unknown 
error' according to sterror).  errno is still cleared to 'Success'.  I 
don't have time go look at the code to see where this came from right 
now, but does anyone know what it means?

Andrew

> 
>>> This makes me think the switch is involved, is this correct?
>>
>>
>> I doubt it. It is either end station, SM, or a combination of the two.
> 
> OK.
> 
>>> OK this makes sense, but I still don't see where all the time is going.
>>>   Should the fact that the switches haven't been reprogrammed since
>>> leaving the groups really effect how long it takes to do a subsequent
>>> join?  I'm not convinced.
>>
>>
>> It takes time for the SM to recalculate the multicast tree. While 
>> leaves can
>> be lazy, I forget whether joins are synchronous or not.
> 
> Is the algorithm for recalculating the tree documented at all?  Or, 
> where is the code for it (assuming I have access)?  I feel like I'm 
> missing something here that explains why it's so costly.
> 
> Andrew
> 
>>
>> Is this time being consumed by the switches when the are asked to
>>> reprogram their tables (I assume some sort of routing table is used
>>> internally)?
>>
>>
>> This is relatively quick compared to the policy for the SM rerouting of
>> multicast based on joins/leaves/group creation/deletion.
> 
> OK.  Thanks for the insight.
> 
> Andrew
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general


From hal.rosenstock at gmail.com  Thu Jul 19 11:14:12 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 19 Jul 2007 11:14:12 -0700
Subject: [ofa-general] Limited number of multicasts groups that can be
	joined?
In-Reply-To: <469FA652.4060909@open-mpi.org>
References: <46699A6D.4070300@open-mpi.org> <4683D7D6.50402@open-mpi.org>
	<468426B6.3060602@ichips.intel.com> <469F9BAB.4080504@open-mpi.org>
	<f0e08f230707191032h5134a643id01771438d7b9fb6@mail.gmail.com>
	<469FA652.4060909@open-mpi.org>
Message-ID: <f0e08f230707191114i447224eci96761091373480d1@mail.gmail.com>

Andrew,

On 7/19/07, Andrew Friedley <afriedle at open-mpi.org> wrote:
>
> Hal Rosenstock wrote:
> > I'm not quite parsing what is the same with what is different in the
> > results
> > (and I presume the only variable is SM).
>
> Yes; this is confusing, I'll try to summarize the various behaviors I'm
> getting.
>
> First, there are two machines.  One has 8 nodes and runs a Topspin
> switch with the Cisco SM on it.  The other is 128 nodes and runs a
> Mellanox switch with Open SM on a compute node.  OFED v1.2 is used on
> both.  Below is how many groups I can join using my test program
> (described elsewhere in the thread)
>
> On the 8 node machine:
> 8 procs (one per node) -- 14 groups.
> 16 procs (two per node) -- 4 groups.
>
> On the 128 node machine:
> 8 procs (one per node, 8 nodes used) -- 14 groups.
> 16 procs (two per node, 8 nodes used) -- unlimited? I stopped past 750.
>
> Some peculiarities complicate this.  On either machine, I've noticed
> that if I haven't been doing anything using IB multicast in say a day
> (haven't tried to figure out exactly how long), in any run scenario
> listed above, I can join 4 groups.  I do a couple runs where I hit
> errors after 4 groups, and then I consistently get the group counts
> above for the rest of the work day.
>
> Second, in the cases in which I am able to join 14 groups, if I run my
> test program twice simultaneously on the same nodes, I am able to join a
> maximum of 14 groups total between the two running tests (as opposed to
> 14 per test run).  Running the test twice simultaneously using a
> disjoint set of nodes is not an issue.


Thanks. I can only comment on the OpenSM configuration and in general on SMs
so I'm still not sure what limits you are hitting; it may be multiple but
not sure. Some seemed to be end node (HCA) related based on a previous
email.

>> This makes me think the switch is involved, is this correct?
> >
> >
> > I doubt it. It is either end station, SM, or a combination of the two.
>
> OK.
>
> >> OK this makes sense, but I still don't see where all the time is going.
> >>   Should the fact that the switches haven't been reprogrammed since
> >> leaving the groups really effect how long it takes to do a subsequent
> >> join?  I'm not convinced.
> >
> >
> > It takes time for the SM to recalculate the multicast tree. While leaves
> > can
> > be lazy, I forget whether joins are synchronous or not.
>
> Is the algorithm for recalculating the tree documented at all?  Or,
> where is the code for it (assuming I have access)?  I feel like I'm
> missing something here that explains why it's so costly.


I'm afraid it is just the code AFAIK :-(

-- Hal

Andrew
>
> >
> > Is this time being consumed by the switches when the are asked to
> >> reprogram their tables (I assume some sort of routing table is used
> >> internally)?
> >
> >
> > This is relatively quick compared to the policy for the SM rerouting of
> > multicast based on joins/leaves/group creation/deletion.
>
> OK.  Thanks for the insight.
>
> Andrew
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070719/0630d1cf/attachment.html>

From afriedle at open-mpi.org  Thu Jul 19 11:18:00 2007
From: afriedle at open-mpi.org (Andrew Friedley)
Date: Thu, 19 Jul 2007 11:18:00 -0700
Subject: [ofa-general] Limited number of multicasts groups that can be
	joined?
In-Reply-To: <f0e08f230707191114i447224eci96761091373480d1@mail.gmail.com>
References: <46699A6D.4070300@open-mpi.org> <4683D7D6.50402@open-mpi.org>	
	<468426B6.3060602@ichips.intel.com>
	<469F9BAB.4080504@open-mpi.org>	
	<f0e08f230707191032h5134a643id01771438d7b9fb6@mail.gmail.com>	
	<469FA652.4060909@open-mpi.org>
	<f0e08f230707191114i447224eci96761091373480d1@mail.gmail.com>
Message-ID: <469FAAD8.8050505@open-mpi.org>


Hal Rosenstock wrote:
> Thanks. I can only comment on the OpenSM configuration and in general on 
> SMs
> so I'm still not sure what limits you are hitting; it may be multiple but
> not sure. Some seemed to be end node (HCA) related based on a previous
> email.

Thanks for you help.  Yes I'm thinking the same thing, though what I'm 
seeing seemingly contradicts the limits that I'm told are in place (and 
have now been changed post-v1.2).

>> Is the algorithm for recalculating the tree documented at all?  Or,
>> where is the code for it (assuming I have access)?  I feel like I'm
>> missing something here that explains why it's so costly.
> 
> 
> I'm afraid it is just the code AFAIK :-(

OK, do you know where it is in the OpenSM code base?

Andrew


From arthur.jones at qlogic.com  Thu Jul 19 11:32:49 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Thu, 19 Jul 2007 11:32:49 -0700
Subject: [ofa-general] is ipath_layer.c dead code?
In-Reply-To: <ada1wf8w8gp.fsf@cisco.com>
References: <ada1wf8w8gp.fsf@cisco.com>
Message-ID: <20070719183249.GA20240@bauxite.pathscale.com>

hi roland, your patch was the right idea, but
i think the attached patch is more complete...

btw: this patch is avail via git pull from:

git://git.qlogic.com/ipath-linux-2.6 for-roland

arthur

On Mon, Jul 16, 2007 at 10:43:02AM -0700, Roland Dreier wrote:
> My kernel seems to build and link fine with the patch below.  Is
> ipath_layer.c being used for anything, or can we just kill it?
> 
>  - R.
-------------- next part --------------
IB/ipath - remove ipath_layer, the former network/verbs layer

From: Arthur Jones <arthur.jones at qlogic.com>

The ipath_layer.[ch] code was an attempt to
provide a single interface for the ipath verbs and
ipath_ether code to use.  As verbs functionality
increased, the layer's functionality became insufficient
and the verbs code broke away to interface directly
to the driver.  The failed attempt to get ipath_ether
upstream was the final nail in the coffin and now it sits
quietly in a dark kernel.org corner waiting for someone to
notice the smell and send it along to it's final resting
place.  Roland Dreier was that someone -- this patch expands
on his work...

Signed-off-by: Arthur Jones <arthur.jones at qlogic.com>
---

 drivers/infiniband/hw/ipath/Makefile      |    1 
 drivers/infiniband/hw/ipath/ipath_layer.c |  365 -----------------------------
 drivers/infiniband/hw/ipath/ipath_layer.h |   71 ------
 drivers/infiniband/hw/ipath/ipath_verbs.h |    2 
 4 files changed, 0 insertions(+), 439 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/Makefile b/drivers/infiniband/hw/ipath/Makefile
index ec2e603..fe67388 100644
--- a/drivers/infiniband/hw/ipath/Makefile
+++ b/drivers/infiniband/hw/ipath/Makefile
@@ -14,7 +14,6 @@ ib_ipath-y := \
 	ipath_init_chip.o \
 	ipath_intr.o \
 	ipath_keys.o \
-	ipath_layer.o \
 	ipath_mad.o \
 	ipath_mmap.o \
 	ipath_mr.o \
diff --git a/drivers/infiniband/hw/ipath/ipath_layer.c b/drivers/infiniband/hw/ipath/ipath_layer.c
deleted file mode 100644
index 82616b7..0000000
--- a/drivers/infiniband/hw/ipath/ipath_layer.c
+++ /dev/null
@@ -1,365 +0,0 @@
-/*
- * Copyright (c) 2006, 2007 QLogic Corporation. All rights reserved.
- * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
- *
- * This software is available to you under a choice of one of two
- * licenses.  You may choose to be licensed under the terms of the GNU
- * General Public License (GPL) Version 2, available from the file
- * COPYING in the main directory of this source tree, or the
- * OpenIB.org BSD license below:
- *
- *     Redistribution and use in source and binary forms, with or
- *     without modification, are permitted provided that the following
- *     conditions are met:
- *
- *      - Redistributions of source code must retain the above
- *        copyright notice, this list of conditions and the following
- *        disclaimer.
- *
- *      - Redistributions in binary form must reproduce the above
- *        copyright notice, this list of conditions and the following
- *        disclaimer in the documentation and/or other materials
- *        provided with the distribution.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
- * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
- * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
- * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
- * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
- * SOFTWARE.
- */
-
-/*
- * These are the routines used by layered drivers, currently just the
- * layered ethernet driver and verbs layer.
- */
-
-#include <linux/io.h>
-#include <asm/byteorder.h>
-
-#include "ipath_kernel.h"
-#include "ipath_layer.h"
-#include "ipath_verbs.h"
-#include "ipath_common.h"
-
-/* Acquire before ipath_devs_lock. */
-static DEFINE_MUTEX(ipath_layer_mutex);
-
-u16 ipath_layer_rcv_opcode;
-
-static int (*layer_intr)(void *, u32);
-static int (*layer_rcv)(void *, void *, struct sk_buff *);
-static int (*layer_rcv_lid)(void *, void *);
-
-static void *(*layer_add_one)(int, struct ipath_devdata *);
-static void (*layer_remove_one)(void *);
-
-int __ipath_layer_intr(struct ipath_devdata *dd, u32 arg)
-{
-	int ret = -ENODEV;
-
-	if (dd->ipath_layer.l_arg && layer_intr)
-		ret = layer_intr(dd->ipath_layer.l_arg, arg);
-
-	return ret;
-}
-
-int ipath_layer_intr(struct ipath_devdata *dd, u32 arg)
-{
-	int ret;
-
-	mutex_lock(&ipath_layer_mutex);
-
-	ret = __ipath_layer_intr(dd, arg);
-
-	mutex_unlock(&ipath_layer_mutex);
-
-	return ret;
-}
-
-int __ipath_layer_rcv(struct ipath_devdata *dd, void *hdr,
-		      struct sk_buff *skb)
-{
-	int ret = -ENODEV;
-
-	if (dd->ipath_layer.l_arg && layer_rcv)
-		ret = layer_rcv(dd->ipath_layer.l_arg, hdr, skb);
-
-	return ret;
-}
-
-int __ipath_layer_rcv_lid(struct ipath_devdata *dd, void *hdr)
-{
-	int ret = -ENODEV;
-
-	if (dd->ipath_layer.l_arg && layer_rcv_lid)
-		ret = layer_rcv_lid(dd->ipath_layer.l_arg, hdr);
-
-	return ret;
-}
-
-void ipath_layer_lid_changed(struct ipath_devdata *dd)
-{
-	mutex_lock(&ipath_layer_mutex);
-
-	if (dd->ipath_layer.l_arg && layer_intr)
-		layer_intr(dd->ipath_layer.l_arg, IPATH_LAYER_INT_LID);
-
-	mutex_unlock(&ipath_layer_mutex);
-}
-
-void ipath_layer_add(struct ipath_devdata *dd)
-{
-	mutex_lock(&ipath_layer_mutex);
-
-	if (layer_add_one)
-		dd->ipath_layer.l_arg =
-			layer_add_one(dd->ipath_unit, dd);
-
-	mutex_unlock(&ipath_layer_mutex);
-}
-
-void ipath_layer_remove(struct ipath_devdata *dd)
-{
-	mutex_lock(&ipath_layer_mutex);
-
-	if (dd->ipath_layer.l_arg && layer_remove_one) {
-		layer_remove_one(dd->ipath_layer.l_arg);
-		dd->ipath_layer.l_arg = NULL;
-	}
-
-	mutex_unlock(&ipath_layer_mutex);
-}
-
-int ipath_layer_register(void *(*l_add)(int, struct ipath_devdata *),
-			 void (*l_remove)(void *),
-			 int (*l_intr)(void *, u32),
-			 int (*l_rcv)(void *, void *, struct sk_buff *),
-			 u16 l_rcv_opcode,
-			 int (*l_rcv_lid)(void *, void *))
-{
-	struct ipath_devdata *dd, *tmp;
-	unsigned long flags;
-
-	mutex_lock(&ipath_layer_mutex);
-
-	layer_add_one = l_add;
-	layer_remove_one = l_remove;
-	layer_intr = l_intr;
-	layer_rcv = l_rcv;
-	layer_rcv_lid = l_rcv_lid;
-	ipath_layer_rcv_opcode = l_rcv_opcode;
-
-	spin_lock_irqsave(&ipath_devs_lock, flags);
-
-	list_for_each_entry_safe(dd, tmp, &ipath_dev_list, ipath_list) {
-		if (!(dd->ipath_flags & IPATH_INITTED))
-			continue;
-
-		if (dd->ipath_layer.l_arg)
-			continue;
-
-		spin_unlock_irqrestore(&ipath_devs_lock, flags);
-		dd->ipath_layer.l_arg = l_add(dd->ipath_unit, dd);
-		spin_lock_irqsave(&ipath_devs_lock, flags);
-	}
-
-	spin_unlock_irqrestore(&ipath_devs_lock, flags);
-	mutex_unlock(&ipath_layer_mutex);
-
-	return 0;
-}
-
-EXPORT_SYMBOL_GPL(ipath_layer_register);
-
-void ipath_layer_unregister(void)
-{
-	struct ipath_devdata *dd, *tmp;
-	unsigned long flags;
-
-	mutex_lock(&ipath_layer_mutex);
-	spin_lock_irqsave(&ipath_devs_lock, flags);
-
-	list_for_each_entry_safe(dd, tmp, &ipath_dev_list, ipath_list) {
-		if (dd->ipath_layer.l_arg && layer_remove_one) {
-			spin_unlock_irqrestore(&ipath_devs_lock, flags);
-			layer_remove_one(dd->ipath_layer.l_arg);
-			spin_lock_irqsave(&ipath_devs_lock, flags);
-			dd->ipath_layer.l_arg = NULL;
-		}
-	}
-
-	spin_unlock_irqrestore(&ipath_devs_lock, flags);
-
-	layer_add_one = NULL;
-	layer_remove_one = NULL;
-	layer_intr = NULL;
-	layer_rcv = NULL;
-	layer_rcv_lid = NULL;
-
-	mutex_unlock(&ipath_layer_mutex);
-}
-
-EXPORT_SYMBOL_GPL(ipath_layer_unregister);
-
-int ipath_layer_open(struct ipath_devdata *dd, u32 * pktmax)
-{
-	int ret;
-	u32 intval = 0;
-
-	mutex_lock(&ipath_layer_mutex);
-
-	if (!dd->ipath_layer.l_arg) {
-		ret = -EINVAL;
-		goto bail;
-	}
-
-	ret = ipath_setrcvhdrsize(dd, IPATH_HEADER_QUEUE_WORDS);
-
-	if (ret < 0)
-		goto bail;
-
-	*pktmax = dd->ipath_ibmaxlen;
-
-	if (*dd->ipath_statusp & IPATH_STATUS_IB_READY)
-		intval |= IPATH_LAYER_INT_IF_UP;
-	if (dd->ipath_lid)
-		intval |= IPATH_LAYER_INT_LID;
-	if (dd->ipath_mlid)
-		intval |= IPATH_LAYER_INT_BCAST;
-	/*
-	 * do this on open, in case low level is already up and
-	 * just layered driver was reloaded, etc.
-	 */
-	if (intval)
-		layer_intr(dd->ipath_layer.l_arg, intval);
-
-	ret = 0;
-bail:
-	mutex_unlock(&ipath_layer_mutex);
-
-	return ret;
-}
-
-EXPORT_SYMBOL_GPL(ipath_layer_open);
-
-u16 ipath_layer_get_lid(struct ipath_devdata *dd)
-{
-	return dd->ipath_lid;
-}
-
-EXPORT_SYMBOL_GPL(ipath_layer_get_lid);
-
-/**
- * ipath_layer_get_mac - get the MAC address
- * @dd: the infinipath device
- * @mac: the MAC is put here
- *
- * This is the EUID-64 OUI octets (top 3), then
- * skip the next 2 (which should both be zero or 0xff).
- * The returned MAC is in network order
- * mac points to at least 6 bytes of buffer
- * We assume that by the time the LID is set, that the GUID is as valid
- * as it's ever going to be, rather than adding yet another status bit.
- */
-
-int ipath_layer_get_mac(struct ipath_devdata *dd, u8 * mac)
-{
-	u8 *guid;
-
-	guid = (u8 *) &dd->ipath_guid;
-
-	mac[0] = guid[0];
-	mac[1] = guid[1];
-	mac[2] = guid[2];
-	mac[3] = guid[5];
-	mac[4] = guid[6];
-	mac[5] = guid[7];
-	if ((guid[3] || guid[4]) && !(guid[3] == 0xff && guid[4] == 0xff))
-		ipath_dbg("Warning, guid bytes 3 and 4 not 0 or 0xffff: "
-			  "%x %x\n", guid[3], guid[4]);
-	return 0;
-}
-
-EXPORT_SYMBOL_GPL(ipath_layer_get_mac);
-
-u16 ipath_layer_get_bcast(struct ipath_devdata *dd)
-{
-	return dd->ipath_mlid;
-}
-
-EXPORT_SYMBOL_GPL(ipath_layer_get_bcast);
-
-int ipath_layer_send_hdr(struct ipath_devdata *dd, struct ether_header *hdr)
-{
-	int ret = 0;
-	u32 __iomem *piobuf;
-	u32 plen, *uhdr;
-	size_t count;
-	__be16 vlsllnh;
-
-	if (!(dd->ipath_flags & IPATH_RCVHDRSZ_SET)) {
-		ipath_dbg("send while not open\n");
-		ret = -EINVAL;
-	} else
-		if ((dd->ipath_flags & (IPATH_LINKUNK | IPATH_LINKDOWN)) ||
-		    dd->ipath_lid == 0) {
-			/*
-			 * lid check is for when sma hasn't yet configured
-			 */
-			ret = -ENETDOWN;
-			ipath_cdbg(VERBOSE, "send while not ready, "
-				   "mylid=%u, flags=0x%x\n",
-				   dd->ipath_lid, dd->ipath_flags);
-		}
-
-	vlsllnh = *((__be16 *) hdr);
-	if (vlsllnh != htons(IPATH_LRH_BTH)) {
-		ipath_dbg("Warning: lrh[0] wrong (%x, not %x); "
-			  "not sending\n", be16_to_cpu(vlsllnh),
-			  IPATH_LRH_BTH);
-		ret = -EINVAL;
-	}
-	if (ret)
-		goto done;
-
-	/* Get a PIO buffer to use. */
-	piobuf = ipath_getpiobuf(dd, NULL);
-	if (piobuf == NULL) {
-		ret = -EBUSY;
-		goto done;
-	}
-
-	plen = (sizeof(*hdr) >> 2); /* actual length */
-	ipath_cdbg(EPKT, "0x%x+1w pio %p\n", plen, piobuf);
-
-	writeq(plen+1, piobuf); /* len (+1 for pad) to pbc, no flags */
-	ipath_flush_wc();
-	piobuf += 2;
-	uhdr = (u32 *)hdr;
-	count = plen-1; /* amount we can copy before trigger word */
-	__iowrite32_copy(piobuf, uhdr, count);
-	ipath_flush_wc();
-	__raw_writel(uhdr[count], piobuf + count);
-	ipath_flush_wc(); /* ensure it's sent, now */
-
-	ipath_stats.sps_ether_spkts++;	/* ether packet sent */
-
-done:
-	return ret;
-}
-
-EXPORT_SYMBOL_GPL(ipath_layer_send_hdr);
-
-int ipath_layer_set_piointbufavail_int(struct ipath_devdata *dd)
-{
-	set_bit(IPATH_S_PIOINTBUFAVAIL, &dd->ipath_sendctrl);
-
-	ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl,
-			 dd->ipath_sendctrl);
-	return 0;
-}
-
-EXPORT_SYMBOL_GPL(ipath_layer_set_piointbufavail_int);
diff --git a/drivers/infiniband/hw/ipath/ipath_layer.h b/drivers/infiniband/hw/ipath/ipath_layer.h
deleted file mode 100644
index 415709c..0000000
--- a/drivers/infiniband/hw/ipath/ipath_layer.h
+++ /dev/null
@@ -1,71 +0,0 @@
-/*
- * Copyright (c) 2006, 2007 QLogic Corporation. All rights reserved.
- * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
- *
- * This software is available to you under a choice of one of two
- * licenses.  You may choose to be licensed under the terms of the GNU
- * General Public License (GPL) Version 2, available from the file
- * COPYING in the main directory of this source tree, or the
- * OpenIB.org BSD license below:
- *
- *     Redistribution and use in source and binary forms, with or
- *     without modification, are permitted provided that the following
- *     conditions are met:
- *
- *      - Redistributions of source code must retain the above
- *        copyright notice, this list of conditions and the following
- *        disclaimer.
- *
- *      - Redistributions in binary form must reproduce the above
- *        copyright notice, this list of conditions and the following
- *        disclaimer in the documentation and/or other materials
- *        provided with the distribution.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
- * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
- * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
- * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
- * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
- * SOFTWARE.
- */
-
-#ifndef _IPATH_LAYER_H
-#define _IPATH_LAYER_H
-
-/*
- * This header file is for symbols shared between the infinipath driver
- * and drivers layered upon it (such as ipath).
- */
-
-struct sk_buff;
-struct ipath_devdata;
-struct ether_header;
-
-int ipath_layer_register(void *(*l_add)(int, struct ipath_devdata *),
-			 void (*l_remove)(void *),
-			 int (*l_intr)(void *, u32),
-			 int (*l_rcv)(void *, void *,
-				      struct sk_buff *),
-			 u16 rcv_opcode,
-			 int (*l_rcv_lid)(void *, void *));
-void ipath_layer_unregister(void);
-int ipath_layer_open(struct ipath_devdata *, u32 * pktmax);
-u16 ipath_layer_get_lid(struct ipath_devdata *dd);
-int ipath_layer_get_mac(struct ipath_devdata *dd, u8 *);
-u16 ipath_layer_get_bcast(struct ipath_devdata *dd);
-int ipath_layer_send_hdr(struct ipath_devdata *dd,
-			 struct ether_header *hdr);
-int ipath_layer_set_piointbufavail_int(struct ipath_devdata *dd);
-
-/* ipath_ether interrupt values */
-#define IPATH_LAYER_INT_IF_UP 0x2
-#define IPATH_LAYER_INT_IF_DOWN 0x4
-#define IPATH_LAYER_INT_LID 0x8
-#define IPATH_LAYER_INT_SEND_CONTINUE 0x10
-#define IPATH_LAYER_INT_BCAST 0x40
-
-extern unsigned ipath_debug; /* debugging bit mask */
-
-#endif				/* _IPATH_LAYER_H */
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h
index f3d1f2c..0a233f5 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.h
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.h
@@ -42,8 +42,6 @@
 #include <rdma/ib_pack.h>
 #include <rdma/ib_user_verbs.h>
 
-#include "ipath_layer.h"
-
 #define IPATH_MAX_RDMA_ATOMIC	4
 
 #define QPN_MAX                 (1 << 24)

From pradeeps at linux.vnet.ibm.com  Thu Jul 19 11:55:39 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Thu, 19 Jul 2007 11:55:39 -0700
Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V8] patch 
Message-ID: <469FB3AB.6080304@linux.vnet.ibm.com>

Addressed Roland's comments and more (hope this passes muster :)). 
The event_handler issue pointed out will be addressed in another patch.


Signed-off-by: Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>
---

--- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib.h	2007-05-30 14:56:25.000000000 -0400
+++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib.h	2007-07-19 11:17:39.000000000 -0400
@@ -95,11 +95,15 @@ enum {
 	IPOIB_MCAST_FLAG_ATTACHED = 3,
 };
 
+#define CM_PACKET_SIZE (ALIGN(IPOIB_CM_MTU, PAGE_SIZE))
 #define	IPOIB_OP_RECV   (1ul << 31)
 #ifdef CONFIG_INFINIBAND_IPOIB_CM
-#define	IPOIB_CM_OP_SRQ (1ul << 30)
+#define	IPOIB_CM_OP_RECV (1ul << 30)
+
+#define NOSRQ_INDEX_TABLE_SIZE 128
+#define NOSRQ_INDEX_MASK      (NOSRQ_INDEX_TABLE_SIZE -1)
 #else
-#define	IPOIB_CM_OP_SRQ (0)
+#define	IPOIB_CM_OP_RECV (0)
 #endif
 
 /* structs */
@@ -166,11 +170,14 @@ enum ipoib_cm_state {
 };
 
 struct ipoib_cm_rx {
-	struct ib_cm_id     *id;
-	struct ib_qp        *qp;
-	struct list_head     list;
-	struct net_device   *dev;
-	unsigned long        jiffies;
+	struct ib_cm_id     	*id;
+	struct ib_qp        	*qp;
+	struct ipoib_cm_rx_buf  *rx_ring; /* Used by NOSRQ only */
+	struct list_head     	 list;
+	struct net_device   	*dev;
+	unsigned long        	 jiffies;
+	u32                      index; /* wr_ids are distinguished by index
+					 * to identify the QP -NOSRQ only */
 	enum ipoib_cm_state  state;
 };
 
@@ -215,6 +222,8 @@ struct ipoib_cm_dev_priv {
 	struct ib_wc            ibwc[IPOIB_NUM_WC];
 	struct ib_sge           rx_sge[IPOIB_CM_RX_SG];
 	struct ib_recv_wr       rx_wr;
+	struct ipoib_cm_rx	**rx_index_table; /* See ipoib_cm_dev_init()
+						   *for usage of this element */
 };
 
 /*
--- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-07-10 17:02:33.000000000 -0400
+++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-07-19 13:55:59.000000000 -0400
@@ -49,6 +49,17 @@ MODULE_PARM_DESC(cm_data_debug_level,
 
 #include "ipoib.h"
 
+static int max_rc_qp = NOSRQ_INDEX_TABLE_SIZE;
+static int max_recv_buf = 1024; /* Default is 1024 MB */
+
+module_param_named(nosrq_max_rc_qp, max_rc_qp, int, 0644);
+MODULE_PARM_DESC(nosrq_max_rc_qp, "Max number of NOSRQ RC QPs supported");
+
+module_param_named(max_receive_buffer, max_recv_buf, int, 0644);
+MODULE_PARM_DESC(max_receive_buffer, "Max Receive Buffer Size in MB");
+
+static atomic_t current_rc_qp = ATOMIC_INIT(0); /* Active number of RC QPs for NOSRQ */
+
 #define IPOIB_CM_IETF_ID 0x1000000000000000ULL
 
 #define IPOIB_CM_RX_UPDATE_TIME (256 * HZ)
@@ -81,20 +92,21 @@ static void ipoib_cm_dma_unmap_rx(struct
 		ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE);
 }
 
-static int ipoib_cm_post_receive(struct net_device *dev, int id)
+static int post_receive_srq(struct net_device *dev, u64 id)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_recv_wr *bad_wr;
 	int i, ret;
 
-	priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_SRQ;
+	priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV;
 
 	for (i = 0; i < IPOIB_CM_RX_SG; ++i)
 		priv->cm.rx_sge[i].addr = priv->cm.srq_ring[id].mapping[i];
 
 	ret = ib_post_srq_recv(priv->cm.srq, &priv->cm.rx_wr, &bad_wr);
 	if (unlikely(ret)) {
-		ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret);
+		ipoib_warn(priv, "post srq failed for buf %lld (%d)\n",
+			   (unsigned long long)id, ret);
 		ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1,
 				      priv->cm.srq_ring[id].mapping);
 		dev_kfree_skb_any(priv->cm.srq_ring[id].skb);
@@ -104,12 +116,47 @@ static int ipoib_cm_post_receive(struct 
 	return ret;
 }
 
-static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, int frags,
+static int post_receive_nosrq(struct net_device *dev, u64 id)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_recv_wr *bad_wr;
+	int i, ret;
+	u32 index;
+	u32 wr_id;
+	struct ipoib_cm_rx *rx_ptr;
+
+	index = id  & NOSRQ_INDEX_MASK ;
+	wr_id = id >> 32;
+
+	rx_ptr = priv->cm.rx_index_table[index];
+
+	priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV;
+
+	for (i = 0; i < IPOIB_CM_RX_SG; ++i)
+		priv->cm.rx_sge[i].addr = rx_ptr->rx_ring[wr_id].mapping[i];
+
+	ret = ib_post_recv(rx_ptr->qp, &priv->cm.rx_wr, &bad_wr);
+	if (unlikely(ret)) {
+		ipoib_warn(priv, "post recv failed for buf %d (%d)\n",
+			   wr_id, ret);
+		ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1,
+				      rx_ptr->rx_ring[wr_id].mapping);
+		dev_kfree_skb_any(rx_ptr->rx_ring[wr_id].skb);
+		rx_ptr->rx_ring[wr_id].skb = NULL;
+	}
+
+	return ret;
+}
+
+static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, u64 id,
+					     int frags,
 					     u64 mapping[IPOIB_CM_RX_SG])
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct sk_buff *skb;
 	int i;
+	struct ipoib_cm_rx *rx_ptr;
+	u32 index, wr_id;
 
 	skb = dev_alloc_skb(IPOIB_CM_HEAD_SIZE + 12);
 	if (unlikely(!skb))
@@ -141,7 +188,14 @@ static struct sk_buff *ipoib_cm_alloc_rx
 			goto partial_error;
 	}
 
-	priv->cm.srq_ring[id].skb = skb;
+	if (priv->cm.srq)
+		priv->cm.srq_ring[id].skb = skb;
+	else {
+		index = id  & NOSRQ_INDEX_MASK ;
+		wr_id = id >> 32;
+		rx_ptr = priv->cm.rx_index_table[index];
+		rx_ptr->rx_ring[wr_id].skb = skb;
+	}
 	return skb;
 
 partial_error:
@@ -198,16 +252,21 @@ static struct ib_qp *ipoib_cm_create_rx_
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_init_attr attr = {
-		.event_handler = ipoib_cm_rx_event_handler,
 		.send_cq = priv->cq, /* For drain WR */
 		.recv_cq = priv->cq,
 		.srq = priv->cm.srq,
 		.cap.max_send_wr = 1, /* For drain WR */
+		.cap.max_recv_wr = ipoib_recvq_size + 1,
 		.cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */
 		.sq_sig_type = IB_SIGNAL_ALL_WR,
 		.qp_type = IB_QPT_RC,
 		.qp_context = p,
 	};
+	if (!priv->cm.srq) {
+		attr.cap.max_recv_sge = IPOIB_CM_RX_SG;
+		attr.event_handler = NULL;
+	} else
+		attr.event_handler = ipoib_cm_rx_event_handler;
 	return ib_create_qp(priv->pd, &attr);
 }
 
@@ -282,12 +341,129 @@ static int ipoib_cm_send_rep(struct net_
 	rep.flow_control = 0;
 	rep.rnr_retry_count = req->rnr_retry_count;
 	rep.target_ack_delay = 20; /* FIXME */
-	rep.srq = 1;
 	rep.qp_num = qp->qp_num;
 	rep.starting_psn = psn;
+	rep.srq	= !!priv->cm.srq;
 	return ib_send_cm_rep(cm_id, &rep);
 }
 
+static void init_context_and_add_list(struct ib_cm_id *cm_id,
+				    struct ipoib_cm_rx *p,
+				    struct ipoib_dev_priv *priv)
+{
+	cm_id->context = p;
+	p->jiffies = jiffies;
+	spin_lock_irq(&priv->lock);
+	if (list_empty(&priv->cm.passive_ids))
+		queue_delayed_work(ipoib_workqueue,
+				   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
+	if (priv->cm.srq) {
+		/* Add this entry to passive ids list head, but do not re-add
+		 * it if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush
+		 * list.
+		 */
+		if (p->state == IPOIB_CM_RX_LIVE)
+			list_move(&p->list, &priv->cm.passive_ids);
+	}
+	spin_unlock_irq(&priv->lock);
+}
+
+static int allocate_and_post_rbuf_nosrq(struct ib_cm_id *cm_id,
+					struct ipoib_cm_rx *p, unsigned psn)
+{
+	struct net_device *dev = cm_id->context;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int ret;
+	u32 qp_num, index;
+	u64 i, recv_mem_used;
+
+	qp_num = p->qp->qp_num;
+
+	/* In the SRQ case there is a common rx buffer called the srq_ring.
+	 * However, for the NOSRQ we create an rx_ring for every
+	 * struct ipoib_cm_rx.
+	 */
+	p->rx_ring = kzalloc(ipoib_recvq_size * sizeof *p->rx_ring, GFP_KERNEL);
+	if (!p->rx_ring) {
+		printk(KERN_WARNING "Failed to allocate rx_ring for 0x%x\n",
+		       qp_num);
+		return -ENOMEM;
+	}
+
+	spin_lock_irq(&priv->lock);
+	list_add(&p->list, &priv->cm.passive_ids);
+	spin_unlock_irq(&priv->lock);
+
+	init_context_and_add_list(cm_id, p, priv);
+	spin_lock_irq(&priv->lock);
+
+	for (index = 0; index < max_rc_qp; index++)
+		if (priv->cm.rx_index_table[index] == NULL)
+			break;
+
+	recv_mem_used = (u64)ipoib_recvq_size *
+			(u64)atomic_inc_return(&current_rc_qp) * CM_PACKET_SIZE;
+	if ((index == max_rc_qp) ||
+	    (recv_mem_used >= max_recv_buf * (1ul << 20))) {
+		spin_unlock_irq(&priv->lock);
+		ipoib_warn(priv, "NOSRQ has reached the configurable limit "
+			   "of either %d RC QPs or, max recv buf size of "
+			   "0x%x MB\n", max_rc_qp, max_recv_buf);
+
+		/* We send a REJ to the remote side indicating that we
+		 * have no more free RC QPs and leave it to the remote side
+		 * to take appropriate action. This should leave the
+		 * current set of QPs unaffected and any subsequent REQs
+		 * will be able to use RC QPs if they are available.
+		 */
+		ib_send_cm_rej(cm_id, IB_CM_REJ_NO_QP, NULL, 0, NULL, 0);
+		ret = -EINVAL;
+		goto err_alloc_and_post;
+	}
+
+	priv->cm.rx_index_table[index] = p;
+	spin_unlock_irq(&priv->lock);
+
+	/* We will subsequently use this stored pointer while freeing
+	 * resources in stale task
+	 */
+	p->index = index;
+
+	ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn);
+	if (ret) {
+		ipoib_warn(priv, "ipoib_cm_modify_rx_qp() failed %d\n", ret);
+		ipoib_cm_dev_cleanup(dev);
+		goto err_alloc_and_post;
+	}
+
+	for (i = 0; i < ipoib_recvq_size; ++i) {
+		if (!ipoib_cm_alloc_rx_skb(dev, i << 32 | index,
+					   IPOIB_CM_RX_SG - 1,
+					   p->rx_ring[i].mapping)) {
+			ipoib_warn(priv, "failed to allocate receive "
+				   "buffer %d\n", (int)i);
+			ipoib_cm_dev_cleanup(dev);
+			ret = -ENOMEM;
+			goto err_alloc_and_post;
+		}
+
+		if (post_receive_nosrq(dev, i << 32 | index)) {
+			ipoib_warn(priv, "post_receive_nosrq "
+				   "failed for  buf %lld\n", (unsigned long long)i);
+			ipoib_cm_dev_cleanup(dev);
+			ret = -EIO;
+			goto err_alloc_and_post;
+		}
+	}
+
+	return 0;
+
+err_alloc_and_post:
+	atomic_dec(&current_rc_qp);
+	kfree(p->rx_ring);
+	return ret;
+}
+
 static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event)
 {
 	struct net_device *dev = cm_id->context;
@@ -302,9 +478,6 @@ static int ipoib_cm_req_handler(struct i
 		return -ENOMEM;
 	p->dev = dev;
 	p->id = cm_id;
-	cm_id->context = p;
-	p->state = IPOIB_CM_RX_LIVE;
-	p->jiffies = jiffies;
 	INIT_LIST_HEAD(&p->list);
 
 	p->qp = ipoib_cm_create_rx_qp(dev, p);
@@ -314,19 +487,21 @@ static int ipoib_cm_req_handler(struct i
 	}
 
 	psn = random32() & 0xffffff;
-	ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn);
-	if (ret)
-		goto err_modify;
+	if (!priv->cm.srq) {
+		ret = allocate_and_post_rbuf_nosrq(cm_id, p, psn);
+		if (ret)
+			goto err_post_nosrq;
+	} else {
+		p->rx_ring = NULL;
+		ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn);
+		if (ret)
+			goto err_modify;
+	}
 
-	spin_lock_irq(&priv->lock);
-	queue_delayed_work(ipoib_workqueue,
-			   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
-	/* Add this entry to passive ids list head, but do not re-add it
-	 * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */
-	p->jiffies = jiffies;
-	if (p->state == IPOIB_CM_RX_LIVE)
-		list_move(&p->list, &priv->cm.passive_ids);
-	spin_unlock_irq(&priv->lock);
+	if (priv->cm.srq) {
+		p->state = IPOIB_CM_RX_LIVE;
+		init_context_and_add_list(cm_id, p, priv);
+	}
 
 	ret = ipoib_cm_send_rep(dev, cm_id, p->qp, &event->param.req_rcvd, psn);
 	if (ret) {
@@ -336,6 +511,8 @@ static int ipoib_cm_req_handler(struct i
 	}
 	return 0;
 
+err_post_nosrq:
+	list_del_init(&p->list);
 err_modify:
 	ib_destroy_qp(p->qp);
 err_qp:
@@ -399,29 +576,60 @@ static void skb_put_frags(struct sk_buff
 	}
 }
 
-void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
+static void timer_check_srq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p)
+{
+	unsigned long flags;
+
+	if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) {
+		spin_lock_irqsave(&priv->lock, flags);
+		p->jiffies = jiffies;
+		/* Move this entry to list head, but do
+		 * not re-add it if it has been removed.
+		 */
+		if (p->state == IPOIB_CM_RX_LIVE)
+			list_move(&p->list, &priv->cm.passive_ids);
+		spin_unlock_irqrestore(&priv->lock, flags);
+	}
+}
+
+static void timer_check_nosrq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p)
+{
+	unsigned long flags;
+
+	if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) {
+		spin_lock_irqsave(&priv->lock, flags);
+		p->jiffies = jiffies;
+		/* Move this entry to list head, but do
+		 * not re-add it if it has been removed. */
+		if (!list_empty(&p->list))
+			list_move(&p->list, &priv->cm.passive_ids);
+		spin_unlock_irqrestore(&priv->lock, flags);
+	}
+}
+
+void handle_rx_wc_srq(struct net_device *dev, struct ib_wc *wc)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	unsigned int wr_id = wc->wr_id & ~IPOIB_CM_OP_SRQ;
+	u64 wr_id = wc->wr_id & ~IPOIB_CM_OP_RECV;
 	struct sk_buff *skb, *newskb;
 	struct ipoib_cm_rx *p;
 	unsigned long flags;
 	u64 mapping[IPOIB_CM_RX_SG];
-	int frags;
+	int frags, ret;
 
-	ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n",
-		       wr_id, wc->status);
+	ipoib_dbg_data(priv, "cm recv completion: id %lld, status: %d\n",
+		       (unsigned long long)wr_id, wc->status);
 
 	if (unlikely(wr_id >= ipoib_recvq_size)) {
-		if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~IPOIB_CM_OP_SRQ)) {
+		if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~IPOIB_CM_OP_RECV)) {
 			spin_lock_irqsave(&priv->lock, flags);
 			list_splice_init(&priv->cm.rx_drain_list, &priv->cm.rx_reap_list);
 			ipoib_cm_start_rx_drain(priv);
 			queue_work(ipoib_workqueue, &priv->cm.rx_reap_task);
 			spin_unlock_irqrestore(&priv->lock, flags);
 		} else
-			ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n",
-				   wr_id, ipoib_recvq_size);
+			ipoib_warn(priv, "cm recv completion event with wrid %lld (> %d)\n",
+				   (unsigned long long)wr_id, ipoib_recvq_size);
 		return;
 	}
 
@@ -429,23 +637,15 @@ void ipoib_cm_handle_rx_wc(struct net_de
 
 	if (unlikely(wc->status != IB_WC_SUCCESS)) {
 		ipoib_dbg(priv, "cm recv error "
-			   "(status=%d, wrid=%d vend_err %x)\n",
-			   wc->status, wr_id, wc->vendor_err);
+			   "(status=%d, wrid=%lld vend_err %x)\n",
+			   wc->status, (unsigned long long)wr_id, wc->vendor_err);
 		++priv->stats.rx_dropped;
-		goto repost;
+		goto repost_srq;
 	}
 
 	if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) {
 		p = wc->qp->qp_context;
-		if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) {
-			spin_lock_irqsave(&priv->lock, flags);
-			p->jiffies = jiffies;
-			/* Move this entry to list head, but do not re-add it
-			 * if it has been moved out of list. */
-			if (p->state == IPOIB_CM_RX_LIVE)
-				list_move(&p->list, &priv->cm.passive_ids);
-			spin_unlock_irqrestore(&priv->lock, flags);
-		}
+		timer_check_srq(priv, p);
 	}
 
 	frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len,
@@ -457,13 +657,113 @@ void ipoib_cm_handle_rx_wc(struct net_de
 		 * If we can't allocate a new RX buffer, dump
 		 * this packet and reuse the old buffer.
 		 */
-		ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id);
+		ipoib_dbg(priv, "failed to allocate receive buffer %lld\n",
+			  (unsigned long long)wr_id);
+		++priv->stats.rx_dropped;
+		goto repost_srq;
+	}
+
+	ipoib_cm_dma_unmap_rx(priv, frags,
+			      priv->cm.srq_ring[wr_id].mapping);
+	memcpy(priv->cm.srq_ring[wr_id].mapping, mapping,
+	       (frags + 1) * sizeof *mapping);
+	ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n",
+		       wc->byte_len, wc->slid);
+
+	skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb);
+
+	skb->protocol = ((struct ipoib_header *) skb->data)->proto;
+	skb_reset_mac_header(skb);
+	skb_pull(skb, IPOIB_ENCAP_LEN);
+
+	dev->last_rx = jiffies;
+	++priv->stats.rx_packets;
+	priv->stats.rx_bytes += skb->len;
+
+	skb->dev = dev;
+	/* XXX get correct PACKET_ type here */
+	skb->pkt_type = PACKET_HOST;
+	netif_receive_skb(skb);
+
+repost_srq:
+	ret = post_receive_srq(dev, wr_id);
+
+	if (unlikely(ret))
+		ipoib_warn(priv, "post_receive_srq failed for buf %lld\n",
+			   (unsigned long long)wr_id);
+
+}
+
+static void handle_rx_wc_nosrq(struct net_device *dev, struct ib_wc *wc)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct sk_buff *skb, *newskb;
+	u64 mapping[IPOIB_CM_RX_SG], wr_id = wc->wr_id >> 32;
+	u32 index;
+	struct ipoib_cm_rx *rx_ptr;
+	int frags, ret;
+
+
+	ipoib_dbg_data(priv, "cm recv completion: id %lld, status: %d\n",
+		       (unsigned long long)wr_id, wc->status);
+
+	if (unlikely(wr_id >= ipoib_recvq_size)) {
+		ipoib_warn(priv, "cm recv completion event with wrid %lld (> %d)\n",
+				   (unsigned long long)wr_id, ipoib_recvq_size);
+		return;
+	}
+
+	index = (wc->wr_id & ~IPOIB_CM_OP_RECV) & NOSRQ_INDEX_MASK ;
+
+	/* This is the only place where rx_ptr could be a NULL - could
+	 * have just received a packet from a connection that has become
+	 * stale and so is going away. We will simply drop the packet and
+	 * let the hardware (it s IB_QPT_RC) handle the dropped packet.
+	 * In the timer_check() function below, p->jiffies is updated and
+	 * hence the connection will not be stale after that.
+	 */
+	rx_ptr = priv->cm.rx_index_table[index];
+	if (unlikely(!rx_ptr)) {
+		ipoib_warn(priv, "Received packet from a connection "
+			   "that is going away. Hardware will handle it.\n");
+		return;
+	}
+
+	skb = rx_ptr->rx_ring[wr_id].skb;
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		ipoib_dbg(priv, "cm recv error "
+			   "(status=%d, wrid=%lld vend_err %x)\n",
+			   wc->status, (unsigned long long)wr_id, wc->vendor_err);
+		++priv->stats.rx_dropped;
+		goto repost_nosrq;
+	}
+
+	if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) {
+		/* There are no guarantees that wc->qp is not NULL for HCAs
+		 * that do not support SRQ. */
+		timer_check_nosrq(priv, rx_ptr);
+	}
+
+	frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len,
+					      (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE;
+
+	newskb = ipoib_cm_alloc_rx_skb(dev, wr_id << 32 | index, frags,
+				       mapping);
+	if (unlikely(!newskb)) {
+		/*
+		 * If we can't allocate a new RX buffer, dump
+		 * this packet and reuse the old buffer.
+		 */
+		ipoib_dbg(priv, "failed to allocate receive buffer %lld\n",
+			  (unsigned long long)wr_id);
 		++priv->stats.rx_dropped;
-		goto repost;
+		goto repost_nosrq;
 	}
 
-	ipoib_cm_dma_unmap_rx(priv, frags, priv->cm.srq_ring[wr_id].mapping);
-	memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, (frags + 1) * sizeof *mapping);
+	ipoib_cm_dma_unmap_rx(priv, frags, rx_ptr->rx_ring[wr_id].mapping);
+	memcpy(rx_ptr->rx_ring[wr_id].mapping, mapping,
+	       (frags + 1) * sizeof *mapping);
 
 	ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n",
 		       wc->byte_len, wc->slid);
@@ -483,10 +783,22 @@ void ipoib_cm_handle_rx_wc(struct net_de
 	skb->pkt_type = PACKET_HOST;
 	netif_receive_skb(skb);
 
-repost:
-	if (unlikely(ipoib_cm_post_receive(dev, wr_id)))
-		ipoib_warn(priv, "ipoib_cm_post_receive failed "
-			   "for buf %d\n", wr_id);
+repost_nosrq:
+	ret = post_receive_nosrq(dev, wr_id << 32 | index);
+
+	if (unlikely(ret))
+		ipoib_warn(priv, "post_receive_nosrq failed for buf %lld\n",
+			   (unsigned long long)wr_id);
+}
+
+void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	if (priv->cm.srq)
+		handle_rx_wc_srq(dev, wc);
+	else
+		handle_rx_wc_nosrq(dev, wc);
 }
 
 static inline int post_send(struct ipoib_dev_priv *priv,
@@ -678,6 +990,42 @@ err_cm:
 	return ret;
 }
 
+static void free_resources_nosrq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p)
+{
+	int i;
+
+	for (i = 0; i < ipoib_recvq_size; ++i)
+		if (p->rx_ring[i].skb) {
+			ipoib_cm_dma_unmap_rx(priv,
+					 IPOIB_CM_RX_SG - 1,
+					 p->rx_ring[i].mapping);
+			dev_kfree_skb_any(p->rx_ring[i].skb);
+			p->rx_ring[i].skb = NULL;
+		}
+	kfree(p->rx_ring);
+}
+
+void dev_stop_nosrq(struct ipoib_dev_priv *priv)
+{
+	struct ipoib_cm_rx *p;
+
+	spin_lock_irq(&priv->lock);
+	while (!list_empty(&priv->cm.passive_ids)) {
+		p = list_entry(priv->cm.passive_ids.next, typeof(*p), list);
+		free_resources_nosrq(priv, p);
+		list_del(&p->list);
+		spin_unlock_irq(&priv->lock);
+		ib_destroy_cm_id(p->id);
+		ib_destroy_qp(p->qp);
+		atomic_dec(&current_rc_qp);
+		kfree(p);
+		spin_lock_irq(&priv->lock);
+	}
+	spin_unlock_irq(&priv->lock);
+
+	cancel_delayed_work(&priv->cm.stale_task);
+}
+
 void ipoib_cm_dev_stop(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -692,6 +1040,11 @@ void ipoib_cm_dev_stop(struct net_device
 	ib_destroy_cm_id(priv->cm.id);
 	priv->cm.id = NULL;
 
+	if (!priv->cm.srq) {
+		dev_stop_nosrq(priv);
+		return;
+	}
+
 	spin_lock_irq(&priv->lock);
 	while (!list_empty(&priv->cm.passive_ids)) {
 		p = list_entry(priv->cm.passive_ids.next, typeof(*p), list);
@@ -815,7 +1168,9 @@ static struct ib_qp *ipoib_cm_create_tx_
 	attr.recv_cq = priv->cq;
 	attr.srq = priv->cm.srq;
 	attr.cap.max_send_wr = ipoib_sendq_size;
+	attr.cap.max_recv_wr = 1;
 	attr.cap.max_send_sge = 1;
+	attr.cap.max_recv_sge = 1;
 	attr.sq_sig_type = IB_SIGNAL_ALL_WR;
 	attr.qp_type = IB_QPT_RC;
 	attr.send_cq = cq;
@@ -855,7 +1210,7 @@ static int ipoib_cm_send_req(struct net_
 	req.retry_count 	      = 0; /* RFC draft warns against retries */
 	req.rnr_retry_count 	      = 0; /* RFC draft warns against retries */
 	req.max_cm_retries 	      = 15;
-	req.srq 	              = 1;
+	req.srq			      = !!priv->cm.srq;
 	return ib_send_cm_req(id, &req);
 }
 
@@ -1200,6 +1555,8 @@ static void ipoib_cm_rx_reap(struct work
 	list_for_each_entry_safe(p, n, &list, list) {
 		ib_destroy_cm_id(p->id);
 		ib_destroy_qp(p->qp);
+		if (!priv->cm.srq)
+			atomic_dec(&current_rc_qp);
 		kfree(p);
 	}
 }
@@ -1218,12 +1575,19 @@ static void ipoib_cm_stale_task(struct w
 		p = list_entry(priv->cm.passive_ids.prev, typeof(*p), list);
 		if (time_before_eq(jiffies, p->jiffies + IPOIB_CM_RX_TIMEOUT))
 			break;
-		list_move(&p->list, &priv->cm.rx_error_list);
-		p->state = IPOIB_CM_RX_ERROR;
-		spin_unlock_irq(&priv->lock);
-		ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE);
-		if (ret)
-			ipoib_warn(priv, "unable to move qp to error state: %d\n", ret);
+		if (!priv->cm.srq) {
+			free_resources_nosrq(priv, p);
+			list_del_init(&p->list);
+			priv->cm.rx_index_table[p->index] = NULL;
+			spin_unlock_irq(&priv->lock);
+		} else {
+			list_move(&p->list, &priv->cm.rx_error_list);
+			p->state = IPOIB_CM_RX_ERROR;
+			spin_unlock_irq(&priv->lock);
+			ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE);
+			if (ret)
+				ipoib_warn(priv, "unable to move qp to error state: %d\n", ret);
+		}
 		spin_lock_irq(&priv->lock);
 	}
 
@@ -1277,16 +1641,40 @@ int ipoib_cm_add_mode_attr(struct net_de
 	return device_create_file(&dev->dev, &dev_attr_mode);
 }
 
+static int create_srq(struct net_device *dev, struct ipoib_dev_priv *priv)
+{
+	struct ib_srq_init_attr srq_init_attr;
+	int ret;
+
+	srq_init_attr.attr.max_wr = ipoib_recvq_size;
+	srq_init_attr.attr.max_sge = IPOIB_CM_RX_SG;
+
+	priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr);
+	if (IS_ERR(priv->cm.srq)) {
+		ret = PTR_ERR(priv->cm.srq);
+		priv->cm.srq = NULL;
+		return ret;
+	}
+
+	priv->cm.srq_ring = kzalloc(ipoib_recvq_size *
+				    sizeof *priv->cm.srq_ring,
+				    GFP_KERNEL);
+	if (!priv->cm.srq_ring) {
+		printk(KERN_WARNING "%s: failed to allocate CM ring "
+		       "(%d entries)\n",
+			priv->ca->name, ipoib_recvq_size);
+		ipoib_cm_dev_cleanup(dev);
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
 int ipoib_cm_dev_init(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	struct ib_srq_init_attr srq_init_attr = {
-		.attr = {
-			.max_wr  = ipoib_recvq_size,
-			.max_sge = IPOIB_CM_RX_SG
-		}
-	};
 	int ret, i;
+	struct ib_device_attr attr;
 
 	INIT_LIST_HEAD(&priv->cm.passive_ids);
 	INIT_LIST_HEAD(&priv->cm.reap_list);
@@ -1303,20 +1691,32 @@ int ipoib_cm_dev_init(struct net_device 
 
 	skb_queue_head_init(&priv->cm.skb_queue);
 
-	priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr);
-	if (IS_ERR(priv->cm.srq)) {
-		ret = PTR_ERR(priv->cm.srq);
-		priv->cm.srq = NULL;
+	ret = ib_query_device(priv->ca, &attr);
+	if (ret)
 		return ret;
-	}
 
-	priv->cm.srq_ring = kzalloc(ipoib_recvq_size * sizeof *priv->cm.srq_ring,
-				    GFP_KERNEL);
-	if (!priv->cm.srq_ring) {
-		printk(KERN_WARNING "%s: failed to allocate CM ring (%d entries)\n",
-		       priv->ca->name, ipoib_recvq_size);
-		ipoib_cm_dev_cleanup(dev);
-		return -ENOMEM;
+	if (attr.max_srq) {
+		/* This device supports SRQ */
+		ret = create_srq(dev, priv);
+		if (ret)
+			return ret;
+		priv->cm.rx_index_table = NULL;
+	} else {
+		priv->cm.srq = NULL;
+		priv->cm.srq_ring = NULL;
+
+		/* Every new REQ that arrives creates a struct ipoib_cm_rx.
+		 * These structures form a link list starting with the
+		 * passive_ids. For quick and easy access we maintain a table
+		 * of pointers to struct ipoib_cm_rx called the rx_index_table
+		 */
+		priv->cm.rx_index_table = kzalloc(NOSRQ_INDEX_TABLE_SIZE *
+					 sizeof *priv->cm.rx_index_table,
+					 GFP_KERNEL);
+		if (!priv->cm.rx_index_table) {
+			printk(KERN_WARNING "Failed to allocate NOSRQ_INDEX_TABLE\n");
+			return -ENOMEM;
+		}
 	}
 
 	for (i = 0; i < IPOIB_CM_RX_SG; ++i)
@@ -1329,17 +1729,24 @@ int ipoib_cm_dev_init(struct net_device 
 	priv->cm.rx_wr.sg_list = priv->cm.rx_sge;
 	priv->cm.rx_wr.num_sge = IPOIB_CM_RX_SG;
 
-	for (i = 0; i < ipoib_recvq_size; ++i) {
-		if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1,
+	/* One can post receive buffers even before the RX QP is created
+	 * only in the SRQ case. Therefore for NOSRQ we skip the rest of init
+	 * and do that in ipoib_cm_req_handler()
+	 */
+
+	if (priv->cm.srq) {
+		for (i = 0; i < ipoib_recvq_size; ++i) {
+			if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1,
 					   priv->cm.srq_ring[i].mapping)) {
-			ipoib_warn(priv, "failed to allocate receive buffer %d\n", i);
-			ipoib_cm_dev_cleanup(dev);
-			return -ENOMEM;
-		}
-		if (ipoib_cm_post_receive(dev, i)) {
-			ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i);
-			ipoib_cm_dev_cleanup(dev);
-			return -EIO;
+				ipoib_warn(priv, "failed to allocate receive buffer %d\n", i);
+				ipoib_cm_dev_cleanup(dev);
+				return -ENOMEM;
+			}
+			if (post_receive_srq(dev, i)) {
+				ipoib_warn(priv, "post_receive_srq failed for buf %d\n", i);
+				ipoib_cm_dev_cleanup(dev);
+				return -EIO;
+			}
 		}
 	}
 
--- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-05-30 14:56:25.000000000 -0400
+++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-07-10 18:30:10.000000000 -0400
@@ -299,7 +299,7 @@ int ipoib_poll(struct net_device *dev, i
 		for (i = 0; i < n; ++i) {
 			struct ib_wc *wc = priv->ibwc + i;
 
-			if (wc->wr_id & IPOIB_CM_OP_SRQ) {
+			if (wc->wr_id & IPOIB_CM_OP_RECV) {
 				++done;
 				--max;
 				ipoib_cm_handle_rx_wc(dev, wc);
@@ -557,7 +557,7 @@ void ipoib_drain_cq(struct net_device *d
 	do {
 		n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc);
 		for (i = 0; i < n; ++i) {
-			if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ)
+			if (priv->ibwc[i].wr_id & IPOIB_CM_OP_RECV)
 				ipoib_cm_handle_rx_wc(dev, priv->ibwc + i);
 			else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV)
 				ipoib_ib_handle_rx_wc(dev, priv->ibwc + i);
--- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-05-30 14:56:25.000000000 -0400
+++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-07-19 02:55:24.000000000 -0400
@@ -175,6 +175,18 @@ int ipoib_transport_dev_init(struct net_
 	if (!ret)
 		size += ipoib_recvq_size + 1 /* 1 extra for rx_drain_qp */;
 
+#ifdef CONFIG_INFINIBAND_IPOIB_CM
+
+	/* We increase the size of the CQ in the NOSRQ case to prevent CQ
+	 * overflow. Every new REQ creates a new RX QP and each QP has an
+	 * RX ring associated with it. Therefore we could have
+	 * NOSRQ_INDEX_TABLE_SIZE*ipoib_recvq_size + ipoib_sendq_size CQEs
+	 * in a CQ.
+	 */
+	if (!priv->cm.srq)
+		size += (NOSRQ_INDEX_TABLE_SIZE - 1) * ipoib_recvq_size;
+#endif
+
 	priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0);
 	if (IS_ERR(priv->cq)) {
 		printk(KERN_WARNING "%s: failed to create CQ\n", ca->name);


From hal.rosenstock at gmail.com  Thu Jul 19 12:32:26 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 19 Jul 2007 12:32:26 -0700
Subject: [ofa-general] Limited number of multicasts groups that can be
	joined?
In-Reply-To: <469FAAD8.8050505@open-mpi.org>
References: <46699A6D.4070300@open-mpi.org> <4683D7D6.50402@open-mpi.org>
	<468426B6.3060602@ichips.intel.com> <469F9BAB.4080504@open-mpi.org>
	<f0e08f230707191032h5134a643id01771438d7b9fb6@mail.gmail.com>
	<469FA652.4060909@open-mpi.org>
	<f0e08f230707191114i447224eci96761091373480d1@mail.gmail.com>
	<469FAAD8.8050505@open-mpi.org>
Message-ID: <f0e08f230707191232x22ee7605xe02ea0e1c0796a24@mail.gmail.com>

On 7/19/07, Andrew Friedley <afriedle at open-mpi.org> wrote:
>
>
>
> Hal Rosenstock wrote:
> > Thanks. I can only comment on the OpenSM configuration and in general on
> > SMs
> > so I'm still not sure what limits you are hitting; it may be multiple
> but
> > not sure. Some seemed to be end node (HCA) related based on a previous
> > email.
>
> Thanks for you help.  Yes I'm thinking the same thing, though what I'm
> seeing seemingly contradicts the limits that I'm told are in place (and
> have now been changed post-v1.2).
>
> >> Is the algorithm for recalculating the tree documented at all?  Or,
> >> where is the code for it (assuming I have access)?  I feel like I'm
> >> missing something here that explains why it's so costly.
> >
> >
> > I'm afraid it is just the code AFAIK :-(
>
> OK, do you know where it is in the OpenSM code base?


Start with osm_sa_mcmember_record.c and work towards:
 osm_mcast_mgr.c  osm_mcm_info.c  osm_mtree.c
 osm_mcast_tbl.c  osm_mcm_port.c  osm_multicast.c

-- Hal

Andrew
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070719/8d29f309/attachment.html>

From mshefty at ichips.intel.com  Thu Jul 19 12:38:04 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 19 Jul 2007 12:38:04 -0700
Subject: [ofa-general] Limited number of multicasts groups that can be
	joined?
In-Reply-To: <469FA9E8.90609@open-mpi.org>
References: <46699A6D.4070300@open-mpi.org>	<4683D7D6.50402@open-mpi.org>		<468426B6.3060602@ichips.intel.com>	<469F9BAB.4080504@open-mpi.org>	<f0e08f230707191032h5134a643id01771438d7b9fb6@mail.gmail.com>	<469FA652.4060909@open-mpi.org>
	<469FA9E8.90609@open-mpi.org>
Message-ID: <469FBD9C.3020104@ichips.intel.com>

> Also some newer results.  I had a long run going on the 128 node machine 
> to see how many groups I really could join, and it just errored out 
> after joining 892 groups successfully.  Specifically, I got an 
> RDMA_CM_EVENT_MULTICAST_ERROR event containing status -22 ('Unknown 
> error' according to sterror).  errno is still cleared to 'Success'.  I 
> don't have time go look at the code to see where this came from right 
> now, but does anyone know what it means?

This is EINVAL and is coming from the librdmacm.  That doesn't really 
help narrow down what the actual cause is unfortunately.  And I don't 
understand the behavior that you're seeing at all.

- Sean


From jim at mellanox.com  Thu Jul 19 14:26:38 2007
From: jim at mellanox.com (Jim Mott)
Date: Thu, 19 Jul 2007 14:26:38 -0700
Subject: [ofa-general] libsdp in OFED 1.1
In-Reply-To: <7C1D552561AF0544ACC7CF6F10E4966ECB541A@chronus.3leafnetworks.corp>
References: <7C1D552561AF0544ACC7CF6F10E4966ECB541A@chronus.3leafnetworks.corp>
Message-ID: <F57121538EA0C94F86018DDD40ADA1D16A6348@mtiexch01.mti.com>

With the setup you describe, I have no problems under OFED 1.2.  SDP does get used automatically, and ls does not complain.


I do not have any experience with SDP under OFED 1.1.  I will try to look at it soon.

The OFED 1.1 library code in port.c includes a close() function, so the easy answer is not going to do it.

All my testing has been with the same libsdp.conf setup you are using (the 1.2 default), so I do not expect your setup to cause any problems.

The entry in modprobe.conf.local is the normal thing.  I would not expect it to be causing you any problems.

All my testing has been with the local environment (LD_LIBRARY_PATH, LD_PRELOAD) overrides instead of /etc/ld.so.preload.  Note that with your putting the fully qualified path for the 64 bit library in /etc/ld.so.preload, there will be issues with 32 bit executables.  Not sure if that is getting you with ls, but I have seen strange problems with other things.

Could you remove your entry from /etc/ld.so.preload, set the environment variables as described in my original note by hand, and retry the ls command?

JIm

-----Original Message-----
From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Jonathan Robertson
Sent: Thursday, July 19, 2007 11:31 AM
To: general at lists.openfabrics.org
Subject: FW: [ofa-general] libsdp in OFED 1.1

Hi Jim,

We are actually using OFED 1.1. Hopefully we'll move to 1.2 in a few weeks. The systems using it are SLES 9 SP3.

Uname -a:
Linux oracle 2.6.5-7.244-smp #1 SMP Mon Dec 12 18:32:25 UTC 2005 x86_64 x86_64 x86_64 GNU/Linux

I have added alias net-pf-27 ib_sdp to modprobe.conf.local
I have modified /usr/local/ofed/etc/libsdp.conf to have the following lines:
log min-level 9 destination syslog
use both server "/usr/local/bin/netserver" *:*
use both client "/usr/local/bin/netperf" *:*
And I created /etc/ld.so.preload and have:
/usr/local/ofed/lib64/libsdp.so

Is there a close function in ofed 1.1? Perhaps I should try to add that to port.c for 1.1?

My reply to your email bounced...
5.1.0 - Unknown address error 550-'5.1.1 unknown or illegal alias: <removed>@austin.rr.com'

Thanks!
Jonathan

-----Original Message-----
From: Jim Mott [mailto:jimmmott at austin.rr.com] 
Sent: Wednesday, July 18, 2007 3:42 PM
To: Jonathan Robertson
Subject: RE: [ofa-general] libsdp in OFED 1.1

Hi,
  I have just taken over support for libsdp and am feeling my way here.
Probably I should have replied to the list, but this works too.

  I assume you are using OFED 1.2 version of the code.  There is a close()
function in that code (port.c), so there is something fishy here.  Could you
send a little more info please.  Stuff like distro, 32/64, and perhaps the
script/commands you use to automate the preload process.

Something like:

  # uname -a
  Linux sw106.lab.mtl.com 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT \
  2006 x86_64 x86_64 x86_64 GNU/Linux

  # export LD_LIBRARY_PATH=/usr/local/ofed/lib64:/usr/local/ofed/lib
  # export LD_PRELOAD=libsdp.so
  # export LIBSDP_CONFIG_FILE=/etc/infiniband/libsdp.conf 

  # ls 
  config_parser.c   config_scanner.c   libsdp.la  Makefile     match.c
port.lo
  config_parser.h   config_scanner.lo  log.c      Makefile.am  match.lo
sdp_inet.h
  config_parser.lo  libsdp.h           log.lo     Makefile.in  port.c
socket.c


Thanks,
Jim


=========================

From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Jonathan
Robertson
Sent: Wednesday, July 18, 2007 3:19 PM
To: general at lists.openfabrics.org
Subject: [ofa-general] libsdp in OFED 1.1

Hello,

I have been using libsdp, and preloading it with the application. I would
like to have it automatically preloaded, but am concerned about some error
messages that seem harmless. So I don't want to have our client use the
ld.so.preload if there are going to be messages.

I see the following when I run a simple 'ls'

# ls
Wed Jul 18 06:11:09 2007 ls[8105] libsdp Error close: no implementation for
close found
 .
..
#

Any suggestions?

I have the following in libsdp.conf
Log min-level 9 destination syslog
Use both server netserver *:*
Use both client netperf *:*

Our client is interested in having weblogic communicate with the oracle DB
using SDP, and the interface to oracle and weblogic being accessible via
tcp/ip over Ethernet as well.

Thanks!
Jonathan


_______________________________________________
general mailing list
general at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From sean.hefty at intel.com  Thu Jul 19 14:44:36 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 19 Jul 2007 14:44:36 -0700
Subject: [ofa-general] latest libipathverbs.git tree
Message-ID: <000001c7ca4e$00550460$9c98070a@amr.corp.intel.com>

Is the git tree on openfabrics:
 
git://git.openfabrics.org/~bos/libipathverbs.git

the most recent version of user space verbs available for the ipath cards?

- Sean


From arthur.jones at qlogic.com  Thu Jul 19 14:47:45 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Thu, 19 Jul 2007 14:47:45 -0700
Subject: [ofa-general] latest libipathverbs.git tree
In-Reply-To: <000001c7ca4e$00550460$9c98070a@amr.corp.intel.com>
References: <000001c7ca4e$00550460$9c98070a@amr.corp.intel.com>
Message-ID: <20070719214745.GB20240@bauxite.pathscale.com>

hi sean, ...

On Thu, Jul 19, 2007 at 02:44:36PM -0700, Sean Hefty wrote:
> Is the git tree on openfabrics:
>  
> git://git.openfabrics.org/~bos/libipathverbs.git
> 
> the most recent version of user space verbs available for the ipath cards?

no, the canonical libipathverbs is now:

git://git.openfabrics.org/~ralphc/libipathverbs

arthur


From sean.hefty at intel.com  Thu Jul 19 14:55:30 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 19 Jul 2007 14:55:30 -0700
Subject: [ofa-general] latest libipathverbs.git tree
In-Reply-To: <20070719214745.GB20240@bauxite.pathscale.com>
Message-ID: <000101c7ca4f$86743ba0$9c98070a@amr.corp.intel.com>

>no, the canonical libipathverbs is now:
>
>git://git.openfabrics.org/~ralphc/libipathverbs

Thanks.

I believe if you create /home/ralphc/public_html directory, and place symbolic
links in it to the git tree, then it will be visible on
http://www.openfabrics.org/git.  I don't remember if additional setup on the
server is required.

- Sean


From arthur.jones at qlogic.com  Thu Jul 19 15:09:05 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Thu, 19 Jul 2007 15:09:05 -0700
Subject: [ofa-general] latest libipathverbs.git tree
In-Reply-To: <000101c7ca4f$86743ba0$9c98070a@amr.corp.intel.com>
References: <20070719214745.GB20240@bauxite.pathscale.com>
	<000101c7ca4f$86743ba0$9c98070a@amr.corp.intel.com>
Message-ID: <20070719220905.GN12489@bauxite.pathscale.com>

hi sean, ...

On Thu, Jul 19, 2007 at 02:55:30PM -0700, Sean Hefty wrote:
> >no, the canonical libipathverbs is now:
> >
> >git://git.openfabrics.org/~ralphc/libipathverbs
> 
> Thanks.
> 
> I believe if you create /home/ralphc/public_html directory, and place symbolic
> links in it to the git tree, then it will be visible on
> http://www.openfabrics.org/git.  I don't remember if additional setup on the
> server is required.

thanks, i tried it, but it doesn't seem to be sufficient...

arthur


From sean.hefty at intel.com  Thu Jul 19 15:13:35 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 19 Jul 2007 15:13:35 -0700
Subject: [ofa-general] latest libipathverbs.git tree
In-Reply-To: <20070719220905.GN12489@bauxite.pathscale.com>
Message-ID: <000201c7ca52$0d364d20$9c98070a@amr.corp.intel.com>

Jeff/Vlad,

Do either of you know the missing step to adding Ralph's git tree to the http
view?  (See below.)

- Sean

>> I believe if you create /home/ralphc/public_html directory, and place
>symbolic
>> links in it to the git tree, then it will be visible on
>> http://www.openfabrics.org/git.  I don't remember if additional setup on the
>> server is required.
>
>thanks, i tried it, but it doesn't seem to be sufficient...
>
>arthur


From weikuan.yu at gmail.com  Thu Jul 19 16:01:10 2007
From: weikuan.yu at gmail.com (Weikuan Yu)
Date: Thu, 19 Jul 2007 19:01:10 -0400
Subject: [ofa-general] IEEE Hot Interconnect 2007: Registration Now Open
Message-ID: <469FED36.9080000@gmail.com>

**** Conference Dates: August 22-24, 2007,  *********

CALL FOR PARTICIPATION:  HOT Interconnect 2007 -- Registration Now Open

15th Annual IEEE Symposium on High-Performance Interconnects August
22nd-24th, 2007, Stanford University, Palo Alto, California William R.
Hewlett Teaching Center

http://www.hoti.org/

We cordially invite you to attend the 15th Annual IEEE Symposium on
High-Performance Interconnects. IEEE Hot Interconnects brings together
architects and designers of high performance chips, software, and
systems at the University and global business levels. Presentations
focus on up-to-the-minute developments demonstrating leading-edge
designs by engineers and researchers throughout the world.

Two days of technical sessions led by John Lockwood and Fabrizio
Petrini, our 2007 General Co-Chairs followed by one day of tutorials to
keep you on top of the latest industry developments and academic
laboratories. Our objective is to address the Networking and
SuperComputing families. This year we are proud to have Ron Brightwell
with Sandia National Laboratories and Dhabaleswar Panda from Ohio State
University as our 2007 IEEE Hot Interconnects Program Co-Chairs. They
are putting together a combined 'HOT' program that includes
interconnects in Supercomputing.

Highlights include:
-------------------
   * Keynote talks:
     o Alex Dickinson, Co-Founder, President & CEO, Luxtera
       "CMOS Photonics - Bringing Moore's Law to Optical Interconnect"

     o Dr. Tryggve Fossum, Intel Fellow and Director of Microarchitecture
Development
       "On-Die Interconnect and Other Challenges for Chip-Level 
Multi-Processing"

   * Panel: Multi-Multicore Interconnect: Scale-Up or Melt-Down?
     Panelists:
       -- Charlie Janac, President and CEO, Arteris
       -- Arun Sharma, Performance Engineering, Google
       -- Manu Thapar, Vice President, Platform Engineering, Yahoo
       -- Drew Wingard, CTO, Sonics
     Moderator: Dan Pitt, Director, Vquence Pty. Ltd.

   * Tutorials
     o NetFPGA (Full day)
       Nick McKeown and John Lockwood, Stanford University

     o Introduction to Programming High Performance Applications on the CELL 
Broadband Engine (Half-day)
       Dr. Jakub Kurzak and Dr. Alfredo Buttari
       Innovative Computing Laboratory, University of Tennessee at Knoxville

     o Design of Interconnection Networks (Half-day)
       John Kim, Stanford Univeristy and Dennis Abts, Cray

   * Technical Program
     o A strong, single-track program featuring 16 research papers on
cutting-edge interconnect technologies

Full details are on the conference Web site:
http://www.hoti.org/hoti15/program/

Important dates:
----------------
  * Registration NOW open, please take advantage of advanced registration.
    (http://www.hoti.org/hoti15/2007reg/)

  * Advanced Registration Deadline:  Midnight Aug 15th, 2007

  * Attendees make their own choices of Hotel. For further info:
    (http://www.hoti.org/hoti15/attendee/)

  * Main Symposium: August 22-23, 2007

  * Panel: 7pm-8pm, August 22nd , 2007

  * Tutorials: August 24th, 2007


From john_park207 at yahoo.com  Thu Jul 19 20:44:25 2007
From: john_park207 at yahoo.com (john park)
Date: Thu, 19 Jul 2007 20:44:25 -0700 (PDT)
Subject: [ofa-general] Act  Now
Message-ID: <717102.51765.qm@web63005.mail.re1.yahoo.com>

Dearest  Friend,
   
  My name is John Park, I work in a bank Here In United kingdom, I need your assistance in moving the sum of Ten Million Five Hundred Thousand British pounds (£10,500,000.00)  into your country.

  Funds are ready in an account managed by me. On agreement I will make you next of kin and the beneficiary of the fund and Transfer the Funds to you.
  This is 100% free risk. Kindly reply through this email address:john_park200 at yahoo.com) for further instruction on how to proceed.
Regards,
John  Park

       
---------------------------------
Be a better Globetrotter. Get better travel answers from someone who knows.
Yahoo! Answers - Check it out.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070719/010a231c/attachment.html>

From kliteyn at mellanox.co.il  Thu Jul 19 21:46:57 2007
From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il)
Date: 20 Jul 2007 07:46:57 +0300
Subject: [ofa-general] nightly osm_sim report 2007-07-20:normal completion
Message-ID: <MTLEXCH01QIWDO3hRDT00001de6@mtlexch01.mtl.com>

OSM Simulation Regression Summary
 
[Generated mail - please do NOT reply]
 
 
OpenSM rev = Thu_Jul_12_11:56:08_2007 [de69204d60071532833b0cdd3baa5e2386dc2c73]
ibutils rev = Tue_Mar_13_14:36:32_2007 [80aaff94f0eb65117db39b9db7d609ffdcc055de]
 
 
Total=560  Pass=560  Fail=0
 
 
Pass:
42 Stability IS1-16.topo
42 Pkey IS1-16.topo
42 OsmTest IS1-16.topo
42 OsmStress IS1-16.topo
42 Multicast IS1-16.topo
42 LidMgr IS1-16.topo
14 Stability IS3-loop.topo
14 Stability IS3-128.topo
14 Pkey IS3-128.topo
14 OsmTest IS3-loop.topo
14 OsmTest IS3-128.topo
14 OsmStress IS3-128.topo
14 Multicast IS3-loop.topo
14 Multicast IS3-128.topo
14 LidMgr IS3-128.topo
14 FatTree merge-roots-4-ary-2-tree.topo
14 FatTree merge-root-4-ary-3-tree.topo
14 FatTree gnu-stallion-64.topo
14 FatTree blend-4-ary-2-tree.topo
14 FatTree RhinoDDR.topo
14 FatTree FullGnu.topo
14 FatTree 4-ary-2-tree.topo
14 FatTree 2-ary-4-tree.topo
14 FatTree 12-node-spaced.topo
14 FTreeFail 4-ary-2-tree-missing-sw-link.topo
14 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo
14 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo
14 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo

Failures:


From krkumar2 at in.ibm.com  Thu Jul 19 23:32:01 2007
From: krkumar2 at in.ibm.com (Krishna Kumar)
Date: Fri, 20 Jul 2007 12:02:01 +0530
Subject: [ofa-general] [PATCH 01/10] HOWTO documentation for Batching SKB.
In-Reply-To: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
Message-ID: <20070720063201.26341.79273.sendpatchset@localhost.localdomain>

Add HOWTO documentation on what batching is, how to implement drivers to use
it, and how users can enable/disable batching.

Signed-off-by: Krishna Kumar <krkumar2 at in.ibm.com>
---
 Batching_skb_API.txt |   91 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 91 insertions(+)

diff -ruNp org/Documentation/networking/Batching_skb_API.txt new/Documentation/networking/Batching_skb_API.txt
--- org/Documentation/networking/Batching_skb_API.txt	1970-01-01 05:30:00.000000000 +0530
+++ new/Documentation/networking/Batching_skb_API.txt	2007-07-20 08:30:22.000000000 +0530
@@ -0,0 +1,91 @@
+		 HOWTO for batching skb API support
+		 -----------------------------------
+
+Section 1: What is batching skb API ?
+Section 2: How batching API works vs the original API ?
+Section 3: How drivers can support this API ?
+Section 4: How users can work with this API ?
+
+
+Introduction: Kernel support for batching skb
+-----------------------------------------------
+
+An extended API is supported in the netdevice layer, which is very similar
+to the existing hard_start_xmit() API. Drivers which wish to take advantage
+of this new API should implement this routine similar to how the
+hard_start_xmit handler is written. The difference between these API's is
+that while the existing hard_start_xmit processes one skb, the new API can
+process multiple skbs (or even one) in a single call. It is also possible
+for the driver writer to re-use most of the code from the existing API in
+the new API without having code duplication.
+
+
+Section 1: What is batching skb API ?
+-------------------------------------
+
+	This is a new API that is optionally exported by a driver. The pre-
+	requisite for a driver to use this API is that it should have a
+	reasonably sized hardware queue that can process multiple skbs.
+
+
+Section 2: How batching API works vs the original API ?
+-------------------------------------------------------
+
+	The networking stack normally gets called from upper layer protocols
+	with a single skb to xmit. This skb is first enqueue'd and an
+	attempt is next made to transmit it immediately (via qdisc_run).
+	However, events like driver lock contention, queue stopped, etc, can
+	result in the skb not getting sent out, and it remains in the queue.
+	When a new xmit is called or when the queue is re-enabled, qdisc_run
+	could potentially find multiple packets in the queue, and have to
+	send them all out one by one iteratively.
+
+	The batching skb API case was added to exploit this situation where
+	if there are multiple skbs, all of them can be sent to the device in
+	one shot. This reduces driver processing, locking at the driver (or
+	in stack for ~LLTX drivers) gets amortized over multiple skbs, and
+	in case of specific drivers where every xmit results in a completion
+	processing (like IPoIB), optimizations could be made in the driver
+	to get a completion for only the last skb that was sent which will
+	result in saving interrupts for every (but the last) skb that was
+	sent in the same batch.
+
+	This batching can result in significant performance gains for
+	systems that have multiple data stream paths over the same network
+	interface card.
+
+
+Section 3: How drivers can support this API ?
+---------------------------------------------
+
+	The new API - dev->hard_start_xmit_batch(struct net_device *dev),
+	simplistically, can be written almost identically to the regular
+	xmit API (hard_start_xmit), except that all skbs on dev->skb_blist
+	should be processed by the driver instead of just one skb. The new
+	API doesn't get any skb as argument to process, instead it picks up
+	all the skbs from dev->skb_blist, where it was added by the stack,
+	and tries to send them out.
+
+	Batching requires the driver to set the NETIF_F_BATCH_SKBS bit in
+	dev->features, and dev->hard_start_xmit_batch should point to the
+	new API implemented for that driver.
+
+
+Section 4: How users can work with this API ?
+---------------------------------------------
+
+	Batching could be disabled for a particular device, e.g. on desktop
+	systems if only one stream of network activity for that device is
+	taking place, since performance could be slightly affected due to
+	extra processing that batching adds. Batching can be enabled if
+	more than one stream of network activity per device is being done,
+	e.g. on servers, or even desktop usage with multiple browser, chat,
+	file transfer sessions, etc.
+
+	Per device batching can be enabled/disabled using:
+
+	echo 1 > /sys/class/net/<device-name>/tx_batch_skbs (enable)
+	echo 0 > /sys/class/net/<device-name>/tx_batch_skbs (disable)
+
+	E.g. to enable batching on eth0, run:
+		echo 1 > /sys/class/net/eth0/tx_batch_skbs


From krkumar2 at in.ibm.com  Thu Jul 19 23:31:49 2007
From: krkumar2 at in.ibm.com (Krishna Kumar)
Date: Fri, 20 Jul 2007 12:01:49 +0530
Subject: [ofa-general] [PATCH 00/10] Implement batching skb API
Message-ID: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>

Hi Dave, Roland, everyone,

In May, I had proposed creating an API for sending 'n' skbs to a driver to
reduce lock overhead, DMA operations, and specific to drivers that have
completion notification like IPoIB - reduce completion handling ("[RFC] New
driver API to speed up small packets xmits" @
http://marc.info/?l=linux-netdev&m=117880900818960&w=2). I had also sent
initial test results for E1000 which showed minor improvements (but also
got degradations) @http://marc.info/?l=linux-netdev&m=117887698405795&w=2.

After fine-tuning qdisc and other changes, I modified IPoIB to use this API,
and now get good gains. Summary for TCP & No Delay: 1 process improves for
all cases from 1.4% to 49.5%; 4 process has almost identical improvements
from -1.7% to 59.1%; 16 process case also improves in the range of -1.2% to
33.4%; while 64 process doesn't have much improvement (-3.3% to 12.4%). UDP
was tested with 1 process netperf with small increase in BW but big
improvement in Service Demand. Netperf latency tests show small drop in
transaction rate (results in separate attachment).

To verify that performance does not degrade with batching turned off (as is
the case for all existing drivers), I ran tests with tx_batch_skbs=0 vs the
original code, without getting real degradation. Also enabled all kernel
debugs to catch panics, warnings, memory free use bugs, etc, and simulated
driver errors to get coverage on core & IPoIB error paths. Testing was on
2-CPU X-series systems and 8-CPU PPC64 Power5 systems using IPoIB over mthca,
and E1000 (used driver that Jamal had converted but didn't get improvement).
On i386, the size of the kernel (drivers are modules) increased by:
	text: 0.007% data: 0.007% bss: 0% total: 0.03%.

There is a parallel WIP by Jamal but the two implementations are completely
different since the code bases from the start were separate. Key changes:
	- Use a single qdisc interface to avoid code duplication and reduce
	  maintainability (sch_generic.c size reduces by ~9%).
	- Has per device configurable parameter to turn on/off batching.
	- qdisc_restart gets slightly modified while looking simple without
	  any checks for batching vs regular code (infact only two lines have
	  changed - 1. instead of dev_dequeue_skb, a new batch-aware function
	  is called; and 2. an extra call to hard_start_xmit_batch.
	- Batching algo/processing is different (eg. if qdisc_restart() finds
	  one skb in the batch list, it will try to batch more (upto a limit)
	  instead of sending that out and batching the rest in the next call.
	- No change in__qdisc_run other than a new argument (from DM's idea).
	- Applies to latest net-2.6.23 compared to 2.6.22-rc4 code.
	- Jamal's code has a separate hw prep handler called from the stack,
	  and results are accessed in driver during xmit later.
	- Jamal's code has dev->xmit_win which is cached by the driver. Mine
	  has dev->xmit_slots but this is used only by the driver while the
	  core has a different mechanism to find how many skbs to batch.
	- Completely different structure/design & coding styles.
(This patch will work with drivers updated by Jamal, Matt & Michael Chan with
minor modifications - rename xmit_win to xmit_slots & rename batch handler)

Patches are described as:
	Mail 0/10  : This mail.
	Mail 1/10  : HOWTO documentation.
	Mail 2/10  : Networking include file changes.
	Mail 3/10  : dev.c changes.
	Mail 4/10  : net-sysfs.c changes.
	Mail 5/10  : sch_generic.c changes.
	Mail 6/10  : IPoIB include file changes.
	Mail 7/10  : IPoIB verbs changes
	Mail 8/10  : IPoIB multicast, CM changes
	Mail 9/10  : IPoIB xmit API addition
	Mail 10/10 : IPoIB xmit internals changes (ipoib_ib.c)

I am also sending separately an attachment with results (across 10 run
cycle), test scripts and a script to analyze results.

Thanks to Sridhar & Shirley Ma for code reviews; Evgeniy, Jamal & Sridhar for
suggesting to put driver skb list on netdev instead of on skb to avoid
requeue; and David Miller for explanation on using batching only when the
queue is woken up.

Please review and provide feedback/ideas; and consider for inclusion.

Thanks,

- KK


From krkumar2 at in.ibm.com  Thu Jul 19 23:33:01 2007
From: krkumar2 at in.ibm.com (Krishna Kumar)
Date: Fri, 20 Jul 2007 12:03:01 +0530
Subject: [ofa-general] [PATCH 06/10] IPoIB header file changes.
In-Reply-To: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
Message-ID: <20070720063301.26341.70540.sendpatchset@localhost.localdomain>

IPoIB header file changes.

Signed-off-by: Krishna Kumar <krkumar2 at in.ibm.com>
---
 ipoib.h |    9 ++++++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib.h new/drivers/infiniband/ulp/ipoib/ipoib.h
--- org/drivers/infiniband/ulp/ipoib/ipoib.h	2007-07-20 07:49:28.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib.h	2007-07-20 08:30:22.000000000 +0530
@@ -269,8 +269,8 @@ struct ipoib_dev_priv {
 	struct ipoib_tx_buf *tx_ring;
 	unsigned             tx_head;
 	unsigned             tx_tail;
-	struct ib_sge        tx_sge;
-	struct ib_send_wr    tx_wr;
+	struct ib_sge        *tx_sge;
+	struct ib_send_wr    *tx_wr;
 
 	struct ib_wc ibwc[IPOIB_NUM_WC];
 
@@ -365,8 +365,11 @@ static inline void ipoib_put_ah(struct i
 int ipoib_open(struct net_device *dev);
 int ipoib_add_pkey_attr(struct net_device *dev);
 
+int ipoib_process_skb(struct net_device *dev, struct sk_buff *skb,
+		      struct ipoib_dev_priv *priv, int snum, int tx_index,
+		      struct ipoib_ah *address, u32 qpn);
 void ipoib_send(struct net_device *dev, struct sk_buff *skb,
-		struct ipoib_ah *address, u32 qpn);
+		struct ipoib_ah *address, u32 qpn, int num_skbs);
 void ipoib_reap_ah(struct work_struct *work);
 
 void ipoib_flush_paths(struct net_device *dev);


From krkumar2 at in.ibm.com  Thu Jul 19 23:32:16 2007
From: krkumar2 at in.ibm.com (Krishna Kumar)
Date: Fri, 20 Jul 2007 12:02:16 +0530
Subject: [ofa-general] [PATCH 02/10] Networking include file changes.
In-Reply-To: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
Message-ID: <20070720063216.26341.80316.sendpatchset@localhost.localdomain>

Networking include file changes for batching.

Signed-off-by: Krishna Kumar <krkumar2 at in.ibm.com>
---
 linux/netdevice.h |   10 ++++++++++
 net/pkt_sched.h   |    6 +++---
 2 files changed, 13 insertions(+), 3 deletions(-)

diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h
--- org/include/linux/netdevice.h	2007-07-20 07:49:28.000000000 +0530
+++ new/include/linux/netdevice.h	2007-07-20 08:30:55.000000000 +0530
@@ -264,6 +264,8 @@ enum netdev_state_t
 	__LINK_STATE_QDISC_RUNNING,
 };
 
+/* Minimum length of device hardware queue for batching to work */
+#define MIN_QUEUE_LEN_BATCH	16
 
 /*
  * This structure holds at boot time configured netdevice settings. They
@@ -340,6 +342,7 @@ struct net_device
 #define NETIF_F_VLAN_CHALLENGED	1024	/* Device cannot handle VLAN packets */
 #define NETIF_F_GSO		2048	/* Enable software GSO. */
 #define NETIF_F_LLTX		4096	/* LockLess TX */
+#define NETIF_F_BATCH_SKBS	8192	/* Driver supports batch skbs API */
 #define NETIF_F_MULTI_QUEUE	16384	/* Has multiple TX/RX queues */
 
 	/* Segmentation offload features */
@@ -452,6 +455,8 @@ struct net_device
 	struct Qdisc		*qdisc_sleeping;
 	struct list_head	qdisc_list;
 	unsigned long		tx_queue_len;	/* Max frames per queue allowed */
+	unsigned long		xmit_slots;	/* Device free slots */
+	struct sk_buff_head	*skb_blist;	/* List of batch skbs */
 
 	/* Partially transmitted GSO packet. */
 	struct sk_buff		*gso_skb;
@@ -472,6 +477,9 @@ struct net_device
 	void			*priv;	/* pointer to private data	*/
 	int			(*hard_start_xmit) (struct sk_buff *skb,
 						    struct net_device *dev);
+	int			(*hard_start_xmit_batch) (struct net_device
+							  *dev);
+
 	/* These may be needed for future network-power-down code. */
 	unsigned long		trans_start;	/* Time (in jiffies) of last Tx	*/
 
@@ -832,6 +840,8 @@ extern int		dev_set_mac_address(struct n
 					    struct sockaddr *);
 extern int		dev_hard_start_xmit(struct sk_buff *skb,
 					    struct net_device *dev);
+extern int		dev_add_skb_to_blist(struct sk_buff *skb,
+					     struct net_device *dev);
 
 extern void		dev_init(void);
 
diff -ruNp org/include/net/pkt_sched.h new/include/net/pkt_sched.h
--- org/include/net/pkt_sched.h	2007-07-20 07:49:28.000000000 +0530
+++ new/include/net/pkt_sched.h	2007-07-20 08:30:22.000000000 +0530
@@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge
 		struct rtattr *tab);
 extern void qdisc_put_rtab(struct qdisc_rate_table *tab);
 
-extern void __qdisc_run(struct net_device *dev);
+extern void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist);
 
-static inline void qdisc_run(struct net_device *dev)
+static inline void qdisc_run(struct net_device *dev, struct sk_buff_head *blist)
 {
 	if (!netif_queue_stopped(dev) &&
 	    !test_and_set_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
-		__qdisc_run(dev);
+		__qdisc_run(dev, blist);
 }
 
 extern int tc_classify_compat(struct sk_buff *skb, struct tcf_proto *tp,


From krkumar2 at in.ibm.com  Thu Jul 19 23:32:49 2007
From: krkumar2 at in.ibm.com (Krishna Kumar)
Date: Fri, 20 Jul 2007 12:02:49 +0530
Subject: [ofa-general] [PATCH 05/10] sch_generic.c changes.
In-Reply-To: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
Message-ID: <20070720063249.26341.125.sendpatchset@localhost.localdomain>

net/sched/sch_generic.c changes to support batching. Adds a batch
aware function (get_skb) to get skbs to send.

Signed-off-by: Krishna Kumar <krkumar2 at in.ibm.com>
---
 sch_generic.c |   94 +++++++++++++++++++++++++++++++++++++++++++---------------
 1 files changed, 71 insertions(+), 23 deletions(-)

diff -ruNp org/net/sched/sch_generic.c new/net/sched/sch_generic.c
--- org/net/sched/sch_generic.c	2007-07-20 07:49:28.000000000 +0530
+++ new/net/sched/sch_generic.c	2007-07-20 08:30:22.000000000 +0530
@@ -9,6 +9,11 @@
  * Authors:	Alexey Kuznetsov, <kuznet at ms2.inr.ac.ru>
  *              Jamal Hadi Salim, <hadi at cyberus.ca> 990601
  *              - Ingress support
+ *
+ * New functionality:
+ *		Krishna Kumar, <krkumar2 at in.ibm.com>, July 2007
+ *		- Support for sending multiple skbs to devices that support
+ *		  new api - dev->hard_start_xmit_batch()
  */
 
 #include <linux/bitops.h>
@@ -59,10 +64,12 @@ static inline int qdisc_qlen(struct Qdis
 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev,
 				  struct Qdisc *q)
 {
-	if (unlikely(skb->next))
-		dev->gso_skb = skb;
-	else
-		q->ops->requeue(skb, q);
+	if (likely(skb)) {
+		if (unlikely(skb->next))
+			dev->gso_skb = skb;
+		else
+			q->ops->requeue(skb, q);
+	}
 
 	netif_schedule(dev);
 	return 0;
@@ -91,18 +98,23 @@ static inline int handle_dev_cpu_collisi
 		/*
 		 * Same CPU holding the lock. It may be a transient
 		 * configuration error, when hard_start_xmit() recurses. We
-		 * detect it by checking xmit owner and drop the packet when
-		 * deadloop is detected. Return OK to try the next skb.
+		 * detect it by checking xmit owner and drop skb (or all
+		 * skbs in batching case) when deadloop is detected. Return
+		 * OK to try the next skb.
 		 */
-		kfree_skb(skb);
+		if (likely(skb))
+			kfree_skb(skb);
+		else if (!skb_queue_empty(dev->skb_blist))
+			skb_queue_purge(dev->skb_blist);
+
 		if (net_ratelimit())
 			printk(KERN_WARNING "Dead loop on netdevice %s, "
 			       "fix it urgently!\n", dev->name);
 		ret = qdisc_qlen(q);
 	} else {
 		/*
-		 * Another cpu is holding lock, requeue & delay xmits for
-		 * some time.
+		 * Another cpu is holding lock. Requeue skb and delay xmits
+		 * for some time.
 		 */
 		__get_cpu_var(netdev_rx_stat).cpu_collision++;
 		ret = dev_requeue_skb(skb, dev, q);
@@ -112,6 +124,39 @@ static inline int handle_dev_cpu_collisi
 }
 
 /*
+ * Algorithm to get skb(s) is:
+ *	- Non batching drivers, or if the batch list is empty and there is 1
+ *	  skb in the queue - dequeue skb and put it in *skbp to tell the
+ *	  caller to use the regular API.
+ *	- Batching drivers where the batch list already contains atleast one
+ *	  skb or if there are multiple skbs in the queue: keep dequeue'ing
+ *	  skb's upto a limit and set *skbp to NULL to tell the caller to use
+ *	  the new API.
+ *
+ * Returns:
+ *	1 - atleast one skb is to be sent out, *skbp contains skb or NULL
+ *	    (in case >1 skbs present in blist for batching)
+ *	0 - no skbs to be sent.
+ */
+static inline int get_skb(struct net_device *dev, struct Qdisc *q,
+			  struct sk_buff_head *blist,
+			  struct sk_buff **skbp)
+{
+	if (likely(!blist) || (!skb_queue_len(blist) && qdisc_qlen(q) <= 1)) {
+		return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
+	} else {
+		int max = dev->tx_queue_len - skb_queue_len(blist);
+		struct sk_buff *skb;
+
+		while (max > 0 && (skb = dev_dequeue_skb(dev, q)) != NULL)
+			max -= dev_add_skb_to_blist(skb, dev);
+
+		*skbp = NULL;
+		return 1;	/* we have atleast one skb in blist */
+	}
+}
+
+/*
  * NOTE: Called under dev->queue_lock with locally disabled BH.
  *
  * __LINK_STATE_QDISC_RUNNING guarantees only one CPU can process this
@@ -130,27 +175,28 @@ static inline int handle_dev_cpu_collisi
  *				>0 - queue is not empty.
  *
  */
-static inline int qdisc_restart(struct net_device *dev)
+static inline int qdisc_restart(struct net_device *dev,
+				struct sk_buff_head *blist)
 {
 	struct Qdisc *q = dev->qdisc;
 	struct sk_buff *skb;
-	unsigned lockless;
+	unsigned getlock;		/* whether we need to get lock or not */
 	int ret;
 
 	/* Dequeue packet */
-	if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL))
+	if (unlikely(!get_skb(dev, q, blist, &skb)))
 		return 0;
 
 	/*
 	 * When the driver has LLTX set, it does its own locking in
-	 * start_xmit. These checks are worth it because even uncongested
+	 * start_xmit. These checks are worth it because even uncontested
 	 * locks can be quite expensive. The driver can do a trylock, as
 	 * is being done here; in case of lock contention it should return
 	 * NETDEV_TX_LOCKED and the packet will be requeued.
 	 */
-	lockless = (dev->features & NETIF_F_LLTX);
+	getlock = !(dev->features & NETIF_F_LLTX);
 
-	if (!lockless && !netif_tx_trylock(dev)) {
+	if (getlock && !netif_tx_trylock(dev)) {
 		/* Another CPU grabbed the driver tx lock */
 		return handle_dev_cpu_collision(skb, dev, q);
 	}
@@ -158,9 +204,12 @@ static inline int qdisc_restart(struct n
 	/* And release queue */
 	spin_unlock(&dev->queue_lock);
 
-	ret = dev_hard_start_xmit(skb, dev);
+	if (likely(skb))
+		ret = dev_hard_start_xmit(skb, dev);
+	else
+		ret = dev->hard_start_xmit_batch(dev);
 
-	if (!lockless)
+	if (getlock)
 		netif_tx_unlock(dev);
 
 	spin_lock(&dev->queue_lock);
@@ -168,7 +217,7 @@ static inline int qdisc_restart(struct n
 
 	switch (ret) {
 	case NETDEV_TX_OK:
-		/* Driver sent out skb successfully */
+		/* Driver sent out skb (or entire skb_blist) successfully */
 		ret = qdisc_qlen(q);
 		break;
 
@@ -179,10 +228,9 @@ static inline int qdisc_restart(struct n
 
 	default:
 		/* Driver returned NETDEV_TX_BUSY - requeue skb */
-		if (unlikely (ret != NETDEV_TX_BUSY && net_ratelimit()))
-			printk(KERN_WARNING "BUG %s code %d qlen %d\n",
+		if (unlikely(ret != NETDEV_TX_BUSY) && net_ratelimit())
+			printk(KERN_WARNING " %s: BUG. code %d qlen %d\n",
 			       dev->name, ret, q->q.qlen);
-
 		ret = dev_requeue_skb(skb, dev, q);
 		break;
 	}
@@ -190,10 +238,10 @@ static inline int qdisc_restart(struct n
 	return ret;
 }
 
-void __qdisc_run(struct net_device *dev)
+void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist)
 {
 	do {
-		if (!qdisc_restart(dev))
+		if (!qdisc_restart(dev, blist))
 			break;
 	} while (!netif_queue_stopped(dev));
 

From krkumar2 at in.ibm.com  Thu Jul 19 23:32:27 2007
From: krkumar2 at in.ibm.com (Krishna Kumar)
Date: Fri, 20 Jul 2007 12:02:27 +0530
Subject: [ofa-general] [PATCH 03/10] dev.c changes.
In-Reply-To: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
Message-ID: <20070720063227.26341.91868.sendpatchset@localhost.localdomain>

Changes in dev.c to support batching : add dev_add_skb_to_blist,
register_netdev recognizes batch aware drivers, and net_tx_action is
the sole user of batching.

Signed-off-by: Krishna Kumar <krkumar2 at in.ibm.com>
---
 dev.c |   77 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 74 insertions(+), 3 deletions(-)

diff -ruNp org/net/core/dev.c new/net/core/dev.c
--- org/net/core/dev.c	2007-07-20 07:49:28.000000000 +0530
+++ new/net/core/dev.c	2007-07-20 08:31:35.000000000 +0530
@@ -1414,6 +1414,45 @@ static int dev_gso_segment(struct sk_buf
 	return 0;
 }
 
+/*
+ * Add skb (skbs in case segmentation is required) to dev->skb_blist. We are
+ * holding QDISC RUNNING bit, so no one else can add to this list. Also, skbs
+ * are dequeued from this list when we call the driver, so the list is safe
+ * from simultaneous deletes too.
+ *
+ * Returns count of successful skb(s) added to skb_blist.
+ */
+int dev_add_skb_to_blist(struct sk_buff *skb, struct net_device *dev)
+{
+	if (!list_empty(&ptype_all))
+		dev_queue_xmit_nit(skb, dev);
+
+	if (netif_needs_gso(dev, skb)) {
+		if (unlikely(dev_gso_segment(skb))) {
+			kfree(skb);
+			return 0;
+		}
+
+		if (skb->next) {
+			int count = 0;
+
+			do {
+				struct sk_buff *nskb = skb->next;
+
+				skb->next = nskb->next;
+				__skb_queue_tail(dev->skb_blist, nskb);
+				count++;
+			} while (skb->next);
+
+			skb->destructor = DEV_GSO_CB(skb)->destructor;
+			kfree_skb(skb);
+			return count;
+		}
+	}
+	__skb_queue_tail(dev->skb_blist, skb);
+	return 1;
+}
+
 int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	if (likely(!skb->next)) {
@@ -1566,7 +1605,7 @@ gso:
 			/* reset queue_mapping to zero */
 			skb->queue_mapping = 0;
 			rc = q->enqueue(skb, q);
-			qdisc_run(dev);
+			qdisc_run(dev, NULL);
 			spin_unlock(&dev->queue_lock);
 
 			rc = rc == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : rc;
@@ -1763,7 +1802,11 @@ static void net_tx_action(struct softirq
 			clear_bit(__LINK_STATE_SCHED, &dev->state);
 
 			if (spin_trylock(&dev->queue_lock)) {
-				qdisc_run(dev);
+				/*
+				 * Try to send out all skbs if batching is
+				 * enabled.
+				 */
+				qdisc_run(dev, dev->skb_blist);
 				spin_unlock(&dev->queue_lock);
 			} else {
 				netif_schedule(dev);
@@ -3397,6 +3440,28 @@ int register_netdevice(struct net_device
 		}
 	}
 
+	if (dev->features & NETIF_F_BATCH_SKBS) {
+		if (!dev->hard_start_xmit_batch ||
+		    dev->tx_queue_len < MIN_QUEUE_LEN_BATCH) {
+			/*
+			 * Batch TX requires API support in driver plus have
+			 * a minimum sized queue.
+			 */
+			printk(KERN_ERR "%s: Dropping NETIF_F_BATCH_SKBS "
+					"since no API support or queue len "
+					"is smaller than %d.\n",
+					dev->name, MIN_QUEUE_LEN_BATCH);
+			dev->features &= ~NETIF_F_BATCH_SKBS;
+		} else {
+			dev->skb_blist = kmalloc(sizeof *dev->skb_blist,
+						 GFP_KERNEL);
+			if (dev->skb_blist) {
+				skb_queue_head_init(dev->skb_blist);
+				dev->tx_queue_len >>= 1;
+			}
+		}
+	}
+
 	/*
 	 *	nil rebuild_header routine,
 	 *	that should be never called and used as just bug trap.
@@ -3732,10 +3797,16 @@ void unregister_netdevice(struct net_dev
 
 	synchronize_net();
 
+	/* Deallocate batching structure */
+	if (dev->skb_blist) {
+		skb_queue_purge(dev->skb_blist);
+		kfree(dev->skb_blist);
+		dev->skb_blist = NULL;
+	}
+
 	/* Shutdown queueing discipline. */
 	dev_shutdown(dev);
 
-
 	/* Notify protocols, that we are about to destroy
 	   this device. They should clean all the things.
 	*/


From krkumar2 at in.ibm.com  Thu Jul 19 23:33:13 2007
From: krkumar2 at in.ibm.com (Krishna Kumar)
Date: Fri, 20 Jul 2007 12:03:13 +0530
Subject: [ofa-general] [PATCH 07/10] IPoIB verb changes.
In-Reply-To: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
Message-ID: <20070720063313.26341.75017.sendpatchset@localhost.localdomain>

IPoIB verb changes to support batching.

Signed-off-by: Krishna Kumar <krkumar2 at in.ibm.com>
---
 ipoib_verbs.c |   23 ++++++++++++++---------
 1 files changed, 14 insertions(+), 9 deletions(-)

diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_verbs.c new/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-07-20 07:49:28.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-07-20 08:30:22.000000000 +0530
@@ -152,11 +152,11 @@ int ipoib_transport_dev_init(struct net_
 			.max_send_sge = 1,
 			.max_recv_sge = 1
 		},
-		.sq_sig_type = IB_SIGNAL_ALL_WR,
+		.sq_sig_type = IB_SIGNAL_REQ_WR,	/* 11.2.4.1 */
 		.qp_type     = IB_QPT_UD
 	};
-
-	int ret, size;
+	struct ib_send_wr *next_wr = NULL;
+	int i, ret, size;
 
 	priv->pd = ib_alloc_pd(priv->ca);
 	if (IS_ERR(priv->pd)) {
@@ -197,12 +197,17 @@ int ipoib_transport_dev_init(struct net_
 	priv->dev->dev_addr[2] = (priv->qp->qp_num >>  8) & 0xff;
 	priv->dev->dev_addr[3] = (priv->qp->qp_num      ) & 0xff;
 
-	priv->tx_sge.lkey 	= priv->mr->lkey;
-
-	priv->tx_wr.opcode 	= IB_WR_SEND;
-	priv->tx_wr.sg_list 	= &priv->tx_sge;
-	priv->tx_wr.num_sge 	= 1;
-	priv->tx_wr.send_flags 	= IB_SEND_SIGNALED;
+	for (i = ipoib_sendq_size - 1; i >= 0; i--) {
+		priv->tx_sge[i].lkey		= priv->mr->lkey;
+		priv->tx_wr[i].opcode		= IB_WR_SEND;
+		priv->tx_wr[i].sg_list		= &priv->tx_sge[i];
+		priv->tx_wr[i].num_sge		= 1;
+		priv->tx_wr[i].send_flags	= 0;
+
+		/* Link the list properly for provider to use */
+		priv->tx_wr[i].next		= next_wr;
+		next_wr				= &priv->tx_wr[i];
+	}
 
 	return 0;
 

From krkumar2 at in.ibm.com  Thu Jul 19 23:33:26 2007
From: krkumar2 at in.ibm.com (Krishna Kumar)
Date: Fri, 20 Jul 2007 12:03:26 +0530
Subject: [ofa-general] [PATCH 08/10] IPoIB multicast/CM changes.
In-Reply-To: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
Message-ID: <20070720063326.26341.24459.sendpatchset@localhost.localdomain>

IPoIB Multicast and CM changes for batching support.

Signed-off-by: Krishna Kumar <krkumar2 at in.ibm.com>
---
 ipoib_cm.c        |   13 +++++++++----
 ipoib_multicast.c |    4 ++--
 2 files changed, 11 insertions(+), 6 deletions(-)

diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_cm.c new/drivers/infiniband/ulp/ipoib/ipoib_cm.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-07-20 07:49:28.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-07-20 08:30:22.000000000 +0530
@@ -493,14 +493,19 @@ static inline int post_send(struct ipoib
 			    unsigned int wr_id,
 			    u64 addr, int len)
 {
+	int ret;
 	struct ib_send_wr *bad_wr;
 
-	priv->tx_sge.addr             = addr;
-	priv->tx_sge.length           = len;
+	priv->tx_sge[0].addr          = addr;
+	priv->tx_sge[0].length        = len;
+
+	priv->tx_wr[0].wr_id 	      = wr_id;
 
-	priv->tx_wr.wr_id 	      = wr_id;
+	priv->tx_wr[0].next = NULL;
+	ret = ib_post_send(tx->qp, priv->tx_wr, &bad_wr);
+	priv->tx_wr[0].next = &priv->tx_wr[1];
 
-	return ib_post_send(tx->qp, &priv->tx_wr, &bad_wr);
+	return ret;
 }
 
 void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx)
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_multicast.c new/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-07-20 07:49:28.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-07-20 08:30:22.000000000 +0530
@@ -217,7 +217,7 @@ static int ipoib_mcast_join_finish(struc
 	if (!memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4,
 		    sizeof (union ib_gid))) {
 		priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey);
-		priv->tx_wr.wr.ud.remote_qkey = priv->qkey;
+		priv->tx_wr[0].wr.ud.remote_qkey = priv->qkey;
 	}
 
 	if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) {
@@ -736,7 +736,7 @@ out:
 			}
 		}
 
-		ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN);
+		ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN, 1);
 	}
 
 unlock:


From krkumar2 at in.ibm.com  Thu Jul 19 23:33:36 2007
From: krkumar2 at in.ibm.com (Krishna Kumar)
Date: Fri, 20 Jul 2007 12:03:36 +0530
Subject: [ofa-general] [PATCH 09/10] IPoIB batching xmit handler support.
In-Reply-To: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
Message-ID: <20070720063336.26341.2955.sendpatchset@localhost.localdomain>

Add a IPoIB batching xmit handler.

Signed-off-by: Krishna Kumar <krkumar2 at in.ibm.com>
---
 ipoib_main.c |  215 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 210 insertions(+), 5 deletions(-)

diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_main.c new/drivers/infiniband/ulp/ipoib/ipoib_main.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-07-20 07:49:28.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-07-20 08:30:22.000000000 +0530
@@ -558,7 +558,8 @@ static void neigh_add_path(struct sk_buf
 				goto err_drop;
 			}
 		} else
-			ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb->dst->neighbour->ha));
+			ipoib_send(dev, skb, path->ah,
+				   IPOIB_QPN(skb->dst->neighbour->ha), 1);
 	} else {
 		neigh->ah  = NULL;
 
@@ -638,7 +639,7 @@ static void unicast_arp_send(struct sk_b
 		ipoib_dbg(priv, "Send unicast ARP to %04x\n",
 			  be16_to_cpu(path->pathrec.dlid));
 
-		ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr));
+		ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr), 1);
 	} else if ((path->query || !path_rec_start(dev, path)) &&
 		   skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
 		/* put pseudoheader back on for next time */
@@ -704,7 +705,8 @@ static int ipoib_start_xmit(struct sk_bu
 				goto out;
 			}
 
-			ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(skb->dst->neighbour->ha));
+			ipoib_send(dev, skb, neigh->ah,
+				   IPOIB_QPN(skb->dst->neighbour->ha), 1);
 			goto out;
 		}
 
@@ -753,6 +755,177 @@ out:
 	return NETDEV_TX_OK;
 }
 
+#define	XMIT_QUEUED_SKBS()						\
+	do {								\
+		if (num_skbs) {						\
+			ipoib_send(dev, NULL, old_neigh->ah, old_qpn,	\
+				   num_skbs);				\
+			num_skbs = 0;					\
+		}							\
+	} while (0)
+
+/*
+ * TODO: Merge with ipoib_start_xmit to use the same code and have a
+ * transparent wrapper caller to xmit's, etc.
+ */
+static int ipoib_start_xmit_frames(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct sk_buff *skb;
+	struct sk_buff_head *blist;
+	int max_skbs, num_skbs = 0, tx_ring_index = -1;
+	u32 qpn, old_qpn = 0;
+	struct ipoib_neigh *neigh, *old_neigh = NULL;
+	unsigned long flags;
+
+	if (unlikely(!spin_trylock_irqsave(&priv->tx_lock, flags)))
+		return NETDEV_TX_LOCKED;
+
+	blist = dev->skb_blist;
+
+	/*
+	 * Send atmost xmit_slots skbs. This also prevents the device getting
+	 * full as ipoib_send modifies the xmit_slots and we use the same
+	 * value to figure how many skbs to send.
+	 */
+	max_skbs = dev->xmit_slots;
+
+	while (max_skbs-- > 0 && (skb = __skb_dequeue(blist)) != NULL) {
+		/*
+		 * From here on, ipoib_send() cannot stop the queue as it
+		 * uses the same initialization as 'max_skbs'. So we can
+		 * optimize to not check for queue stopped for every skb.
+		 */
+		if (likely(skb->dst && skb->dst->neighbour)) {
+			if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) {
+				XMIT_QUEUED_SKBS();
+				ipoib_path_lookup(skb, dev);
+				continue;
+			}
+
+			neigh = *to_ipoib_neigh(skb->dst->neighbour);
+
+			if (ipoib_cm_get(neigh)) {
+				if (ipoib_cm_up(neigh)) {
+					XMIT_QUEUED_SKBS();
+					ipoib_cm_send(dev, skb,
+						      ipoib_cm_get(neigh));
+					continue;
+				}
+			} else if (neigh->ah) {
+				if (unlikely(memcmp(&neigh->dgid.raw,
+						    skb->dst->neighbour->ha + 4,
+						    sizeof(union ib_gid)))) {
+					spin_lock(&priv->lock);
+					/*
+					 * It's safe to call ipoib_put_ah()
+					 * inside priv->lock here, because we
+					 * know that path->ah will always hold
+					 * one more reference, so ipoib_put_ah()
+					 * will never do more than decrement
+					 * the ref count.
+					 */
+					ipoib_put_ah(neigh->ah);
+					list_del(&neigh->list);
+					ipoib_neigh_free(dev, neigh);
+					spin_unlock(&priv->lock);
+					XMIT_QUEUED_SKBS();
+					ipoib_path_lookup(skb, dev);
+					continue;
+				}
+
+				qpn = IPOIB_QPN(skb->dst->neighbour->ha);
+				if (neigh != old_neigh || qpn != old_qpn) {
+					/*
+					 * Sending to a different destination
+					 * from earlier skb's - send all
+					 * existing skbs (if any).
+					 */
+					if (tx_ring_index == -1) {
+						/*
+						 * First time, find where to
+						 * store skb.
+						 */
+						tx_ring_index = priv->tx_head &
+							(ipoib_sendq_size - 1);
+					} else {
+						/* Some skbs to send */
+						XMIT_QUEUED_SKBS();
+					}
+					old_neigh = neigh;
+					old_qpn = IPOIB_QPN(skb->dst->neighbour->ha);
+				}
+
+				if (ipoib_process_skb(dev, skb, priv, num_skbs,
+						      tx_ring_index, neigh->ah,
+						      qpn))
+					continue;
+
+				num_skbs++;
+
+				/* Queue'd one skb, get index for next skb */
+				if (max_skbs)
+					tx_ring_index = (tx_ring_index + 1) &
+							(ipoib_sendq_size - 1);
+				continue;
+			}
+
+			if (skb_queue_len(&neigh->queue) <
+			    IPOIB_MAX_PATH_REC_QUEUE) {
+				spin_lock(&priv->lock);
+				__skb_queue_tail(&neigh->queue, skb);
+				spin_unlock(&priv->lock);
+			} else {
+				dev_kfree_skb_any(skb);
+				++priv->stats.tx_dropped;
+				++max_skbs;
+			}
+		} else {
+			struct ipoib_pseudoheader *phdr =
+				(struct ipoib_pseudoheader *) skb->data;
+			skb_pull(skb, sizeof *phdr);
+
+			if (phdr->hwaddr[4] == 0xff) {
+				/* Add in the P_Key for multicast*/
+				phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff;
+				phdr->hwaddr[9] = priv->pkey & 0xff;
+
+				XMIT_QUEUED_SKBS();
+				ipoib_mcast_send(dev, phdr->hwaddr + 4, skb);
+			} else {
+				/* unicast GID -- should be ARP or RARP reply */
+
+				if ((be16_to_cpup((__be16 *) skb->data) !=
+				    ETH_P_ARP) &&
+				    (be16_to_cpup((__be16 *) skb->data) !=
+				    ETH_P_RARP)) {
+					ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x "
+						IPOIB_GID_FMT "\n",
+						skb->dst ? "neigh" : "dst",
+						be16_to_cpup((__be16 *)
+						skb->data),
+						IPOIB_QPN(phdr->hwaddr),
+						IPOIB_GID_RAW_ARG(phdr->hwaddr
+								  + 4));
+					dev_kfree_skb_any(skb);
+					++priv->stats.tx_dropped;
+					++max_skbs;
+					continue;
+				}
+				XMIT_QUEUED_SKBS();
+				unicast_arp_send(skb, dev, phdr);
+			}
+		}
+	}
+
+	/* Send out last packets (if any) */
+	XMIT_QUEUED_SKBS();
+
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
+
+	return skb_queue_empty(blist) ? NETDEV_TX_OK : NETDEV_TX_BUSY;
+}
+
 static struct net_device_stats *ipoib_get_stats(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -898,11 +1071,35 @@ int ipoib_dev_init(struct net_device *de
 
 	/* priv->tx_head & tx_tail are already 0 */
 
-	if (ipoib_ib_dev_init(dev, ca, port))
+	/* Allocate tx_sge */
+	priv->tx_sge = kmalloc(ipoib_sendq_size * sizeof *priv->tx_sge,
+			       GFP_KERNEL);
+	if (!priv->tx_sge) {
+		printk(KERN_WARNING "%s: failed to allocate TX sge (%d entries)\n",
+		       ca->name, ipoib_sendq_size);
 		goto out_tx_ring_cleanup;
+	}
+
+	/* Allocate tx_wr */
+	priv->tx_wr = kmalloc(ipoib_sendq_size * sizeof *priv->tx_wr,
+			      GFP_KERNEL);
+	if (!priv->tx_wr) {
+		printk(KERN_WARNING "%s: failed to allocate TX wr (%d entries)\n",
+		       ca->name, ipoib_sendq_size);
+		goto out_tx_sge_cleanup;
+	}
+
+	if (ipoib_ib_dev_init(dev, ca, port))
+		goto out_tx_wr_cleanup;
 
 	return 0;
 
+out_tx_wr_cleanup:
+	kfree(priv->tx_wr);
+
+out_tx_sge_cleanup:
+	kfree(priv->tx_sge);
+
 out_tx_ring_cleanup:
 	kfree(priv->tx_ring);
 
@@ -930,9 +1127,13 @@ void ipoib_dev_cleanup(struct net_device
 
 	kfree(priv->rx_ring);
 	kfree(priv->tx_ring);
+	kfree(priv->tx_sge);
+	kfree(priv->tx_wr);
 
 	priv->rx_ring = NULL;
 	priv->tx_ring = NULL;
+	priv->tx_sge = NULL;
+	priv->tx_wr = NULL;
 }
 
 static void ipoib_setup(struct net_device *dev)
@@ -943,6 +1144,7 @@ static void ipoib_setup(struct net_devic
 	dev->stop 		 = ipoib_stop;
 	dev->change_mtu 	 = ipoib_change_mtu;
 	dev->hard_start_xmit 	 = ipoib_start_xmit;
+	dev->hard_start_xmit_batch = ipoib_start_xmit_frames;
 	dev->get_stats 		 = ipoib_get_stats;
 	dev->tx_timeout 	 = ipoib_timeout;
 	dev->hard_header 	 = ipoib_hard_header;
@@ -963,7 +1165,10 @@ static void ipoib_setup(struct net_devic
 	dev->addr_len 		 = INFINIBAND_ALEN;
 	dev->type 		 = ARPHRD_INFINIBAND;
 	dev->tx_queue_len 	 = ipoib_sendq_size * 2;
-	dev->features            = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX;
+	dev->features            = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX |
+					NETIF_F_BATCH_SKBS;
+
+	dev->xmit_slots		= ipoib_sendq_size;
 
 	/* MTU will be reset when mcast join happens */
 	dev->mtu 		 = IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN;


From krkumar2 at in.ibm.com  Thu Jul 19 23:32:38 2007
From: krkumar2 at in.ibm.com (Krishna Kumar)
Date: Fri, 20 Jul 2007 12:02:38 +0530
Subject: [ofa-general] [PATCH 04/10] net-sysfs.c changes.
In-Reply-To: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
Message-ID: <20070720063238.26341.41474.sendpatchset@localhost.localdomain>

Support to turn on/off batching from /sys.

Signed-off-by: Krishna Kumar <krkumar2 at in.ibm.com>
---
 net-sysfs.c |   70 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 70 insertions(+)

diff -ruNp org/net/core/net-sysfs.c new/net/core/net-sysfs.c
--- org/net/core/net-sysfs.c	2007-07-20 07:49:28.000000000 +0530
+++ new/net/core/net-sysfs.c	2007-07-20 08:34:45.000000000 +0530
@@ -230,6 +230,74 @@ static ssize_t store_weight(struct devic
 	return netdev_store(dev, attr, buf, len, change_weight);
 }
 
+static ssize_t show_tx_batch_skbs(struct device *dev,
+				  struct device_attribute *attr, char *buf)
+{
+	struct net_device *netdev = to_net_dev(dev);
+
+	return sprintf(buf, fmt_dec, netdev->skb_blist ? 1 : 0);
+}
+
+static int change_tx_batch_skbs(struct net_device *net,
+				unsigned long new_tx_batch_skbs)
+{
+	int ret = 0;
+	struct sk_buff_head *blist;
+
+	if (!(net->features & NETIF_F_BATCH_SKBS) ||
+	    (new_tx_batch_skbs && net->tx_queue_len < MIN_QUEUE_LEN_BATCH)) {
+		/*
+		 * Driver doesn't support batching SKBS, or the queue len
+		 * is insufficient. TODO: Add similar check to disable
+		 * batching in change_tx_queue_len() if queue_len becomes
+		 * smaller than MIN_QUEUE_LEN_BATCH.
+		 */
+		ret = -ENOTSUPP;
+		goto out;
+	}
+
+	/* Handle invalid argument */
+	if (new_tx_batch_skbs < 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* Check if new value is same as the current */
+	new_tx_batch_skbs = !!new_tx_batch_skbs;
+	if (!!net->skb_blist == new_tx_batch_skbs)
+		goto out;
+
+	if (new_tx_batch_skbs &&
+	    (blist = kmalloc(sizeof *blist, GFP_KERNEL)) == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	spin_lock(&net->queue_lock);
+	if (new_tx_batch_skbs) {
+		skb_queue_head_init(blist);
+		net->skb_blist = blist;
+		net->tx_queue_len >>= 1;
+	} else {
+		if (!skb_queue_empty(net->skb_blist))
+			skb_queue_purge(net->skb_blist);
+		kfree(net->skb_blist);
+		net->skb_blist = NULL;
+		net->tx_queue_len <<= 1;
+	}
+	spin_unlock(&net->queue_lock);
+
+out:
+	return ret;
+}
+
+static ssize_t store_tx_batch_skbs(struct device *dev,
+				   struct device_attribute *attr,
+				   const char *buf, size_t len)
+{
+	return netdev_store(dev, attr, buf, len, change_tx_batch_skbs);
+}
+
 static struct device_attribute net_class_attributes[] = {
 	__ATTR(addr_len, S_IRUGO, show_addr_len, NULL),
 	__ATTR(iflink, S_IRUGO, show_iflink, NULL),
@@ -246,6 +314,8 @@ static struct device_attribute net_class
 	__ATTR(flags, S_IRUGO | S_IWUSR, show_flags, store_flags),
 	__ATTR(tx_queue_len, S_IRUGO | S_IWUSR, show_tx_queue_len,
 	       store_tx_queue_len),
+	__ATTR(tx_batch_skbs, S_IRUGO | S_IWUSR, show_tx_batch_skbs,
+	       store_tx_batch_skbs),
 	__ATTR(weight, S_IRUGO | S_IWUSR, show_weight, store_weight),
 	{}
 };


From shemminger at linux-foundation.org  Fri Jul 20 00:18:48 2007
From: shemminger at linux-foundation.org (Stephen Hemminger)
Date: Fri, 20 Jul 2007 08:18:48 +0100
Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API
In-Reply-To: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
Message-ID: <20070720081848.7cc652fb@oldman>

On Fri, 20 Jul 2007 12:01:49 +0530
Krishna Kumar <krkumar2 at in.ibm.com> wrote:

> Hi Dave, Roland, everyone,
> 
> In May, I had proposed creating an API for sending 'n' skbs to a driver to
> reduce lock overhead, DMA operations, and specific to drivers that have
> completion notification like IPoIB - reduce completion handling ("[RFC] New
> driver API to speed up small packets xmits" @
> http://marc.info/?l=linux-netdev&m=117880900818960&w=2). I had also sent
> initial test results for E1000 which showed minor improvements (but also
> got degradations) @http://marc.info/?l=linux-netdev&m=117887698405795&w=2.
> 
> After fine-tuning qdisc and other changes, I modified IPoIB to use this API,
> and now get good gains. Summary for TCP & No Delay: 1 process improves for
> all cases from 1.4% to 49.5%; 4 process has almost identical improvements
> from -1.7% to 59.1%; 16 process case also improves in the range of -1.2% to
> 33.4%; while 64 process doesn't have much improvement (-3.3% to 12.4%). UDP
> was tested with 1 process netperf with small increase in BW but big
> improvement in Service Demand. Netperf latency tests show small drop in
> transaction rate (results in separate attachment).
> 

You may see worse performance with batching in the real world when
running over WAN's.  Like TSO, batching will generate back to back packet
trains that are subject to multi-packet synchronized loss. The problem is that
intermediate router queues are often close to full, and when a long string
of packets arrives back to back only the first ones will get in, the rest
get dropped.  Normal sends have at least minimal pacing so they are less
likely do get synchronized drop.


From krkumar2 at in.ibm.com  Fri Jul 20 00:20:09 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Fri, 20 Jul 2007 12:50:09 +0530
Subject: [ofa-general] Results & Scripts for : "[PATCH 00/10] Implement
	batching skb API"
Message-ID: <OFC83B564A.1F170542-ON6525731E.0026886A-6525731E.00284C5F@in.ibm.com>


Attached file contains scripts for running tests and parsing results :

(See attached file: scripts.tar)

The result of a 10 run (average) TCP iperf (and 1 netperf for UDP) is
given below.

Thanks,

- KK

-----------------------------------------------------------------------------
Test configuration : Single cross-over cable for MTHCA cards (MT23108) on
two PPC64 systems, both systems are 8-CPU P5 1.5 GHz processors with 8HB memory.

A. TCP results for a 10 run average are as follows (using iperf, could not
   run netperf in parallel as it is not synchronized):

        First number  : Orig BW in KB/s.
        Second number : New BW in KB/s.
        Third number  : Percentage change.

   IPoIB was configured with 512 sendq size while default configuration (128)
   gave positives for most test cases but more negatives for 512 and 4K buffer
   sizes.

            Buffer Size 32
TCP Threads:1 : 3126    3169              1.4
TCP Threads:4 : 9739    10889             11.8
TCP Threads:16 : 35383  47218             33.4
TCP Threads:64 : 85147  84196             -1.1
            Average : 9.05%

TCP No Delay: Threads:1 : 1990      2976        49.5
TCP No Delay: Threads:4 : 8137      8770        7.7
TCP No Delay: Threads:16 : 31714    37308 17.63
TCP No Delay: Threads:64 : 72830    81892 12.44
            Average : 14.19%

            Buffer Size 128
TCP Threads:1 : 12674   13339             5.2
TCP Threads:4 : 37889   40816             7.7
TCP Threads:16 : 141342 165935                  17.3
TCP Threads:64 : 199813 196283                  -1.7
            Average : 6.29%

TCP No Delay: Threads:1 : 7732      11272       45.7
TCP No Delay: Threads:4 : 33348     35222       5.6
TCP No Delay: Threads:16 : 120507   143960      19.5
TCP No Delay: Threads:64 : 195459   193875      -0.8
            Average : 7.64%

            Buffer Size 512
TCP Threads:1 : 42256   55735             31.9
TCP Threads:4 : 161237  161777                  0.3
TCP Threads:16 : 227911 231781                  1.7
TCP Threads:64 : 229779 223152                  -2.9
            Average : 1.70%

TCP No Delay: Threads:1 : 30065     42500       41.3
TCP No Delay: Threads:4 : 79076     125848            59.1
TCP No Delay: Threads:16 : 225725   224155      -0.7
TCP No Delay: Threads:64 : 231220   223664      -3.26
            Average : 8.84%

            Buffer Size 4096
TCP Threads:1 : 119364  135445                  13.5
TCP Threads:4 : 261301  256754                  -1.7
TCP Threads:16 : 246889 247065                  0.07
TCP Threads:64 : 237613 234185                  -1.4
            Average : 0.95%

TCP No Delay: Threads:1 : 102187    104087      1.9
TCP No Delay: Threads:4 : 204139    243169      19.1
TCP No Delay: Threads:16 : 245529   242519      -1.2
TCP No Delay: Threads:64 : 236826   233382      -1.4
            Average : 4.37%

-----------------------------------------------------------------------------
B. Using netperf to run 1 process UDP (1 run, measured with 128 sendq
   size, will be re-doing with 512 sendq and for 10 runs average) :

----------------------------------------------------------
   Org                    New              Perc
BW        Service     BW     Service    BW     Service
----------------------------------------------------------
6.40      1277.64     6.50    1272.41   1.56     -.40
24.80     663.01      25.80   318.13    4.03     -52.01
101.80    81.02       101.90  80.63     .09      -.48
395.70    20.77       395.90  20.74     .05      -.14
1172.90   7.00        1156.80 7.10      -1.37    1.42

---------------------------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scripts.tar
Type: application/octet-stream
Size: 10240 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070720/3f06d470/attachment.obj>

From krkumar2 at in.ibm.com  Fri Jul 20 00:30:25 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Fri, 20 Jul 2007 13:00:25 +0530
Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API
In-Reply-To: <20070720081848.7cc652fb@oldman>
Message-ID: <OF37602E5B.BA36F154-ON6525731E.0028A4E9-6525731E.00293CC3@in.ibm.com>

Stephen Hemminger <shemminger at linux-foundation.org> wrote on 07/20/2007
12:48:48 PM:

> You may see worse performance with batching in the real world when
> running over WAN's.  Like TSO, batching will generate back to back packet
> trains that are subject to multi-packet synchronized loss. The problem is
that
> intermediate router queues are often close to full, and when a long
string
> of packets arrives back to back only the first ones will get in, the rest
> get dropped.  Normal sends have at least minimal pacing so they are less
> likely do get synchronized drop.

Hi Stephen,

OK. The difference that I could see is that in existing code, the "minimal
pacing" also could lead to (possibly slighly lesser) loss since sends are
quick iterations at the IP layer, while in batching sends are iterative at
the driver layer.

Is it an issue ? Any suggestions ?

Thanks,

- KK


From shemminger at linux-foundation.org  Fri Jul 20 00:57:37 2007
From: shemminger at linux-foundation.org (Stephen Hemminger)
Date: Fri, 20 Jul 2007 08:57:37 +0100
Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API
In-Reply-To: <OF37602E5B.BA36F154-ON6525731E.0028A4E9-6525731E.00293CC3@in.ibm.com>
References: <20070720081848.7cc652fb@oldman>
	<OF37602E5B.BA36F154-ON6525731E.0028A4E9-6525731E.00293CC3@in.ibm.com>
Message-ID: <20070720085737.5319d3d4@oldman>

On Fri, 20 Jul 2007 13:00:25 +0530
Krishna Kumar2 <krkumar2 at in.ibm.com> wrote:

> Stephen Hemminger <shemminger at linux-foundation.org> wrote on 07/20/2007
> 12:48:48 PM:
> 
> > You may see worse performance with batching in the real world when
> > running over WAN's.  Like TSO, batching will generate back to back packet
> > trains that are subject to multi-packet synchronized loss. The problem is
> that
> > intermediate router queues are often close to full, and when a long
> string
> > of packets arrives back to back only the first ones will get in, the rest
> > get dropped.  Normal sends have at least minimal pacing so they are less
> > likely do get synchronized drop.
> 
> Hi Stephen,
> 
> OK. The difference that I could see is that in existing code, the "minimal
> pacing" also could lead to (possibly slighly lesser) loss since sends are
> quick iterations at the IP layer, while in batching sends are iterative at
> the driver layer.
> 
> Is it an issue ? Any suggestions ?

Not an immediate issue, but it is the kind of thing that could cause performance
regression reports if it was used on every interface by default.


From krkumar2 at in.ibm.com  Fri Jul 20 00:47:40 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Fri, 20 Jul 2007 13:17:40 +0530
Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API
In-Reply-To: <20070720081848.7cc652fb@oldman>
Message-ID: <OF106EAF42.CD1BE35D-ON6525731E.002A5041-6525731E.002AD13B@in.ibm.com>

Stephen Hemminger <shemminger at linux-foundation.org> wrote on 07/20/2007
12:48:48 PM:

> You may see worse performance with batching in the real world when
> running over WAN's.  Like TSO, batching will generate back to back packet
> trains that are subject to multi-packet synchronized loss. The problem is
that
> intermediate router queues are often close to full, and when a long
string
> of packets arrives back to back only the first ones will get in, the rest
> get dropped.  Normal sends have at least minimal pacing so they are less
> likely do get synchronized drop.

Also forgot to mention in the previous mail, if performance is seen to be
dipping,
batching can be disabled on WAN's by:

echo 0 > /sys/class/net/<dev>/tx_batch_skbs

and use batching on local/site networks in that case.


From vlad at lists.openfabrics.org  Fri Jul 20 01:38:42 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Fri, 20 Jul 2007 01:38:42 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070720-0100 daily build status
Message-ID: <20070720083842.479C7E608CA@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.15
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.19
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.14
Passed on x86_64 with linux-2.6.16
Passed on powerpc with linux-2.6.18
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.13
Passed on ia64 with linux-2.6.16
Passed on x86_64 with linux-2.6.15
Passed on ia64 with linux-2.6.17
Passed on ppc64 with linux-2.6.15
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.21.1
Passed on powerpc with linux-2.6.16
Passed on ppc64 with linux-2.6.16
Passed on x86_64 with linux-2.6.12
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.14
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.13
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.13
Passed on powerpc with linux-2.6.12
Passed on ia64 with linux-2.6.21.1
Passed on ppc64 with linux-2.6.14
Passed on powerpc with linux-2.6.15
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5

Failed:


From vlad at lists.openfabrics.org  Fri Jul 20 02:43:16 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Fri, 20 Jul 2007 02:43:16 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_c_kernel 20070720-0200 daily build status
Message-ID: <20070720094316.E11CAE608C8@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2_c

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.12
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.19
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.13
Passed on powerpc with linux-2.6.19
Passed on powerpc with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on powerpc with linux-2.6.17
Passed on powerpc with linux-2.6.13
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.22
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.13
Passed on ia64 with linux-2.6.15
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.16
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.16
Passed on ia64 with linux-2.6.14
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.18
Passed on ia64 with linux-2.6.13
Passed on x86_64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.15
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ppc64 with linux-2.6.14
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.16
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.18-1.2798.fc6

Failed:


From kaber at trash.net  Fri Jul 20 02:59:35 2007
From: kaber at trash.net (Patrick McHardy)
Date: Fri, 20 Jul 2007 11:59:35 +0200
Subject: [ofa-general] Re: [PATCH 02/10] Networking include file changes.
In-Reply-To: <20070720063216.26341.80316.sendpatchset@localhost.localdomain>
References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
	<20070720063216.26341.80316.sendpatchset@localhost.localdomain>
Message-ID: <46A08787.8040501@trash.net>

Krishna Kumar wrote:
> diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h
> --- org/include/linux/netdevice.h	2007-07-20 07:49:28.000000000 +0530
> +++ new/include/linux/netdevice.h	2007-07-20 08:30:55.000000000 +0530
> @@ -264,6 +264,8 @@ enum netdev_state_t
>  	__LINK_STATE_QDISC_RUNNING,
>  };
>  
> +/* Minimum length of device hardware queue for batching to work */
> +#define MIN_QUEUE_LEN_BATCH	16


Is there any downside in using batching with smaller queue sizes?


From kaber at trash.net  Fri Jul 20 03:04:30 2007
From: kaber at trash.net (Patrick McHardy)
Date: Fri, 20 Jul 2007 12:04:30 +0200
Subject: [ofa-general] Re: [PATCH 03/10] dev.c changes.
In-Reply-To: <20070720063227.26341.91868.sendpatchset@localhost.localdomain>
References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
	<20070720063227.26341.91868.sendpatchset@localhost.localdomain>
Message-ID: <46A088AE.1090702@trash.net>

Krishna Kumar wrote:
> @@ -3397,6 +3440,28 @@ int register_netdevice(struct net_device
>  		}
>  	}
>  
> +	if (dev->features & NETIF_F_BATCH_SKBS) {
> +		if (!dev->hard_start_xmit_batch ||
> +		    dev->tx_queue_len < MIN_QUEUE_LEN_BATCH) {
> +			/*
> +			 * Batch TX requires API support in driver plus have
> +			 * a minimum sized queue.
> +			 */
> +			printk(KERN_ERR "%s: Dropping NETIF_F_BATCH_SKBS "
> +					"since no API support or queue len "
> +					"is smaller than %d.\n",
> +					dev->name, MIN_QUEUE_LEN_BATCH);
> +			dev->features &= ~NETIF_F_BATCH_SKBS;


The queue length can be changed through multiple interfaces, if that
really is important you need to catch these cases too.

> +		} else {
> +			dev->skb_blist = kmalloc(sizeof *dev->skb_blist,
> +						 GFP_KERNEL);


Why not simply put the head in struct net_device? It seems to me that
this could also be used for gso_skb.

> +			if (dev->skb_blist) {
> +				skb_queue_head_init(dev->skb_blist);
> +				dev->tx_queue_len >>= 1;
> +			}
> +		}
> +	}
> +
>  	/*
>  	 *	nil rebuild_header routine,
>  	 *	that should be never called and used as just bug trap.
> @@ -3732,10 +3797,16 @@ void unregister_netdevice(struct net_dev
>  
>  	synchronize_net();
>  
> +	/* Deallocate batching structure */
> +	if (dev->skb_blist) {
> +		skb_queue_purge(dev->skb_blist);
> +		kfree(dev->skb_blist);
> +		dev->skb_blist = NULL;
> +	}
> +

Queue purging should be done in dev_deactivate.


From attawayu at laco.com  Fri Jul 20 04:03:42 2007
From: attawayu at laco.com (Eric Spencer)
Date: Fri, 20 Jul 2007 10:03:42 -0100
Subject: [ofa-general] Wir wissen was Frauen wollern may be somewhat --
	Something more fun. 
Message-ID: <01c7cab5$40bf03e0$c9705bd9@attawayu>

Versuchen Sie unser Produkt und Sie werden fuhlen was unsere Kunden bestatigen

Preise die keine Konkurrenz kennen 

- Visa verifizierter Onlineshop
- Bequem und diskret online bestellen.
- Kostenlose, arztliche Telefon-Beratung
- Diskrete Verpackung und Zahlung
- Kein peinlicher Arztbesuch erforderlich
- Kein langes Warten - Auslieferung innerhalb von 2-3 Tagen
- keine versteckte Kosten

Ciaaaaaalis 10 Pack. 27,00 Euro
Viaaaagra 10 Pack. 21,00 Euro

Jetzt bestellen - und vier Pillen umsonst erhalten
http://ykliekl.flowsame.com/?173359073325

(bitte warten Sie einen Moment bis die Seite vollstandig geladen wird)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070720/eeafe589/attachment.html>

From kaber at trash.net  Fri Jul 20 03:07:20 2007
From: kaber at trash.net (Patrick McHardy)
Date: Fri, 20 Jul 2007 12:07:20 +0200
Subject: [ofa-general] Re: [PATCH 04/10] net-sysfs.c changes.
In-Reply-To: <20070720063238.26341.41474.sendpatchset@localhost.localdomain>
References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
	<20070720063238.26341.41474.sendpatchset@localhost.localdomain>
Message-ID: <46A08958.3090509@trash.net>

Krishna Kumar wrote:
> Support to turn on/off batching from /sys.


rtnetlink support seems more important than sysfs to me.


From kaber at trash.net  Fri Jul 20 03:11:01 2007
From: kaber at trash.net (Patrick McHardy)
Date: Fri, 20 Jul 2007 12:11:01 +0200
Subject: [ofa-general] Re: [PATCH 05/10] sch_generic.c changes.
In-Reply-To: <20070720063249.26341.125.sendpatchset@localhost.localdomain>
References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
	<20070720063249.26341.125.sendpatchset@localhost.localdomain>
Message-ID: <46A08A35.5090104@trash.net>

Krishna Kumar wrote:
> diff -ruNp org/net/sched/sch_generic.c new/net/sched/sch_generic.c
> --- org/net/sched/sch_generic.c	2007-07-20 07:49:28.000000000 +0530
> +++ new/net/sched/sch_generic.c	2007-07-20 08:30:22.000000000 +0530
> @@ -9,6 +9,11 @@
>   * Authors:	Alexey Kuznetsov, <kuznet at ms2.inr.ac.ru>
>   *              Jamal Hadi Salim, <hadi at cyberus.ca> 990601
>   *              - Ingress support
> + *
> + * New functionality:
> + *		Krishna Kumar, <krkumar2 at in.ibm.com>, July 2007
> + *		- Support for sending multiple skbs to devices that support
> + *		  new api - dev->hard_start_xmit_batch()


No new changelogs in source code please, git keeps track of that.

> -static inline int qdisc_restart(struct net_device *dev)
> +static inline int qdisc_restart(struct net_device *dev,
> +				struct sk_buff_head *blist)
>  {
>  	struct Qdisc *q = dev->qdisc;
>  	struct sk_buff *skb;
> -	unsigned lockless;
> +	unsigned getlock;		/* whether we need to get lock or not */


Unrelated rename, please get rid of this to reduce the noise.


From krkumar2 at in.ibm.com  Thu Jul 19 23:33:48 2007
From: krkumar2 at in.ibm.com (Krishna Kumar)
Date: Fri, 20 Jul 2007 12:03:48 +0530
Subject: [ofa-general] [PATCH 10/10] IPoIB batching in internal xmit/handler
	routines.
In-Reply-To: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
Message-ID: <20070720063348.26341.73753.sendpatchset@localhost.localdomain>

Add batching support to IPoIB post_send and TX completion handler.

Signed-off-by: Krishna Kumar <krkumar2 at in.ibm.com>
---
 ipoib_ib.c |  233 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 files changed, 187 insertions(+), 46 deletions(-)

diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_ib.c new/drivers/infiniband/ulp/ipoib/ipoib_ib.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-07-20 07:49:28.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-07-20 08:30:22.000000000 +0530
@@ -242,8 +242,9 @@ repost:
 static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int i = 0, num_completions;
+	int tx_ring_index = priv->tx_tail & (ipoib_sendq_size - 1);
 	unsigned int wr_id = wc->wr_id;
-	struct ipoib_tx_buf *tx_req;
 	unsigned long flags;
 
 	ipoib_dbg_data(priv, "send completion: id %d, status: %d\n",
@@ -255,23 +256,60 @@ static void ipoib_ib_handle_tx_wc(struct
 		return;
 	}
 
-	tx_req = &priv->tx_ring[wr_id];
+	num_completions = wr_id - tx_ring_index + 1;
+	if (num_completions <= 0)
+		num_completions += ipoib_sendq_size;
+
+	/*
+	 * Handle skbs completion from tx_tail to wr_id. It is possible to
+	 * handle WC's from earlier post_sends (possible multiple) in this
+	 * iteration as we move from tx_tail to wr_id, since if the last
+	 * WR (which is the one which had a completion request) failed to be
+	 * sent for any of those earlier request(s), no completion
+	 * notification is generated for successful WR's of those earlier
+	 * request(s).
+	 */
+	while (1) {
+		/*
+		 * Could use while (i < num_completions), but it is costly
+		 * since in most cases there is 1 completion, and we end up
+		 * doing an extra "index = (index+1) & (ipoib_sendq_size-1)"
+		 */
+		struct ipoib_tx_buf *tx_req = &priv->tx_ring[tx_ring_index];
+
+		if (likely(tx_req->skb)) {
+			ib_dma_unmap_single(priv->ca, tx_req->mapping,
+					    tx_req->skb->len, DMA_TO_DEVICE);
 
-	ib_dma_unmap_single(priv->ca, tx_req->mapping,
-			    tx_req->skb->len, DMA_TO_DEVICE);
+			++priv->stats.tx_packets;
+			priv->stats.tx_bytes += tx_req->skb->len;
 
-	++priv->stats.tx_packets;
-	priv->stats.tx_bytes += tx_req->skb->len;
+			dev_kfree_skb_any(tx_req->skb);
+		}
+		/*
+		 * else this skb failed synchronously when posted and was
+		 * freed immediately.
+		 */
+
+		if (++i == num_completions)
+			break;
 
-	dev_kfree_skb_any(tx_req->skb);
+		/* More WC's to handle */
+		tx_ring_index = (tx_ring_index + 1) & (ipoib_sendq_size - 1);
+	}
 
 	spin_lock_irqsave(&priv->tx_lock, flags);
-	++priv->tx_tail;
+
+	priv->tx_tail += num_completions;
 	if (unlikely(test_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags)) &&
 	    priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) {
 		clear_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
 		netif_wake_queue(dev);
 	}
+
+	/* Make more slots available for posts */
+	dev->xmit_slots = ipoib_sendq_size - (priv->tx_head - priv->tx_tail);
+
 	spin_unlock_irqrestore(&priv->tx_lock, flags);
 
 	if (wc->status != IB_WC_SUCCESS &&
@@ -340,78 +378,181 @@ void ipoib_ib_completion(struct ib_cq *c
 	netif_rx_schedule(dev_ptr);
 }
 
-static inline int post_send(struct ipoib_dev_priv *priv,
-			    unsigned int wr_id,
-			    struct ib_ah *address, u32 qpn,
-			    u64 addr, int len)
+/*
+ * post_send : Post WR(s) to the device.
+ *
+ * num_skbs is the number of WR's, 'start_index' is the first slot in
+ * tx_wr[] or tx_sge[]. Note: 'start_index' is normally zero, unless a
+ * previous post_send returned error and we are trying to send the untried
+ * WR's, in which case start_index will point to the first untried WR.
+ *
+ * We also break the WR link before posting so that the driver knows how
+ * many WR's to process, and this is set back after the post.
+ */
+static inline int post_send(struct ipoib_dev_priv *priv, u32 qpn,
+			    int start_index, int num_skbs,
+			    struct ib_send_wr **bad_wr)
 {
-	struct ib_send_wr *bad_wr;
+	int ret;
+	struct ib_send_wr *last_wr, *next_wr;
+
+	last_wr = &priv->tx_wr[start_index + num_skbs - 1];
+
+	/* Set Completion Notification for last WR */
+	last_wr->send_flags = IB_SEND_SIGNALED;
 
-	priv->tx_sge.addr             = addr;
-	priv->tx_sge.length           = len;
+	/* Terminate the last WR */
+	next_wr = last_wr->next;
+	last_wr->next = NULL;
 
-	priv->tx_wr.wr_id 	      = wr_id;
-	priv->tx_wr.wr.ud.remote_qpn  = qpn;
-	priv->tx_wr.wr.ud.ah 	      = address;
+	/* Send all the WR's in one doorbell */
+	ret = ib_post_send(priv->qp, &priv->tx_wr[start_index], bad_wr);
 
-	return ib_post_send(priv->qp, &priv->tx_wr, &bad_wr);
+	/* Restore send_flags & WR chain */
+	last_wr->send_flags = 0;
+	last_wr->next = next_wr;
+
+	return ret;
 }
 
-void ipoib_send(struct net_device *dev, struct sk_buff *skb,
-		struct ipoib_ah *address, u32 qpn)
+/*
+ * Map skb & store skb/mapping in tx_req; and details of the WR in tx_wr
+ * to pass to the driver.
+ *
+ * Returns :
+ *	- 0 on successful processing of the skb
+ *	- 1 if the skb was freed.
+ */
+int ipoib_process_skb(struct net_device *dev, struct sk_buff *skb,
+		      struct ipoib_dev_priv *priv, int wr_num,
+		      int tx_ring_index, struct ipoib_ah *address, u32 qpn)
 {
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	struct ipoib_tx_buf *tx_req;
 	u64 addr;
+	struct ipoib_tx_buf *tx_req;
 
 	if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) {
-		ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n",
+		ipoib_warn(priv, "packet len %d (> %d) too long to "
+			   "send, dropping\n",
 			   skb->len, priv->mcast_mtu + IPOIB_ENCAP_LEN);
 		++priv->stats.tx_dropped;
 		++priv->stats.tx_errors;
 		ipoib_cm_skb_too_long(dev, skb, priv->mcast_mtu);
-		return;
+		return 1;
 	}
 
-	ipoib_dbg_data(priv, "sending packet, length=%d address=%p qpn=0x%06x\n",
+	ipoib_dbg_data(priv, "sending packet, length=%d address=%p "
+		       "qpn=0x%06x\n",
 		       skb->len, address, qpn);
 
 	/*
 	 * We put the skb into the tx_ring _before_ we call post_send()
 	 * because it's entirely possible that the completion handler will
-	 * run before we execute anything after the post_send().  That
+	 * run before we execute anything after the post_send(). That
 	 * means we have to make sure everything is properly recorded and
 	 * our state is consistent before we call post_send().
 	 */
-	tx_req = &priv->tx_ring[priv->tx_head & (ipoib_sendq_size - 1)];
-	tx_req->skb = skb;
-	addr = ib_dma_map_single(priv->ca, skb->data, skb->len,
-				 DMA_TO_DEVICE);
+	addr = ib_dma_map_single(priv->ca, skb->data, skb->len, DMA_TO_DEVICE);
 	if (unlikely(ib_dma_mapping_error(priv->ca, addr))) {
 		++priv->stats.tx_errors;
 		dev_kfree_skb_any(skb);
-		return;
+		return 1;
 	}
+
+	tx_req = &priv->tx_ring[tx_ring_index];
+	tx_req->skb = skb;
 	tx_req->mapping = addr;
+	priv->tx_sge[wr_num].addr = addr;
+	priv->tx_sge[wr_num].length = skb->len;
+	priv->tx_wr[wr_num].wr_id = tx_ring_index;
+	priv->tx_wr[wr_num].wr.ud.remote_qpn = qpn;
+	priv->tx_wr[wr_num].wr.ud.ah = address->ah;
 
-	if (unlikely(post_send(priv, priv->tx_head & (ipoib_sendq_size - 1),
-			       address->ah, qpn, addr, skb->len))) {
-		ipoib_warn(priv, "post_send failed\n");
-		++priv->stats.tx_errors;
-		ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE);
-		dev_kfree_skb_any(skb);
-	} else {
-		dev->trans_start = jiffies;
+	return 0;
+}
 
-		address->last_send = priv->tx_head;
-		++priv->tx_head;
+/*
+ * If an skb is passed to this function, it is the single, unprocessed skb
+ * send case. Otherwise if skb is NULL, it means that all skbs are already
+ * processed and put on the priv->tx_wr,tx_sge,tx_ring, etc.
+ */
+void ipoib_send(struct net_device *dev, struct sk_buff *skb,
+		struct ipoib_ah *address, u32 qpn, int num_skbs)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int start_index = 0;
+
+	if (skb && ipoib_process_skb(dev, skb, priv, 0, priv->tx_head &
+				     (ipoib_sendq_size - 1), address, qpn))
+		return;
+
+	/* Send out all the skb's in one post */
+	while (num_skbs) {
+		struct ib_send_wr *bad_wr;
+
+		if (unlikely((post_send(priv, qpn, start_index, num_skbs,
+					&bad_wr)))) {
+			int done;
+
+			/*
+			 * Better error handling can be done here, like free
+			 * all untried skbs if err == -ENOMEM. However at this
+			 * time, we re-try all the skbs, all of which will
+			 * likely fail anyway (unless device finished sending
+			 * some out in the meantime). This is not a regression
+			 * since the earlier code is not doing this either.
+			 */
+			ipoib_warn(priv, "post_send failed\n");
 
-		if (priv->tx_head - priv->tx_tail == ipoib_sendq_size) {
-			ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
-			netif_stop_queue(dev);
-			set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
+			/* Get #WR's that finished successfully */
+			done = bad_wr - &priv->tx_wr[start_index];
+
+			/* Handle 1 error */
+			priv->stats.tx_errors++;
+			ib_dma_unmap_single(priv->ca,
+				priv->tx_sge[start_index + done].addr,
+				priv->tx_sge[start_index + done].length,
+				DMA_TO_DEVICE);
+
+			/* Handle 'n' successes */
+			if (done) {
+				dev->trans_start = jiffies;
+				address->last_send = priv->tx_head;
+			}
+
+			/* Free failed WR & reset for WC handler to recognize */
+			dev_kfree_skb_any(priv->tx_ring[bad_wr->wr_id].skb);
+			priv->tx_ring[bad_wr->wr_id].skb = NULL;
+
+			/* Move head to first untried WR */
+			priv->tx_head += (done + 1);
+				/* + 1 for WR that was tried & failed */
+
+			/* Get count of skbs that were not tried */
+			num_skbs -= (done + 1);
+
+			/* Get start index for next iteration */
+			start_index += (done + 1);
+		} else {
+			dev->trans_start = jiffies;
+
+			address->last_send = priv->tx_head;
+			priv->tx_head += num_skbs;
+			num_skbs = 0;
 		}
 	}
+
+	if (unlikely(priv->tx_head - priv->tx_tail == ipoib_sendq_size)) {
+		/*
+		 * Not accurate as some intermediate slots could have been
+		 * freed on error, but no harm - only queue stopped earlier.
+		 */
+		ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
+		netif_stop_queue(dev);
+		set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
+	}
+
+	/* Reduce the number of slots for sends */
+	dev->xmit_slots = ipoib_sendq_size - (priv->tx_head - priv->tx_tail);
 }
 
 static void __ipoib_reap_ah(struct net_device *dev)


From krkumar2 at in.ibm.com  Fri Jul 20 03:28:49 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Fri, 20 Jul 2007 15:58:49 +0530
Subject: [ofa-general] Re: [PATCH 04/10] net-sysfs.c changes.
In-Reply-To: <46A08958.3090509@trash.net>
Message-ID: <OF3743242C.0BC81AC3-ON6525731E.00397CC1-6525731E.00399244@in.ibm.com>

Patrick McHardy <kaber at trash.net> wrote on 07/20/2007 03:37:20 PM:

> Krishna Kumar wrote:
> > Support to turn on/off batching from /sys.
>
>
> rtnetlink support seems more important than sysfs to me.

Thanks, I will add that as a patch. The reason to add to sysfs is that
it is easier to change for a user (and similar to tx_queue_len).

- KK


From krkumar2 at in.ibm.com  Fri Jul 20 03:27:37 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Fri, 20 Jul 2007 15:57:37 +0530
Subject: [ofa-general] Re: [PATCH 03/10] dev.c changes.
In-Reply-To: <46A088AE.1090702@trash.net>
Message-ID: <OFC586F180.42DDCDC5-ON6525731E.0038D792-6525731E.00397615@in.ibm.com>

Hi Patrick,

Thanks for your comments.

Patrick McHardy <kaber at trash.net> wrote on 07/20/2007 03:34:30 PM:

> The queue length can be changed through multiple interfaces, if that
> really is important you need to catch these cases too.

I have a TODO comment in net-sysfs.c which is to catch this case.

> > +      } else {
> > +         dev->skb_blist = kmalloc(sizeof *dev->skb_blist,
> > +                   GFP_KERNEL);
>
>
> Why not simply put the head in struct net_device? It seems to me that
> this could also be used for gso_skb.

Without going into GSO, it is wasting some 32 bytes on i386 since most
drivers
don't export this API.

> Queue purging should be done in dev_deactivate.

I originally had it in dev_deactivate, but when I did a ifdown eth0, ifup
eth0,
the system panic'd. The first solution I thought was to initialize the
skb_blist
in dev_change_flags() rather than in register_netdev(), but then felt that
a
series of ifup/ifdown will unnecessarily check stuff/malloc/free/initialize
stuff,
and so thought of putting it in unregister_netdev (where it is balanced
with
register_netdev).

Is there any reason to move this ?

Thanks,

- KK


From krkumar2 at in.ibm.com  Fri Jul 20 03:32:42 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Fri, 20 Jul 2007 16:02:42 +0530
Subject: [ofa-general] Re: [PATCH 05/10] sch_generic.c changes.
In-Reply-To: <46A08A35.5090104@trash.net>
Message-ID: <OF9D893096.26B48226-ON6525731E.003999A1-6525731E.0039ED23@in.ibm.com>

Patrick McHardy <kaber at trash.net> wrote on 07/20/2007 03:41:01 PM:

> Krishna Kumar wrote:
> > diff -ruNp org/net/sched/sch_generic.c new/net/sched/sch_generic.c
> > --- org/net/sched/sch_generic.c   2007-07-20 07:49:28.000000000 +0530
> > +++ new/net/sched/sch_generic.c   2007-07-20 08:30:22.000000000 +0530
> > @@ -9,6 +9,11 @@
> >   * Authors:   Alexey Kuznetsov, <kuznet at ms2.inr.ac.ru>
> >   *              Jamal Hadi Salim, <hadi at cyberus.ca> 990601
> >   *              - Ingress support
> > + *
> > + * New functionality:
> > + *      Krishna Kumar, <krkumar2 at in.ibm.com>, July 2007
> > + *      - Support for sending multiple skbs to devices that support
> > + *        new api - dev->hard_start_xmit_batch()
>
>
> No new changelogs in source code please, git keeps track of that.

Ah, didn't know this, thanks for letting me know.

> > -static inline int qdisc_restart(struct net_device *dev)
> > +static inline int qdisc_restart(struct net_device *dev,
> > +            struct sk_buff_head *blist)
> >  {
> >     struct Qdisc *q = dev->qdisc;
> >     struct sk_buff *skb;
> > -   unsigned lockless;
> > +   unsigned getlock;      /* whether we need to get lock or not */
>
>
> Unrelated rename, please get rid of this to reduce the noise.

OK, I guess I should have sent that change earlier :) The reason to change
the name is to avoid (double-negative) checks like :

      if (!lockless)
to
      if (getlock).

I will remove these changes.

thanks,

- KK


From kaber at trash.net  Fri Jul 20 04:20:37 2007
From: kaber at trash.net (Patrick McHardy)
Date: Fri, 20 Jul 2007 13:20:37 +0200
Subject: [ofa-general] Re: [PATCH 03/10] dev.c changes.
In-Reply-To: <OFC586F180.42DDCDC5-ON6525731E.0038D792-6525731E.00397615@in.ibm.com>
References: <OFC586F180.42DDCDC5-ON6525731E.0038D792-6525731E.00397615@in.ibm.com>
Message-ID: <46A09A85.7020500@trash.net>

Krishna Kumar2 wrote:
> Hi Patrick,
>
> Thanks for your comments.
>
> Patrick McHardy <kaber at trash.net> wrote on 07/20/2007 03:34:30 PM:
>
>   
>> The queue length can be changed through multiple interfaces, if that
>> really is important you need to catch these cases too.
>>     
>
> I have a TODO comment in net-sysfs.c which is to catch this case.
>   

I noticed that. Still wondering why it is important at all though.

>   
>>> +      } else {
>>> +         dev->skb_blist = kmalloc(sizeof *dev->skb_blist,
>>> +                   GFP_KERNEL);
>>>       
>> Why not simply put the head in struct net_device? It seems to me that
>> this could also be used for gso_skb.
>>     
>
> Without going into GSO, it is wasting some 32 bytes on i386 since most
> drivers don't export this API.
>   

32 bytes? I count 16, - 4 for the pointer, so its 12 bytes of waste.
If you'd use it for gso_skb it would come down to 8 bytes. struct
net_device is a pig already, and there are better ways to reduce this
than starting to allocating single members with a few bytes IMO.

>   
>> Queue purging should be done in dev_deactivate.
>>     
>
> I originally had it in dev_deactivate, but when I did a ifdown eth0, ifup
> eth0,
> the system panic'd. The first solution I thought was to initialize the
> skb_blist
> in dev_change_flags() rather than in register_netdev(), but then felt that
> a
> series of ifup/ifdown will unnecessarily check stuff/malloc/free/initialize
> stuff,
> and so thought of putting it in unregister_netdev (where it is balanced
> with
> register_netdev).
>
> Is there any reason to move this ?
>   

Yes, packets can be holding references to various stuff and
these should be released on device down. As I said above I
don't really like the allocation, but even if you want to
keep it, just do the purging and dev_deactivate and keep the
freeing in unregister_netdev (actually I guess it should be
free_netdev to handle register_netdevice errors).


From kaber at trash.net  Fri Jul 20 04:21:51 2007
From: kaber at trash.net (Patrick McHardy)
Date: Fri, 20 Jul 2007 13:21:51 +0200
Subject: [ofa-general] Re: [PATCH 04/10] net-sysfs.c changes.
In-Reply-To: <OF3743242C.0BC81AC3-ON6525731E.00397CC1-6525731E.00399244@in.ibm.com>
References: <OF3743242C.0BC81AC3-ON6525731E.00397CC1-6525731E.00399244@in.ibm.com>
Message-ID: <46A09ACF.20805@trash.net>

Krishna Kumar2 wrote:
> Patrick McHardy <kaber at trash.net> wrote on 07/20/2007 03:37:20 PM:
>
>
>   
>> rtnetlink support seems more important than sysfs to me.
>>     
>
> Thanks, I will add that as a patch. The reason to add to sysfs is that
> it is easier to change for a user (and similar to tx_queue_len).
>   

Thanks.


From kaber at trash.net  Fri Jul 20 04:24:01 2007
From: kaber at trash.net (Patrick McHardy)
Date: Fri, 20 Jul 2007 13:24:01 +0200
Subject: [ofa-general] Re: [PATCH 05/10] sch_generic.c changes.
In-Reply-To: <OF9D893096.26B48226-ON6525731E.003999A1-6525731E.0039ED23@in.ibm.com>
References: <OF9D893096.26B48226-ON6525731E.003999A1-6525731E.0039ED23@in.ibm.com>
Message-ID: <46A09B51.6030301@trash.net>

Krishna Kumar2 wrote:
> Patrick McHardy <kaber at trash.net> wrote on 07/20/2007 03:41:01 PM:
>   
>>> -static inline int qdisc_restart(struct net_device *dev)
>>> +static inline int qdisc_restart(struct net_device *dev,
>>> +            struct sk_buff_head *blist)
>>>  {
>>>     struct Qdisc *q = dev->qdisc;
>>>     struct sk_buff *skb;
>>> -   unsigned lockless;
>>> +   unsigned getlock;      /* whether we need to get lock or not */
>>>       
>> Unrelated rename, please get rid of this to reduce the noise.
>>     
>
> OK, I guess I should have sent that change earlier :) The reason to change
> the name is to avoid (double-negative) checks like :
>
>       if (!lockless)
> to
>       if (getlock).
>
> I will remove these changes.
>   

I guess you could put it in another patch. But frankly, I think
the biggest uglyness is the conditional locking, not naming or
double negation, so it won't really make the code any nicer :)


From krkumar2 at in.ibm.com  Fri Jul 20 04:52:05 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Fri, 20 Jul 2007 17:22:05 +0530
Subject: [ofa-general] Re: [PATCH 03/10] dev.c changes.
In-Reply-To: <46A09A85.7020500@trash.net>
Message-ID: <OF316A971D.F3CBB5DC-ON6525731E.003F6683-6525731E.004131B7@in.ibm.com>

Hi Patrick,

Patrick McHardy <kaber at trash.net> wrote on 07/20/2007 04:50:37 PM:

> > I have a TODO comment in net-sysfs.c which is to catch this case.
> >
>
> I noticed that. Still wondering why it is important at all though.

I saw another mail of yours on the marc list on this same topic (which
still hasn't come to me in the mail), so I will answer both :

> Is there any downside in using batching with smaller queue sizes?

I think there is, but as yet I don't have any data (and 16 is probably
higher
than reqd) to show it. If the queue size is very small (like 4), the extra
processing to maintain this list may take more cycles than the performance
gains for sending out few skbs, esp since most xmits will send out 1 skb
and
skb batching takes places less often (when tx lock fails or queue gets
full).

OTOH, there might be a gain to even send out 2 skbs, the problem is in
doing
the extra processing before xmit and not at the time of xmit.

Does this sound OK ? If so, I will add the code to implement the TODO for
tx_queue_len checking too.

> > Without going into GSO, it is wasting some 32 bytes on i386 since most
> > drivers don't export this API.
>
> 32 bytes? I count 16, - 4 for the pointer, so its 12 bytes of waste.
> If you'd use it for gso_skb it would come down to 8 bytes. struct
> net_device is a pig already, and there are better ways to reduce this
> than starting to allocating single members with a few bytes IMO.

Sorry, I wanted to say 12 bytes on 32 bit system but mixed it up and
said 32 bytes. So I guess static allocation is better then, and it will
also help in performance as memory access is not required (offsetof
should work).

> Yes, packets can be holding references to various stuff and
> these should be released on device down. As I said above I
> don't really like the allocation, but even if you want to
> keep it, just do the purging and dev_deactivate and keep the
> freeing in unregister_netdev (actually I guess it should be
> free_netdev to handle register_netdevice errors).

Right, that makes it clean to do (and avoid stale packets on down).
I will make both these changes now.

Thanks for these suggestions,

- KK


From kaber at trash.net  Fri Jul 20 04:55:37 2007
From: kaber at trash.net (Patrick McHardy)
Date: Fri, 20 Jul 2007 13:55:37 +0200
Subject: [ofa-general] Re: [PATCH 03/10] dev.c changes.
In-Reply-To: <OF316A971D.F3CBB5DC-ON6525731E.003F6683-6525731E.004131B7@in.ibm.com>
References: <OF316A971D.F3CBB5DC-ON6525731E.003F6683-6525731E.004131B7@in.ibm.com>
Message-ID: <46A0A2B9.1050504@trash.net>

Krishna Kumar2 wrote:
> Patrick McHardy <kaber at trash.net> wrote on 07/20/2007 04:50:37 PM
>> Is there any downside in using batching with smaller queue sizes?
>>     
>
> I think there is, but as yet I don't have any data (and 16 is probably
> higher
> than reqd) to show it. If the queue size is very small (like 4), the extra
> processing to maintain this list may take more cycles than the performance
> gains for sending out few skbs, esp since most xmits will send out 1 skb
> and
> skb batching takes places less often (when tx lock fails or queue gets
> full).
>
> OTOH, there might be a gain to even send out 2 skbs, the problem is in
> doing
> the extra processing before xmit and not at the time of xmit.
>
> Does this sound OK ? If so, I will add the code to implement the TODO for
> tx_queue_len checking too.
>   

I can't really argue about the numbers, but it seems to me that only
devices which *usually* have a sufficient queue length will support
this, and anyone setting the queue length of a gbit device to <16 is
begging for trouble anyway. So it doesn't really seem worth to bloat
the code for handling an insane configuration as long as it doesn't
break.


From krkumar2 at in.ibm.com  Fri Jul 20 05:09:18 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Fri, 20 Jul 2007 17:39:18 +0530
Subject: [ofa-general] Re: [PATCH 03/10] dev.c changes.
In-Reply-To: <46A09A85.7020500@trash.net>
Message-ID: <OFA7D22363.1BE2EB19-ON6525731E.0042AA56-6525731E.0042C514@in.ibm.com>

Patrick McHardy <kaber at trash.net> wrote on 07/20/2007:

> I can't really argue about the numbers, but it seems to me that only
> devices which *usually* have a sufficient queue length will support
> this, and anyone setting the queue length of a gbit device to <16 is
> begging for trouble anyway. So it doesn't really seem worth to bloat
> the code for handling an insane configuration as long as it doesn't
> break.

Ah, I get your point now. So if driver sets BATCHING and user then sets
queue_len to (say) 4, then poor results are expected (and kernel doesn't
need to try fix it). Same for driver setting BATCHING when it's queue is
small in the first place, which no driver writer should do anyway. I think
it makes the code a lot easier too. Will update.

thanks,

- KK


From krkumar2 at in.ibm.com  Fri Jul 20 05:25:18 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Fri, 20 Jul 2007 17:55:18 +0530
Subject: [ofa-general] Re: [PATCH 03/10] dev.c changes.
In-Reply-To: <46A09A85.7020500@trash.net>
Message-ID: <OF5FB5DB1E.711307CE-ON6525731E.0043AC4C-6525731E.00443C1B@in.ibm.com>

Patrick McHardy <kaber at trash.net> wrote on 07/20/2007 04:50:37 PM:

> 32 bytes? I count 16, - 4 for the pointer, so its 12 bytes of waste.
> If you'd use it for gso_skb it would come down to 8 bytes. struct
> net_device is a pig already, and there are better ways to reduce this
> than starting to allocating single members with a few bytes IMO.

Currently, this allocated pointer is an indication to let kernel users
(qdisc_restart, setting/resetting tx_batch_skbs) know whether batching
is enabled or disabled. Removing the pointer and making it static means
those users cannot figure out this information . Adding another field to
netdev may be a bad idea, so I am thinking of overloading dev->features
to add a new flag (other than NETIF_F_BATCH_SKBS, since that is a driver
capabilities flag) which can be set/cleared based on NETIF_F_BATCH_SKBS
bit. Does this approach sound OK ?

Thanks,

- KK


From kaber at trash.net  Fri Jul 20 05:37:06 2007
From: kaber at trash.net (Patrick McHardy)
Date: Fri, 20 Jul 2007 14:37:06 +0200
Subject: [ofa-general] Re: [PATCH 03/10] dev.c changes.
In-Reply-To: <OF5FB5DB1E.711307CE-ON6525731E.0043AC4C-6525731E.00443C1B@in.ibm.com>
References: <OF5FB5DB1E.711307CE-ON6525731E.0043AC4C-6525731E.00443C1B@in.ibm.com>
Message-ID: <46A0AC72.7090707@trash.net>

Krishna Kumar2 wrote:
> Patrick McHardy <kaber at trash.net> wrote on 07/20/2007 04:50:37 PM:
>
>   
>> 32 bytes? I count 16, - 4 for the pointer, so its 12 bytes of waste.
>> If you'd use it for gso_skb it would come down to 8 bytes. struct
>> net_device is a pig already, and there are better ways to reduce this
>> than starting to allocating single members with a few bytes IMO.
>>     
>
> Currently, this allocated pointer is an indication to let kernel users
> (qdisc_restart, setting/resetting tx_batch_skbs) know whether batching
> is enabled or disabled. Removing the pointer and making it static means
> those users cannot figure out this information . Adding another field to
> netdev may be a bad idea, so I am thinking of overloading dev->features
> to add a new flag (other than NETIF_F_BATCH_SKBS, since that is a driver
> capabilities flag) which can be set/cleared based on NETIF_F_BATCH_SKBS
> bit. Does this approach sound OK ?
>   

I guess so. It would be more consistent with things like HW checksumming
etc. though to handle this through ethtool and have the ethtool callbacks
set or clear just the one feature bit. That would mean you don't need
to provide further indication of the device's capabilities to the stack
since only the driver enables or disables the feature.


From krkumar2 at in.ibm.com  Fri Jul 20 05:33:56 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Fri, 20 Jul 2007 18:03:56 +0530
Subject: [ofa-general] Re: [PATCH 03/10] dev.c changes.
Message-ID: <OF36BA88CF.246FB4F2-ON6525731E.0044EE77-6525731E.004506A3@in.ibm.com>


(My Notes crashed when I hit the Send button, so not sure if this went
out).

__________________

Patrick McHardy <kaber at trash.net> wrote on 07/20/2007 04:50:37 PM:

> 32 bytes? I count 16, - 4 for the pointer, so its 12 bytes of waste.
> If you'd use it for gso_skb it would come down to 8 bytes. struct
> net_device is a pig already, and there are better ways to reduce this
> than starting to allocating single members with a few bytes IMO.

Currently, this allocated pointer is an indication to let kernel users
(qdisc_restart, setting/resetting tx_batch_skbs) know whether batching
is enabled or disabled. Removing the pointer and making it static means
those users cannot figure out this information . Adding another field to
netdev may be a bad idea, so I am thinking of overloading dev->features
to add a new flag (other than NETIF_F_BATCH_SKBS, since that is a driver
capabilities flag) which can be set/cleared based on NETIF_F_BATCH_SKBS
bit. Does this approach sound OK ?

Thanks,

- KK


From dsekustzyqa at discount-traveller.de  Fri Jul 20 05:47:05 2007
From: dsekustzyqa at discount-traveller.de (Marguerite)
Date: Fri, 20 Jul 2007 02:47:05 -1000
Subject: [ofa-general] Thinking about you
Message-ID: <365c01c7ca78$423081f0$a4ca724b@dsekustzyqa>


She wear looked at paint harmony the note Wendy had left on the slept kitchen table. Wendy was up to something. She had disap How very Catholic of collar them, Nancy said caustically. I light didnt see the article, engine box but Im familiar wit No. Ive never needed one. Why? He rhyme frantically hoped Nancy wasnt going to cautious try and harbor pressure him into something
 
Her sister, Angela, was a good example of someone disgust who vivaciously should have burst received guidance about person sex when s fragile nail outside Let all ursine thy converse be sincere, As they tactic approached, island they recognized it as a meddle canoe, with half its length below the camp ice and the other The room big news is Im going to be an aunt. Nancy and Cliff welcome want to have a baby. Its prefer shock even possible she  
"Aye, it's ill livin' in a hen-roost for them account as zoom doesn't like fold fleas," said Mrs. peck Poyser. "We've all h "Ah, to slip be sail sure," said Mrs. ski Poyser, emphatically, "you make but a poor trap drive to catch luck if you go through "Now, lad," said Adam, as Seth made arch his appearance, "the coffin's done, and we can take need woman it over to B "No! What a during pity! Such a pretty pocket. Well, I think I've got some things occur in refuse mine that order will make a "What brass art table girl goin' to do?" roll asked Lisbeth. "Set about thy feyther's coffin?" observation She receipt moor was unsure of what to do. Grounding Wendy meant staying home with hand her, which required ignoring h
Heres canvas hug something strange, Cliff said changing the angle subject Ive been thinking fly about writing a sci Hetty now came back from the pantry forgive and said, "I can travel take Totty now, trade rod Aunt, if you like."  As relaxed Rose Ann continued her fantasy, she crawl thought of how she would describe what sex spent untidy was really about.
 
tip daughter bath Sex should only happen in ashamed marriage, Rose Ann thought. The guilt from premarital sex will haunt you, 
About a license quarter to seven there teaching was an unusual appearance overthrow of excitement in the chalk village of Hayslope, a  The coffin was harbor linen soon propped on the tall shoulders of swear the two brothers, and they curtain were making their wa Ben said, Well, unit heres another hair piece of happen scientific misinformation, sneeze or double-speak... Ive gone thr  Cliff gave Ben a glamorous gone space dirty look and a smile. I knew thumb that! He heard feminine chuckles in the background
Most rod of the concern revolves around milk, but other organic produce would detect be effected, change comparison too. The new The man stealthily learn who had cut the hole in effect the ice and wedged the boastfully canoe into it, watched them from the comfort Mr. Casson's brake person was by eaten no means of that common type which can be allowed hospital busily to pass without descrip Benny smoked and drank spilt too much. He had a hard time loss quaint breathing, and had pencil forgotten what it felt like t
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070720/0a9e576d/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: uPOqyAuZi4M.gif
Type: image/gif
Size: 8319 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070720/0a9e576d/attachment.gif>

From johnpol at 2ka.mipt.ru  Fri Jul 20 05:54:23 2007
From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov)
Date: Fri, 20 Jul 2007 16:54:23 +0400
Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API
In-Reply-To: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
Message-ID: <20070720125423.GB13468@2ka.mipt.ru>

Hi Krishna.

On Fri, Jul 20, 2007 at 12:01:49PM +0530, Krishna Kumar (krkumar2 at in.ibm.com) wrote:
> After fine-tuning qdisc and other changes, I modified IPoIB to use this API,
> and now get good gains. Summary for TCP & No Delay: 1 process improves for
> all cases from 1.4% to 49.5%; 4 process has almost identical improvements
> from -1.7% to 59.1%; 16 process case also improves in the range of -1.2% to
> 33.4%; while 64 process doesn't have much improvement (-3.3% to 12.4%). UDP
> was tested with 1 process netperf with small increase in BW but big
> improvement in Service Demand. Netperf latency tests show small drop in
> transaction rate (results in separate attachment).

What about round-robin tcp time and latency test? In theory such batching
mode should not change that timings, but practice can show new aspects.
I will review code later this week (likely tomorrow) and if there will
be some issues return back.

-- 
	Evgeniy Polyakov


From krkumar2 at in.ibm.com  Fri Jul 20 06:02:50 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Fri, 20 Jul 2007 18:32:50 +0530
Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API
In-Reply-To: <20070720125423.GB13468@2ka.mipt.ru>
Message-ID: <OF7D4124DB.BCA4AB16-ON6525731E.00475350-6525731E.0047AC06@in.ibm.com>

Hi Evgeniy,

Evgeniy Polyakov <johnpol at 2ka.mipt.ru> wrote on 07/20/2007 06:24:23 PM:

> > After fine-tuning qdisc and other changes, I modified IPoIB to use this
API,
> > and now get good gains. Summary for TCP & No Delay: 1 process improves
for
> > all cases from 1.4% to 49.5%; 4 process has almost identical
improvements
> > from -1.7% to 59.1%; 16 process case also improves in the range of
-1.2% to
> > 33.4%; while 64 process doesn't have much improvement (-3.3% to 12.4%).
UDP
> > was tested with 1 process netperf with small increase in BW but big
> > improvement in Service Demand. Netperf latency tests show small drop in
> > transaction rate (results in separate attachment).
>
> What about round-robin tcp time and latency test? In theory such batching
> mode should not change that timings, but practice can show new aspects.
> I will review code later this week (likely tomorrow) and if there will
> be some issues return back.

I had run RR test quite some time back and don't have the result at this
time,
other than remembering it was almost the same as the original. As I am
running
some tests on those systems at this time, I can send the results of RR
tomorrow.

Thanks,

- KK


From hnguyen at linux.vnet.ibm.com  Fri Jul 20 06:48:35 2007
From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen)
Date: Fri, 20 Jul 2007 15:48:35 +0200
Subject: [ofa-general] [PATCH 0/5] ehca: MR large page, small queue and fixes
Message-ID: <200707201548.36047.hnguyen@linux.vnet.ibm.com>

Here is a patch set against Roland's git, branch for-2.6.23 for ehca.
It adds support for MR large page and small queues. In addition of that
it also contains various small fixes from previous comments and what
we found.

They are in details:
[1/5] adds support for MR large page
[2/5] generates event when SRQ limit reached
[3/5] makes ehca2ib_return_code() non inline
[4/5] makes internal_create/destroy_qp() static
[5/5] adds support for small queues

The patches should apply cleanly, in order, against Roland's git. Please
review the changes and apply the patches if they are okay.

Regards,
Nam & Stefan


From hnguyen at linux.vnet.ibm.com  Fri Jul 20 07:01:51 2007
From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen)
Date: Fri, 20 Jul 2007 16:01:51 +0200
Subject: [ofa-general] [PATCH 1/5] ehca: Supports large page MRs
Message-ID: <200707201601.52277.hnguyen@linux.vnet.ibm.com>

From: Hoang-Nam Nguyen <hnguyen at de.ibm.com>
Date: Thu, 19 Jul 2007 20:48:04 +0200
Subject: [PATCH 1/5] IB/ehca: Support large page MRs

Add support for MR pages larger than 4K on eHCA2. This reduces firmware
memory consumption. If enabled via the mr_largepage module parameter, the MR
page size will be determined based on the MR length and the hardware
capabilities - if the MR is >= 16M, 16M pages are used, for example.

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_classes.h |    9 +
 drivers/infiniband/hw/ehca/ehca_main.c    |   18 ++-
 drivers/infiniband/hw/ehca/ehca_mrmw.c    |  371 ++++++++++++++++++++++++-----
 drivers/infiniband/hw/ehca/ehca_mrmw.h    |    2 +-
 drivers/infiniband/hw/ehca/hcp_if.c       |   20 ++-
 5 files changed, 357 insertions(+), 63 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h
index 043e4fb..63b8b9f 100644
--- a/drivers/infiniband/hw/ehca/ehca_classes.h
+++ b/drivers/infiniband/hw/ehca/ehca_classes.h
@@ -100,6 +100,11 @@ struct ehca_sport {
 	struct ehca_sma_attr saved_attr;
 };
 
+#define HCA_CAP_MR_PGSIZE_4K  1
+#define HCA_CAP_MR_PGSIZE_64K 2
+#define HCA_CAP_MR_PGSIZE_1M  4
+#define HCA_CAP_MR_PGSIZE_16M 8
+
 struct ehca_shca {
 	struct ib_device ib_device;
 	struct ibmebus_dev *ibmebus_dev;
@@ -115,6 +120,8 @@ struct ehca_shca {
 	struct h_galpas galpas;
 	struct mutex modify_mutex;
 	u64 hca_cap;
+	/* MR pgsize: bit 0-3 means 4K, 64K, 1M, 16M respectively */
+	u32 hca_cap_mr_pgsize;
 	int max_mtu;
 };
 
@@ -206,6 +213,7 @@ struct ehca_mr {
 	enum ehca_mr_flag flags;
 	u32 num_kpages;		/* number of kernel pages */
 	u32 num_hwpages;	/* number of hw pages to form MR */
+	u64 hwpage_size;	/* hw page size used for this MR */
 	int acl;		/* ACL (stored here for usage in reregister) */
 	u64 *start;		/* virtual start address (stored here for */
 				/* usage in reregister) */
@@ -240,6 +248,7 @@ struct ehca_mr_pginfo {
 	enum ehca_mr_pgi_type type;
 	u64 num_kpages;
 	u64 kpage_cnt;
+	u64 hwpage_size;     /* hw page size used for this MR */
 	u64 num_hwpages;     /* number of hw pages */
 	u64 hwpage_cnt;      /* counter for hw pages */
 	u64 next_hwpage;     /* next hw page in buffer/chunk/listelem */
diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c
index 36377c6..34661c3 100644
--- a/drivers/infiniband/hw/ehca/ehca_main.c
+++ b/drivers/infiniband/hw/ehca/ehca_main.c
@@ -63,6 +63,7 @@ int ehca_port_act_time = 30;
 int ehca_poll_all_eqs  = 1;
 int ehca_static_rate   = -1;
 int ehca_scaling_code  = 0;
+int ehca_mr_largepage  = 0;
 
 module_param_named(open_aqp1,     ehca_open_aqp1,     int, 0);
 module_param_named(debug_level,   ehca_debug_level,   int, 0);
@@ -72,7 +73,8 @@ module_param_named(use_hp_mr,     ehca_use_hp_mr,     int, 0);
 module_param_named(port_act_time, ehca_port_act_time, int, 0);
 module_param_named(poll_all_eqs,  ehca_poll_all_eqs,  int, 0);
 module_param_named(static_rate,   ehca_static_rate,   int, 0);
-module_param_named(scaling_code,   ehca_scaling_code,   int, 0);
+module_param_named(scaling_code,  ehca_scaling_code,  int, 0);
+module_param_named(mr_largepage,  ehca_mr_largepage,  int, 0);
 
 MODULE_PARM_DESC(open_aqp1,
 		 "AQP1 on startup (0: no (default), 1: yes)");
@@ -95,6 +97,9 @@ MODULE_PARM_DESC(static_rate,
 		 "set permanent static rate (default: disabled)");
 MODULE_PARM_DESC(scaling_code,
 		 "set scaling code (0: disabled/default, 1: enabled)");
+MODULE_PARM_DESC(mr_largepage,
+		 "use large page for MR (0: use PAGE_SIZE (default), "
+		 "1: use large page depending on MR size");
 
 DEFINE_RWLOCK(ehca_qp_idr_lock);
 DEFINE_RWLOCK(ehca_cq_idr_lock);
@@ -295,6 +300,8 @@ int ehca_sense_attributes(struct ehca_shca *shca)
 		if (EHCA_BMASK_GET(hca_cap_descr[i].mask, shca->hca_cap))
 			ehca_gen_dbg("   %s", hca_cap_descr[i].descr);
 
+	shca->hca_cap_mr_pgsize = rblock->memory_page_size_supported;
+
 	port = (struct hipz_query_port *)rblock;
 	h_ret = hipz_h_query_port(shca->ipz_hca_handle, 1, port);
 	if (h_ret != H_SUCCESS) {
@@ -590,6 +597,14 @@ static ssize_t ehca_show_adapter_handle(struct device *dev,
 }
 static DEVICE_ATTR(adapter_handle, S_IRUGO, ehca_show_adapter_handle, NULL);
 
+static ssize_t ehca_show_mr_largepage(struct device *dev,
+				      struct device_attribute *attr,
+				      char *buf)
+{
+	return sprintf(buf, "%d\n", ehca_mr_largepage);
+}
+static DEVICE_ATTR(mr_largepage, S_IRUGO, ehca_show_mr_largepage, NULL);
+
 static struct attribute *ehca_dev_attrs[] = {
 	&dev_attr_adapter_handle.attr,
 	&dev_attr_num_ports.attr,
@@ -606,6 +621,7 @@ static struct attribute *ehca_dev_attrs[] = {
 	&dev_attr_cur_mw.attr,
 	&dev_attr_max_pd.attr,
 	&dev_attr_max_ah.attr,
+	&dev_attr_mr_largepage.attr,
 	NULL
 };
 
diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c
index 6262c54..ba28783 100644
--- a/drivers/infiniband/hw/ehca/ehca_mrmw.c
+++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c
@@ -5,6 +5,7 @@
  *
  *  Authors: Dietmar Decker <ddecker at de.ibm.com>
  *           Christoph Raisch <raisch at de.ibm.com>
+ *           Hoang-Nam Nguyen <hnguyen at de.ibm.com>
  *
  *  Copyright (c) 2005 IBM Corporation
  *
@@ -56,6 +57,37 @@
 static struct kmem_cache *mr_cache;
 static struct kmem_cache *mw_cache;
 
+enum ehca_mr_pgsize {
+	EHCA_MR_PGSIZE4K  = 0x1000L,
+	EHCA_MR_PGSIZE64K = 0x10000L,
+	EHCA_MR_PGSIZE1M  = 0x100000L,
+	EHCA_MR_PGSIZE16M = 0x1000000L
+};
+
+extern int ehca_mr_largepage;
+
+static u32 ehca_encode_hwpage_size(u32 pgsize)
+{
+	u32 idx = 0;
+	pgsize >>= 12;
+	/*
+	 * map mr page size into hw code:
+	 * 0, 1, 2, 3 for 4K, 64K, 1M, 64M
+	 */
+	while (!(pgsize & 1)) {
+		idx++;
+		pgsize >>= 4;
+	}
+	return idx;
+}
+
+static u64 ehca_get_max_hwpage_size(struct ehca_shca *shca)
+{
+	if (shca->hca_cap_mr_pgsize & HCA_CAP_MR_PGSIZE_16M)
+		return EHCA_MR_PGSIZE16M;
+	return EHCA_MR_PGSIZE4K;
+}
+
 static struct ehca_mr *ehca_mr_new(void)
 {
 	struct ehca_mr *me;
@@ -207,19 +239,23 @@ struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd,
 		struct ehca_mr_pginfo pginfo;
 		u32 num_kpages;
 		u32 num_hwpages;
+		u64 hw_pgsize;
 
 		num_kpages = NUM_CHUNKS(((u64)iova_start % PAGE_SIZE) + size,
 					PAGE_SIZE);
-		num_hwpages = NUM_CHUNKS(((u64)iova_start % EHCA_PAGESIZE) +
-					 size, EHCA_PAGESIZE);
+		/* for kernel space we try most possible pgsize */
+		hw_pgsize = ehca_get_max_hwpage_size(shca);
+		num_hwpages = NUM_CHUNKS(((u64)iova_start % hw_pgsize) + size,
+					 hw_pgsize);
 		memset(&pginfo, 0, sizeof(pginfo));
 		pginfo.type = EHCA_MR_PGI_PHYS;
 		pginfo.num_kpages = num_kpages;
+		pginfo.hwpage_size = hw_pgsize;
 		pginfo.num_hwpages = num_hwpages;
 		pginfo.u.phy.num_phys_buf = num_phys_buf;
 		pginfo.u.phy.phys_buf_array = phys_buf_array;
-		pginfo.next_hwpage = (((u64)iova_start & ~PAGE_MASK) /
-				      EHCA_PAGESIZE);
+		pginfo.next_hwpage =
+			((u64)iova_start & ~(hw_pgsize - 1)) / hw_pgsize;
 
 		ret = ehca_reg_mr(shca, e_mr, iova_start, size, mr_access_flags,
 				  e_pd, &pginfo, &e_mr->ib.ib_mr.lkey,
@@ -259,6 +295,7 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 	int ret;
 	u32 num_kpages;
 	u32 num_hwpages;
+	u64 hwpage_size;
 
 	if (!pd) {
 		ehca_gen_err("bad pd=%p", pd);
@@ -309,16 +346,32 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 
 	/* determine number of MR pages */
 	num_kpages = NUM_CHUNKS((virt % PAGE_SIZE) + length, PAGE_SIZE);
-	num_hwpages = NUM_CHUNKS((virt % EHCA_PAGESIZE) + length,
-				 EHCA_PAGESIZE);
+	/* select proper hw_pgsize */
+	if (ehca_mr_largepage &&
+	    (shca->hca_cap_mr_pgsize & HCA_CAP_MR_PGSIZE_16M)) {
+		if (length <= EHCA_MR_PGSIZE4K
+		    && PAGE_SIZE == EHCA_MR_PGSIZE4K)
+			hwpage_size = EHCA_MR_PGSIZE4K;
+		else if (length <= EHCA_MR_PGSIZE64K)
+			hwpage_size = EHCA_MR_PGSIZE64K;
+		else if (length <= EHCA_MR_PGSIZE1M)
+			hwpage_size = EHCA_MR_PGSIZE1M;
+		else
+			hwpage_size = EHCA_MR_PGSIZE16M;
+	} else
+		hwpage_size = EHCA_MR_PGSIZE4K;
+	ehca_dbg(pd->device, "hwpage_size=%lx", hwpage_size);
 
+reg_user_mr_fallback:
+	num_hwpages = NUM_CHUNKS((virt % hwpage_size) + length, hwpage_size);
 	/* register MR on HCA */
 	memset(&pginfo, 0, sizeof(pginfo));
 	pginfo.type = EHCA_MR_PGI_USER;
+	pginfo.hwpage_size = hwpage_size;
 	pginfo.num_kpages = num_kpages;
 	pginfo.num_hwpages = num_hwpages;
 	pginfo.u.usr.region = e_mr->umem;
-	pginfo.next_hwpage = e_mr->umem->offset / EHCA_PAGESIZE;
+	pginfo.next_hwpage = e_mr->umem->offset / hwpage_size;
 	pginfo.u.usr.next_chunk = list_prepare_entry(pginfo.u.usr.next_chunk,
 						     (&e_mr->umem->chunk_list),
 						     list);
@@ -326,6 +379,18 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 	ret = ehca_reg_mr(shca, e_mr, (u64 *)virt, length, mr_access_flags,
 			  e_pd, &pginfo, &e_mr->ib.ib_mr.lkey,
 			  &e_mr->ib.ib_mr.rkey);
+	if (ret == -EINVAL && pginfo.hwpage_size > PAGE_SIZE) {
+		ehca_warn(pd->device, "failed to register mr "
+			  "with hwpage_size=%lx", hwpage_size);
+		ehca_info(pd->device, "try to register mr with "
+			  "kpage_size=%lx", PAGE_SIZE);
+		/*
+		 * this means kpages are not contiguous for a hw page
+		 * try kernel page size as fallback solution
+		 */
+		hwpage_size = PAGE_SIZE;
+		goto reg_user_mr_fallback;
+	}
 	if (ret) {
 		ib_mr = ERR_PTR(ret);
 		goto reg_user_mr_exit2;
@@ -452,6 +517,8 @@ int ehca_rereg_phys_mr(struct ib_mr *mr,
 	new_pd = container_of(mr->pd, struct ehca_pd, ib_pd);
 
 	if (mr_rereg_mask & IB_MR_REREG_TRANS) {
+		u64 hw_pgsize = ehca_get_max_hwpage_size(shca);
+
 		new_start = iova_start;	/* change address */
 		/* check physical buffer list and calculate size */
 		ret = ehca_mr_chk_buf_and_calc_size(phys_buf_array,
@@ -468,16 +535,17 @@ int ehca_rereg_phys_mr(struct ib_mr *mr,
 		}
 		num_kpages = NUM_CHUNKS(((u64)new_start % PAGE_SIZE) +
 					new_size, PAGE_SIZE);
-		num_hwpages = NUM_CHUNKS(((u64)new_start % EHCA_PAGESIZE) +
-					 new_size, EHCA_PAGESIZE);
+		num_hwpages = NUM_CHUNKS(((u64)new_start % hw_pgsize) +
+					 new_size, hw_pgsize);
 		memset(&pginfo, 0, sizeof(pginfo));
 		pginfo.type = EHCA_MR_PGI_PHYS;
 		pginfo.num_kpages = num_kpages;
+		pginfo.hwpage_size = hw_pgsize;
 		pginfo.num_hwpages = num_hwpages;
 		pginfo.u.phy.num_phys_buf = num_phys_buf;
 		pginfo.u.phy.phys_buf_array = phys_buf_array;
-		pginfo.next_hwpage = (((u64)iova_start & ~PAGE_MASK) /
-				      EHCA_PAGESIZE);
+		pginfo.next_hwpage =
+			((u64)iova_start & ~(hw_pgsize - 1)) / hw_pgsize;
 	}
 	if (mr_rereg_mask & IB_MR_REREG_ACCESS)
 		new_acl = mr_access_flags;
@@ -709,6 +777,7 @@ struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd,
 	int ret;
 	u32 tmp_lkey, tmp_rkey;
 	struct ehca_mr_pginfo pginfo;
+	u64 hw_pgsize;
 
 	/* check other parameters */
 	if (((mr_access_flags & IB_ACCESS_REMOTE_WRITE) &&
@@ -738,8 +807,8 @@ struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd,
 		ib_fmr = ERR_PTR(-EINVAL);
 		goto alloc_fmr_exit0;
 	}
-	if (((1 << fmr_attr->page_shift) != EHCA_PAGESIZE) &&
-	    ((1 << fmr_attr->page_shift) != PAGE_SIZE)) {
+	hw_pgsize = ehca_get_max_hwpage_size(shca);
+	if ((1 << fmr_attr->page_shift) != hw_pgsize) {
 		ehca_err(pd->device, "unsupported fmr_attr->page_shift=%x",
 			 fmr_attr->page_shift);
 		ib_fmr = ERR_PTR(-EINVAL);
@@ -755,6 +824,10 @@ struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd,
 
 	/* register MR on HCA */
 	memset(&pginfo, 0, sizeof(pginfo));
+	/*
+	 * pginfo.num_hwpages==0, ie register_rpages() will not be called
+	 * but deferred to map_phys_fmr()
+	 */
 	ret = ehca_reg_mr(shca, e_fmr, NULL,
 			  fmr_attr->max_pages * (1 << fmr_attr->page_shift),
 			  mr_access_flags, e_pd, &pginfo,
@@ -765,6 +838,7 @@ struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd,
 	}
 
 	/* successful */
+	e_fmr->hwpage_size = hw_pgsize;
 	e_fmr->fmr_page_size = 1 << fmr_attr->page_shift;
 	e_fmr->fmr_max_pages = fmr_attr->max_pages;
 	e_fmr->fmr_max_maps = fmr_attr->max_maps;
@@ -822,10 +896,12 @@ int ehca_map_phys_fmr(struct ib_fmr *fmr,
 	memset(&pginfo, 0, sizeof(pginfo));
 	pginfo.type = EHCA_MR_PGI_FMR;
 	pginfo.num_kpages = list_len;
-	pginfo.num_hwpages = list_len * (e_fmr->fmr_page_size / EHCA_PAGESIZE);
+	pginfo.hwpage_size = e_fmr->hwpage_size;
+	pginfo.num_hwpages =
+		list_len * e_fmr->fmr_page_size / pginfo.hwpage_size;
 	pginfo.u.fmr.page_list = page_list;
-	pginfo.next_hwpage = ((iova & (e_fmr->fmr_page_size-1)) /
-			      EHCA_PAGESIZE);
+	pginfo.next_hwpage =
+		(iova & (e_fmr->fmr_page_size-1)) / pginfo.hwpage_size;
 	pginfo.u.fmr.fmr_pgsize = e_fmr->fmr_page_size;
 
 	ret = ehca_rereg_mr(shca, e_fmr, (u64 *)iova,
@@ -964,7 +1040,7 @@ int ehca_reg_mr(struct ehca_shca *shca,
 	struct ehca_mr_hipzout_parms hipzout;
 
 	ehca_mrmw_map_acl(acl, &hipz_acl);
-	ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl);
+	ehca_mrmw_set_pgsize_hipz_acl(pginfo->hwpage_size, &hipz_acl);
 	if (ehca_use_hp_mr == 1)
 		hipz_acl |= 0x00000001;
 
@@ -987,6 +1063,7 @@ int ehca_reg_mr(struct ehca_shca *shca,
 	/* successful registration */
 	e_mr->num_kpages = pginfo->num_kpages;
 	e_mr->num_hwpages = pginfo->num_hwpages;
+	e_mr->hwpage_size = pginfo->hwpage_size;
 	e_mr->start = iova_start;
 	e_mr->size = size;
 	e_mr->acl = acl;
@@ -1029,6 +1106,9 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca,
 	u32 i;
 	u64 *kpage;
 
+	if (!pginfo->num_hwpages) /* in case of fmr */
+		return 0;
+
 	kpage = ehca_alloc_fw_ctrlblock(GFP_KERNEL);
 	if (!kpage) {
 		ehca_err(&shca->ib_device, "kpage alloc failed");
@@ -1036,7 +1116,7 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca,
 		goto ehca_reg_mr_rpages_exit0;
 	}
 
-	/* max 512 pages per shot */
+	/* max MAX_RPAGES ehca mr pages per register call */
 	for (i = 0; i < NUM_CHUNKS(pginfo->num_hwpages, MAX_RPAGES); i++) {
 
 		if (i == NUM_CHUNKS(pginfo->num_hwpages, MAX_RPAGES) - 1) {
@@ -1049,8 +1129,8 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca,
 		ret = ehca_set_pagebuf(pginfo, rnum, kpage);
 		if (ret) {
 			ehca_err(&shca->ib_device, "ehca_set_pagebuf "
-					 "bad rc, ret=%x rnum=%x kpage=%p",
-					 ret, rnum, kpage);
+				 "bad rc, ret=%x rnum=%x kpage=%p",
+				 ret, rnum, kpage);
 			goto ehca_reg_mr_rpages_exit1;
 		}
 
@@ -1065,9 +1145,10 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca,
 		} else
 			rpage = *kpage;
 
-		h_ret = hipz_h_register_rpage_mr(shca->ipz_hca_handle, e_mr,
-						 0, /* pagesize 4k */
-						 0, rpage, rnum);
+		h_ret = hipz_h_register_rpage_mr(
+			shca->ipz_hca_handle, e_mr,
+			ehca_encode_hwpage_size(pginfo->hwpage_size),
+			0, rpage, rnum);
 
 		if (i == NUM_CHUNKS(pginfo->num_hwpages, MAX_RPAGES) - 1) {
 			/*
@@ -1131,7 +1212,7 @@ inline int ehca_rereg_mr_rereg1(struct ehca_shca *shca,
 	struct ehca_mr_hipzout_parms hipzout;
 
 	ehca_mrmw_map_acl(acl, &hipz_acl);
-	ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl);
+	ehca_mrmw_set_pgsize_hipz_acl(pginfo->hwpage_size, &hipz_acl);
 
 	kpage = ehca_alloc_fw_ctrlblock(GFP_KERNEL);
 	if (!kpage) {
@@ -1182,6 +1263,7 @@ inline int ehca_rereg_mr_rereg1(struct ehca_shca *shca,
 		 */
 		e_mr->num_kpages = pginfo->num_kpages;
 		e_mr->num_hwpages = pginfo->num_hwpages;
+		e_mr->hwpage_size = pginfo->hwpage_size;
 		e_mr->start = iova_start;
 		e_mr->size = size;
 		e_mr->acl = acl;
@@ -1268,13 +1350,14 @@ int ehca_rereg_mr(struct ehca_shca *shca,
 
 		/* set some MR values */
 		e_mr->flags = save_mr.flags;
+		e_mr->hwpage_size = save_mr.hwpage_size;
 		e_mr->fmr_page_size = save_mr.fmr_page_size;
 		e_mr->fmr_max_pages = save_mr.fmr_max_pages;
 		e_mr->fmr_max_maps = save_mr.fmr_max_maps;
 		e_mr->fmr_map_cnt = save_mr.fmr_map_cnt;
 
 		ret = ehca_reg_mr(shca, e_mr, iova_start, size, acl,
-				      e_pd, pginfo, lkey, rkey);
+				  e_pd, pginfo, lkey, rkey);
 		if (ret) {
 			u32 offset = (u64)(&e_mr->flags) - (u64)e_mr;
 			memcpy(&e_mr->flags, &(save_mr.flags),
@@ -1355,6 +1438,7 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca,
 
 	/* set some MR values */
 	e_fmr->flags = save_fmr.flags;
+	e_fmr->hwpage_size = save_fmr.hwpage_size;
 	e_fmr->fmr_page_size = save_fmr.fmr_page_size;
 	e_fmr->fmr_max_pages = save_fmr.fmr_max_pages;
 	e_fmr->fmr_max_maps = save_fmr.fmr_max_maps;
@@ -1363,8 +1447,6 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca,
 
 	memset(&pginfo, 0, sizeof(pginfo));
 	pginfo.type = EHCA_MR_PGI_FMR;
-	pginfo.num_kpages = 0;
-	pginfo.num_hwpages = 0;
 	ret = ehca_reg_mr(shca, e_fmr, NULL,
 			  (e_fmr->fmr_max_pages * e_fmr->fmr_page_size),
 			  e_fmr->acl, e_pd, &pginfo, &tmp_lkey,
@@ -1373,7 +1455,6 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca,
 		u32 offset = (u64)(&e_fmr->flags) - (u64)e_fmr;
 		memcpy(&e_fmr->flags, &(save_mr.flags),
 		       sizeof(struct ehca_mr) - offset);
-		goto ehca_unmap_one_fmr_exit0;
 	}
 
 ehca_unmap_one_fmr_exit0:
@@ -1401,7 +1482,7 @@ int ehca_reg_smr(struct ehca_shca *shca,
 	struct ehca_mr_hipzout_parms hipzout;
 
 	ehca_mrmw_map_acl(acl, &hipz_acl);
-	ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl);
+	ehca_mrmw_set_pgsize_hipz_acl(e_origmr->hwpage_size, &hipz_acl);
 
 	h_ret = hipz_h_register_smr(shca->ipz_hca_handle, e_newmr, e_origmr,
 				    (u64)iova_start, hipz_acl, e_pd->fw_pd,
@@ -1420,6 +1501,7 @@ int ehca_reg_smr(struct ehca_shca *shca,
 	/* successful registration */
 	e_newmr->num_kpages = e_origmr->num_kpages;
 	e_newmr->num_hwpages = e_origmr->num_hwpages;
+	e_newmr->hwpage_size   = e_origmr->hwpage_size;
 	e_newmr->start = iova_start;
 	e_newmr->size = e_origmr->size;
 	e_newmr->acl = acl;
@@ -1452,6 +1534,7 @@ int ehca_reg_internal_maxmr(
 	struct ib_phys_buf ib_pbuf;
 	u32 num_kpages;
 	u32 num_hwpages;
+	u64 hw_pgsize;
 
 	e_mr = ehca_mr_new();
 	if (!e_mr) {
@@ -1468,13 +1551,15 @@ int ehca_reg_internal_maxmr(
 	ib_pbuf.size = size_maxmr;
 	num_kpages = NUM_CHUNKS(((u64)iova_start % PAGE_SIZE) + size_maxmr,
 				PAGE_SIZE);
-	num_hwpages = NUM_CHUNKS(((u64)iova_start % EHCA_PAGESIZE) + size_maxmr,
-				 EHCA_PAGESIZE);
+	hw_pgsize = ehca_get_max_hwpage_size(shca);
+	num_hwpages = NUM_CHUNKS(((u64)iova_start % hw_pgsize) + size_maxmr,
+				 hw_pgsize);
 
 	memset(&pginfo, 0, sizeof(pginfo));
 	pginfo.type = EHCA_MR_PGI_PHYS;
 	pginfo.num_kpages = num_kpages;
 	pginfo.num_hwpages = num_hwpages;
+	pginfo.hwpage_size = hw_pgsize;
 	pginfo.u.phy.num_phys_buf = 1;
 	pginfo.u.phy.phys_buf_array = &ib_pbuf;
 
@@ -1523,7 +1608,7 @@ int ehca_reg_maxmr(struct ehca_shca *shca,
 	struct ehca_mr_hipzout_parms hipzout;
 
 	ehca_mrmw_map_acl(acl, &hipz_acl);
-	ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl);
+	ehca_mrmw_set_pgsize_hipz_acl(e_origmr->hwpage_size, &hipz_acl);
 
 	h_ret = hipz_h_register_smr(shca->ipz_hca_handle, e_newmr, e_origmr,
 				    (u64)iova_start, hipz_acl, e_pd->fw_pd,
@@ -1539,6 +1624,7 @@ int ehca_reg_maxmr(struct ehca_shca *shca,
 	/* successful registration */
 	e_newmr->num_kpages = e_origmr->num_kpages;
 	e_newmr->num_hwpages = e_origmr->num_hwpages;
+	e_newmr->hwpage_size = e_origmr->hwpage_size;
 	e_newmr->start = iova_start;
 	e_newmr->size = e_origmr->size;
 	e_newmr->acl = acl;
@@ -1684,6 +1770,7 @@ static int ehca_set_pagebuf_user1(struct ehca_mr_pginfo *pginfo,
 	u64 pgaddr;
 	u32 i = 0;
 	u32 j = 0;
+	int hwpages_per_kpage = PAGE_SIZE / pginfo->hwpage_size;
 
 	/* loop over desired chunk entries */
 	chunk      = pginfo->u.usr.next_chunk;
@@ -1695,7 +1782,7 @@ static int ehca_set_pagebuf_user1(struct ehca_mr_pginfo *pginfo,
 				<< PAGE_SHIFT ;
 			*kpage = phys_to_abs(pgaddr +
 					     (pginfo->next_hwpage *
-					      EHCA_PAGESIZE));
+					      pginfo->hwpage_size));
 			if ( !(*kpage) ) {
 				ehca_gen_err("pgaddr=%lx "
 					     "chunk->page_list[i]=%lx "
@@ -1708,8 +1795,7 @@ static int ehca_set_pagebuf_user1(struct ehca_mr_pginfo *pginfo,
 			(pginfo->hwpage_cnt)++;
 			(pginfo->next_hwpage)++;
 			kpage++;
-			if (pginfo->next_hwpage %
-			    (PAGE_SIZE / EHCA_PAGESIZE) == 0) {
+			if (pginfo->next_hwpage % hwpages_per_kpage == 0) {
 				(pginfo->kpage_cnt)++;
 				(pginfo->u.usr.next_nmap)++;
 				pginfo->next_hwpage = 0;
@@ -1738,6 +1824,143 @@ static int ehca_set_pagebuf_user1(struct ehca_mr_pginfo *pginfo,
 	return ret;
 }
 
+/*
+ * check given pages for contiguous layout
+ * last page addr is returned in prev_pgaddr for further check
+ */
+static int ehca_check_kpages_per_ate(struct scatterlist *page_list,
+				     int start_idx, int end_idx,
+				     u64 *prev_pgaddr)
+{
+	int t;
+	for (t = start_idx; t <= end_idx; t++) {
+		u64 pgaddr = page_to_pfn(page_list[t].page) << PAGE_SHIFT;
+		ehca_gen_dbg("chunk_page=%lx value=%016lx", pgaddr,
+			     *(u64 *)abs_to_virt(phys_to_abs(pgaddr)));
+		if (pgaddr - PAGE_SIZE != *prev_pgaddr) {
+			ehca_gen_err("uncontiguous page found pgaddr=%lx "
+				     "prev_pgaddr=%lx page_list_i=%x",
+				     pgaddr, *prev_pgaddr, t);
+			return -EINVAL;
+		}
+		*prev_pgaddr = pgaddr;
+	}
+	return 0;
+}
+
+/* PAGE_SIZE < pginfo->hwpage_size */
+static int ehca_set_pagebuf_user2(struct ehca_mr_pginfo *pginfo,
+				  u32 number,
+				  u64 *kpage)
+{
+	int ret = 0;
+	struct ib_umem_chunk *prev_chunk;
+	struct ib_umem_chunk *chunk;
+	u64 pgaddr, prev_pgaddr;
+	u32 i = 0;
+	u32 j = 0;
+	int kpages_per_hwpage = pginfo->hwpage_size / PAGE_SIZE;
+	int nr_kpages = kpages_per_hwpage;
+
+	/* loop over desired chunk entries */
+	chunk      = pginfo->u.usr.next_chunk;
+	prev_chunk = pginfo->u.usr.next_chunk;
+	list_for_each_entry_continue(
+		chunk, (&(pginfo->u.usr.region->chunk_list)), list) {
+		for (i = pginfo->u.usr.next_nmap; i < chunk->nmap; ) {
+			if (nr_kpages == kpages_per_hwpage) {
+				pgaddr = ( page_to_pfn(chunk->page_list[i].page)
+					   << PAGE_SHIFT );
+				*kpage = phys_to_abs(pgaddr);
+				if ( !(*kpage) ) {
+					ehca_gen_err("pgaddr=%lx i=%x",
+						     pgaddr, i);
+					ret = -EFAULT;
+					return ret;
+				}
+				/*
+				 * The first page in a hwpage must be aligned;
+				 * the first MR page is exempt from this rule.
+				 */
+				if (pgaddr & (pginfo->hwpage_size - 1)) {
+					if (pginfo->hwpage_cnt) {
+						ehca_gen_err(
+							"invalid alignment "
+							"pgaddr=%lx i=%x "
+							"mr_pgsize=%lx",
+							pgaddr, i,
+							pginfo->hwpage_size);
+						ret = -EFAULT;
+						return ret;
+					}
+					/* first MR page */
+					pginfo->kpage_cnt =
+						(pgaddr &
+						 (pginfo->hwpage_size - 1)) >>
+						PAGE_SHIFT;
+					nr_kpages -= pginfo->kpage_cnt;
+					*kpage = phys_to_abs(
+						pgaddr &
+						~(pginfo->hwpage_size - 1));
+				}
+				ehca_gen_dbg("kpage=%lx chunk_page=%lx "
+					     "value=%016lx", *kpage, pgaddr,
+					     *(u64 *)abs_to_virt(
+						     phys_to_abs(pgaddr)));
+				prev_pgaddr = pgaddr;
+				i++;
+				pginfo->kpage_cnt++;
+				pginfo->u.usr.next_nmap++;
+				nr_kpages--;
+				if (!nr_kpages)
+					goto next_kpage;
+				continue;
+			}
+			if (i + nr_kpages > chunk->nmap) {
+				ret = ehca_check_kpages_per_ate(
+					chunk->page_list, i,
+					chunk->nmap - 1, &prev_pgaddr);
+				if (ret) return ret;
+				pginfo->kpage_cnt += chunk->nmap - i;
+				pginfo->u.usr.next_nmap += chunk->nmap - i;
+				nr_kpages -= chunk->nmap - i;
+				break;
+			}
+
+			ret = ehca_check_kpages_per_ate(chunk->page_list, i,
+							i + nr_kpages - 1,
+							&prev_pgaddr);
+			if (ret) return ret;
+			i += nr_kpages;
+			pginfo->kpage_cnt += nr_kpages;
+			pginfo->u.usr.next_nmap += nr_kpages;
+next_kpage:
+			nr_kpages = kpages_per_hwpage;
+			(pginfo->hwpage_cnt)++;
+			kpage++;
+			j++;
+			if (j >= number) break;
+		}
+		if ((pginfo->u.usr.next_nmap >= chunk->nmap) &&
+		    (j >= number)) {
+			pginfo->u.usr.next_nmap = 0;
+			prev_chunk = chunk;
+			break;
+		} else if (pginfo->u.usr.next_nmap >= chunk->nmap) {
+			pginfo->u.usr.next_nmap = 0;
+			prev_chunk = chunk;
+		} else if (j >= number)
+			break;
+		else
+			prev_chunk = chunk;
+	}
+	pginfo->u.usr.next_chunk =
+		list_prepare_entry(prev_chunk,
+				   (&(pginfo->u.usr.region->chunk_list)),
+				   list);
+	return ret;
+}
+
 int ehca_set_pagebuf_phys(struct ehca_mr_pginfo *pginfo,
 			  u32 number,
 			  u64 *kpage)
@@ -1750,9 +1973,10 @@ int ehca_set_pagebuf_phys(struct ehca_mr_pginfo *pginfo,
 	/* loop over desired phys_buf_array entries */
 	while (i < number) {
 		pbuf   = pginfo->u.phy.phys_buf_array + pginfo->u.phy.next_buf;
-		num_hw  = NUM_CHUNKS((pbuf->addr % EHCA_PAGESIZE) +
-				     pbuf->size, EHCA_PAGESIZE);
-		offs_hw = (pbuf->addr & ~PAGE_MASK) / EHCA_PAGESIZE;
+		num_hw  = NUM_CHUNKS((pbuf->addr % pginfo->hwpage_size) +
+				     pbuf->size, pginfo->hwpage_size);
+		offs_hw = (pbuf->addr & ~(pginfo->hwpage_size - 1)) /
+			pginfo->hwpage_size;
 		while (pginfo->next_hwpage < offs_hw + num_hw) {
 			/* sanity check */
 			if ((pginfo->kpage_cnt >= pginfo->num_kpages) ||
@@ -1768,21 +1992,23 @@ int ehca_set_pagebuf_phys(struct ehca_mr_pginfo *pginfo,
 				return -EFAULT;
 			}
 			*kpage = phys_to_abs(
-				(pbuf->addr & EHCA_PAGEMASK)
-				+ (pginfo->next_hwpage * EHCA_PAGESIZE));
+				(pbuf->addr & ~(pginfo->hwpage_size - 1)) +
+				(pginfo->next_hwpage * pginfo->hwpage_size));
 			if ( !(*kpage) && pbuf->addr ) {
-				ehca_gen_err("pbuf->addr=%lx "
-					     "pbuf->size=%lx "
+				ehca_gen_err("pbuf->addr=%lx pbuf->size=%lx "
 					     "next_hwpage=%lx", pbuf->addr,
-					     pbuf->size,
-					     pginfo->next_hwpage);
+					     pbuf->size, pginfo->next_hwpage);
 				return -EFAULT;
 			}
 			(pginfo->hwpage_cnt)++;
 			(pginfo->next_hwpage)++;
-			if (pginfo->next_hwpage %
-			    (PAGE_SIZE / EHCA_PAGESIZE) == 0)
-				(pginfo->kpage_cnt)++;
+			if (PAGE_SIZE >= pginfo->hwpage_size) {
+				if (pginfo->next_hwpage %
+				    (PAGE_SIZE / pginfo->hwpage_size) == 0)
+					(pginfo->kpage_cnt)++;
+			} else
+				pginfo->kpage_cnt += pginfo->hwpage_size /
+					PAGE_SIZE;
 			kpage++;
 			i++;
 			if (i >= number) break;
@@ -1806,8 +2032,8 @@ int ehca_set_pagebuf_fmr(struct ehca_mr_pginfo *pginfo,
 	/* loop over desired page_list entries */
 	fmrlist = pginfo->u.fmr.page_list + pginfo->u.fmr.next_listelem;
 	for (i = 0; i < number; i++) {
-		*kpage = phys_to_abs((*fmrlist & EHCA_PAGEMASK) +
-				     pginfo->next_hwpage * EHCA_PAGESIZE);
+		*kpage = phys_to_abs((*fmrlist & ~(pginfo->hwpage_size - 1)) +
+				     pginfo->next_hwpage * pginfo->hwpage_size);
 		if ( !(*kpage) ) {
 			ehca_gen_err("*fmrlist=%lx fmrlist=%p "
 				     "next_listelem=%lx next_hwpage=%lx",
@@ -1817,15 +2043,38 @@ int ehca_set_pagebuf_fmr(struct ehca_mr_pginfo *pginfo,
 			return -EFAULT;
 		}
 		(pginfo->hwpage_cnt)++;
-		(pginfo->next_hwpage)++;
-		kpage++;
-		if (pginfo->next_hwpage %
-		    (pginfo->u.fmr.fmr_pgsize / EHCA_PAGESIZE) == 0) {
-			(pginfo->kpage_cnt)++;
-			(pginfo->u.fmr.next_listelem)++;
-			fmrlist++;
-			pginfo->next_hwpage = 0;
+		if (pginfo->u.fmr.fmr_pgsize >= pginfo->hwpage_size) {
+			if (pginfo->next_hwpage %
+			    (pginfo->u.fmr.fmr_pgsize /
+			     pginfo->hwpage_size) == 0) {
+				(pginfo->kpage_cnt)++;
+				(pginfo->u.fmr.next_listelem)++;
+				fmrlist++;
+				pginfo->next_hwpage = 0;
+			} else
+				(pginfo->next_hwpage)++;
+		} else {
+			unsigned int cnt_per_hwpage = pginfo->hwpage_size /
+				pginfo->u.fmr.fmr_pgsize;
+			unsigned int j;
+			u64 prev = *kpage;
+			/* check if adrs are contiguous */
+			for (j = 1; j < cnt_per_hwpage; j++) {
+				u64 p = phys_to_abs(fmrlist[j] &
+						    ~(pginfo->hwpage_size - 1));
+				if (prev + pginfo->u.fmr.fmr_pgsize != p) {
+					ehca_gen_err("uncontiguous fmr pages "
+						     "found prev=%lx p=%lx "
+						     "idx=%x", prev, p, i + j);
+					return -EINVAL;
+				}
+				prev = p;
+			}
+			pginfo->kpage_cnt += cnt_per_hwpage;
+			pginfo->u.fmr.next_listelem += cnt_per_hwpage;
+			fmrlist += cnt_per_hwpage;
 		}
+		kpage++;
 	}
 	return ret;
 }
@@ -1842,7 +2091,9 @@ int ehca_set_pagebuf(struct ehca_mr_pginfo *pginfo,
 		ret = ehca_set_pagebuf_phys(pginfo, number, kpage);
 		break;
 	case EHCA_MR_PGI_USER:
-		ret = ehca_set_pagebuf_user1(pginfo, number, kpage);
+		ret = PAGE_SIZE >= pginfo->hwpage_size ?
+			ehca_set_pagebuf_user1(pginfo, number, kpage) :
+			ehca_set_pagebuf_user2(pginfo, number, kpage);
 		break;
 	case EHCA_MR_PGI_FMR:
 		ret = ehca_set_pagebuf_fmr(pginfo, number, kpage);
@@ -1895,9 +2146,9 @@ void ehca_mrmw_map_acl(int ib_acl,
 /*----------------------------------------------------------------------*/
 
 /* sets page size in hipz access control for MR/MW. */
-void ehca_mrmw_set_pgsize_hipz_acl(u32 *hipz_acl) /*INOUT*/
+void ehca_mrmw_set_pgsize_hipz_acl(u32 pgsize, u32 *hipz_acl) /*INOUT*/
 {
-	return; /* HCA supports only 4k */
+	*hipz_acl |= (ehca_encode_hwpage_size(pgsize) << 24);
 } /* end ehca_mrmw_set_pgsize_hipz_acl() */
 
 /*----------------------------------------------------------------------*/
diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.h b/drivers/infiniband/hw/ehca/ehca_mrmw.h
index 24f13fe..bc8f4e3 100644
--- a/drivers/infiniband/hw/ehca/ehca_mrmw.h
+++ b/drivers/infiniband/hw/ehca/ehca_mrmw.h
@@ -111,7 +111,7 @@ int ehca_mr_is_maxmr(u64 size,
 void ehca_mrmw_map_acl(int ib_acl,
 		       u32 *hipz_acl);
 
-void ehca_mrmw_set_pgsize_hipz_acl(u32 *hipz_acl);
+void ehca_mrmw_set_pgsize_hipz_acl(u32 pgsize, u32 *hipz_acl);
 
 void ehca_mrmw_reverse_map_acl(const u32 *hipz_acl,
 			       int *ib_acl);
diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c
index 3394e05..358796c 100644
--- a/drivers/infiniband/hw/ehca/hcp_if.c
+++ b/drivers/infiniband/hw/ehca/hcp_if.c
@@ -427,7 +427,8 @@ u64 hipz_h_register_rpage(const struct ipz_adapter_handle adapter_handle,
 {
 	return ehca_plpar_hcall_norets(H_REGISTER_RPAGES,
 				       adapter_handle.handle,      /* r4  */
-				       queue_type | pagesize << 8, /* r5  */
+				       (u64)queue_type | ((u64)pagesize) << 8,
+				       /* r5  */
 				       resource_handle,	           /* r6  */
 				       logical_address_of_page,    /* r7  */
 				       count,	                   /* r8  */
@@ -724,6 +725,9 @@ u64 hipz_h_alloc_resource_mr(const struct ipz_adapter_handle adapter_handle,
 	u64 ret;
 	u64 outs[PLPAR_HCALL9_BUFSIZE];
 
+	ehca_gen_dbg("kernel PAGE_SIZE=%x access_ctrl=%016x "
+		     "vaddr=%lx length=%lx",
+		     (u32)PAGE_SIZE, access_ctrl, vaddr, length);
 	ret = ehca_plpar_hcall9(H_ALLOC_RESOURCE, outs,
 				adapter_handle.handle,            /* r4 */
 				5,                                /* r5 */
@@ -746,8 +750,22 @@ u64 hipz_h_register_rpage_mr(const struct ipz_adapter_handle adapter_handle,
 			     const u64 logical_address_of_page,
 			     const u64 count)
 {
+	extern int ehca_debug_level;
 	u64 ret;
 
+	if (unlikely(ehca_debug_level >= 2)) {
+		if (count > 1) {
+			u64 *kpage;
+			int i;
+			kpage = (u64 *)abs_to_virt(logical_address_of_page);
+			for (i = 0; i < count; i++)
+				ehca_gen_dbg("kpage[%d]=%p",
+					     i, (void *)kpage[i]);
+		} else
+			ehca_gen_dbg("kpage=%p",
+				     (void *)logical_address_of_page);
+	}
+
 	if ((count > 1) && (logical_address_of_page & (EHCA_PAGESIZE-1))) {
 		ehca_gen_err("logical_address_of_page not on a 4k boundary "
 			     "adapter_handle=%lx mr=%p mr_handle=%lx "
-- 
1.5.2


From hnguyen at linux.vnet.ibm.com  Fri Jul 20 07:02:46 2007
From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen)
Date: Fri, 20 Jul 2007 16:02:46 +0200
Subject: [ofa-general] [PATCH 3/5] ehca: Make ehca2ib_return_code()
	non-inline
Message-ID: <200707201602.46415.hnguyen@linux.vnet.ibm.com>

From: Joachim Fenkes <fenkes at de.ibm.com>
Date: Thu, 19 Jul 2007 21:13:57 +0200
Subject: [PATCH 3/5] IB/ehca: Make ehca2ib_return_code() non-inline

It's nowhere in the main path and making it non-inline saves ~1.5K of code.

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_main.c  |   17 +++++++++++++++++
 drivers/infiniband/hw/ehca/ehca_tools.h |   19 +------------------
 2 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c
index 34661c3..3bd7afb 100644
--- a/drivers/infiniband/hw/ehca/ehca_main.c
+++ b/drivers/infiniband/hw/ehca/ehca_main.c
@@ -130,6 +130,23 @@ void ehca_free_fw_ctrlblock(void *ptr)
 }
 #endif
 
+int ehca2ib_return_code(u64 ehca_rc)
+{
+	switch (ehca_rc) {
+	case H_SUCCESS:
+		return 0;
+	case H_RESOURCE:             /* Resource in use */
+	case H_BUSY:
+		return -EBUSY;
+	case H_NOT_ENOUGH_RESOURCES: /* insufficient resources */
+	case H_CONSTRAINED:          /* resource constraint */
+	case H_NO_MEM:
+		return -ENOMEM;
+	default:
+		return -EINVAL;
+	}
+}
+
 static int ehca_create_slab_caches(void)
 {
 	int ret;
diff --git a/drivers/infiniband/hw/ehca/ehca_tools.h b/drivers/infiniband/hw/ehca/ehca_tools.h
index 678b813..57c77a7 100644
--- a/drivers/infiniband/hw/ehca/ehca_tools.h
+++ b/drivers/infiniband/hw/ehca/ehca_tools.h
@@ -154,24 +154,7 @@ extern int ehca_debug_level;
 #define EHCA_BMASK_GET(mask, value) \
 	(EHCA_BMASK_MASK(mask) & (((u64)(value)) >> EHCA_BMASK_SHIFTPOS(mask)))
 
-
 /* Converts ehca to ib return code */
-static inline int ehca2ib_return_code(u64 ehca_rc)
-{
-	switch (ehca_rc) {
-	case H_SUCCESS:
-		return 0;
-	case H_RESOURCE:             /* Resource in use */
-	case H_BUSY:
-		return -EBUSY;
-	case H_NOT_ENOUGH_RESOURCES: /* insufficient resources */
-	case H_CONSTRAINED:          /* resource constraint */
-	case H_NO_MEM:
-		return -ENOMEM;
-	default:
-		return -EINVAL;
-	}
-}
-
+int ehca2ib_return_code(u64 ehca_rc);
 
 #endif /* EHCA_TOOLS_H */
-- 
1.5.2


From hnguyen at linux.vnet.ibm.com  Fri Jul 20 07:02:18 2007
From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen)
Date: Fri, 20 Jul 2007 16:02:18 +0200
Subject: [ofa-general] [PATCH 2/5] ehca: Generate event when SRQ limit
	reached
Message-ID: <200707201602.19142.hnguyen@linux.vnet.ibm.com>

From: Joachim Fenkes <fenkes at de.ibm.com>
Date: Thu, 19 Jul 2007 20:51:43 +0200
Subject: [PATCH 2/5] IB/ehca: Generate event when SRQ limit reached

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_irq.c |   42 ++++++++++++++++++++++-----------
 1 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c
index 4fb01fc..71c0799 100644
--- a/drivers/infiniband/hw/ehca/ehca_irq.c
+++ b/drivers/infiniband/hw/ehca/ehca_irq.c
@@ -175,9 +175,8 @@ error_data1:
 
 }
 
-static void qp_event_callback(struct ehca_shca *shca,
-			      u64 eqe,
-			      enum ib_event_type event_type)
+static void qp_event_callback(struct ehca_shca *shca, u64 eqe,
+			      enum ib_event_type event_type, int fatal)
 {
 	struct ib_event event;
 	struct ehca_qp *qp;
@@ -191,16 +190,26 @@ static void qp_event_callback(struct ehca_shca *shca,
 	if (!qp)
 		return;
 
-	ehca_error_data(shca, qp, qp->ipz_qp_handle.handle);
+	if (fatal)
+		ehca_error_data(shca, qp, qp->ipz_qp_handle.handle);
 
-	if (!qp->ib_qp.event_handler)
-		return;
+	event.device = &shca->ib_device;
 
-	event.device     = &shca->ib_device;
-	event.event      = event_type;
-	event.element.qp = &qp->ib_qp;
+	if (qp->ext_type == EQPT_SRQ) {
+		if (!qp->ib_srq.event_handler)
+			return;
 
-	qp->ib_qp.event_handler(&event, qp->ib_qp.qp_context);
+		event.event = fatal ? IB_EVENT_SRQ_ERR : event_type;
+		event.element.srq = &qp->ib_srq;
+		qp->ib_srq.event_handler(&event, qp->ib_srq.srq_context);
+	} else {
+		if (!qp->ib_qp.event_handler)
+			return;
+
+		event.event = event_type;
+		event.element.qp = &qp->ib_qp;
+		qp->ib_qp.event_handler(&event, qp->ib_qp.qp_context);
+	}
 
 	return;
 }
@@ -234,17 +243,17 @@ static void parse_identifier(struct ehca_shca *shca, u64 eqe)
 
 	switch (identifier) {
 	case 0x02: /* path migrated */
-		qp_event_callback(shca, eqe, IB_EVENT_PATH_MIG);
+		qp_event_callback(shca, eqe, IB_EVENT_PATH_MIG, 0);
 		break;
 	case 0x03: /* communication established */
-		qp_event_callback(shca, eqe, IB_EVENT_COMM_EST);
+		qp_event_callback(shca, eqe, IB_EVENT_COMM_EST, 0);
 		break;
 	case 0x04: /* send queue drained */
-		qp_event_callback(shca, eqe, IB_EVENT_SQ_DRAINED);
+		qp_event_callback(shca, eqe, IB_EVENT_SQ_DRAINED, 0);
 		break;
 	case 0x05: /* QP error */
 	case 0x06: /* QP error */
-		qp_event_callback(shca, eqe, IB_EVENT_QP_FATAL);
+		qp_event_callback(shca, eqe, IB_EVENT_QP_FATAL, 1);
 		break;
 	case 0x07: /* CQ error */
 	case 0x08: /* CQ error */
@@ -278,6 +287,11 @@ static void parse_identifier(struct ehca_shca *shca, u64 eqe)
 		ehca_err(&shca->ib_device, "Interface trace stopped.");
 		break;
 	case 0x14: /* first error capture info available */
+		ehca_info(&shca->ib_device, "First error capture available");
+		break;
+	case 0x15: /* SRQ limit reached */
+		qp_event_callback(shca, eqe, IB_EVENT_SRQ_LIMIT_REACHED, 0);
+		break;
 	default:
 		ehca_err(&shca->ib_device, "Unknown identifier: %x on %s.",
 			 identifier, shca->ib_device.name);
-- 
1.5.2


From hnguyen at linux.vnet.ibm.com  Fri Jul 20 07:03:09 2007
From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen)
Date: Fri, 20 Jul 2007 16:03:09 +0200
Subject: [ofa-general] [PATCH 4/5] ehca: Make internal_create/destroy_qp()
	static
Message-ID: <200707201603.10321.hnguyen@linux.vnet.ibm.com>

From: Joachim Fenkes <fenkes at de.ibm.com>
Date: Thu, 19 Jul 2007 21:40:00 +0200
Subject: [PATCH 4/5] IB/ehca: Make internal_{create,destroy}_qp() static

They're only used in ehca_qp.c

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_qp.c |   17 +++++++++--------
 1 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c
index 48e9cea..b916d9c 100644
--- a/drivers/infiniband/hw/ehca/ehca_qp.c
+++ b/drivers/infiniband/hw/ehca/ehca_qp.c
@@ -363,10 +363,11 @@ init_qp_queue1:
  * the value of the is_srq parameter. If init_attr and srq_init_attr share
  * fields, the field out of init_attr is used.
  */
-struct ehca_qp *internal_create_qp(struct ib_pd *pd,
-				   struct ib_qp_init_attr *init_attr,
-				   struct ib_srq_init_attr *srq_init_attr,
-				   struct ib_udata *udata, int is_srq)
+static struct ehca_qp *internal_create_qp(
+	struct ib_pd *pd,
+	struct ib_qp_init_attr *init_attr,
+	struct ib_srq_init_attr *srq_init_attr,
+	struct ib_udata *udata, int is_srq)
 {
 	struct ehca_qp *my_qp;
 	struct ehca_pd *my_pd = container_of(pd, struct ehca_pd, ib_pd);
@@ -752,8 +753,8 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 	return IS_ERR(ret) ? (struct ib_qp *)ret : &ret->ib_qp;
 }
 
-int internal_destroy_qp(struct ib_device *dev, struct ehca_qp *my_qp,
-			struct ib_uobject *uobject);
+static int internal_destroy_qp(struct ib_device *dev, struct ehca_qp *my_qp,
+			       struct ib_uobject *uobject);
 
 struct ib_srq *ehca_create_srq(struct ib_pd *pd,
 			       struct ib_srq_init_attr *srq_init_attr,
@@ -1669,8 +1670,8 @@ query_srq_exit1:
 	return ret;
 }
 
-int internal_destroy_qp(struct ib_device *dev, struct ehca_qp *my_qp,
-			struct ib_uobject *uobject)
+static int internal_destroy_qp(struct ib_device *dev, struct ehca_qp *my_qp,
+			       struct ib_uobject *uobject)
 {
 	struct ehca_shca *shca = container_of(dev, struct ehca_shca, ib_device);
 	struct ehca_pd *my_pd = container_of(my_qp->ib_qp.pd, struct ehca_pd,
-- 
1.5.2


From hnguyen at linux.vnet.ibm.com  Fri Jul 20 07:04:17 2007
From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen)
Date: Fri, 20 Jul 2007 16:04:17 +0200
Subject: [ofa-general] [PATCH 5/5] ehca: Support small QP queues
Message-ID: <200707201604.17991.hnguyen@linux.vnet.ibm.com>

From: Stefan Roscher <stefan.roscher at de.ibm.com>
Date: Fri, 20 Jul 2007 13:59:14 +0200
Subject: [PATCH 5/5] IB/ehca: Small QP queues

eHCA2 supports QP queues that can be as small as 512 bytes. This greatly
reduces memory overhead for consumers that use lots of QPs with small queues
(e.g. RDMA-only QPs). Apart from dealing with firmware, this code needs to
manage bite-sized chunks of kernel pages, making sure that no kernel page is
shared between different protection domains.

Signed-off-by: Hoang-Nam Nguyen <hnguyen at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_classes.h |   41 ++++--
 drivers/infiniband/hw/ehca/ehca_cq.c      |    8 +-
 drivers/infiniband/hw/ehca/ehca_eq.c      |    8 +-
 drivers/infiniband/hw/ehca/ehca_main.c    |   14 ++-
 drivers/infiniband/hw/ehca/ehca_pd.c      |   25 +++-
 drivers/infiniband/hw/ehca/ehca_qp.c      |  163 +++++++++++++---------
 drivers/infiniband/hw/ehca/ehca_uverbs.c  |    2 +-
 drivers/infiniband/hw/ehca/hcp_if.c       |   30 +++--
 drivers/infiniband/hw/ehca/ipz_pt_fn.c    |  222 ++++++++++++++++++++++-------
 drivers/infiniband/hw/ehca/ipz_pt_fn.h    |   26 +++-
 10 files changed, 379 insertions(+), 160 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h
index 63b8b9f..3725aa8 100644
--- a/drivers/infiniband/hw/ehca/ehca_classes.h
+++ b/drivers/infiniband/hw/ehca/ehca_classes.h
@@ -43,7 +43,6 @@
 #ifndef __EHCA_CLASSES_H__
 #define __EHCA_CLASSES_H__
 
-
 struct ehca_module;
 struct ehca_qp;
 struct ehca_cq;
@@ -129,6 +128,10 @@ struct ehca_pd {
 	struct ib_pd ib_pd;
 	struct ipz_pd fw_pd;
 	u32 ownpid;
+	/* small queue mgmt */
+	struct mutex lock;
+	struct list_head free[2];
+	struct list_head full[2];
 };
 
 enum ehca_ext_qp_type {
@@ -307,6 +310,8 @@ int ehca_init_av_cache(void);
 void ehca_cleanup_av_cache(void);
 int ehca_init_mrmw_cache(void);
 void ehca_cleanup_mrmw_cache(void);
+int ehca_init_small_qp_cache(void);
+void ehca_cleanup_small_qp_cache(void);
 
 extern rwlock_t ehca_qp_idr_lock;
 extern rwlock_t ehca_cq_idr_lock;
@@ -324,7 +329,7 @@ struct ipzu_queue_resp {
 	u32 queue_length; /* queue length allocated in bytes */
 	u32 pagesize;
 	u32 toggle_state;
-	u32 dummy; /* padding for 8 byte alignment */
+	u32 offset; /* save offset within a page for small_qp */
 };
 
 struct ehca_create_cq_resp {
@@ -366,15 +371,29 @@ enum ehca_ll_comp_flags {
 	LLQP_COMP_MASK = 0x60,
 };
 
+struct ehca_alloc_queue_parms {
+	/* input parameters */
+	int max_wr;
+	int max_sge;
+	int page_size;
+	int is_small;
+
+	/* output parameters */
+	u16 act_nr_wqes;
+	u8  act_nr_sges;
+	u32 queue_size; /* bytes for small queues, pages otherwise */
+};
+
 struct ehca_alloc_qp_parms {
-/* input parameters */
+	struct ehca_alloc_queue_parms squeue;
+	struct ehca_alloc_queue_parms rqueue;
+
+	/* input parameters */
 	enum ehca_service_type servicetype;
+	int qp_storage;
 	int sigtype;
 	enum ehca_ext_qp_type ext_type;
 	enum ehca_ll_comp_flags ll_comp_flags;
-
-	int max_send_wr, max_recv_wr;
-	int max_send_sge, max_recv_sge;
 	int ud_av_l_key_ctl;
 
 	u32 token;
@@ -384,18 +403,10 @@ struct ehca_alloc_qp_parms {
 
 	u32 srq_qpn, srq_token, srq_limit;
 
-/* output parameters */
+	/* output parameters */
 	u32 real_qp_num;
 	struct ipz_qp_handle qp_handle;
 	struct h_galpas galpas;
-
-	u16 act_nr_send_wqes;
-	u16 act_nr_recv_wqes;
-	u8  act_nr_recv_sges;
-	u8  act_nr_send_sges;
-
-	u32 nr_rq_pages;
-	u32 nr_sq_pages;
 };
 
 int ehca_cq_assign_qp(struct ehca_cq *cq, struct ehca_qp *qp);
diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c
index 9e87883..5746787 100644
--- a/drivers/infiniband/hw/ehca/ehca_cq.c
+++ b/drivers/infiniband/hw/ehca/ehca_cq.c
@@ -190,8 +190,8 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector,
 		goto create_cq_exit2;
 	}
 
-	ipz_rc = ipz_queue_ctor(&my_cq->ipz_queue, param.act_pages,
-				EHCA_PAGESIZE, sizeof(struct ehca_cqe), 0);
+	ipz_rc = ipz_queue_ctor(NULL, &my_cq->ipz_queue, param.act_pages,
+				EHCA_PAGESIZE, sizeof(struct ehca_cqe), 0, 0);
 	if (!ipz_rc) {
 		ehca_err(device, "ipz_queue_ctor() failed ipz_rc=%x device=%p",
 			 ipz_rc, device);
@@ -285,7 +285,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector,
 	return cq;
 
 create_cq_exit4:
-	ipz_queue_dtor(&my_cq->ipz_queue);
+	ipz_queue_dtor(NULL, &my_cq->ipz_queue);
 
 create_cq_exit3:
 	h_ret = hipz_h_destroy_cq(adapter_handle, my_cq, 1);
@@ -359,7 +359,7 @@ int ehca_destroy_cq(struct ib_cq *cq)
 			 "ehca_cq=%p cq_num=%x", h_ret, my_cq, cq_num);
 		return ehca2ib_return_code(h_ret);
 	}
-	ipz_queue_dtor(&my_cq->ipz_queue);
+	ipz_queue_dtor(NULL, &my_cq->ipz_queue);
 	kmem_cache_free(cq_cache, my_cq);
 
 	return 0;
diff --git a/drivers/infiniband/hw/ehca/ehca_eq.c b/drivers/infiniband/hw/ehca/ehca_eq.c
index 4825975..1d41faa 100644
--- a/drivers/infiniband/hw/ehca/ehca_eq.c
+++ b/drivers/infiniband/hw/ehca/ehca_eq.c
@@ -86,8 +86,8 @@ int ehca_create_eq(struct ehca_shca *shca,
 		return -EINVAL;
 	}
 
-	ret = ipz_queue_ctor(&eq->ipz_queue, nr_pages,
-			     EHCA_PAGESIZE, sizeof(struct ehca_eqe), 0);
+	ret = ipz_queue_ctor(NULL, &eq->ipz_queue, nr_pages,
+			     EHCA_PAGESIZE, sizeof(struct ehca_eqe), 0, 0);
 	if (!ret) {
 		ehca_err(ib_dev, "Can't allocate EQ pages eq=%p", eq);
 		goto create_eq_exit1;
@@ -145,7 +145,7 @@ int ehca_create_eq(struct ehca_shca *shca,
 	return 0;
 
 create_eq_exit2:
-	ipz_queue_dtor(&eq->ipz_queue);
+	ipz_queue_dtor(NULL, &eq->ipz_queue);
 
 create_eq_exit1:
 	hipz_h_destroy_eq(shca->ipz_hca_handle, eq);
@@ -181,7 +181,7 @@ int ehca_destroy_eq(struct ehca_shca *shca, struct ehca_eq *eq)
 		ehca_err(&shca->ib_device, "Can't free EQ resources.");
 		return -EINVAL;
 	}
-	ipz_queue_dtor(&eq->ipz_queue);
+	ipz_queue_dtor(NULL, &eq->ipz_queue);
 
 	return 0;
 }
diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c
index 3bd7afb..e09a2ae 100644
--- a/drivers/infiniband/hw/ehca/ehca_main.c
+++ b/drivers/infiniband/hw/ehca/ehca_main.c
@@ -181,6 +181,12 @@ static int ehca_create_slab_caches(void)
 		goto create_slab_caches5;
 	}
 
+	ret = ehca_init_small_qp_cache();
+	if (ret) {
+		ehca_gen_err("Cannot create small queue SLAB cache.");
+		goto create_slab_caches6;
+	}
+
 #ifdef CONFIG_PPC_64K_PAGES
 	ctblk_cache = kmem_cache_create("ehca_cache_ctblk",
 					EHCA_PAGESIZE, H_CB_ALIGNMENT,
@@ -188,12 +194,15 @@ static int ehca_create_slab_caches(void)
 					NULL, NULL);
 	if (!ctblk_cache) {
 		ehca_gen_err("Cannot create ctblk SLAB cache.");
-		ehca_cleanup_mrmw_cache();
-		goto create_slab_caches5;
+		ehca_cleanup_small_qp_cache();
+		goto create_slab_caches6;
 	}
 #endif
 	return 0;
 
+create_slab_caches6:
+	ehca_cleanup_mrmw_cache();
+
 create_slab_caches5:
 	ehca_cleanup_av_cache();
 
@@ -211,6 +220,7 @@ create_slab_caches2:
 
 static void ehca_destroy_slab_caches(void)
 {
+	ehca_cleanup_small_qp_cache();
 	ehca_cleanup_mrmw_cache();
 	ehca_cleanup_av_cache();
 	ehca_cleanup_qp_cache();
diff --git a/drivers/infiniband/hw/ehca/ehca_pd.c b/drivers/infiniband/hw/ehca/ehca_pd.c
index 79d0591..79d5bc8 100644
--- a/drivers/infiniband/hw/ehca/ehca_pd.c
+++ b/drivers/infiniband/hw/ehca/ehca_pd.c
@@ -49,6 +49,7 @@ struct ib_pd *ehca_alloc_pd(struct ib_device *device,
 			    struct ib_ucontext *context, struct ib_udata *udata)
 {
 	struct ehca_pd *pd;
+	int i;
 
 	pd = kmem_cache_zalloc(pd_cache, GFP_KERNEL);
 	if (!pd) {
@@ -58,6 +59,11 @@ struct ib_pd *ehca_alloc_pd(struct ib_device *device,
 	}
 
 	pd->ownpid = current->tgid;
+	for (i = 0; i < 2; i++) {
+		INIT_LIST_HEAD(&pd->free[i]);
+		INIT_LIST_HEAD(&pd->full[i]);
+	}
+	mutex_init(&pd->lock);
 
 	/*
 	 * Kernel PD: when device = -1, 0
@@ -81,6 +87,9 @@ int ehca_dealloc_pd(struct ib_pd *pd)
 {
 	u32 cur_pid = current->tgid;
 	struct ehca_pd *my_pd = container_of(pd, struct ehca_pd, ib_pd);
+	int i, leftovers = 0;
+	extern struct kmem_cache *small_qp_cache;
+	struct ipz_small_queue_page *page, *tmp;
 
 	if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context &&
 	    my_pd->ownpid != cur_pid) {
@@ -89,8 +98,20 @@ int ehca_dealloc_pd(struct ib_pd *pd)
 		return -EINVAL;
 	}
 
-	kmem_cache_free(pd_cache,
-			container_of(pd, struct ehca_pd, ib_pd));
+	for (i = 0; i < 2; i++) {
+		list_splice(&my_pd->full[i], &my_pd->free[i]);
+		list_for_each_entry_safe(page, tmp, &my_pd->free[i], list) {
+			leftovers = 1;
+			free_page(page->page);
+			kmem_cache_free(small_qp_cache, page);
+		}
+	}
+
+	if (leftovers)
+		ehca_warn(pd->device,
+			  "Some small queue pages were not freed");
+
+	kmem_cache_free(pd_cache, my_pd);
 
 	return 0;
 }
diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c
index b916d9c..6c6f9d9 100644
--- a/drivers/infiniband/hw/ehca/ehca_qp.c
+++ b/drivers/infiniband/hw/ehca/ehca_qp.c
@@ -275,34 +275,39 @@ static inline void queue2resp(struct ipzu_queue_resp *resp,
 	resp->toggle_state = queue->toggle_state;
 }
 
-static inline int ll_qp_msg_size(int nr_sge)
-{
-	return 128 << nr_sge;
-}
-
 /*
  * init_qp_queue initializes/constructs r/squeue and registers queue pages.
  */
 static inline int init_qp_queue(struct ehca_shca *shca,
+				struct ehca_pd *pd,
 				struct ehca_qp *my_qp,
 				struct ipz_queue *queue,
 				int q_type,
 				u64 expected_hret,
-				int nr_q_pages,
-				int wqe_size,
-				int nr_sges)
+				struct ehca_alloc_queue_parms *parms,
+				int wqe_size)
 {
-	int ret, cnt, ipz_rc;
+	int ret, cnt, ipz_rc, nr_q_pages;
 	void *vpage;
 	u64 rpage, h_ret;
 	struct ib_device *ib_dev = &shca->ib_device;
 	struct ipz_adapter_handle ipz_hca_handle = shca->ipz_hca_handle;
 
-	if (!nr_q_pages)
+	if (!parms->queue_size)
 		return 0;
 
-	ipz_rc = ipz_queue_ctor(queue, nr_q_pages, EHCA_PAGESIZE,
-				wqe_size, nr_sges);
+	if (parms->is_small) {
+		nr_q_pages = 1;
+		ipz_rc = ipz_queue_ctor(pd, queue, nr_q_pages,
+					128 << parms->page_size,
+					wqe_size, parms->act_nr_sges, 1);
+	} else {
+		nr_q_pages = parms->queue_size;
+		ipz_rc = ipz_queue_ctor(pd, queue, nr_q_pages,
+					EHCA_PAGESIZE, wqe_size,
+					parms->act_nr_sges, 0);
+	}
+
 	if (!ipz_rc) {
 		ehca_err(ib_dev, "Cannot allocate page for queue. ipz_rc=%x",
 			 ipz_rc);
@@ -323,7 +328,7 @@ static inline int init_qp_queue(struct ehca_shca *shca,
 		h_ret = hipz_h_register_rpage_qp(ipz_hca_handle,
 						 my_qp->ipz_qp_handle,
 						 NULL, 0, q_type,
-						 rpage, 1,
+						 rpage, parms->is_small ? 0 : 1,
 						 my_qp->galpas.kernel);
 		if (cnt == (nr_q_pages - 1)) {	/* last page! */
 			if (h_ret != expected_hret) {
@@ -354,10 +359,45 @@ static inline int init_qp_queue(struct ehca_shca *shca,
 	return 0;
 
 init_qp_queue1:
-	ipz_queue_dtor(queue);
+	ipz_queue_dtor(pd, queue);
 	return ret;
 }
 
+static inline int ehca_calc_wqe_size(int act_nr_sge, int is_llqp)
+{
+	if (is_llqp)
+		return 128 << act_nr_sge;
+	else
+		return offsetof(struct ehca_wqe,
+				u.nud.sg_list[act_nr_sge]);
+}
+
+static void ehca_determine_small_queue(struct ehca_alloc_queue_parms *queue,
+				       int req_nr_sge, int is_llqp)
+{
+	u32 wqe_size, q_size;
+	int act_nr_sge = req_nr_sge;
+
+	if (!is_llqp)
+		/* round up #SGEs so WQE size is a power of 2 */
+		for (act_nr_sge = 4; act_nr_sge <= 252;
+		     act_nr_sge = 4 + 2 * act_nr_sge)
+			if (act_nr_sge >= req_nr_sge)
+				break;
+
+	wqe_size = ehca_calc_wqe_size(act_nr_sge, is_llqp);
+	q_size = wqe_size * (queue->max_wr + 1);
+
+	if (q_size <= 512)
+		queue->page_size = 2;
+	else if (q_size <= 1024)
+		queue->page_size = 3;
+	else
+		queue->page_size = 0;
+
+	queue->is_small = (queue->page_size != 0);
+}
+
 /*
  * Create an ib_qp struct that is either a QP or an SRQ, depending on
  * the value of the is_srq parameter. If init_attr and srq_init_attr share
@@ -553,10 +593,20 @@ static struct ehca_qp *internal_create_qp(
 	if (my_qp->recv_cq)
 		parms.recv_cq_handle = my_qp->recv_cq->ipz_cq_handle;
 
-	parms.max_send_wr = init_attr->cap.max_send_wr;
-	parms.max_recv_wr = init_attr->cap.max_recv_wr;
-	parms.max_send_sge = max_send_sge;
-	parms.max_recv_sge = max_recv_sge;
+	parms.squeue.max_wr = init_attr->cap.max_send_wr;
+	parms.rqueue.max_wr = init_attr->cap.max_recv_wr;
+	parms.squeue.max_sge = max_send_sge;
+	parms.rqueue.max_sge = max_recv_sge;
+
+	if (EHCA_BMASK_GET(HCA_CAP_MINI_QP, shca->hca_cap)
+	    && !(context && udata)) { /* no small QP support in userspace ATM */
+		ehca_determine_small_queue(
+			&parms.squeue, max_send_sge, is_llqp);
+		ehca_determine_small_queue(
+			&parms.rqueue, max_recv_sge, is_llqp);
+		parms.qp_storage =
+			(parms.squeue.is_small || parms.rqueue.is_small);
+	}
 
 	h_ret = hipz_h_alloc_resource_qp(shca->ipz_hca_handle, &parms);
 	if (h_ret != H_SUCCESS) {
@@ -570,50 +620,33 @@ static struct ehca_qp *internal_create_qp(
 	my_qp->ipz_qp_handle = parms.qp_handle;
 	my_qp->galpas = parms.galpas;
 
+	swqe_size = ehca_calc_wqe_size(parms.squeue.act_nr_sges, is_llqp);
+	rwqe_size = ehca_calc_wqe_size(parms.rqueue.act_nr_sges, is_llqp);
+
 	switch (qp_type) {
 	case IB_QPT_RC:
-		if (!is_llqp) {
-			swqe_size = offsetof(struct ehca_wqe, u.nud.sg_list[
-					     (parms.act_nr_send_sges)]);
-			rwqe_size = offsetof(struct ehca_wqe, u.nud.sg_list[
-					     (parms.act_nr_recv_sges)]);
-		} else { /* for LLQP we need to use msg size, not wqe size */
-			swqe_size = ll_qp_msg_size(max_send_sge);
-			rwqe_size = ll_qp_msg_size(max_recv_sge);
-			parms.act_nr_send_sges = 1;
-			parms.act_nr_recv_sges = 1;
-		}
-		break;
-	case IB_QPT_UC:
-		swqe_size = offsetof(struct ehca_wqe,
-				     u.nud.sg_list[parms.act_nr_send_sges]);
-		rwqe_size = offsetof(struct ehca_wqe,
-				     u.nud.sg_list[parms.act_nr_recv_sges]);
+		if (is_llqp) {
+			parms.squeue.act_nr_sges = 1;
+			parms.rqueue.act_nr_sges = 1;
+		}
 		break;
-
 	case IB_QPT_UD:
 	case IB_QPT_GSI:
 	case IB_QPT_SMI:
+		/* UD circumvention */
 		if (is_llqp) {
-			swqe_size = ll_qp_msg_size(parms.act_nr_send_sges);
-			rwqe_size = ll_qp_msg_size(parms.act_nr_recv_sges);
-			parms.act_nr_send_sges = 1;
-			parms.act_nr_recv_sges = 1;
+			parms.squeue.act_nr_sges = 1;
+			parms.rqueue.act_nr_sges = 1;
 		} else {
-			/* UD circumvention */
-			parms.act_nr_send_sges -= 2;
-			parms.act_nr_recv_sges -= 2;
-			swqe_size = offsetof(struct ehca_wqe, u.ud_av.sg_list[
-						     parms.act_nr_send_sges]);
-			rwqe_size = offsetof(struct ehca_wqe, u.ud_av.sg_list[
-						     parms.act_nr_recv_sges]);
+			parms.squeue.act_nr_sges -= 2;
+			parms.rqueue.act_nr_sges -= 2;
 		}
 
 		if (IB_QPT_GSI == qp_type || IB_QPT_SMI == qp_type) {
-			parms.act_nr_send_wqes = init_attr->cap.max_send_wr;
-			parms.act_nr_recv_wqes = init_attr->cap.max_recv_wr;
-			parms.act_nr_send_sges = init_attr->cap.max_send_sge;
-			parms.act_nr_recv_sges = init_attr->cap.max_recv_sge;
+			parms.squeue.act_nr_wqes = init_attr->cap.max_send_wr;
+			parms.rqueue.act_nr_wqes = init_attr->cap.max_recv_wr;
+			parms.squeue.act_nr_sges = init_attr->cap.max_send_sge;
+			parms.rqueue.act_nr_sges = init_attr->cap.max_recv_sge;
 			ib_qp_num = (qp_type == IB_QPT_SMI) ? 0 : 1;
 		}
 
@@ -626,10 +659,9 @@ static struct ehca_qp *internal_create_qp(
 	/* initialize r/squeue and register queue pages */
 	if (HAS_SQ(my_qp)) {
 		ret = init_qp_queue(
-			shca, my_qp, &my_qp->ipz_squeue, 0,
+			shca, my_pd, my_qp, &my_qp->ipz_squeue, 0,
 			HAS_RQ(my_qp) ? H_PAGE_REGISTERED : H_SUCCESS,
-			parms.nr_sq_pages, swqe_size,
-			parms.act_nr_send_sges);
+			&parms.squeue, swqe_size);
 		if (ret) {
 			ehca_err(pd->device, "Couldn't initialize squeue "
 				 "and pages  ret=%x", ret);
@@ -639,9 +671,8 @@ static struct ehca_qp *internal_create_qp(
 
 	if (HAS_RQ(my_qp)) {
 		ret = init_qp_queue(
-			shca, my_qp, &my_qp->ipz_rqueue, 1,
-			H_SUCCESS, parms.nr_rq_pages, rwqe_size,
-			parms.act_nr_recv_sges);
+			shca, my_pd, my_qp, &my_qp->ipz_rqueue, 1,
+			H_SUCCESS, &parms.rqueue, rwqe_size);
 		if (ret) {
 			ehca_err(pd->device, "Couldn't initialize rqueue "
 				 "and pages ret=%x", ret);
@@ -671,10 +702,10 @@ static struct ehca_qp *internal_create_qp(
 	}
 
 	init_attr->cap.max_inline_data = 0; /* not supported yet */
-	init_attr->cap.max_recv_sge = parms.act_nr_recv_sges;
-	init_attr->cap.max_recv_wr = parms.act_nr_recv_wqes;
-	init_attr->cap.max_send_sge = parms.act_nr_send_sges;
-	init_attr->cap.max_send_wr = parms.act_nr_send_wqes;
+	init_attr->cap.max_recv_sge = parms.rqueue.act_nr_sges;
+	init_attr->cap.max_recv_wr = parms.rqueue.act_nr_wqes;
+	init_attr->cap.max_send_sge = parms.squeue.act_nr_sges;
+	init_attr->cap.max_send_wr = parms.squeue.act_nr_wqes;
 	my_qp->init_attr = *init_attr;
 
 	/* NOTE: define_apq0() not supported yet */
@@ -708,6 +739,8 @@ static struct ehca_qp *internal_create_qp(
 		resp.ext_type = my_qp->ext_type;
 		resp.qkey = my_qp->qkey;
 		resp.real_qp_num = my_qp->real_qp_num;
+		resp.ipz_rqueue.offset = my_qp->ipz_rqueue.offset;
+		resp.ipz_squeue.offset = my_qp->ipz_squeue.offset;
 		if (HAS_SQ(my_qp))
 			queue2resp(&resp.ipz_squeue, &my_qp->ipz_squeue);
 		if (HAS_RQ(my_qp))
@@ -724,11 +757,11 @@ static struct ehca_qp *internal_create_qp(
 
 create_qp_exit4:
 	if (HAS_RQ(my_qp))
-		ipz_queue_dtor(&my_qp->ipz_rqueue);
+		ipz_queue_dtor(my_pd, &my_qp->ipz_rqueue);
 
 create_qp_exit3:
 	if (HAS_SQ(my_qp))
-		ipz_queue_dtor(&my_qp->ipz_squeue);
+		ipz_queue_dtor(my_pd, &my_qp->ipz_squeue);
 
 create_qp_exit2:
 	hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp);
@@ -1735,9 +1768,9 @@ static int internal_destroy_qp(struct ib_device *dev, struct ehca_qp *my_qp,
 	}
 
 	if (HAS_RQ(my_qp))
-		ipz_queue_dtor(&my_qp->ipz_rqueue);
+		ipz_queue_dtor(my_pd, &my_qp->ipz_rqueue);
 	if (HAS_SQ(my_qp))
-		ipz_queue_dtor(&my_qp->ipz_squeue);
+		ipz_queue_dtor(my_pd, &my_qp->ipz_squeue);
 	kmem_cache_free(qp_cache, my_qp);
 	return 0;
 }
diff --git a/drivers/infiniband/hw/ehca/ehca_uverbs.c b/drivers/infiniband/hw/ehca/ehca_uverbs.c
index 05c4157..4bc687f 100644
--- a/drivers/infiniband/hw/ehca/ehca_uverbs.c
+++ b/drivers/infiniband/hw/ehca/ehca_uverbs.c
@@ -149,7 +149,7 @@ static int ehca_mmap_queue(struct vm_area_struct *vma, struct ipz_queue *queue,
 			ehca_gen_err("vm_insert_page() failed rc=%x", ret);
 			return ret;
 		}
-		start +=  PAGE_SIZE;
+		start += PAGE_SIZE;
 	}
 	vma->vm_private_data = mm_count;
 	(*mm_count)++;
diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c
index 358796c..fdbfebe 100644
--- a/drivers/infiniband/hw/ehca/hcp_if.c
+++ b/drivers/infiniband/hw/ehca/hcp_if.c
@@ -52,10 +52,13 @@
 #define H_ALL_RES_QP_ENHANCED_OPS       EHCA_BMASK_IBM(9, 11)
 #define H_ALL_RES_QP_PTE_PIN            EHCA_BMASK_IBM(12, 12)
 #define H_ALL_RES_QP_SERVICE_TYPE       EHCA_BMASK_IBM(13, 15)
+#define H_ALL_RES_QP_STORAGE            EHCA_BMASK_IBM(16, 17)
 #define H_ALL_RES_QP_LL_RQ_CQE_POSTING  EHCA_BMASK_IBM(18, 18)
 #define H_ALL_RES_QP_LL_SQ_CQE_POSTING  EHCA_BMASK_IBM(19, 21)
 #define H_ALL_RES_QP_SIGNALING_TYPE     EHCA_BMASK_IBM(22, 23)
 #define H_ALL_RES_QP_UD_AV_LKEY_CTRL    EHCA_BMASK_IBM(31, 31)
+#define H_ALL_RES_QP_SMALL_SQ_PAGE_SIZE EHCA_BMASK_IBM(32, 35)
+#define H_ALL_RES_QP_SMALL_RQ_PAGE_SIZE EHCA_BMASK_IBM(36, 39)
 #define H_ALL_RES_QP_RESOURCE_TYPE      EHCA_BMASK_IBM(56, 63)
 
 #define H_ALL_RES_QP_MAX_OUTST_SEND_WR  EHCA_BMASK_IBM(0, 15)
@@ -299,6 +302,11 @@ u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle,
 		| EHCA_BMASK_SET(H_ALL_RES_QP_PTE_PIN, 0)
 		| EHCA_BMASK_SET(H_ALL_RES_QP_SERVICE_TYPE, parms->servicetype)
 		| EHCA_BMASK_SET(H_ALL_RES_QP_SIGNALING_TYPE, parms->sigtype)
+		| EHCA_BMASK_SET(H_ALL_RES_QP_STORAGE, parms->qp_storage)
+		| EHCA_BMASK_SET(H_ALL_RES_QP_SMALL_SQ_PAGE_SIZE,
+				 parms->squeue.page_size)
+		| EHCA_BMASK_SET(H_ALL_RES_QP_SMALL_RQ_PAGE_SIZE,
+				 parms->rqueue.page_size)
 		| EHCA_BMASK_SET(H_ALL_RES_QP_LL_RQ_CQE_POSTING,
 				 !!(parms->ll_comp_flags & LLQP_RECV_COMP))
 		| EHCA_BMASK_SET(H_ALL_RES_QP_LL_SQ_CQE_POSTING,
@@ -309,13 +317,13 @@ u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle,
 
 	max_r10_reg =
 		EHCA_BMASK_SET(H_ALL_RES_QP_MAX_OUTST_SEND_WR,
-			       parms->max_send_wr + 1)
+			       parms->squeue.max_wr + 1)
 		| EHCA_BMASK_SET(H_ALL_RES_QP_MAX_OUTST_RECV_WR,
-				 parms->max_recv_wr + 1)
+				 parms->rqueue.max_wr + 1)
 		| EHCA_BMASK_SET(H_ALL_RES_QP_MAX_SEND_SGE,
-				 parms->max_send_sge)
+				 parms->squeue.max_sge)
 		| EHCA_BMASK_SET(H_ALL_RES_QP_MAX_RECV_SGE,
-				 parms->max_recv_sge);
+				 parms->rqueue.max_sge);
 
 	r11 = EHCA_BMASK_SET(H_ALL_RES_QP_SRQ_QP_TOKEN, parms->srq_token);
 
@@ -335,17 +343,17 @@ u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle,
 
 	parms->qp_handle.handle = outs[0];
 	parms->real_qp_num = (u32)outs[1];
-	parms->act_nr_send_wqes =
+	parms->squeue.act_nr_wqes =
 		(u16)EHCA_BMASK_GET(H_ALL_RES_QP_ACT_OUTST_SEND_WR, outs[2]);
-	parms->act_nr_recv_wqes =
+	parms->rqueue.act_nr_wqes =
 		(u16)EHCA_BMASK_GET(H_ALL_RES_QP_ACT_OUTST_RECV_WR, outs[2]);
-	parms->act_nr_send_sges =
+	parms->squeue.act_nr_sges =
 		(u8)EHCA_BMASK_GET(H_ALL_RES_QP_ACT_SEND_SGE, outs[3]);
-	parms->act_nr_recv_sges =
+	parms->rqueue.act_nr_sges =
 		(u8)EHCA_BMASK_GET(H_ALL_RES_QP_ACT_RECV_SGE, outs[3]);
-	parms->nr_sq_pages =
+	parms->squeue.queue_size =
 		(u32)EHCA_BMASK_GET(H_ALL_RES_QP_SQUEUE_SIZE_PAGES, outs[4]);
-	parms->nr_rq_pages =
+	parms->rqueue.queue_size =
 		(u32)EHCA_BMASK_GET(H_ALL_RES_QP_RQUEUE_SIZE_PAGES, outs[4]);
 
 	if (ret == H_SUCCESS)
@@ -497,7 +505,7 @@ u64 hipz_h_register_rpage_qp(const struct ipz_adapter_handle adapter_handle,
 			     const u64 count,
 			     const struct h_galpa galpa)
 {
-	if (count != 1) {
+	if (count > 1) {
 		ehca_gen_err("Page counter=%lx", count);
 		return H_PARAMETER;
 	}
diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn.c b/drivers/infiniband/hw/ehca/ipz_pt_fn.c
index 9606f13..6506501 100644
--- a/drivers/infiniband/hw/ehca/ipz_pt_fn.c
+++ b/drivers/infiniband/hw/ehca/ipz_pt_fn.c
@@ -40,6 +40,11 @@
 
 #include "ehca_tools.h"
 #include "ipz_pt_fn.h"
+#include "ehca_classes.h"
+
+#define PAGES_PER_KPAGE (PAGE_SIZE >> EHCA_PAGESHIFT)
+
+struct kmem_cache *small_qp_cache;
 
 void *ipz_qpageit_get_inc(struct ipz_queue *queue)
 {
@@ -49,7 +54,7 @@ void *ipz_qpageit_get_inc(struct ipz_queue *queue)
 		queue->current_q_offset -= queue->pagesize;
 		ret = NULL;
 	}
-	if (((u64)ret) % EHCA_PAGESIZE) {
+	if (((u64)ret) % queue->pagesize) {
 		ehca_gen_err("ERROR!! not at PAGE-Boundary");
 		return NULL;
 	}
@@ -83,80 +88,195 @@ int ipz_queue_abs_to_offset(struct ipz_queue *queue, u64 addr, u64 *q_offset)
 	return -EINVAL;
 }
 
-int ipz_queue_ctor(struct ipz_queue *queue,
-		   const u32 nr_of_pages,
-		   const u32 pagesize, const u32 qe_size, const u32 nr_of_sg)
+#if PAGE_SHIFT < EHCA_PAGESHIFT
+#error Kernel pages must be at least as large than eHCA pages (4K) !
+#endif
+
+/*
+ * allocate pages for queue:
+ * outer loop allocates whole kernel pages (page aligned) and
+ * inner loop divides a kernel page into smaller hca queue pages
+ */
+static int alloc_queue_pages(struct ipz_queue *queue, const u32 nr_of_pages)
 {
-	int pages_per_kpage = PAGE_SIZE >> EHCA_PAGESHIFT;
-	int f;
+	int k, f = 0;
+	u8 *kpage;
 
-	if (pagesize > PAGE_SIZE) {
-		ehca_gen_err("FATAL ERROR: pagesize=%x is greater "
-			     "than kernel page size", pagesize);
-		return 0;
-	}
-	if (!pages_per_kpage) {
-		ehca_gen_err("FATAL ERROR: invalid kernel page size. "
-			     "pages_per_kpage=%x", pages_per_kpage);
-		return 0;
-	}
-	queue->queue_length = nr_of_pages * pagesize;
-	queue->queue_pages = vmalloc(nr_of_pages * sizeof(void *));
-	if (!queue->queue_pages) {
-		ehca_gen_err("ERROR!! didn't get the memory");
-		return 0;
-	}
-	memset(queue->queue_pages, 0, nr_of_pages * sizeof(void *));
-	/*
-	 * allocate pages for queue:
-	 * outer loop allocates whole kernel pages (page aligned) and
-	 * inner loop divides a kernel page into smaller hca queue pages
-	 */
-	f = 0;
 	while (f < nr_of_pages) {
-		u8 *kpage = (u8 *)get_zeroed_page(GFP_KERNEL);
-		int k;
+		kpage = (u8 *)get_zeroed_page(GFP_KERNEL);
 		if (!kpage)
-			goto ipz_queue_ctor_exit0; /*NOMEM*/
-		for (k = 0; k < pages_per_kpage && f < nr_of_pages; k++) {
-			(queue->queue_pages)[f] = (struct ipz_page *)kpage;
+			goto out;
+
+		for (k = 0; k < PAGES_PER_KPAGE && f < nr_of_pages; k++) {
+			queue->queue_pages[f] = (struct ipz_page *)kpage;
 			kpage += EHCA_PAGESIZE;
 			f++;
 		}
 	}
+	return 1;
 
-	queue->current_q_offset = 0;
+out:
+	for (f = 0; f < nr_of_pages && queue->queue_pages[f];
+	     f += PAGES_PER_KPAGE)
+		free_page((unsigned long)(queue->queue_pages)[f]);
+	return 0;
+}
+
+static int alloc_small_queue_page(struct ipz_queue *queue, struct ehca_pd *pd)
+{
+	int order = ilog2(queue->pagesize) - 9;
+	struct ipz_small_queue_page *page;
+	unsigned long bit;
+
+	mutex_lock(&pd->lock);
+
+	if (!list_empty(&pd->free[order]))
+		page = list_entry(pd->free[order].next,
+				  struct ipz_small_queue_page, list);
+	else {
+		page = kmem_cache_zalloc(small_qp_cache, GFP_KERNEL);
+		if (!page)
+			goto out;
+
+		page->page = get_zeroed_page(GFP_KERNEL);
+		if (!page->page) {
+			kmem_cache_free(small_qp_cache, page);
+			goto out;
+		}
+
+		list_add(&page->list, &pd->free[order]);
+	}
+
+	bit = find_first_zero_bit(page->bitmap, IPZ_SPAGE_PER_KPAGE >> order);
+	__set_bit(bit, page->bitmap);
+	page->fill++;
+
+	if (page->fill == IPZ_SPAGE_PER_KPAGE >> order)
+		list_move(&page->list, &pd->full[order]);
+
+	mutex_unlock(&pd->lock);
+
+	queue->queue_pages[0] = (void *)(page->page | (bit << (order + 9)));
+	queue->small_page = page;
+	return 1;
+
+out:
+	ehca_err(pd->ib_pd.device, "failed to allocate small queue page");
+	return 0;
+}
+
+static void free_small_queue_page(struct ipz_queue *queue, struct ehca_pd *pd)
+{
+	int order = ilog2(queue->pagesize) - 9;
+	struct ipz_small_queue_page *page = queue->small_page;
+	unsigned long bit;
+	int free_page = 0;
+
+	bit = ((unsigned long)queue->queue_pages[0] & PAGE_MASK)
+		>> (order + 9);
+
+	mutex_lock(&pd->lock);
+
+	__clear_bit(bit, page->bitmap);
+	page->fill--;
+
+	if (page->fill == 0) {
+		list_del(&page->list);
+		free_page = 1;
+	}
+
+	if (page->fill == (IPZ_SPAGE_PER_KPAGE >> order) - 1)
+		/* the page was full until we freed the chunk */
+		list_move_tail(&page->list, &pd->free[order]);
+
+	mutex_unlock(&pd->lock);
+
+	if (free_page) {
+		free_page(page->page);
+		kmem_cache_free(small_qp_cache, page);
+	}
+}
+
+int ipz_queue_ctor(struct ehca_pd *pd, struct ipz_queue *queue,
+		   const u32 nr_of_pages, const u32 pagesize,
+		   const u32 qe_size, const u32 nr_of_sg,
+		   int is_small)
+{
+	if (pagesize > PAGE_SIZE) {
+		ehca_gen_err("FATAL ERROR: pagesize=%x "
+			     "is greater than kernel page size", pagesize);
+		return 0;
+	}
+
+	/* init queue fields */
+	queue->queue_length = nr_of_pages * pagesize;
+	queue->pagesize = pagesize;
 	queue->qe_size = qe_size;
 	queue->act_nr_of_sg = nr_of_sg;
-	queue->pagesize = pagesize;
+	queue->current_q_offset = 0;
 	queue->toggle_state = 1;
-	return 1;
+	queue->small_page = NULL;
 
- ipz_queue_ctor_exit0:
-	ehca_gen_err("Couldn't get alloc pages queue=%p f=%x nr_of_pages=%x",
-		     queue, f, nr_of_pages);
-	for (f = 0; f < nr_of_pages; f += pages_per_kpage) {
-		if (!(queue->queue_pages)[f])
-			break;
-		free_page((unsigned long)(queue->queue_pages)[f]);
+	/* allocate queue page pointers */
+	queue->queue_pages = vmalloc(nr_of_pages * sizeof(void *));
+	if (!queue->queue_pages) {
+		ehca_gen_err("Couldn't allocate queue page list");
+		return 0;
 	}
+	memset(queue->queue_pages, 0, nr_of_pages * sizeof(void *));
+
+	/* allocate actual queue pages */
+	if (is_small) {
+		if (!alloc_small_queue_page(queue, pd))
+			goto ipz_queue_ctor_exit0;
+	} else
+		if (!alloc_queue_pages(queue, nr_of_pages))
+			goto ipz_queue_ctor_exit0;
+
+	return 1;
+
+ipz_queue_ctor_exit0:
+	ehca_gen_err("Couldn't alloc pages queue=%p "
+		 "nr_of_pages=%x",  queue, nr_of_pages);
+	vfree(queue->queue_pages);
+
 	return 0;
 }
 
-int ipz_queue_dtor(struct ipz_queue *queue)
+int ipz_queue_dtor(struct ehca_pd *pd, struct ipz_queue *queue)
 {
-	int pages_per_kpage = PAGE_SIZE >> EHCA_PAGESHIFT;
-	int g;
-	int nr_pages;
+	int i, nr_pages;
 
 	if (!queue || !queue->queue_pages) {
 		ehca_gen_dbg("queue or queue_pages is NULL");
 		return 0;
 	}
-	nr_pages = queue->queue_length / queue->pagesize;
-	for (g = 0; g < nr_pages; g += pages_per_kpage)
-		free_page((unsigned long)(queue->queue_pages)[g]);
+
+	if (queue->small_page)
+		free_small_queue_page(queue, pd);
+	else {
+		nr_pages = queue->queue_length / queue->pagesize;
+		for (i = 0; i < nr_pages; i += PAGES_PER_KPAGE)
+			free_page((unsigned long)queue->queue_pages[i]);
+	}
+
 	vfree(queue->queue_pages);
 
 	return 1;
 }
+
+int ehca_init_small_qp_cache(void)
+{
+	small_qp_cache = kmem_cache_create("ehca_cache_small_qp",
+					   sizeof(struct ipz_small_queue_page),
+					   0, SLAB_HWCACHE_ALIGN, NULL, NULL);
+	if (!small_qp_cache)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void ehca_cleanup_small_qp_cache(void)
+{
+	kmem_cache_destroy(small_qp_cache);
+}
diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn.h b/drivers/infiniband/hw/ehca/ipz_pt_fn.h
index 39a4f64..c6937a0 100644
--- a/drivers/infiniband/hw/ehca/ipz_pt_fn.h
+++ b/drivers/infiniband/hw/ehca/ipz_pt_fn.h
@@ -51,11 +51,25 @@
 #include "ehca_tools.h"
 #include "ehca_qes.h"
 
+struct ehca_pd;
+struct ipz_small_queue_page;
+
 /* struct generic ehca page */
 struct ipz_page {
 	u8 entries[EHCA_PAGESIZE];
 };
 
+#define IPZ_SPAGE_PER_KPAGE (PAGE_SIZE / 512)
+
+struct ipz_small_queue_page {
+	unsigned long page;
+	unsigned long bitmap[IPZ_SPAGE_PER_KPAGE / BITS_PER_LONG];
+	int fill;
+	void *mapped_addr;
+	u32 mmap_count;
+	struct list_head list;
+};
+
 /* struct generic queue in linux kernel virtual memory (kv) */
 struct ipz_queue {
 	u64 current_q_offset;	/* current queue entry */
@@ -66,7 +80,8 @@ struct ipz_queue {
 	u32 queue_length;	/* queue length allocated in bytes */
 	u32 pagesize;
 	u32 toggle_state;	/* toggle flag - per page */
-	u32 dummy3;		/* 64 bit alignment */
+	u32 offset; /* save offset within page for small_qp */
+	struct ipz_small_queue_page *small_page;
 };
 
 /*
@@ -188,9 +203,10 @@ struct ipz_qpt {
  * see ipz_qpt_ctor()
  * returns true if ok, false if out of memory
  */
-int ipz_queue_ctor(struct ipz_queue *queue, const u32 nr_of_pages,
-		   const u32 pagesize, const u32 qe_size,
-		   const u32 nr_of_sg);
+int ipz_queue_ctor(struct ehca_pd *pd, struct ipz_queue *queue,
+		   const u32 nr_of_pages, const u32 pagesize,
+		   const u32 qe_size, const u32 nr_of_sg,
+		   int is_small);
 
 /*
  * destructor for a ipz_queue_t
@@ -198,7 +214,7 @@ int ipz_queue_ctor(struct ipz_queue *queue, const u32 nr_of_pages,
  *  see ipz_queue_ctor()
  *  returns true if ok, false if queue was NULL-ptr of free failed
  */
-int ipz_queue_dtor(struct ipz_queue *queue);
+int ipz_queue_dtor(struct ehca_pd *pd, struct ipz_queue *queue);
 
 /*
  * constructor for a ipz_qpt_t,
-- 
1.5.2


From mst at dev.mellanox.co.il  Fri Jul 20 07:17:59 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 20 Jul 2007 17:17:59 +0300
Subject: [ofa-general] Re: IPOB CM (NOSRQ) [PATCH V8] patch
In-Reply-To: <469FB3AB.6080304@linux.vnet.ibm.com>
References: <469FB3AB.6080304@linux.vnet.ibm.com>
Message-ID: <20070720141759.GF31246@mellanox.co.il>

> @@ -815,7 +1168,9 @@ static struct ib_qp *ipoib_cm_create_tx_
>  	attr.recv_cq = priv->cq;
>  	attr.srq = priv->cm.srq;
>  	attr.cap.max_send_wr = ipoib_sendq_size;
> +	attr.cap.max_recv_wr = 1;
>  	attr.cap.max_send_sge = 1;
> +	attr.cap.max_recv_sge = 1;
>  	attr.sq_sig_type = IB_SIGNAL_ALL_WR;
>  	attr.qp_type = IB_QPT_RC;
>  	attr.send_cq = cq;

You never post a receive WR on this QP, do you?
So
1. What's magic about 1 as max recv wr? Why not 0?
2. If the remote sends a packet on this QP, it'llget closed,
   won't it? Looks like a spec violation.


-- 
MST


From davem at systemfabricworks.com  Fri Jul 20 08:09:50 2007
From: davem at systemfabricworks.com (davem at systemfabricworks.com)
Date: Fri, 20 Jul 2007 10:09:50 -0500
Subject: [ofa-general] [PATCH] infiniband-diags/scripts: Handle new and old
 topology file format
Message-ID: <46A0D03E.mail35T1S2JP6@systemfabricworks.com>


   Fix infiniband-diags scripts to handle changed ibnetdiscover topology file
   format and remain backward compatible with old file format.

Signed-off-by: David A. McMillen <davem at systemfabricworks.com>
---
 infiniband-diags/scripts/ibcheckerrors.in      |    4 +++-
 infiniband-diags/scripts/ibchecknet.in         |    4 +++-
 infiniband-diags/scripts/ibcheckstate.in       |    4 +++-
 infiniband-diags/scripts/ibcheckwidth.in       |    4 +++-
 infiniband-diags/scripts/ibclearcounters.in    |    4 +++-
 infiniband-diags/scripts/ibclearerrors.in      |    4 +++-
 infiniband-diags/scripts/ibdatacounters.in     |    4 +++-
 7 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/infiniband-diags/scripts/ibcheckerrors.in b/infiniband-diags/scripts/ibcheckerrors.in
index 8a6c012..e08eba3 100644
--- a/infiniband-diags/scripts/ibcheckerrors.in
+++ b/infiniband-diags/scripts/ibcheckerrors.in
@@ -91,13 +91,15 @@ function check_node(lid)
 		nports++
 		port = $1
 		if (!nodechecked) {
-			lid = $5
+			lid = substr($0, index($0, " lid ") + 5)
+			lid = substr(lid, 1, index(lid, " ") - 1)
 			check_node(lid)
 		}
 		if (badnode) {
 			print "\n# " ntype ": nodeguid 0x" nodeguid " failed"
 			next
 		}
+		sub("\\(.*\\)", "", port)
 		gsub("[\\[\\]]", "", port)
 		if (nodeerr)
 			if (system("'$IBPATH'/ibcheckerrs '$gflags' '$verbose' '$brief' " lid " " port)) {
diff --git a/infiniband-diags/scripts/ibchecknet.in b/infiniband-diags/scripts/ibchecknet.in
index 3154d9e..9f36742 100644
--- a/infiniband-diags/scripts/ibchecknet.in
+++ b/infiniband-diags/scripts/ibchecknet.in
@@ -84,13 +84,15 @@ function check_node(lid)
 		nports++
 		port = $1
 		if (!nodechecked) {
-			lid = $5
+			lid = substr($0, index($0, " lid ") + 5)
+			lid = substr(lid, 1, index(lid, " ") - 1)
 			check_node(lid)
 		}
 		if (badnode) {
 			print "\n# " ntype ": nodeguid 0x" nodeguid " failed"
 			next
 		}
+		sub("\\(.*\\)", "", port)
 		gsub("[\\[\\]]", "", port)
 		if (system("'$IBPATH'/ibcheckport '$gflags' '$verbose' " lid " " port)) {
 			if (!'$v' && oldlid != lid) {
diff --git a/infiniband-diags/scripts/ibcheckstate.in b/infiniband-diags/scripts/ibcheckstate.in
index 9268670..30b5513 100644
--- a/infiniband-diags/scripts/ibcheckstate.in
+++ b/infiniband-diags/scripts/ibcheckstate.in
@@ -83,13 +83,15 @@ function check_node(lid)
 		nports++
 		port = $1
 		if (!nodechecked) {
-			lid = $5
+			lid = substr($0, index($0, " lid ") + 5)
+			lid = substr(lid, 1, index(lid, " ") - 1)
 			check_node(lid)
 		}
 		if (badnode) {
 			print "\n# " ntype ": nodeguid 0x" nodeguid " failed"
 			next
 		}
+		sub("\\(.*\\)", "", port)
 		gsub("[\\[\\]]", "", port)
 		if (system("'$IBPATH'/ibcheckportstate '$gflags' '$verbose' " lid " " port)) {
 			if (!'$v' && oldlid != lid) {
diff --git a/infiniband-diags/scripts/ibcheckwidth.in b/infiniband-diags/scripts/ibcheckwidth.in
index 7a8e7e0..072d433 100644
--- a/infiniband-diags/scripts/ibcheckwidth.in
+++ b/infiniband-diags/scripts/ibcheckwidth.in
@@ -83,13 +83,15 @@ function check_node(lid)
 		nports++
 		port = $1
 		if (!nodechecked) {
-			lid = $5
+			lid = substr($0, index($0, " lid ") + 5)
+			lid = substr(lid, 1, index(lid, " ") - 1)
 			check_node(lid)
 		}
 		if (badnode) {
 			print "\n# " ntype ": nodeguid 0x" nodeguid " failed"
 			next
 		}
+		sub("\\(.*\\)", "", port)
 		gsub("[\\[\\]]", "", port)
 		if (system("'$IBPATH'/ibcheckportwidth '$gflags' '$verbose' " lid " " port)) {
 			if (!'$v' && oldlid != lid) {
diff --git a/infiniband-diags/scripts/ibclearcounters.in b/infiniband-diags/scripts/ibclearcounters.in
index fa6ab83..54551b3 100644
--- a/infiniband-diags/scripts/ibclearcounters.in
+++ b/infiniband-diags/scripts/ibclearcounters.in
@@ -73,9 +73,11 @@ function clear_port_counters(lid, port)
 
 /^\[/   {
 			port = $1
+			sub("\\(.*\\)", "", port)
 			gsub("[\\[\\]]", "", port)
 			if (!nodecleared) {
-				lid = $5
+				lid = substr($0, index($0, " lid ") + 5)
+				lid = substr(lid, 1, index(lid, " ") - 1)
 				clear_port_counters(lid, port)
 			}
 		}
diff --git a/infiniband-diags/scripts/ibclearerrors.in b/infiniband-diags/scripts/ibclearerrors.in
index bce8f83..4a086ae 100644
--- a/infiniband-diags/scripts/ibclearerrors.in
+++ b/infiniband-diags/scripts/ibclearerrors.in
@@ -66,9 +66,11 @@ function clear_errors(lid, port)
 
 /^\[/   {
 			port = $1
+			sub("\\(.*\\)", "", port)
 			gsub("[\\[\\]]", "", port)
 			if (!nodecleared) {
-				lid = $5
+				lid = substr($0, index($0, " lid ") + 5)
+				lid = substr(lid, 1, index(lid, " ") - 1)
 				clear_errors(lid, port)
 			}
 		}
diff --git a/infiniband-diags/scripts/ibdatacounters.in b/infiniband-diags/scripts/ibdatacounters.in
index ce8c71a..d27149e 100644
--- a/infiniband-diags/scripts/ibdatacounters.in
+++ b/infiniband-diags/scripts/ibdatacounters.in
@@ -91,13 +91,15 @@ function check_node(lid)
 		nports++
 		port = $1
 		if (!nodechecked) {
-			lid = $5
+			lid = substr($0, index($0, " lid ") + 5)
+			lid = substr(lid, 1, index(lid, " ") - 1)
 			check_node(lid)
 		}
 		if (badnode) {
 			print "\n# " ntype ": nodeguid 0x" nodeguid " failed"
 			next
 		}
+		sub("\\(.*\\)", "", port)
 		gsub("[\\[\\]]", "", port)
 		if (nodeerr)
 			if (system("'$IBPATH'/ibdatacounts'$gflags' '$verbose' '$brief' " lid " " port)) {


From davem at systemfabricworks.com  Fri Jul 20 08:12:03 2007
From: davem at systemfabricworks.com (davem at systemfabricworks.com)
Date: Fri, 20 Jul 2007 10:12:03 -0500
Subject: [ofa-general] [PATCH] infiniband-diags/ibnetdiscover: Fix DDR link
	speed decode
Message-ID: <46A0D0C3.mail37511HD5H@systemfabricworks.com>


   Fix ibnetdiscover DDR link speed decode by moving string from [3] to [2].

Signed-off-by: David A. McMillen <davem at systemfabricworks.com>
---
 infiniband-diags/src/ibnetdiscover.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c
index c321d59..ccd70cb 100644
--- a/infiniband-diags/src/ibnetdiscover.c
+++ b/infiniband-diags/src/ibnetdiscover.c
@@ -78,8 +78,8 @@ static char *linkwidth_str[] = {
 static char *linkspeed_str[] = {
 	"???",
 	"SDR",
-	"???",
 	"DDR",
+	"???",
 	"QDR"
 };
 

From aedanzig at info-mass.com  Fri Jul 20 09:06:56 2007
From: aedanzig at info-mass.com (Tamara Lay)
Date: Fri, 20 Jul 2007 18:06:56 +0200
Subject: [ofa-general] daze bankruptcy  addend
Message-ID: <001801c7caf8$c2a8efe0$086a1fc4@pc100050>

come carney  creature blustery.  concretion cartographer cedric class  caveman birmingham.  cerium braille 
calendrical  bribery  bard borderland   component  bayonne aft.   desolate  cutout arrangeable  binaural bishopric  czar    cornell.


From shemminger at linux-foundation.org  Fri Jul 20 09:22:03 2007
From: shemminger at linux-foundation.org (Stephen Hemminger)
Date: Fri, 20 Jul 2007 17:22:03 +0100
Subject: [ofa-general] Re: [PATCH 04/10] net-sysfs.c changes.
In-Reply-To: <46A09ACF.20805@trash.net>
References: <OF3743242C.0BC81AC3-ON6525731E.00397CC1-6525731E.00399244@in.ibm.com>
	<46A09ACF.20805@trash.net>
Message-ID: <20070720172203.0eaeea86@oldman>

On Fri, 20 Jul 2007 13:21:51 +0200
Patrick McHardy <kaber at trash.net> wrote:

> Krishna Kumar2 wrote:
> > Patrick McHardy <kaber at trash.net> wrote on 07/20/2007 03:37:20 PM:
> >
> >
> >   
> >> rtnetlink support seems more important than sysfs to me.
> >>     
> >
> > Thanks, I will add that as a patch. The reason to add to sysfs is that
> > it is easier to change for a user (and similar to tx_queue_len).
> >   
> 

But since batching is so similar to TSO, i really should be part of the
flags and controlled by ethtool like other offload flags.


From pradeeps at linux.vnet.ibm.com  Fri Jul 20 09:31:12 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Fri, 20 Jul 2007 09:31:12 -0700
Subject: [ofa-general] Re: IPOB CM (NOSRQ) [PATCH V8] patch
In-Reply-To: <20070720141759.GF31246@mellanox.co.il>
References: <469FB3AB.6080304@linux.vnet.ibm.com>
	<20070720141759.GF31246@mellanox.co.il>
Message-ID: <46A0E350.5060207@linux.vnet.ibm.com>

Michael S. Tsirkin wrote:
>> @@ -815,7 +1168,9 @@ static struct ib_qp *ipoib_cm_create_tx_
>>  	attr.recv_cq = priv->cq;
>>  	attr.srq = priv->cm.srq;
>>  	attr.cap.max_send_wr = ipoib_sendq_size;
>> +	attr.cap.max_recv_wr = 1;
>>  	attr.cap.max_send_sge = 1;
>> +	attr.cap.max_recv_sge = 1;
>>  	attr.sq_sig_type = IB_SIGNAL_ALL_WR;
>>  	attr.qp_type = IB_QPT_RC;
>>  	attr.send_cq = cq;
> 
> You never post a receive WR on this QP, do you?
> So
> 1. What's magic about 1 as max recv wr? Why not 0?
> 2. If the remote sends a packet on this QP, it'llget closed,
>    won't it? Looks like a spec violation.
> 
> 
Good catch. I can probably set max_recv_sge to 0 too -right?
I can do that in a separate patch later on.
However, I see nothing in table 46 of the IB spec that tells me
that it is a violation of the spec. Which section are you
referring to?

Pradeep


From xma at us.ibm.com  Fri Jul 20 09:39:24 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Fri, 20 Jul 2007 09:39:24 -0700
Subject: [ofa-general] Where is openib-1.2.tgz in OFED-1.2?
Message-ID: <OFCBCCD0A4.5D427786-ON8725731E.005B401C-8825731E.002F8BA2@us.ibm.com>


      I downloaded OFED-1.2.tgz. It doesn't include source code openib-*tgz
as OFED-1.1. Where I can find the source code without installing any RPMs??

Thanks
Shirley
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070720/2bce3a9e/attachment.html>

From sri at us.ibm.com  Fri Jul 20 10:25:05 2007
From: sri at us.ibm.com (Sridhar Samudrala)
Date: Fri, 20 Jul 2007 10:25:05 -0700
Subject: [ofa-general] Re: [PATCH 02/10] Networking include file changes.
In-Reply-To: <20070720063216.26341.80316.sendpatchset@localhost.localdomain>
References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
	<20070720063216.26341.80316.sendpatchset@localhost.localdomain>
Message-ID: <1184952305.12431.16.camel@localhost.localdomain>

On Fri, 2007-07-20 at 12:02 +0530, Krishna Kumar wrote:
> Networking include file changes for batching.
> 
> Signed-off-by: Krishna Kumar <krkumar2 at in.ibm.com>
> ---
>  linux/netdevice.h |   10 ++++++++++
>  net/pkt_sched.h   |    6 +++---
>  2 files changed, 13 insertions(+), 3 deletions(-)
> 
> diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h
> --- org/include/linux/netdevice.h	2007-07-20 07:49:28.000000000 +0530
> +++ new/include/linux/netdevice.h	2007-07-20 08:30:55.000000000 +0530
> @@ -264,6 +264,8 @@ enum netdev_state_t
>  	__LINK_STATE_QDISC_RUNNING,
>  };
> 
> +/* Minimum length of device hardware queue for batching to work */
> +#define MIN_QUEUE_LEN_BATCH	16
> 
>  /*
>   * This structure holds at boot time configured netdevice settings. They
> @@ -340,6 +342,7 @@ struct net_device
>  #define NETIF_F_VLAN_CHALLENGED	1024	/* Device cannot handle VLAN packets */
>  #define NETIF_F_GSO		2048	/* Enable software GSO. */
>  #define NETIF_F_LLTX		4096	/* LockLess TX */
> +#define NETIF_F_BATCH_SKBS	8192	/* Driver supports batch skbs API */
>  #define NETIF_F_MULTI_QUEUE	16384	/* Has multiple TX/RX queues */
> 
>  	/* Segmentation offload features */
> @@ -452,6 +455,8 @@ struct net_device
>  	struct Qdisc		*qdisc_sleeping;
>  	struct list_head	qdisc_list;
>  	unsigned long		tx_queue_len;	/* Max frames per queue allowed */
> +	unsigned long		xmit_slots;	/* Device free slots */
> +	struct sk_buff_head	*skb_blist;	/* List of batch skbs */
> 
>  	/* Partially transmitted GSO packet. */
>  	struct sk_buff		*gso_skb;
> @@ -472,6 +477,9 @@ struct net_device
>  	void			*priv;	/* pointer to private data	*/
>  	int			(*hard_start_xmit) (struct sk_buff *skb,
>  						    struct net_device *dev);
> +	int			(*hard_start_xmit_batch) (struct net_device
> +							  *dev);
> +
>  	/* These may be needed for future network-power-down code. */
>  	unsigned long		trans_start;	/* Time (in jiffies) of last Tx	*/
> 
> @@ -832,6 +840,8 @@ extern int		dev_set_mac_address(struct n
>  					    struct sockaddr *);
>  extern int		dev_hard_start_xmit(struct sk_buff *skb,
>  					    struct net_device *dev);
> +extern int		dev_add_skb_to_blist(struct sk_buff *skb,
> +					     struct net_device *dev);
> 
>  extern void		dev_init(void);
> 
> diff -ruNp org/include/net/pkt_sched.h new/include/net/pkt_sched.h
> --- org/include/net/pkt_sched.h	2007-07-20 07:49:28.000000000 +0530
> +++ new/include/net/pkt_sched.h	2007-07-20 08:30:22.000000000 +0530
> @@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge
>  		struct rtattr *tab);
>  extern void qdisc_put_rtab(struct qdisc_rate_table *tab);
> 
> -extern void __qdisc_run(struct net_device *dev);
> +extern void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist);

Why do we need this additional 'blist' argument?
Is this different from dev->skb_blist?

> 
> -static inline void qdisc_run(struct net_device *dev)
> +static inline void qdisc_run(struct net_device *dev, struct sk_buff_head *blist)
>  {
>  	if (!netif_queue_stopped(dev) &&
>  	    !test_and_set_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
> -		__qdisc_run(dev);
> +		__qdisc_run(dev, blist);
>  }
> 
>  extern int tc_classify_compat(struct sk_buff *skb, struct tcf_proto *tp,


From sri at us.ibm.com  Fri Jul 20 10:44:19 2007
From: sri at us.ibm.com (Sridhar Samudrala)
Date: Fri, 20 Jul 2007 10:44:19 -0700
Subject: [ofa-general] Re: [PATCH 03/10] dev.c changes.
In-Reply-To: <20070720063227.26341.91868.sendpatchset@localhost.localdomain>
References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
	<20070720063227.26341.91868.sendpatchset@localhost.localdomain>
Message-ID: <1184953459.12431.21.camel@localhost.localdomain>

On Fri, 2007-07-20 at 12:02 +0530, Krishna Kumar wrote:
> Changes in dev.c to support batching : add dev_add_skb_to_blist,
> register_netdev recognizes batch aware drivers, and net_tx_action is
> the sole user of batching.
> 
> Signed-off-by: Krishna Kumar <krkumar2 at in.ibm.com>
> ---
>  dev.c |   77 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 files changed, 74 insertions(+), 3 deletions(-)
> 
> diff -ruNp org/net/core/dev.c new/net/core/dev.c
> --- org/net/core/dev.c	2007-07-20 07:49:28.000000000 +0530
> +++ new/net/core/dev.c	2007-07-20 08:31:35.000000000 +0530

<snip>

> @@ -1566,7 +1605,7 @@ gso:
>  			/* reset queue_mapping to zero */
>  			skb->queue_mapping = 0;
>  			rc = q->enqueue(skb, q);
> -			qdisc_run(dev);
> +			qdisc_run(dev, NULL);

OK. So you are passing a NULL blist here. However, i am
not sure why batching is not used in this situation.

Thanks
Sridhar

>  			spin_unlock(&dev->queue_lock);
> 
>  			rc = rc == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : rc;
> @@ -1763,7 +1802,11 @@ static void net_tx_action(struct softirq
>  			clear_bit(__LINK_STATE_SCHED, &dev->state);
> 
>  			if (spin_trylock(&dev->queue_lock)) {
> -				qdisc_run(dev);
> +				/*
> +				 * Try to send out all skbs if batching is
> +				 * enabled.
> +				 */
> +				qdisc_run(dev, dev->skb_blist);
>  				spin_unlock(&dev->queue_lock);
>  			} else {
>  				netif_schedule(dev);
> @@ -3397,6 +3440,28 @@ int register_netdevice(struct net_device
>  		}
>  	}
> 
> +	if (dev->features & NETIF_F_BATCH_SKBS) {
> +		if (!dev->hard_start_xmit_batch ||
> +		    dev->tx_queue_len < MIN_QUEUE_LEN_BATCH) {
> +			/*
> +			 * Batch TX requires API support in driver plus have
> +			 * a minimum sized queue.
> +			 */
> +			printk(KERN_ERR "%s: Dropping NETIF_F_BATCH_SKBS "
> +					"since no API support or queue len "
> +					"is smaller than %d.\n",
> +					dev->name, MIN_QUEUE_LEN_BATCH);
> +			dev->features &= ~NETIF_F_BATCH_SKBS;
> +		} else {
> +			dev->skb_blist = kmalloc(sizeof *dev->skb_blist,
> +						 GFP_KERNEL);
> +			if (dev->skb_blist) {
> +				skb_queue_head_init(dev->skb_blist);
> +				dev->tx_queue_len >>= 1;
> +			}
> +		}
> +	}
> +
>  	/*
>  	 *	nil rebuild_header routine,
>  	 *	that should be never called and used as just bug trap.
> @@ -3732,10 +3797,16 @@ void unregister_netdevice(struct net_dev
> 
>  	synchronize_net();
> 
> +	/* Deallocate batching structure */
> +	if (dev->skb_blist) {
> +		skb_queue_purge(dev->skb_blist);
> +		kfree(dev->skb_blist);
> +		dev->skb_blist = NULL;
> +	}
> +
>  	/* Shutdown queueing discipline. */
>  	dev_shutdown(dev);
> 
> -
>  	/* Notify protocols, that we are about to destroy
>  	   this device. They should clean all the things.
>  	*/


From ttelford.groups at gmail.com  Fri Jul 20 10:51:49 2007
From: ttelford.groups at gmail.com (Troy Telford)
Date: Fri, 20 Jul 2007 11:51:49 -0600
Subject: [ofa-general] OFED Release tarballs
Message-ID: <200707201151.50178.ttelford.groups@gmail.com>

I've been tracking OFED development for quite a while, pulling the occasional 
snapshot from the git repositories.

And after looking at what is in the git repositories, and then comparing it to 
what is in the OFED release (in this case the 1.2 release), I've been unable 
to find how one gets from what's released in the various git repositories to 
what you get in the OFED release.

The main reason is because before OFED 1.2 was 'released', I was building 
experimental RPMs from the git sources, and more than anything, I'd like to 
know how the OFED distribution went from what's in the git repositories to 
its final release state; simply checking out the OFED-1.2 tag doesn't seem to 
be sufficient to get everything I see in the official release.

Obviously, I've already looked inside the src.rpm's; many of the files that 
are 'missing' in git are generated by ./autogen.sh in the various git 
repositories, so I'm less concerned with those differences.

But there are a few things that I haven't been able to find so far (at least 
in the repositories named 'ofed_1_2/<foo>')

Are there other repositories that have 'stuff' that made it into the OFED 1.2 
distribution that aren't from the 'ofed_1_2/*' repositories?
-- 
Troy Telford


From kaber at trash.net  Fri Jul 20 11:16:36 2007
From: kaber at trash.net (Patrick McHardy)
Date: Fri, 20 Jul 2007 20:16:36 +0200
Subject: [ofa-general] Re: [PATCH 05/10] sch_generic.c changes.
In-Reply-To: <20070720063249.26341.125.sendpatchset@localhost.localdomain>
References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
	<20070720063249.26341.125.sendpatchset@localhost.localdomain>
Message-ID: <46A0FC04.1000006@trash.net>

Krishna Kumar wrote:
> +static inline int get_skb(struct net_device *dev, struct Qdisc *q,
> +			  struct sk_buff_head *blist,
> +			  struct sk_buff **skbp)
> +{
> +	if (likely(!blist) || (!skb_queue_len(blist) && qdisc_qlen(q) <= 1)) {
> +		return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
> +	} else {
> +		int max = dev->tx_queue_len - skb_queue_len(blist);


I'm assuming the driver will simply leave excess packets in the
blist for the next run. The check for tx_queue_len is wrong though,
its only a default which can be overriden and some qdiscs don't
care for it at all.

> +		struct sk_buff *skb;
> +
> +		while (max > 0 && (skb = dev_dequeue_skb(dev, q)) != NULL)
> +			max -= dev_add_skb_to_blist(skb, dev);
> +
> +		*skbp = NULL;
> +		return 1;	/* we have atleast one skb in blist */
> +	}
> +}


> -void __qdisc_run(struct net_device *dev)
> +void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist)


And the patches should really be restructured so this change is
in the same patch changing the header and the caller, for example.

>  {
>  	do {
> -		if (!qdisc_restart(dev))
> +		if (!qdisc_restart(dev, blist))
>  			break;
>  	} while (!netif_queue_stopped(dev));
>  
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


From pradeeps at linux.vnet.ibm.com  Fri Jul 20 12:00:33 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Fri, 20 Jul 2007 12:00:33 -0700
Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V7] patch resubmit
In-Reply-To: <469EFCD5.5050800@linux.vnet.ibm.com>
References: <469E4CA2.2040708@linux.vnet.ibm.com> <adahco1i9yk.fsf@cisco.com>
	<469EB694.7040408@linux.vnet.ibm.com> <adaps2pgdxx.fsf@cisco.com>
	<469EFCD5.5050800@linux.vnet.ibm.com>
Message-ID: <46A10651.1060205@linux.vnet.ibm.com>

Pradeep Satyanarayana wrote:
> Roland Dreier wrote:
>>  > They are not quite the same. How about:
>>  > #define CM_PACKET_SIZE (ALIGN(IPOIB_CM_MTU, PAGE_SIZE))
>>
>> That makes sense.
>>
>>  > >  > -		.event_handler = ipoib_cm_rx_event_handler,
>>  > > 
>>  > > why?  seems harmless to just leave this alone for all QPs even if an
>>  > > SRQ isn't attached.
>>  > 
>>  > If memory serves me right, I tried that and ran into some inexplicable problems.
>>  > Maybe it was hang or no traffic went through -don't exactly recollect what it was.
>>  > After this change the problem went away.
>>
>> Umm... I would like to get to the root cause of that.  Because as far
>> as I can see there is no problem if the event handler is called for a
>> non-SRQ QP.  The event will never be "last WQE reached" (since only a
>> QP attached to an SRQ can generate that) and so the event handler will
>> just return immediately and do nothing.
> 
> Since I do not recollect what the issue was it was it might require some investigation 
> -especially since we have a short window for the merge. Would it be okay if I submit a 
> patch without this for the merge? Subsequently I will submit a patch to address this issue.
> 
> Pradeep
> 

There appears to be no problems with the 2.6.22 git tree if I leave the event_handler the same
for all QPs. However,  I see some ehca initialization errors with a slightly older kernel. 
I will work with the ehca folks (in Germany) and track this down and let you know.

Pradeep


From sashak at voltaire.com  Fri Jul 20 14:11:06 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 21 Jul 2007 00:11:06 +0300
Subject: [ofa-general] latest libipathverbs.git tree
In-Reply-To: <000201c7ca52$0d364d20$9c98070a@amr.corp.intel.com>
References: <20070719220905.GN12489@bauxite.pathscale.com>
	<000201c7ca52$0d364d20$9c98070a@amr.corp.intel.com>
Message-ID: <20070720211106.GN16597@sashak.voltaire.com>

On 15:13 Thu 19 Jul     , Sean Hefty wrote:
> Jeff/Vlad,
> 
> Do either of you know the missing step to adding Ralph's git tree to the http
> view?  (See below.)

I did. Actually symbolic link to Ralph's scm directory was needed (for
gitweb):

  ln -s ~ralphc/scm /pub/scm/'~ralphc'

Sasha

> 
> - Sean
> 
> >> I believe if you create /home/ralphc/public_html directory, and place
> >symbolic
> >> links in it to the git tree, then it will be visible on
> >> http://www.openfabrics.org/git.  I don't remember if additional setup on the
> >> server is required.
> >
> >thanks, i tried it, but it doesn't seem to be sufficient...
> >
> >arthur
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From sweitzen at cisco.com  Fri Jul 20 15:20:03 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Fri, 20 Jul 2007 15:20:03 -0700
Subject: [ofa-general] Where is openib-1.2.tgz in OFED-1.2?
In-Reply-To: <OFCBCCD0A4.5D427786-ON8725731E.005B401C-8825731E.002F8BA2@us.ibm.com>
References: <OFCBCCD0A4.5D427786-ON8725731E.005B401C-8825731E.002F8BA2@us.ibm.com>
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA303E34766@xmb-sjc-216.amer.cisco.com>

Use rpmcpio | cpio -iv to extract the source tarballs from the source
RPMs.
 
The OFED 1.2 structure is less confusing, because the 1.1 openib-*.tgz
file was never actually used to compile the code, it was only a
redundant duplicate copy of the code.
 
Scott


________________________________

	From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Shirley Ma
	Sent: Friday, July 20, 2007 9:39 AM
	To: openib-general at openib.org
	Subject: [ofa-general] Where is openib-1.2.tgz in OFED-1.2?
	
	
	I downloaded OFED-1.2.tgz. It doesn't include source code
openib-*tgz as OFED-1.1. Where I can find the source code without
installing any RPMs??
	
	Thanks
	Shirley

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070720/ce154618/attachment.html>

From xma at us.ibm.com  Fri Jul 20 15:56:24 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Fri, 20 Jul 2007 15:56:24 -0700
Subject: [ofa-general] Where is openib-1.2.tgz in OFED-1.2?
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA303E34766@xmb-sjc-216.amer.cisco.com>
Message-ID: <OF2791B478.A19E1131-ON8725731E.007E5791-8825731E.00520FB4@us.ibm.com>


Thanks Scoot for the tip.

Shirley
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070720/c8fa962c/attachment.html>

From rdreier at cisco.com  Fri Jul 20 20:32:52 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 20 Jul 2007 20:32:52 -0700
Subject: [ofa-general] is ipath_layer.c dead code?
In-Reply-To: <20070719183249.GA20240@bauxite.pathscale.com> (Arthur Jones's
	message of "Thu, 19 Jul 2007 11:32:49 -0700")
References: <ada1wf8w8gp.fsf@cisco.com>
	<20070719183249.GA20240@bauxite.pathscale.com>
Message-ID: <adawswu8m8r.fsf@cisco.com>

thanks, applied.  I did indeed miss the header file being dead too.

BTW...

 > The failed attempt to get ipath_ether upstream was the final nail in the coffin

I don't think that the attempt to get ipath_ether upstream was ever
that vigorous -- I don't see much demand for it, but if you guys feel
that it has advantages for users then I wouldn't rule out merging it.

 - R.


From rdreier at cisco.com  Fri Jul 20 20:41:23 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 20 Jul 2007 20:41:23 -0700
Subject: [ofa-general] Re: [PATCH] IB/mthca: change command token on timeout
In-Reply-To: <20070719112849.GK24018@mellanox.co.il> (Michael S. Tsirkin's
	message of "Thu, 19 Jul 2007 14:28:49 +0300")
References: <20070719112849.GK24018@mellanox.co.il>
Message-ID: <adamyxq8luk.fsf@cisco.com>

thanks, I applied this and also did the same thing for mlx4:

commit 8a7bc1f72356a1f7dc67a168067c3942e8db395a
Author: Roland Dreier <rolandd at cisco.com>
Date:   Fri Jul 20 20:39:31 2007 -0700

    mlx4_core: Change command token on timeout
    
    The FW command token is currently only updated on a command completion
    event. This means that on command timeout, the same token will be
    reused for new command, which results in a mess if the timed out
    command *does* eventually complete.
    
    This is the same change as the patch for mthca from Michael
    S. Tsirkin <mst at dev.mellanox.co.il> that was just merged.  It seems
    sensible to avoid gratuitous differences in FW command processing
    between mthca and mlx4.
    
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

diff --git a/drivers/net/mlx4/cmd.c b/drivers/net/mlx4/cmd.c
index c1f81a9..5d791e4 100644
--- a/drivers/net/mlx4/cmd.c
+++ b/drivers/net/mlx4/cmd.c
@@ -246,8 +246,6 @@ void mlx4_cmd_event(struct mlx4_dev *dev, u16 token, u8 status, u64 out_param)
 	context->result    = mlx4_status_to_errno(status);
 	context->out_param = out_param;
 
-	context->token += priv->cmd.token_mask + 1;
-
 	complete(&context->done);
 }
 
@@ -264,6 +262,7 @@ static int mlx4_cmd_wait(struct mlx4_dev *dev, u64 in_param, u64 *out_param,
 	spin_lock(&cmd->context_lock);
 	BUG_ON(cmd->free_head < 0);
 	context = &cmd->context[cmd->free_head];
+	context->token += priv->cmd.token_mask + 1;
 	cmd->free_head = context->next;
 	spin_unlock(&cmd->context_lock);
 

From rdreier at cisco.com  Fri Jul 20 20:55:39 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 20 Jul 2007 20:55:39 -0700
Subject: [ofa-general] Re: [PATCH] IB/mlx4: fix oops in qp allocation for srq
	case
In-Reply-To: <20070719161543.GC31246@mellanox.co.il> (Michael S. Tsirkin's
	message of "Thu, 19 Jul 2007 19:15:44 +0300")
References: <20070719094039.GF24018@mellanox.co.il>
	<adar6n4fo2w.fsf@cisco.com> <20070719161543.GC31246@mellanox.co.il>
Message-ID: <adavece76mc.fsf@cisco.com>

(BTW, the kmalloc(0) crash should be fixed in Linus's latest git)

 > the bug in error handling is real though, isn't it?

yes, quite right.  I queued this up:

commit 597869e4dafbb05a69f571e5109f06245807ed6c
Author: Roland Dreier <rolandd at cisco.com>
Date:   Fri Jul 20 20:54:30 2007 -0700

    IB/mlx4: Fix error path in create_qp_common()
    
    The error handling code at err_wrid in create_qp_common() does not
    handle a userspace QP attached to an SRQ correctly, since it ends up
    in the else clause of the if statement.  This means it tries to
    kfree() the uninitialized qp->sq.wrid and qp->rq.wrid pointers.  Fix
    this so we only free the wrid arrays for kernel QPs.
    
    Pointed out by Michael S. Tsirkin <mst at dev.mellanox.co.il>.
    
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 5456bc4..f6315df 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -415,9 +415,11 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 	return 0;
 
 err_wrid:
-	if (pd->uobject && !init_attr->srq)
-		mlx4_ib_db_unmap_user(to_mucontext(pd->uobject->context), &qp->db);
-	else {
+	if (pd->uobject) {
+		if (!init_attr->srq)
+			mlx4_ib_db_unmap_user(to_mucontext(pd->uobject->context),
+					      &qp->db);
+	} else {
 		kfree(qp->sq.wrid);
 		kfree(qp->rq.wrid);
 	}


From rdreier at cisco.com  Fri Jul 20 21:02:11 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 20 Jul 2007 21:02:11 -0700
Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V8] patch
In-Reply-To: <469FB3AB.6080304@linux.vnet.ibm.com> (Pradeep Satyanarayana's
	message of "Thu, 19 Jul 2007 11:55:39 -0700")
References: <469FB3AB.6080304@linux.vnet.ibm.com>
Message-ID: <adalkda76bg.fsf@cisco.com>

I just noticed another bug here I think:

here you search up to max_rc_qp:

 > +	for (index = 0; index < max_rc_qp; index++)
 > +		if (priv->cm.rx_index_table[index] == NULL)
 > +			break;

but here

 > +		priv->cm.rx_index_table = kzalloc(NOSRQ_INDEX_TABLE_SIZE *
 > +					 sizeof *priv->cm.rx_index_table,
 > +					 GFP_KERNEL);

the table is allocated with a fixed size of NOSRQ_INDEX_TABLE_SIZE.
(BTW, kcalloc might be slightly preferred here, since you are actually
allocating an array).

If max_rc_qp is going to be a module parameter:

 > +module_param_named(nosrq_max_rc_qp, max_rc_qp, int, 0644);

(and, I just noticed, one you allow to be changed at runtime ?!)
then rx_index_table has to be allocated with the right size.

 - R.


From rdreier at cisco.com  Fri Jul 20 21:07:13 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 20 Jul 2007 21:07:13 -0700
Subject: [ofa-general] [PATCH 1/5] ehca: Supports large page MRs
In-Reply-To: <200707201601.52277.hnguyen@linux.vnet.ibm.com> (Hoang-Nam
	Nguyen's message of "Fri, 20 Jul 2007 16:01:51 +0200")
References: <200707201601.52277.hnguyen@linux.vnet.ibm.com>
Message-ID: <adabqe67632.fsf@cisco.com>

I applied this, but I agree with checkpatch.pl:

 > WARNING: externs should be avoided in .c files
 > #227: FILE: drivers/infiniband/hw/ehca/ehca_mrmw.c:67:
 > +extern int ehca_mr_largepage;
 > 
 > WARNING: externs should be avoided in .c files
 > #949: FILE: drivers/infiniband/hw/ehca/hcp_if.c:753:
 > +	extern int ehca_debug_level;

if you need to use a variable in more than one .c file, put the extern
declaration in a common header that's included everywhere you use the
variable, including the .c file that it is defined in.  That way the
compiler can see if you get confused about the type of the variable.

When you get a chance, please post a follow-on patch to fix this.

 - R.


From rdreier at cisco.com  Fri Jul 20 21:12:52 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 20 Jul 2007 21:12:52 -0700
Subject: [ofa-general] Re: [PATCH 2/5] ehca: Generate event when SRQ limit
	reached
In-Reply-To: <200707201602.19142.hnguyen@linux.vnet.ibm.com> (Hoang-Nam
	Nguyen's message of "Fri, 20 Jul 2007 16:02:18 +0200")
References: <200707201602.19142.hnguyen@linux.vnet.ibm.com>
Message-ID: <ada7iou75tn.fsf@cisco.com>

thanks, applied.

BTW, does your SRQ-capable hardware support generating the "last WQE
reached" event?  There's not any reliable way to avoid problems when
destroying QPs attached to an SRQ without it, and the IB spec requires
CAs that support SRQs to generate it (o11-5.2.5 in chapter 11 of vol 1).

I don't see any code in ehca to generate the event, and IPoIB CM at
least will be very unhappy when using SRQs if the event is not
generated.

 - R.


From rdreier at cisco.com  Fri Jul 20 21:14:28 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 20 Jul 2007 21:14:28 -0700
Subject: [ofa-general] Re: [PATCH 3/5] ehca: Make ehca2ib_return_code()
	non-inline
In-Reply-To: <200707201602.46415.hnguyen@linux.vnet.ibm.com> (Hoang-Nam
	Nguyen's message of "Fri, 20 Jul 2007 16:02:46 +0200")
References: <200707201602.46415.hnguyen@linux.vnet.ibm.com>
Message-ID: <ada3azi75qz.fsf@cisco.com>

thanks, applied


From rdreier at cisco.com  Fri Jul 20 21:20:49 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 20 Jul 2007 21:20:49 -0700
Subject: [ofa-general] [PATCH 5/5] ehca: Support small QP queues
In-Reply-To: <200707201604.17991.hnguyen@linux.vnet.ibm.com> (Hoang-Nam
	Nguyen's message of "Fri, 20 Jul 2007 16:04:17 +0200")
References: <200707201604.17991.hnguyen@linux.vnet.ibm.com>
Message-ID: <adawswu5qvy.fsf@cisco.com>

thanks, applied.  I fixed this up myself to work with commit 20c2df83,
which got rid of the destructor argument to kmem_cache_create() -- you
probably want to check my tree to make sure it's OK.

Also the same as I said before about checkpatch.pl's warning:

WARNING: externs should be avoided in .c files
#337: FILE: drivers/infiniband/hw/ehca/ehca_pd.c:91:
+	extern struct kmem_cache *small_qp_cache;

please fix that up when you get a chance


From kliteyn at mellanox.co.il  Fri Jul 20 21:43:03 2007
From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il)
Date: 21 Jul 2007 07:43:03 +0300
Subject: [ofa-general] nightly osm_sim report 2007-07-21:normal completion
Message-ID: <MTLEXCH01OAeNPHsWoc00001f49@mtlexch01.mtl.com>

OSM Simulation Regression Summary
 
[Generated mail - please do NOT reply]
 
 
OpenSM rev = Thu_Jul_12_11:56:08_2007 [de69204d60071532833b0cdd3baa5e2386dc2c73]
ibutils rev = Tue_Mar_13_14:36:32_2007 [80aaff94f0eb65117db39b9db7d609ffdcc055de]
 
 
Total=560  Pass=560  Fail=0
 
 
Pass:
42 Stability IS1-16.topo
42 Pkey IS1-16.topo
42 OsmTest IS1-16.topo
42 OsmStress IS1-16.topo
42 Multicast IS1-16.topo
42 LidMgr IS1-16.topo
14 Stability IS3-loop.topo
14 Stability IS3-128.topo
14 Pkey IS3-128.topo
14 OsmTest IS3-loop.topo
14 OsmTest IS3-128.topo
14 OsmStress IS3-128.topo
14 Multicast IS3-loop.topo
14 Multicast IS3-128.topo
14 LidMgr IS3-128.topo
14 FatTree merge-roots-4-ary-2-tree.topo
14 FatTree merge-root-4-ary-3-tree.topo
14 FatTree gnu-stallion-64.topo
14 FatTree blend-4-ary-2-tree.topo
14 FatTree RhinoDDR.topo
14 FatTree FullGnu.topo
14 FatTree 4-ary-2-tree.topo
14 FatTree 2-ary-4-tree.topo
14 FatTree 12-node-spaced.topo
14 FTreeFail 4-ary-2-tree-missing-sw-link.topo
14 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo
14 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo
14 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo

Failures:


From rdreier at cisco.com  Fri Jul 20 21:54:27 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 20 Jul 2007 21:54:27 -0700
Subject: [ofa-general] Re: [PATCH] IB/mlx4: enable MSI-X by default
In-Reply-To: <20070719112155.GJ24018@mellanox.co.il> (Michael S. Tsirkin's
	message of "Thu, 19 Jul 2007 14:21:55 +0300")
References: <20070719112155.GJ24018@mellanox.co.il>
Message-ID: <adaejj24arg.fsf@cisco.com>

 > -	mlx4_enable_msi_x(dev);
 > -
 >  	if (mlx4_cmd_init(dev)) {
 >  		mlx4_err(dev, "Failed to init command interface, aborting.\n");
 >  		goto err_free_dev;
 >  	}
 >  
 > +	mlx4_enable_msi_x(dev);

Why this change?  I don't see anything in mlx4_cmd_init() that seems
to matter in terms of coming before or after enabling MSI-X.

 >  	err = mlx4_init_hca(dev);
 > +	if (err == -EBUSY && (dev->flags & MLX4_FLAG_MSI_X)) {
 > +		mlx4_warn(dev, "Trying again with MSI/MSI-X disabled.\n");
 > +		dev->flags &= ~MLX4_FLAG_MSI_X;
 > +		pci_disable_msix(pdev);
 > +		err = mlx4_init_hca(dev);
 > +	}
 > +
 >  	if (err)
 >  		goto err_cmd;
 >  
 > +	mlx4_enable_msi_x(dev);
 > +
 >  	err = mlx4_setup_hca(dev);

Have you actually tested this on a system where MSI-X fails?  Because
I don't see how it could work-- we don't actually try interrupts until
mlx4_setup_hca() (in fact we don't even create any EQs until then).
So I don't see how mlx4_init_hca() could tell if MSI-X is OK...

 - R.


From rdreier at cisco.com  Fri Jul 20 21:56:20 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 20 Jul 2007 21:56:20 -0700
Subject: [ofa-general] [GIT PULL] please pull infiniband.git
Message-ID: <ada4pjy4aob.fsf@cisco.com>

Linus, please pull from

    master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This tree is also available from kernel.org mirrors at:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This will get another small batch of changes for 2.6.23:

Arthur Jones (1):
      IB/ipath: Remove ipath_layer dead code

Florin Malita (1):
      IB/mlx4: Fix leaks in __mlx4_ib_modify_qp

Hoang-Nam Nguyen (3):
      IB/ehca: Support large page MRs
      IB/ehca: Generate async event when SRQ limit reached
      IB/ehca: Move ehca2ib_return_code() out of line

Joachim Fenkes (1):
      IB/ehca: Make internal_create/destroy_qp() static

Michael S. Tsirkin (1):
      IB/mthca: Change command token on timeout

Roland Dreier (2):
      mlx4_core: Change command token on timeout
      IB/mlx4: Fix error path in create_qp_common()

Stefan Roscher (1):
      IB/ehca: Support small QP queues

 drivers/infiniband/hw/ehca/ehca_classes.h |   50 +++--
 drivers/infiniband/hw/ehca/ehca_cq.c      |    8 +-
 drivers/infiniband/hw/ehca/ehca_eq.c      |    8 +-
 drivers/infiniband/hw/ehca/ehca_irq.c     |   42 +++-
 drivers/infiniband/hw/ehca/ehca_main.c    |   49 ++++-
 drivers/infiniband/hw/ehca/ehca_mrmw.c    |  371 ++++++++++++++++++++++++-----
 drivers/infiniband/hw/ehca/ehca_mrmw.h    |    2 +-
 drivers/infiniband/hw/ehca/ehca_pd.c      |   25 ++-
 drivers/infiniband/hw/ehca/ehca_qp.c      |  178 ++++++++------
 drivers/infiniband/hw/ehca/ehca_tools.h   |   19 +--
 drivers/infiniband/hw/ehca/ehca_uverbs.c  |    2 +-
 drivers/infiniband/hw/ehca/hcp_if.c       |   50 +++-
 drivers/infiniband/hw/ehca/ipz_pt_fn.c    |  222 +++++++++++++----
 drivers/infiniband/hw/ehca/ipz_pt_fn.h    |   26 ++-
 drivers/infiniband/hw/ipath/Makefile      |    1 -
 drivers/infiniband/hw/ipath/ipath_layer.c |  365 ----------------------------
 drivers/infiniband/hw/ipath/ipath_layer.h |   71 ------
 drivers/infiniband/hw/ipath/ipath_verbs.h |    2 -
 drivers/infiniband/hw/mlx4/qp.c           |   20 +-
 drivers/infiniband/hw/mthca/mthca_cmd.c   |    3 +-
 drivers/net/mlx4/cmd.c                    |    3 +-
 21 files changed, 802 insertions(+), 715 deletions(-)
 delete mode 100644 drivers/infiniband/hw/ipath/ipath_layer.c
 delete mode 100644 drivers/infiniband/hw/ipath/ipath_layer.h


From HNGUYEN at de.ibm.com  Sat Jul 21 01:22:54 2007
From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen)
Date: Sat, 21 Jul 2007 10:22:54 +0200
Subject: [ofa-general] [PATCH 1/5] ehca: Supports large page MRs
In-Reply-To: <adabqe67632.fsf@cisco.com>
Message-ID: <OFDBEFA17C.9FE35C61-ONC125731F.002DE742-C125731F.002E0843@de.ibm.com>

Hi Roland!
> I applied this, but I agree with checkpatch.pl:
>
>  > WARNING: externs should be avoided in .c files
>  > #227: FILE: drivers/infiniband/hw/ehca/ehca_mrmw.c:67:
>  > +extern int ehca_mr_largepage;
>  >
>  > WARNING: externs should be avoided in .c files
>  > #949: FILE: drivers/infiniband/hw/ehca/hcp_if.c:753:
>  > +   extern int ehca_debug_level;
>
> if you need to use a variable in more than one .c file, put the extern
> declaration in a common header that's included everywhere you use the
> variable, including the .c file that it is defined in.  That way the
> compiler can see if you get confused about the type of the variable.
That's true.
> When you get a chance, please post a follow-on patch to fix this.
Sure thing. Will do that for rc2.
Thanks!
Nam


From vlad at lists.openfabrics.org  Sat Jul 21 01:38:32 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Sat, 21 Jul 2007 01:38:32 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070721-0100 daily build status
Message-ID: <20070721083832.AFC96E60838@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.12
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.20
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.18
Passed on x86_64 with linux-2.6.19
Passed on ia64 with linux-2.6.19
Passed on x86_64 with linux-2.6.12
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.16
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.13
Passed on powerpc with linux-2.6.13
Passed on x86_64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on x86_64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.13
Passed on x86_64 with linux-2.6.14
Passed on powerpc with linux-2.6.12
Passed on ia64 with linux-2.6.16
Passed on ppc64 with linux-2.6.12
Passed on ia64 with linux-2.6.15
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.14
Passed on powerpc with linux-2.6.16
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.13
Passed on powerpc with linux-2.6.14
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5

Failed:


From krkumar2 at in.ibm.com  Fri Jul 20 23:46:30 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Sat, 21 Jul 2007 12:16:30 +0530
Subject: [ofa-general] Re: [PATCH 04/10] net-sysfs.c changes.
In-Reply-To: <20070720172203.0eaeea86@oldman>
Message-ID: <OFFAA4E4C4.F1496A54-ON6525731F.0025194E-6525731F.0025379C@in.ibm.com>

Stephen Hemminger <shemminger at linux-foundation.org> wrote on 07/20/2007
09:52:03 PM:
> Patrick McHardy <kaber at trash.net> wrote:
>
> > Krishna Kumar2 wrote:
> > > Patrick McHardy <kaber at trash.net> wrote on 07/20/2007 03:37:20 PM:
> > >
> > >
> > >
> > >> rtnetlink support seems more important than sysfs to me.
> > >>
> > >
> > > Thanks, I will add that as a patch. The reason to add to sysfs is
that
> > > it is easier to change for a user (and similar to tx_queue_len).
> > >
> >
>
> But since batching is so similar to TSO, i really should be part of the
> flags and controlled by ethtool like other offload flags.

So should I add all three interfaces (or which ones) :

      1. /sys (like for tx_queue_len)
      2. netlink
      3. ethtool.

Or only 2 & 3 are enough ?

thanks,

- KK


From krkumar2 at in.ibm.com  Fri Jul 20 23:44:12 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Sat, 21 Jul 2007 12:14:12 +0530
Subject: [ofa-general] Re: [PATCH 03/10] dev.c changes.
In-Reply-To: <1184953459.12431.21.camel@localhost.localdomain>
Message-ID: <OF2F4CD3AD.13BD72EC-ON6525731F.00242872-6525731F.002501AC@in.ibm.com>

Hi Sridhar,

Sridhar Samudrala <sri at us.ibm.com> wrote on 07/20/2007 11:14:19 PM:

> > @@ -1566,7 +1605,7 @@ gso:
> >           /* reset queue_mapping to zero */
> >           skb->queue_mapping = 0;
> >           rc = q->enqueue(skb, q);
> > -         qdisc_run(dev);
> > +         qdisc_run(dev, NULL);
>
> OK. So you are passing a NULL blist here. However, i am
> not sure why batching is not used in this situation.

Actually it could be used, but in most cases there will be only
one skb. If I pass the blist here, the result (for batching
case) will be to put one single skb into the blist and call
the new xmit API. That wastes cycles as we take a skb out
from the queue (as in regular code) and then add it to the
blist (different in the new code) and then the driver has to
remove this skb from the blist (different in the new code).
I could try batching but then require there are more than
1 skbs before adding to the blist (or the blist doesn't already
have skbs, in which case adding even one skb makes sense). Also,
it will have a slight impact for regular drivers where for each
xmit, one extra dereference for dev->skb_blist (which is always
NULL) is made, which was another reason to always pass NULL.

I will check what the results are by giving passing blist here
too and make the above change. I will run tests for that (as
well as NETPERF RR test as asked by Evgeniy).

Thanks,

- KK


From krkumar2 at in.ibm.com  Fri Jul 20 23:56:23 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Sat, 21 Jul 2007 12:26:23 +0530
Subject: [ofa-general] Re: [PATCH 05/10] sch_generic.c changes.
In-Reply-To: <46A0FC04.1000006@trash.net>
Message-ID: <OF206CB1B7.31D1C39B-ON6525731F.00254E4C-6525731F.00261F1A@in.ibm.com>

Hi Patrick,

Patrick McHardy <kaber at trash.net> wrote on 07/20/2007 11:46:36 PM:

> Krishna Kumar wrote:
> > +static inline int get_skb(struct net_device *dev, struct Qdisc *q,
> > +           struct sk_buff_head *blist,
> > +           struct sk_buff **skbp)
> > +{
> > +   if (likely(!blist) || (!skb_queue_len(blist) && qdisc_qlen(q) <=
1)) {
> > +      return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
> > +   } else {
> > +      int max = dev->tx_queue_len - skb_queue_len(blist);
>
>
> I'm assuming the driver will simply leave excess packets in the
> blist for the next run.

Yes, and the next run will be scheduled even if no more xmits are called
either
due to qdisc_restart()'s call to driver returning :
      BUSY : driver failed to send all, net_tx_action will handle this
later (the
             case you mentioned)
      OK : and qlen is > 0, return 1 and __qdisc_run() will re-retry (where
            blist len will become zero as driver processed EVERYTHING on
blist)

> The check for tx_queue_len is wrong though,
> its only a default which can be overriden and some qdiscs don't
> care for it at all.

I think it should not matter whether qdiscs use this or not, or even if it
is modified (unless it is made zero in which case this breaks). The
intention behind this check is to make sure that not more than tx_queue_len
skbs are in all queues put together (q->qdisc + dev->skb_blist), otherwise
the blist can become too large and breaks the idea of tx_queue_len. Is that
a good justification ?

> > -void __qdisc_run(struct net_device *dev)
> > +void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist)
>
>
> And the patches should really be restructured so this change is
> in the same patch changing the header and the caller, for example.

Ah, OK.

Thanks,

- KK


From krkumar2 at in.ibm.com  Sat Jul 21 00:24:08 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Sat, 21 Jul 2007 12:54:08 +0530
Subject: [ofa-general] Re: [PATCH 05/10] sch_generic.c changes.
In-Reply-To: <OF206CB1B7.31D1C39B-ON6525731F.00254E4C-6525731F.00261F1A@LocalDomain>
Message-ID: <OFF2224CCE.8045DF8B-ON6525731F.00287359-6525731F.0028A9BC@in.ibm.com>

Krishna Kumar2/India/IBM wrote on 07/21/2007 12:26:23 PM:

> Hi Patrick,
>
> Patrick McHardy <kaber at trash.net> wrote on 07/20/2007 11:46:36 PM:
>
> > The check for tx_queue_len is wrong though,
> > its only a default which can be overriden and some qdiscs don't
> > care for it at all.

> I think it should not matter whether qdiscs use this or not, or even if
it
> is modified (unless it is made zero in which case this breaks). The
> intention behind this check is to make sure that not more than
tx_queue_len
> skbs are in all queues put together (q->qdisc + dev->skb_blist),
otherwise
> the blist can become too large and breaks the idea of tx_queue_len. Is
that
> a good justification ?

Also, if tx_queue_len is set to zero, I think my code will not execute and
the existing code will break at rc = q->enqueue() (for sched's checking
queue
limits).


From krkumar2 at in.ibm.com  Fri Jul 20 23:30:10 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Sat, 21 Jul 2007 12:00:10 +0530
Subject: [ofa-general] Re: [PATCH 02/10] Networking include file changes.
In-Reply-To: <1184952305.12431.16.camel@localhost.localdomain>
Message-ID: <OF46F8CFAE.90A18F30-ON6525731F.0023182A-6525731F.0023B8DF@in.ibm.com>

Hi Sridhar,

Sridhar Samudrala <sri at us.ibm.com> wrote on 07/20/2007 10:55:05 PM:
> > diff -ruNp org/include/net/pkt_sched.h new/include/net/pkt_sched.h
> > --- org/include/net/pkt_sched.h   2007-07-20 07:49:28.000000000 +0530
> > +++ new/include/net/pkt_sched.h   2007-07-20 08:30:22.000000000 +0530
> > @@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge
> >        struct rtattr *tab);
> >  extern void qdisc_put_rtab(struct qdisc_rate_table *tab);
> >
> > -extern void __qdisc_run(struct net_device *dev);
> > +extern void __qdisc_run(struct net_device *dev, struct sk_buff_head
*blist);
>
> Why do we need this additional 'blist' argument?
> Is this different from dev->skb_blist?

It is the same, but I want to call it mostly with NULL and rarely with the
batch list pointer (so it is related to your other question). My original
code didn't have this and was trying batching in all cases. But in most
xmit's (probably almost all), there will be only one packet in the queue to
send and batching will never happen. When there is a lock contention or if
the queue is stopped, then the next iteration will find >1 packets. But I
still will try no batching for the lock failure case as there be probably
2 packets (one from previous time and 1 from this time, or 3 if two
failures,
etc), and try batching only when queue was stopped from net_tx_action (this
was based on Dave Miller's idea).

Thanks,

- KK


From vlad at lists.openfabrics.org  Sat Jul 21 02:43:54 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Sat, 21 Jul 2007 02:43:54 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_c_kernel 20070721-0200 daily build status
Message-ID: <20070721094354.1086BE60873@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2_c

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.13
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on powerpc with linux-2.6.18
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.16
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.13
Passed on powerpc with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on powerpc with linux-2.6.13
Passed on ia64 with linux-2.6.12
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.17
Passed on ia64 with linux-2.6.22
Passed on powerpc with linux-2.6.16
Passed on powerpc with linux-2.6.14
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.12
Passed on ia64 with linux-2.6.18
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.16
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.14
Passed on ppc64 with linux-2.6.13
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.18-8.el5

Failed:


From hadi at cyberus.ca  Sat Jul 21 06:18:41 2007
From: hadi at cyberus.ca (jamal)
Date: Sat, 21 Jul 2007 09:18:41 -0400
Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API
In-Reply-To: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
Message-ID: <1185023921.5192.45.camel@localhost>


I am (have been) under extreme travel mode - so i will have high latency
in follow ups.

On Fri, 2007-20-07 at 12:01 +0530, Krishna Kumar wrote:
> Hi Dave, Roland, everyone,
> 
> In May, I had proposed creating an API for sending 'n' skbs to a driver to
> reduce lock overhead, DMA operations, and specific to drivers that have
> completion notification like IPoIB - reduce completion handling ("[RFC] New
> driver API to speed up small packets xmits" @
> http://marc.info/?l=linux-netdev&m=117880900818960&w=2). I had also sent
> initial test results for E1000 which showed minor improvements (but also
> got degradations) @http://marc.info/?l=linux-netdev&m=117887698405795&w=2.
> 

Add to that context: that i have been putting out patches on this over
the last 3+ years as well as several public presentations = last one
being: http://vger.kernel.org/jamal_netconf2006.sxi

My main problem (and obstacles to submitting the patches) has been a
result of not doing the approriate testing - i had been testing
forwarding path (in all my results post the latest patches) when i
should really have been testing the improvement of the tx path. 

> There is a parallel WIP by Jamal but the two implementations are completely
> different since the code bases from the start were separate. Key changes:
> 	- Use a single qdisc interface to avoid code duplication and reduce
> 	  maintainability (sch_generic.c size reduces by ~9%).
> 	- Has per device configurable parameter to turn on/off batching.
> 	- qdisc_restart gets slightly modified while looking simple without
> 	  any checks for batching vs regular code (infact only two lines have
> 	  changed - 1. instead of dev_dequeue_skb, a new batch-aware function
> 	  is called; and 2. an extra call to hard_start_xmit_batch.

> 	- No change in__qdisc_run other than a new argument (from DM's idea).
> 	- Applies to latest net-2.6.23 compared to 2.6.22-rc4 code.

All the above are cosmetic differences. To me is the highest priority
is making sure that batching is useful and what the limitations are.
At some point, when all looks good - i dont mind adding an ethtool
interface to turn off/on batching, merge with the new qdisc restart path
instead of having a parallel path, solicit feedback on naming, where to
allocate structs etc etc. All that is low prio if batching across a
variety of hardware and applications doesnt prove useful. At the moment,
i am unsure theres consistency to justify push batching in.

Having said that below are the main architectural differences we have
which is what we really need to discuss and see what proves useful:

>         - Batching algo/processing is different (eg. if
>           qdisc_restart() finds
> 	  one skb in the batch list, it will try to batch more (upto a limit)
> 	  instead of sending that out and batching the rest in the next call.

This sounds a little more aggressive but maybe useful.
I have experimented with setting upper bound limits (current patches
have a pktgen interface to set the max to send) and have concluded that
it is unneeded. Probing by letting the driver tell you what space is
available has proven to be the best approach. I have been meaning to
remove the code in pktgen which allows these limits.
 
> 	- Jamal's code has a separate hw prep handler called from the stack,
> 	  and results are accessed in driver during xmit later.

I have explained the reasoning to this a few times. A recent response to
Michael Chan is here:
http://marc.info/?l=linux-netdev&m=118346921316657&w=2
And heres a response to you that i havent heard back on:
http://marc.info/?l=linux-netdev&m=118355539503924&w=2

My tests so far indicate this interface is useful. It doesnt apply well
to some drivers (for example i dont use it in tun) - which makes it
optional but useful nevertheless. I will be more than happy to kill this
if i can find cases where it proves to be a bad idea.

> 	- Jamal's code has dev->xmit_win which is cached by the driver. Mine
> 	  has dev->xmit_slots but this is used only by the driver while the
> 	  core has a different mechanism to find how many skbs to batch.

This is related to the first item.

> 	- Completely different structure/design & coding styles.
> (This patch will work with drivers updated by Jamal, Matt & Michael Chan with
> minor modifications - rename xmit_win to xmit_slots & rename batch handler)

Again, cosmetics (and indication you are morphing towards me).

So if i was to sum up this, (it would be useful discussion to have on
these) the real difference is:

a) you have an extra check on refilling the skb list when you find that
it has a single skb. I tagged this as being potentially useful.
b) You have a check for some upper bound on the number of skbs to send
to the driver. I tagged this as unnecessary - the interface is still on
in my current code, so it shouldnt be hard to show one way or other.
c) You dont have prep_xmit()

Add to that list any other architectural differences i may have missed
and lets discuss and hopefully make some good progress.

cheers,
jaaml


From hadi at cyberus.ca  Sat Jul 21 06:46:19 2007
From: hadi at cyberus.ca (jamal)
Date: Sat, 21 Jul 2007 09:46:19 -0400
Subject: [ofa-general] TCP and batching WAS(Re: [PATCH 00/10] Implement
	batching skb API
In-Reply-To: <20070720081848.7cc652fb@oldman>
References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
	<20070720081848.7cc652fb@oldman>
Message-ID: <1185025579.5192.68.camel@localhost>

On Fri, 2007-20-07 at 08:18 +0100, Stephen Hemminger wrote:

> You may see worse performance with batching in the real world when
> running over WAN's.  Like TSO, batching will generate back to back packet
> trains that are subject to multi-packet synchronized loss. 

Has someone done any study on TSO effect? Doesnt ECN with a RED router
help on something like this?
I find it suprising that a single flow doing TSO would overwhelm a
routers buffer. I actually think the value of batching as far as TCP is
concerned is propotional to the number of flows. i.e the more flows you
have the more batching you will end up doing. And if TCPs fairness is
the legend talk it has been made to be, then i dont see this as
problematic.

BTW, something i noticed regards to GSO when testing batching:
For TCP packets slightly above MDU (upto 2K), GSO gives worse
performance than non-GSO. Actually has nothing to do with batching,
rather it works the same way with or without batching changes.

Another oddity:
Looking at the flow rate from a purely packets/second (I know thats a
router centric view, but i found it strange nevertheless) - you see that
as packet size goes up, the pps also goes up. I tried mucking around
with nagle etc, but saw no observable changes. Any insight?
My expectation was that the pps would stay at least the same or get
better with smaller packets (assuming theres less data to push around).

cheers,
jamal


From mst at dev.mellanox.co.il  Sat Jul 21 12:48:51 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sat, 21 Jul 2007 22:48:51 +0300
Subject: [ofa-general] Re: IPOB CM (NOSRQ) [PATCH V8] patch
In-Reply-To: <46A0E350.5060207@linux.vnet.ibm.com>
References: <469FB3AB.6080304@linux.vnet.ibm.com>
	<20070720141759.GF31246@mellanox.co.il>
	<46A0E350.5060207@linux.vnet.ibm.com>
Message-ID: <20070721194851.GA20438@mellanox.co.il>

> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
> Subject: Re: IPOB CM (NOSRQ) [PATCH V8] patch
> 
> Michael S. Tsirkin wrote:
> >> @@ -815,7 +1168,9 @@ static struct ib_qp *ipoib_cm_create_tx_
> >>  	attr.recv_cq = priv->cq;
> >>  	attr.srq = priv->cm.srq;
> >>  	attr.cap.max_send_wr = ipoib_sendq_size;
> >> +	attr.cap.max_recv_wr = 1;
> >>  	attr.cap.max_send_sge = 1;
> >> +	attr.cap.max_recv_sge = 1;
> >>  	attr.sq_sig_type = IB_SIGNAL_ALL_WR;
> >>  	attr.qp_type = IB_QPT_RC;
> >>  	attr.send_cq = cq;
> > 
> > You never post a receive WR on this QP, do you?
> > So
> > 1. What's magic about 1 as max recv wr? Why not 0?
> > 2. If the remote sends a packet on this QP, it'llget closed,
> >    won't it? Looks like a spec violation.
> > 
> > 
> Good catch. I can probably set max_recv_sge to 0 too -right?
> I can do that in a separate patch later on.
> However, I see nothing in table 46 of the IB spec that tells me
> that it is a violation of the spec. Which section are you
> referring to?

The IPoIB RFC.

-- 
MST


From kliteyn at dev.mellanox.co.il  Sat Jul 21 15:07:50 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Sun, 22 Jul 2007 01:07:50 +0300
Subject: [ofa-general] QoS RFC
Message-ID: <46A283B6.1070105@dev.mellanox.co.il>

Hi All

Please find the attached RFC describing how QoS policy support could be implemented in the OpenFabrics stack.
Your comments are welcome.

-- Yevgeny

               RFC: OpenFabrics Enhancements for QoS Support
              ===============================================

Authors: . Eitan Zahavi <eitan at mellanox.co.il>
Authors: . Yevgeny Kliteynik <kliteyn at mellanox.co.il>
Date: .... Jul 2007.
Revision:  0.2

Table of contents:
1. Overview
2. Architecture
3. Supported Policy
4. CMA functionality
5. IPoIB functionality
6. SDP functionality
7. SRP functionality
8. iSER functionality
9. OpenSM functionality

1. Overview
------------
Quality of Service requirements stem from the realization of I/O consolidation
over IB network: As multiple applications and ULPs share the same fabric, means
to control their use of the network resources are becoming a must. The basic
need is to differentiate the service levels provided to different traffic flows,
such that a policy could be enforced and control each flow utilization of the
fabric resources.

IBTA specification defined several hardware features and management interfaces
to support QoS:
* Up to 15 Virtual Lanes (VL) carry traffic in a non-blocking manner
* Arbitration between traffic of different VLs is performed by a 2 priority
   levels weighted round robin arbiter. The arbiter is programmable with
   a sequence of (VL, weight) pairs and maximal number of high priority credits
   to be processed before low priority is served
* Packets carry class of service marking in the range 0 to 15 in their
   header SL field
* Each switch can map the incoming packet by its SL to a particular output
   VL based on programmable table VL=SL-to-VL-MAP(in-port, out-port, SL)
* The Subnet Administrator controls each communication flow parameters
   by providing them as a response to Path Record (PR) or MultiPathRecord (MPR)
   queries

The IB QoS features provide the means to implement a DiffServ like architecture.
DiffServ architecture (IETF RFC2474 2475) is widely used today in highly dynamic
fabrics.

This proposal provides the detailed functional definition for the various
software elements that are required to enable a DiffServ like architecture over
the OpenFabrics software stack.


2. Architecture
----------------
This proposal split the QoS functionality between the SM/SA, CMA and the various
ULPS. We take the "chronology approach" to describe how the overall system
works:

2.1. The network manager (human) provides a set of rules (policy) that defines
how the network is being configured and how its resources are split to different
QoS-Levels. The policy also define how to decide which QoS-Level each
application or ULP or service use.

2.2. The SM analyzes the provided policy to see if it is realizable and performs
the necessary fabric setup. The SM may continuously monitor the policy and adapt
to changes in it. Part of this policy defines the default QoS-Level of each
partition. The SA is being enhanced to match the requested Source, Destination,
QoS-Class, Service-ID (and optionally SL and priority) against the policy. So
clients (ULPs, programs) can obtain a policy enforced QoS. The SM is also
enhanced to support setting up partitions with appropriate IPoIB broadcast
group. This broadcast group carries its QoS attributes: SL, MTU and
RATE.

2.3. IPoIB is being setup. IPoIB uses the SL, MTU and RATE available on the
multicast group which forms the broadcast group of this partition.

2.4. MPI which provides non IB based connection management should be configured
to run using hard coded SLs. It uses these SLs for every QP being opened.

2.5. ULPs that use CM interface (like SRP) should have their own pre-assigned
Service-ID and use it while obtaining PR/MPR for establishing connections.
The SA receiving the PR/MPR should match it against the policy and return
the appropriate PR/MPR including SL, MTU and RATE.

2.6. ULPs and programs using CMA to establish RC connection should provide the
CMA the target IP and Service-ID. Some of the ULPs might also provide QoS-Class
(E.g. for SDP sockets that are provided the TOS socket option). The CMA should
then use the provided Service-ID and optional QoS-Class and pass them in the
PR/MPR request. The resulting PR/MPR should be used for configuring the
connection QP.

PathRecord and MultiPathRecord enhancement for QoS:
As mentioned above the PathRecord and MultiPathRecord attributes should be
enhanced to carry the Service-ID which is a 64bit value, which has been
standardized by the IBTA. A new field QoS-Class is also provided.
A new capability bit should describe the SM QoS support in the SA class port
info. This approach provides an easy migration path for existing access layer
and ULPs by not introducing new set of PR/MPR attribute.


3. Supported Policy
--------------------

The QoS policy supported by this proposal is divided into 4 sub sections:

I) Port Group: a set of CAs, Routers or Switches that share the same settings.
A port group might be a partition defined by the partition manager policy in
terms of GUIDs. Future implementations might provide support for NodeDescription
based definition of port groups.

II) Fabric Setup:
Defines how the SL2VL and VLArb tables should be setup. This policy definition
assumes the computation of overall end to end network behavior should be performed
outside of OpenSM.

III) QoS-Levels Definition:
This section defines the possible sets of parameters for QoS that a client
might be mapped to. Each set holds: SL and optionally: Max MTU, Max Rate,
Packet Lifetime and Path Bits (in case LMC > 0 is used for QoS).

IV) Matching Rules:
A list of rules that match an incoming PR/MPR request to a QoS-Level. The
rules are processed in order such as the first match is applied. Each rule is
built out of a set of match expressions which should all match for the rule to
apply. The matching expressions are defined for the following fields
** SRC and DST to lists of port groups
** Service-ID to a list of Service-ID or Service-ID ranges
** QoS-Class to a list of QoS-Class values or ranges

QoS Policy file syntax

* Empty lines are ignored
* Leading and trailing blanks, as well as empty lines, are ignored, so the
   indentation in the example is just for better readability
* Comments are started with the pound sign (#) and terminated by EOL
* Comments may appear only in a separate line
* Keywords that denote section/subsection start have matching closing keywords
* Any keyword should be the first non-blank in the line

QoS Policy file example

     # Port Groups define sets of ports to be used later in the settings
     port-groups
         # using port GUIDs
         port-group
             name: Storage
             # "use" is just a description that is used for logging.
             #  Other than that, it is just a commentary
             use: our SRP storage targets
             port-guid: 0x1000000000000001
             port-guid: 0x1000000000000002
         end-port-group

         port-group
             name: Virtual Servers
             use: node desc and IB port num
             # The syntax of the port name is as follows: "hostname/CA-num/Pnum".
             # "hostname" and "CA-num" are compared to the first 2 words of
             # NodeDescription, and "Pnum" is a port number on that node.
             port-name: vs1/HCA-1/P1
             port-name: vs3/HCA-1/P1
             port-name: vs3/HCA-2/P2
         end-port-group

         # using partitions defined in the partition policy
         port-group
             name: Group for Partition 1
             use: default settings
             partition: Part1
         end-port-group

         # using node types CA|ROUTER|SWITCH
         port-group
             name: Routers
             use: all routers
             node-type: ROUTER
         end-port-group

     end-port-groups

     qos-setup

         # define all types of VLArb tables. The length of the tables should
         # match the physically supported tables by their target ports
         vlarb-tables
             # scope defines the exact ports the VLArb tables apply to
             vlarb-scope
                 # defining VLArb tables on all the ports that belong to
                 # port group 'Storage', and on all the ports connected
                 # to ports of port group 'Storage'
                 group: Storage
                 # "across" means all the ports that are connected to ports
                 # that belong to the specified port group
                 across: Storage
                 # VLArb table holds VL and weight pairs
                 vlarb-high: 0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1
                 vlarb-low: 8:255,9:127,10:63,11:31,12:15,13:7,14:3
                 vl-high-limit: 10
             end-vlarb-scope
             # There can be several scopes
         end-vlarb-tables

         sl2vl-tables
             # Scope defines the exact devices and in/out ports tables apply to.
             # Note: if the same port is matching several rules the *FIRST* one applies.
             sl2vl-scope
                 # SL2VL tables are orgnized as SL2VL(in-port,out-port)
                 # "from: n,m" means we define the SL2VL(n,*) and SL2VL(m,*)
                 # "to: n,m" means we define the SL2VL(*,n) and SL2VL(*,m)
                 #
                 # The following example specifies that all the SL2VL tables
                 # entries should be defined for all the ports of group Part1:
                 group: Part1
                 from: *
                 to: *
                 # SL2VL table has to have 16 values at max - one for each SL.
                 # If the user specifies less than 16 values, all the missing
                 # VL values will be implicitly set to 0
                 sl2vl-table: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
             end-sl2vl-scope

             sl2vl-scope
                 # "across-to" is a combination of "across" keyword (definition can be found
                 # in VLArb tables section) and "to" keyword.
                 # "across: PortGroupName" refers to all the ports that are connected
                 # to ports that belong to PortGroupName.
                 #
                 # Example of "across-to" usage:
                 #   A user has a set of 'special' nodes (e.g. storage nodes), and all
                 #   the traffic to these nodes has to get specific VL.
                 #   The solution is to define port group (i.g. "Storage") that will
                 #   include all the ports of these nodes, and then to configure SL2VL
                 #   tables on all the switch ports that are connected to the Storage
                 #   port group by specifying "across-to: Storage".
                 #
                 across-to: Storage2
                 # Similar to "across-to", "across-from" is a combination of "across"
                 # and "to" keywords
                 across-from: Storage1
                 sl2vl-table: 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0
             end-sl2vl-scope
         end-sl2vl-tables

     end-qos-setup


     qos-levels

         # the first one is just setting SL
         qos-level
             use: for the lowest priority communication
             sl: 15
             packet-life: 16
         end-qos-level
         # the second sets SL and QoS Class
         qos-level
             use: low latency best bandwidth
             sl: 0
         end-qos-level
         # the whole set: SL, MTU-Limit, Rate-Limit, Packet Lifetime, Path Bits
         qos-level
             use: just an example
             sl: 0
             mtu-limit: 1
             rate-limit: 1
             packet-life: 12
             # Path Bits can be used e.g. to provide a different routes through the
             # subnet to a particular port
             path-bits: 2,4,8-32
         end-qos-level

     end-qos-levels


     # Match rules are scanned in a first-fit manner (like firewall rules table)
     qos-match-rules

         # matching by single criteria: class (list of values and ranges)
         qos-match-rule
             # just a description
             use: low latency by class 7-9 or 11
             qos-class: 7-9,11
             # number of qos-level to apply to the matching PR/MPR
             qos-level-sn: 1
         end-qos-match-rule
         # show matching by destination group AND service-ids
         qos-match-rule
             use: Storage targets connection
             destination: Storage
             service-id: 22,4719-5000
             qos-level-sn: 2
         end-qos-match-rule
         # show matching by source group only
         qos-match-rule
             use: bla bla
             source: Storage
             qos-level-sn: 3
         end-qos-match-rule

     end-qos-match-rules


4. IPoIB
---------

IPoIB already query the SA for its broadcast group information. The additional
functionality required is for IPoIB to provide the broadcast group SL, MTU,
and RATE in every following PathRecord query performed when a new UDAV is
needed by IPoIB.
We could assign a special Service-ID for IPoIB use but since all communication
on the same IPoIB interface shares the same QoS-Level without the ability to
differentiate it by target service we can ignore it for simplicity.

5. CMA features
----------------

The CMA interface supports Service-ID through the notion of port space as a
prefixes to the port_num which is part of the sockaddr provided to
rdma_resolve_add(). What is missing is the explicit request for a QoS-Class that
should allow the ULP (like SDP) to propagate a specific request for a class of
service. A mechanism for providing the QoS-Class is available in the IPv6 address,
so we could use that address field. Another option is to implement a special
connection options API for CMA.

Missing functionality by CMA is the usage of the provided QoS-Class and Service-ID
in the sent PR/MPR. When a response is obtained it is an existing requirement for
the CMA to use the PR/MPR from the response in setting up the QP address vector.


6. SDP
-------

SDP uses CMA for building its connections.
The Service-ID for SDP is 0x000000000001PPPP, where PPPP are 4 hex digits
holding the remote TCP/IP Port Number to connect to.
SDP might be provided with SO_PRIORITY socket option. In that case the value
provided should be sent to the CMA as the TClass option of that connection.

7. SRP
-------

Current SRP implementation uses its own CM callbacks (not CMA). So SRP should
fill in the Service-ID in the PR/MPR by itself and use that information in
setting up the QP. The T10 SRP standard defines the SRP Service-ID to be defined
by the SRP target I/O Controller (but they should also comply with IBTA Service-
ID rules). Anyway, the Service-ID is reported by the I/O Controller in the
ServiceEntries DMA attribute and should be used in the PR/MPR if the SA
reports its ability to handle QoS PR/MPRs.

8. iSER
--------
iSER uses CMA and thus should be very close to SDP. The Service-ID for iSER
should be TBD.


9. OpenSM features
-------------------
The QoS related functionality to be provided by OpenSM can be split into two
main parts:

3.1. Fabric Setup
During fabric initialization the SM should parse the policy and apply its
settings to the discovered fabric elements. The following actions should be
performed:
* Parsing of policy
* Node Group identification. Warning should be provided for each node not
   specified but found.
* SL2VL settings validation should be checked:
   + A warning will be provided if there are no matching targets for the SL2VL
     setting statement.
   + An error message will be printed to the log file if an invalid setting is
     found. A setting is invalid if it refers to:
     - Non existing port numbers of the target devices
     - Unsupported VLs for the target device. In the later case the map to non
       existing VLs should be replaced to VL15 i.e. packets will be dropped.
* SL2VL setting is to be performed
* VL Arbitration table settings should be validated according to the following
   rules:
   + A warning will be provided if there are no matching targets for the setting
     statement
   + An error will be provided if the port number exceeds the target ports
   + An error will be generated if the table length exceeds device capabilities
   + A warning will be generated if the table quote a VL that is not supported
     by the target device
* VL Arbitration tables will be set on the appropriate targets

3.2. PR/MPR query handling:
OpenSM should be able to enforce the provided policy on client request.
The overall flow for such requests is: first the request is matched against the
defined match rules such that the target QoS-Level definition is found. Given
the QoS-Level a path(s) search is performed with the given restrictions imposed
by that level. The following two sections describe these steps.

How Service-ID is carried in the PathRecord and MultiPathRecord attributes is
now standardized by the IBTA.


3.2.1. Matching rule search:
A rule is "matching" a PR/MPR request using the following criteria:
* Matching rules provide values in a list of either single value, or range of
   values. A PR/MPR field is "matching" the rule field if it is explicitly
   noted in the list of values or is one of the values covered by a range
   included in the field values list.
* Only PR/MPR fields that have their component mask bit set should be
   compared.
* For a rule to be "matching" a PR/MPR request all the rule fields should be
   "matching" their PR/MPR fields. Such that a PR/MPR request that does
   not have a component mask field set for one of the rule defined fields  can
   not match that rule.
* A PR/MPR request that have a component mask bit set for one of the fields
   that is not defined by the rule can match the rule.

The algorithm to be used for searching for a rule match might be as simple as a
sequential search through all rules or enhanced for better performance. The
semantics of every rule field and its matching PR/MPR field are described
below:
* Source: the SGID or SLID should be part of this group
* Destination: the DGID or DLID should be part of this group
* Service-ID: check if the requested Service-ID (available in the PR/MPR old
   SM-Key field) is matching any of this rule Service-IDs
* TClass: check if the PR/MPR TClass field is matching

3.2.2 PR/MPR response generation:
The QoS-Level pointed by the first rule that matches the PR/MPR request
should be used for obtaining the response SL, MTU-Limit, RATE-Limit, Path-Bits
and QoS-Class. A default QoS-Level should be used if no rule is matching the query.

The efficient algorithm for finding paths that meet the QoS-Level criteria is
beyond the scope of this RFC and left for the implementer to provide. However
the criteria by which the paths match the QoS-Level are described below:

* SL: The paths found should all use the given SL. For that sake PR/MPR
   algorithm should traverse the path from source to destination only through
   ports that carry a valid VL (not VL15) by the SL2VL map (should consider input
   and output ports and SL).
* MTU-Limit: The resulting paths MTU should not exceed the given MTU-Limit
* Rate-Limit: The resulting paths RATE should not exceed the given RATE-Limit
   (rate limit is given in units of link BW = Width*Speed according to IBTA
   Specification Vol-1 table-205 p-901 l-24).
* Path-Bits: define the target LID lowest bits (number of bits defined by the
   target port PortInfo.LMC field). The path should traverse the LFT using the
   target port LID with the path-bits set.
* QoS-Class: should be returned in the result PR/MPR. When routing is going to
   be supported by OpenSM we might use this field in selecting the target
   router too in a TBD way.


From sashak at voltaire.com  Sat Jul 21 15:44:20 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 22 Jul 2007 01:44:20 +0300
Subject: [ofa-general] Re: [PATCH] infiniband-diags/scripts: Handle new and
	old topology file format
In-Reply-To: <46A0D03E.mail35T1S2JP6@systemfabricworks.com>
References: <46A0D03E.mail35T1S2JP6@systemfabricworks.com>
Message-ID: <20070721224419.GP16597@sashak.voltaire.com>

On 10:09 Fri 20 Jul     , davem at systemfabricworks.com wrote:
> 
> 
>    Fix infiniband-diags scripts to handle changed ibnetdiscover topology file
>    format and remain backward compatible with old file format.
> 
> Signed-off-by: David A. McMillen <davem at systemfabricworks.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sat Jul 21 15:44:46 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 22 Jul 2007 01:44:46 +0300
Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibnetdiscover: Fix DDR
	link speed decode
In-Reply-To: <46A0D0C3.mail37511HD5H@systemfabricworks.com>
References: <46A0D0C3.mail37511HD5H@systemfabricworks.com>
Message-ID: <20070721224446.GQ16597@sashak.voltaire.com>

On 10:12 Fri 20 Jul     , davem at systemfabricworks.com wrote:
> 
> 
>    Fix ibnetdiscover DDR link speed decode by moving string from [3] to [2].
> 
> Signed-off-by: David A. McMillen <davem at systemfabricworks.com>

Applied. Thanks.

Sasha


From pradeeps at linux.vnet.ibm.com  Sat Jul 21 15:46:15 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Sat, 21 Jul 2007 15:46:15 -0700
Subject: [ofa-general] NOSRQ misc patch [PATCH V1]
Message-ID: <46A28CB7.1040509@linux.vnet.ibm.com>

This patch is to be applied on top of the IPOIB CM (NOSRQ) [PATCH V8].
This fixes the issues that Roland and Michael pointed out and more.

Signed-off-by: Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>
---

--- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib.h	2007-07-21 17:50:47.000000000 -0400
+++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib.h	2007-07-21 18:20:29.000000000 -0400
@@ -101,7 +101,6 @@ enum {
 #define	IPOIB_CM_OP_RECV (1ul << 30)
 
 #define NOSRQ_INDEX_TABLE_SIZE 128
-#define NOSRQ_INDEX_MASK      (NOSRQ_INDEX_TABLE_SIZE -1)
 #else
 #define	IPOIB_CM_OP_RECV (0)
 #endif
@@ -447,6 +446,7 @@ void ipoib_drain_cq(struct net_device *d
 /* We don't support UC connections at the moment */
 #define IPOIB_CM_SUPPORTED(ha)   (ha[0] & (IPOIB_FLAGS_RC))
 
+extern int max_rc_qp ;
 static inline int ipoib_cm_admin_enabled(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
--- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-07-21 17:50:47.000000000 -0400
+++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-07-21 18:08:15.000000000 -0400
@@ -49,17 +49,18 @@ MODULE_PARM_DESC(cm_data_debug_level,
 
 #include "ipoib.h"
 
-static int max_rc_qp = NOSRQ_INDEX_TABLE_SIZE;
+int max_rc_qp = NOSRQ_INDEX_TABLE_SIZE;
 static int max_recv_buf = 1024; /* Default is 1024 MB */
 
 module_param_named(nosrq_max_rc_qp, max_rc_qp, int, 0644);
-MODULE_PARM_DESC(nosrq_max_rc_qp, "Max number of NOSRQ RC QPs supported");
+MODULE_PARM_DESC(nosrq_max_rc_qp, "Max number of NOSRQ RC QPs supported; must be a power of 2");
 
 module_param_named(max_receive_buffer, max_recv_buf, int, 0644);
 MODULE_PARM_DESC(max_receive_buffer, "Max Receive Buffer Size in MB");
 
 static atomic_t current_rc_qp = ATOMIC_INIT(0); /* Active number of RC QPs for NOSRQ */
 
+#define NOSRQ_INDEX_MASK      (max_rc_qp -1)
 #define IPOIB_CM_IETF_ID 0x1000000000000000ULL
 
 #define IPOIB_CM_RX_UPDATE_TIME (256 * HZ)
@@ -1024,6 +1025,7 @@ void dev_stop_nosrq(struct ipoib_dev_pri
 	spin_unlock_irq(&priv->lock);
 
 	cancel_delayed_work(&priv->cm.stale_task);
+	kfree(priv->cm.rx_index_table);
 }
 
 void ipoib_cm_dev_stop(struct net_device *dev)
@@ -1168,9 +1170,9 @@ static struct ib_qp *ipoib_cm_create_tx_
 	attr.recv_cq = priv->cq;
 	attr.srq = priv->cm.srq;
 	attr.cap.max_send_wr = ipoib_sendq_size;
-	attr.cap.max_recv_wr = 1;
+	attr.cap.max_recv_wr = 0;
 	attr.cap.max_send_sge = 1;
-	attr.cap.max_recv_sge = 1;
+	attr.cap.max_recv_sge = 0;
 	attr.sq_sig_type = IB_SIGNAL_ALL_WR;
 	attr.qp_type = IB_QPT_RC;
 	attr.send_cq = cq;
@@ -1710,11 +1712,11 @@ int ipoib_cm_dev_init(struct net_device 
 		 * passive_ids. For quick and easy access we maintain a table
 		 * of pointers to struct ipoib_cm_rx called the rx_index_table
 		 */
-		priv->cm.rx_index_table = kzalloc(NOSRQ_INDEX_TABLE_SIZE *
-					 sizeof *priv->cm.rx_index_table,
-					 GFP_KERNEL);
+		priv->cm.rx_index_table = kcalloc(max_rc_qp,
+						  sizeof *priv->cm.rx_index_table,
+						  GFP_KERNEL);
 		if (!priv->cm.rx_index_table) {
-			printk(KERN_WARNING "Failed to allocate NOSRQ_INDEX_TABLE\n");
+			printk(KERN_WARNING "Failed to allocate rx_index_table\n");
 			return -ENOMEM;
 		}
 	}
--- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-07-21 17:50:47.000000000 -0400
+++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-07-21 18:09:26.000000000 -0400
@@ -180,11 +180,11 @@ int ipoib_transport_dev_init(struct net_
 	/* We increase the size of the CQ in the NOSRQ case to prevent CQ
 	 * overflow. Every new REQ creates a new RX QP and each QP has an
 	 * RX ring associated with it. Therefore we could have
-	 * NOSRQ_INDEX_TABLE_SIZE*ipoib_recvq_size + ipoib_sendq_size CQEs
+	 * max_rc_qp*ipoib_recvq_size + ipoib_sendq_size CQEs
 	 * in a CQ.
 	 */
 	if (!priv->cm.srq)
-		size += (NOSRQ_INDEX_TABLE_SIZE - 1) * ipoib_recvq_size;
+		size += (max_rc_qp - 1) * ipoib_recvq_size;
 #endif
 
 	priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0);


From kliteyn at mellanox.co.il  Sat Jul 21 21:40:22 2007
From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il)
Date: 22 Jul 2007 07:40:22 +0300
Subject: [ofa-general] nightly osm_sim report 2007-07-22:normal completion
Message-ID: <MTLEXCH01diixjnzi95000020a0@mtlexch01.mtl.com>

OSM Simulation Regression Summary
 
[Generated mail - please do NOT reply]
 
 
OpenSM rev = Thu_Jul_12_11:56:08_2007 [de69204d60071532833b0cdd3baa5e2386dc2c73]
ibutils rev = Tue_Mar_13_14:36:32_2007 [80aaff94f0eb65117db39b9db7d609ffdcc055de]
 
 
Total=560  Pass=559  Fail=1
 
 
Pass:
42 Stability IS1-16.topo
42 Pkey IS1-16.topo
42 OsmTest IS1-16.topo
42 OsmStress IS1-16.topo
42 Multicast IS1-16.topo
42 LidMgr IS1-16.topo
14 Stability IS3-loop.topo
14 Stability IS3-128.topo
14 Pkey IS3-128.topo
14 OsmTest IS3-loop.topo
14 OsmTest IS3-128.topo
14 OsmStress IS3-128.topo
14 Multicast IS3-loop.topo
14 Multicast IS3-128.topo
14 FatTree merge-roots-4-ary-2-tree.topo
14 FatTree merge-root-4-ary-3-tree.topo
14 FatTree gnu-stallion-64.topo
14 FatTree blend-4-ary-2-tree.topo
14 FatTree RhinoDDR.topo
14 FatTree FullGnu.topo
14 FatTree 4-ary-2-tree.topo
14 FatTree 2-ary-4-tree.topo
14 FatTree 12-node-spaced.topo
14 FTreeFail 4-ary-2-tree-missing-sw-link.topo
14 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo
14 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo
14 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo
13 LidMgr IS3-128.topo

Failures:
1 LidMgr IS3-128.topo


From mst at dev.mellanox.co.il  Sat Jul 21 23:05:57 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 22 Jul 2007 09:05:57 +0300
Subject: [ofa-general] Re: NOSRQ misc patch [PATCH V1]
In-Reply-To: <46A28CB7.1040509@linux.vnet.ibm.com>
References: <46A28CB7.1040509@linux.vnet.ibm.com>
Message-ID: <20070722060557.GB20438@mellanox.co.il>

> @@ -1168,9 +1170,9 @@ static struct ib_qp *ipoib_cm_create_tx_
>  	attr.recv_cq = priv->cq;
>  	attr.srq = priv->cm.srq;
>  	attr.cap.max_send_wr = ipoib_sendq_size;
> -	attr.cap.max_recv_wr = 1;
> +	attr.cap.max_recv_wr = 0;
>  	attr.cap.max_send_sge = 1;
> -	attr.cap.max_recv_sge = 1;
> +	attr.cap.max_recv_sge = 0;
>  	attr.sq_sig_type = IB_SIGNAL_ALL_WR;
>  	attr.qp_type = IB_QPT_RC;
>  	attr.send_cq = cq;

I don't see how does this fix things.
This line 
>  	attr.srq = priv->cm.srq;
connected the TX QP to SRQ, making it possible to get packets on this QP.
But if cm.srq is NULL, and a remote sends a packet on this connection,
the connection will get closed. Which is a quality of implementation issue.

-- 
MST


From ogerlitz at voltaire.com  Sat Jul 21 23:17:58 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 22 Jul 2007 09:17:58 +0300
Subject: [ofa-general] Re: IPoIB path caching
In-Reply-To: <469DD7BB.6060009@voltaire.com>
References: <adalkdl43w0.fsf@cisco.com>	<adahco943ip.fsf@cisco.com>	<4696D1F3.2040507@ichips.intel.com>	<15ddcffd0707172020j5b68fcb2v7d3ca77863998020@mail.gmail.com>	<adaabtuo0n9.fsf@cisco.com>	<f0e08f230707172140y36569958xadcba6faae1d65b3@mail.gmail.com>	<20070718050928.GA3103@obsidianresearch.com>	<ada7ioymfsw.fsf@cisco.com>	<20070718072841.GC1115@mellanox.co.il>
	<469DD7BB.6060009@voltaire.com>
Message-ID: <46A2F696.4060007@voltaire.com>

Or Gerlitz wrote:
> Michael S. Tsirkin wrote:

>>> And ARP table aging gives a way to recover
>>> from stale cached data, eventually at least.

>> Does it?

>> $ grep path_list drivers/infiniband/ulp/ipoib/*c
>> drivers/infiniband/ulp/ipoib/ipoib_main.c: list_add_tail(&path->list, &priv->path_list);
>> drivers/infiniband/ulp/ipoib/ipoib_main.c: list_splice(&priv->path_list, &remove_list);
>> drivers/infiniband/ulp/ipoib/ipoib_main.c: INIT_LIST_HEAD(&priv->path_list);
>> drivers/infiniband/ulp/ipoib/ipoib_main.c: INIT_LIST_HEAD(&priv->path_list);

>> In other words we add paths to ipoib specific cache, but we never seem
>> to *remove* individual paths from cache - we only know how to do
>> full cache invalidates on events such as port state change.

> this seems like a bug, if the stack decided to delete OR change a 
> neighbour, the path associated with it must not be re-used to create the 
> address handle or to establish the connection, same for multicast 
> neighbours.

Roland,

Can you provide your take here?

Do you agree that using cached IB L2 info where the net stack wants to 
renew its IPoIB L2 (which is IB L3 && L4) info is a bug?

Or.


From krkumar2 at in.ibm.com  Sat Jul 21 23:27:54 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Sun, 22 Jul 2007 11:57:54 +0530
Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API
In-Reply-To: <1185023921.5192.45.camel@localhost>
Message-ID: <OF0F9581FE.472D94A6-ON65257320.001FC787-65257320.00238386@in.ibm.com>

Hi Jamal,

J Hadi Salim <j.hadi123 at gmail.com> wrote on 07/21/2007 06:48:41 PM:

> >    - Use a single qdisc interface to avoid code duplication and reduce
> >      maintainability (sch_generic.c size reduces by ~9%).
> >    - Has per device configurable parameter to turn on/off batching.
> >    - qdisc_restart gets slightly modified while looking simple without
> >      any checks for batching vs regular code (infact only two lines
have
> >      changed - 1. instead of dev_dequeue_skb, a new batch-aware
function
> >      is called; and 2. an extra call to hard_start_xmit_batch.
>
> >    - No change in__qdisc_run other than a new argument (from DM's
idea).
> >    - Applies to latest net-2.6.23 compared to 2.6.22-rc4 code.
>
> All the above are cosmetic differences. To me is the highest priority
> is making sure that batching is useful and what the limitations are.
> At some point, when all looks good - i dont mind adding an ethtool
> interface to turn off/on batching, merge with the new qdisc restart path
> instead of having a parallel path, solicit feedback on naming, where to
> allocate structs etc etc. All that is low prio if batching across a
> variety of hardware and applications doesnt prove useful. At the moment,
> i am unsure theres consistency to justify push batching in.

Batching need not be useful for every hardware. If there is hardware that
is useful to exploit batching (like clearly IPoIB is a good candidate as
both the TX and the TX completion path can handle multiple skb processing,
and I haven't looked at other drivers to see if any of them can do
something
similar), then IMHO it makes sense to enable batching for that hardware. It
is upto the other drivers to determine whether converting to the batching
API makes sense or not. And as indicated, the total size increase for
adding
the kernel support is also insignificant - 0.03%, or 1164 Bytes (using the
'size' command).

> Having said that below are the main architectural differences we have
> which is what we really need to discuss and see what proves useful:
>
> >         - Batching algo/processing is different (eg. if
> >           qdisc_restart() finds
> >      one skb in the batch list, it will try to batch more (upto a
limit)
> >      instead of sending that out and batching the rest in the next
call.
>
> This sounds a little more aggressive but maybe useful.
> I have experimented with setting upper bound limits (current patches
> have a pktgen interface to set the max to send) and have concluded that
> it is unneeded. Probing by letting the driver tell you what space is
> available has proven to be the best approach. I have been meaning to
> remove the code in pktgen which allows these limits.

I don't quite agree with that approach, eg, if the blist is empty and the
driver tells there is space for one packet, you will add one packet and
the driver sends it out and the device is stopped (with potentially lot of
skbs on dev->q). Then no packets are added till the queue is enabled, at
which time a flood of skbs will be processed increasing latency and holding
lock for a single longer duration. My approach will mitigate holding lock
for longer times and instead send skbs to the device as long as we are
within
the limits.

Infact in my rev2 patch (being today or tomorrow after handling Patrick's
and
Stephen's comments), I am even removing the driver specific xmit_slots as I
find it is adding bloat and requires more cycles than calculating the value
each time xmit is done (ofcourse in your approach it is required since the
stack uses it).

> >    - Jamal's code has a separate hw prep handler called from the stack,
> >      and results are accessed in driver during xmit later.
>
> I have explained the reasoning to this a few times. A recent response to
> Michael Chan is here:
> http://marc.info/?l=linux-netdev&m=118346921316657&w=2

Since E1000 doesn't seem to use the TX lock on RX (atleast I couldn't find
it),
I feel having prep will not help as no other cpu can execute the queue/xmit
code anyway (E1000 is also a LLTX driver). Other driver that hold tx lock
could
get improvement however.

> And heres a response to you that i havent heard back on:
> http://marc.info/?l=linux-netdev&m=118355539503924&w=2

That is because it answered my query :) It is what I was expecting, but
thanks
for the explanation.

> My tests so far indicate this interface is useful. It doesnt apply well

I wonder if you tried enabling/disabling 'prep' on E1000 to see how the
performance is affected. If it helps, I guess you could send me a patch to
add that and I can also test it to see what the effect is. I didn't add it
since IPoIB wouldn't be able to exploit it (unless someone is kind enough
to show me how to).

> So if i was to sum up this, (it would be useful discussion to have on
> these) the real difference is:
>
> a) you have an extra check on refilling the skb list when you find that
> it has a single skb. I tagged this as being potentially useful.

It is very useful since extra processing is not required for one skb case -
you remove it from list and unnecessarily add it to a different list and
then
delete it immediately in the driver when all that was required is to pass
the
skb directly to the driver using it's original API (ofcourse the caveat is
that
I also have a check to add that *single* skb to the blist in case there are
already earlier skbs on the blist, this helps in batching and more
importantly -
to send skbs in order).

> b) You have a check for some upper bound on the number of skbs to send
> to the driver. I tagged this as unnecessary - the interface is still on
> in my current code, so it shouldnt be hard to show one way or other.

Explained earlier wrt latency.

> c) You dont have prep_xmit()
>
> Add to that list any other architectural differences i may have missed
> and lets discuss and hopefully make some good progress.

I think the code I have is ready and stable, and the issues pointed out so
far is also incorporated and to be sent out today. Please let me know if
you want to add something to it.

Thanks for your review/comments,

- KK


From eitan at mellanox.co.il  Sat Jul 21 23:36:06 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sun, 22 Jul 2007 09:36:06 +0300
Subject: [ofa-general] opensm: a bug in heavy sweep? - no LFT
	re-configuration
Message-ID: <863azhrlm1.fsf@sw053.lab.mtl.com>

Hi Sasha

I am running some tests manually and apparently it looks like 
I found a bug. Here is the sequence of things:
1. SM sweeps the fabric assign LFTs  
2. I manually modify some LFTs (single entry now marked UNREACHABLE
3. I force some switch change bit to 1 or issue kill -HUP
4. The SM reports SUBNET UP
5. The modified LFT entry is still UNREACHABLE and the path is broken

It looks to me some optimization of routing does not fully reroute
unless some condition is met - but that condition does not include the
above triggers listed in step 3.

Thanks

Eitan


From pradeeps at linux.vnet.ibm.com  Sun Jul 22 00:16:10 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Sun, 22 Jul 2007 00:16:10 -0700
Subject: [ofa-general] Re: NOSRQ misc patch [PATCH V1]
In-Reply-To: <20070722060557.GB20438@mellanox.co.il>
References: <46A28CB7.1040509@linux.vnet.ibm.com>
	<20070722060557.GB20438@mellanox.co.il>
Message-ID: <46A3043A.3030200@linux.vnet.ibm.com>

Michael S. Tsirkin wrote:
>> @@ -1168,9 +1170,9 @@ static struct ib_qp *ipoib_cm_create_tx_
>>  	attr.recv_cq = priv->cq;
>>  	attr.srq = priv->cm.srq;
>>  	attr.cap.max_send_wr = ipoib_sendq_size;
>> -	attr.cap.max_recv_wr = 1;
>> +	attr.cap.max_recv_wr = 0;
>>  	attr.cap.max_send_sge = 1;
>> -	attr.cap.max_recv_sge = 1;
>> +	attr.cap.max_recv_sge = 0;
>>  	attr.sq_sig_type = IB_SIGNAL_ALL_WR;
>>  	attr.qp_type = IB_QPT_RC;
>>  	attr.send_cq = cq;
> 
> I don't see how does this fix things.
> This line 
>>  	attr.srq = priv->cm.srq;
> connected the TX QP to SRQ, making it possible to get packets on this QP.
> But if cm.srq is NULL, and a remote sends a packet on this connection,
> the connection will get closed. Which is a quality of implementation issue.
> 
When the QP numbers are exchanged correctly, then it should not receive
a packet on this QP in the first place. That is an error case and so should
be a rare event. Assuming that still happens, that should be setup again 
because it is an RC connection.

Pradeep


From mst at dev.mellanox.co.il  Sun Jul 22 00:20:43 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 22 Jul 2007 10:20:43 +0300
Subject: [ofa-general] Re: NOSRQ misc patch [PATCH V1]
In-Reply-To: <46A3043A.3030200@linux.vnet.ibm.com>
References: <46A28CB7.1040509@linux.vnet.ibm.com>
	<20070722060557.GB20438@mellanox.co.il>
	<46A3043A.3030200@linux.vnet.ibm.com>
Message-ID: <20070722072043.GB7188@mellanox.co.il>

> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
> Subject: Re: NOSRQ misc patch [PATCH V1]
> 
> Michael S. Tsirkin wrote:
> >> @@ -1168,9 +1170,9 @@ static struct ib_qp *ipoib_cm_create_tx_
> >>  	attr.recv_cq = priv->cq;
> >>  	attr.srq = priv->cm.srq;
> >>  	attr.cap.max_send_wr = ipoib_sendq_size;
> >> -	attr.cap.max_recv_wr = 1;
> >> +	attr.cap.max_recv_wr = 0;
> >>  	attr.cap.max_send_sge = 1;
> >> -	attr.cap.max_recv_sge = 1;
> >> +	attr.cap.max_recv_sge = 0;
> >>  	attr.sq_sig_type = IB_SIGNAL_ALL_WR;
> >>  	attr.qp_type = IB_QPT_RC;
> >>  	attr.send_cq = cq;
> > 
> > I don't see how does this fix things.
> > This line 
> >>  	attr.srq = priv->cm.srq;
> > connected the TX QP to SRQ, making it possible to get packets on this QP.
> > But if cm.srq is NULL, and a remote sends a packet on this connection,
> > the connection will get closed. Which is a quality of implementation issue.
> > 
> When the QP numbers are exchanged correctly, then it should not receive
> a packet on this QP in the first place.

Re-read the RFC. It is perfectly legal to reuse a passive QP for transmitting
packets. We don't do this currently but we might in the future.

> That is an error case and so should
> be a rare event. Assuming that still happens, that should be setup again 
> because it is an RC connection.

Won't it closed immediately again once remote tries to use it?

-- 
MST


From vlad at lists.openfabrics.org  Sun Jul 22 01:37:11 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Sun, 22 Jul 2007 01:37:11 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070722-0100 daily build status
Message-ID: <20070722083711.537DDE60825@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.19
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ia64 with linux-2.6.13
Passed on x86_64 with linux-2.6.12
Passed on ia64 with linux-2.6.14
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.14
Passed on ia64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on powerpc with linux-2.6.19
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.13
Passed on powerpc with linux-2.6.16
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.14
Passed on ia64 with linux-2.6.12
Passed on x86_64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ppc64 with linux-2.6.15
Passed on powerpc with linux-2.6.15
Passed on ia64 with linux-2.6.17
Passed on ppc64 with linux-2.6.16
Passed on ia64 with linux-2.6.15
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.13
Passed on x86_64 with linux-2.6.18-8.el5
Passed on ia64 with linux-2.6.16
Passed on ppc64 with linux-2.6.14
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From krkumar2 at in.ibm.com  Sun Jul 22 02:04:57 2007
From: krkumar2 at in.ibm.com (Krishna Kumar)
Date: Sun, 22 Jul 2007 14:34:57 +0530
Subject: [ofa-general] [PATCH 00/12 -Rev2] Implement batching skb API
Message-ID: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>

This set of patches implements the batching API, and makes the following
changes resulting from the review of the first set:

Changes :
---------
1.  Changed skb_blist from pointer to static as it saves only 12 bytes
    (i386), but bloats the code.
2.  Removed requirement for driver to set "features & NETIF_F_BATCH_SKBS"
    in register_netdev to enable batching as it is redundant. Changed this
    flag to NETIF_F_BATCH_ON and it is set by register_netdev, and other
    user changable calls can modify this bit to enable/disable batching.
3.  Added ethtool support to enable/disable batching (not tested).
4.  Added rtnetlink support to enable/disable batching (not tested).
5.  Removed MIN_QUEUE_LEN_BATCH for batching as high performance drivers
    should not have a small queue anyway (adding bloat).
6.  skbs are purged from dev_deactivate instead of from unregister_netdev
    to drop all references to the device.
7.  Removed changelog in source code in sch_generic.c, and unrelated renames
    from sch_generic.c (lockless, comments).
8.  Removed xmit_slots entirely, as it was adding bloat (code and header)
    and not adding value (it is calculated and set twice in internal send
    routine and handle work completion, and referenced once in batch xmit;
    and can instead be calculated once in xmit).

Issues :
--------
1. Remove /sysfs support completely ?
2. Whether rtnetlink support is required as GSO has only ethtool ?

Patches are described as:
	Mail 0/12  : This mail.
	Mail 1/12  : HOWTO documentation.
	Mail 2/12  : Changes to netdevice.h
	Mail 3/12  : dev.c changes.
	Mail 4/12  : Ethtool changes.
	Mail 5/12  : sysfs changes.
	Mail 6/12  : rtnetlink changes.
	Mail 7/12  : Change in qdisc_run & qdisc_restart API, modify callers
		     to use this API.
	Mail 8/12  : IPoIB include file changes.
	Mail 9/12  : IPoIB verbs changes
	Mail 10/12 : IPoIB multicast, CM changes
	Mail 11/12 : IPoIB xmit API addition
	Mail 12/12 : IPoIB xmit internals changes (ipoib_ib.c)

I have started a 10 run test for various buffer sizes and processes, and
will post the results on Monday.

Please review and provide feedback/ideas; and consider for inclusion.

Thanks,

- KK


From krkumar2 at in.ibm.com  Sun Jul 22 02:05:06 2007
From: krkumar2 at in.ibm.com (Krishna Kumar)
Date: Sun, 22 Jul 2007 14:35:06 +0530
Subject: [ofa-general] [PATCH 01/12 -Rev2] HOWTO documentation for Batching
	SKB.
In-Reply-To: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
Message-ID: <20070722090506.7787.69681.sendpatchset@K50wks273871wss.in.ibm.com>

diff -ruNp org/Documentation/networking/Batching_skb_API.txt rev2/Documentation/networking/Batching_skb_API.txt
--- org/Documentation/networking/Batching_skb_API.txt	1970-01-01 05:30:00.000000000 +0530
+++ rev2/Documentation/networking/Batching_skb_API.txt	2007-07-20 16:09:45.000000000 +0530
@@ -0,0 +1,91 @@
+		 HOWTO for batching skb API support
+		 -----------------------------------
+
+Section 1: What is batching skb API ?
+Section 2: How batching API works vs the original API ?
+Section 3: How drivers can support this API ?
+Section 4: How users can work with this API ?
+
+
+Introduction: Kernel support for batching skb
+-----------------------------------------------
+
+An extended API is supported in the netdevice layer, which is very similar
+to the existing hard_start_xmit() API. Drivers which wish to take advantage
+of this new API should implement this routine similar to how the
+hard_start_xmit handler is written. The difference between these API's is
+that while the existing hard_start_xmit processes one skb, the new API can
+process multiple skbs (or even one) in a single call. It is also possible
+for the driver writer to re-use most of the code from the existing API in
+the new API without having code duplication.
+
+
+Section 1: What is batching skb API ?
+-------------------------------------
+
+	This is a new API that is optionally exported by a driver. The pre-
+	requisite for a driver to use this API is that it should have a
+	reasonably sized hardware queue that can process multiple skbs.
+
+
+Section 2: How batching API works vs the original API ?
+-------------------------------------------------------
+
+	The networking stack normally gets called from upper layer protocols
+	with a single skb to xmit. This skb is first enqueue'd and an
+	attempt is next made to transmit it immediately (via qdisc_run).
+	However, events like driver lock contention, queue stopped, etc, can
+	result in the skb not getting sent out, and it remains in the queue.
+	When a new xmit is called or when the queue is re-enabled, qdisc_run
+	could potentially find multiple packets in the queue, and have to
+	send them all out one by one iteratively.
+
+	The batching skb API case was added to exploit this situation where
+	if there are multiple skbs, all of them can be sent to the device in
+	one shot. This reduces driver processing, locking at the driver (or
+	in stack for ~LLTX drivers) gets amortized over multiple skbs, and
+	in case of specific drivers where every xmit results in a completion
+	processing (like IPoIB), optimizations could be made in the driver
+	to get a completion for only the last skb that was sent which will
+	result in saving interrupts for every (but the last) skb that was
+	sent in the same batch.
+
+	This batching can result in significant performance gains for
+	systems that have multiple data stream paths over the same network
+	interface card.
+
+
+Section 3: How drivers can support this API ?
+---------------------------------------------
+
+	The new API - dev->hard_start_xmit_batch(struct net_device *dev),
+	simplistically, can be written almost identically to the regular
+	xmit API (hard_start_xmit), except that all skbs on dev->skb_blist
+	should be processed by the driver instead of just one skb. The new
+	API doesn't get any skb as argument to process, instead it picks up
+	all the skbs from dev->skb_blist, where it was added by the stack,
+	and tries to send them out.
+
+	Batching requires the driver to set the NETIF_F_BATCH_SKBS bit in
+	dev->features, and dev->hard_start_xmit_batch should point to the
+	new API implemented for that driver.
+
+
+Section 4: How users can work with this API ?
+---------------------------------------------
+
+	Batching could be disabled for a particular device, e.g. on desktop
+	systems if only one stream of network activity for that device is
+	taking place, since performance could be slightly affected due to
+	extra processing that batching adds. Batching can be enabled if
+	more than one stream of network activity per device is being done,
+	e.g. on servers, or even desktop usage with multiple browser, chat,
+	file transfer sessions, etc.
+
+	Per device batching can be enabled/disabled using:
+
+	echo 1 > /sys/class/net/<device-name>/tx_batch_skbs (enable)
+	echo 0 > /sys/class/net/<device-name>/tx_batch_skbs (disable)
+
+	E.g. to enable batching on eth0, run:
+		echo 1 > /sys/class/net/eth0/tx_batch_skbs


From krkumar2 at in.ibm.com  Sun Jul 22 02:05:16 2007
From: krkumar2 at in.ibm.com (Krishna Kumar)
Date: Sun, 22 Jul 2007 14:35:16 +0530
Subject: [ofa-general] [PATCH 02/12 -Rev2] Changes to netdevice.h
In-Reply-To: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
Message-ID: <20070722090516.7787.79695.sendpatchset@K50wks273871wss.in.ibm.com>

diff -ruNp org/include/linux/netdevice.h rev2/include/linux/netdevice.h
--- org/include/linux/netdevice.h	2007-07-20 07:49:28.000000000 +0530
+++ rev2/include/linux/netdevice.h	2007-07-22 13:20:16.000000000 +0530
@@ -340,6 +340,7 @@ struct net_device
 #define NETIF_F_VLAN_CHALLENGED	1024	/* Device cannot handle VLAN packets */
 #define NETIF_F_GSO		2048	/* Enable software GSO. */
 #define NETIF_F_LLTX		4096	/* LockLess TX */
+#define NETIF_F_BATCH_ON	8192	/* Batching skbs xmit API is enabled */
 #define NETIF_F_MULTI_QUEUE	16384	/* Has multiple TX/RX queues */
 
 	/* Segmentation offload features */
@@ -452,6 +453,7 @@ struct net_device
 	struct Qdisc		*qdisc_sleeping;
 	struct list_head	qdisc_list;
 	unsigned long		tx_queue_len;	/* Max frames per queue allowed */
+	struct sk_buff_head	skb_blist;	/* List of batch skbs */
 
 	/* Partially transmitted GSO packet. */
 	struct sk_buff		*gso_skb;
@@ -472,6 +474,9 @@ struct net_device
 	void			*priv;	/* pointer to private data	*/
 	int			(*hard_start_xmit) (struct sk_buff *skb,
 						    struct net_device *dev);
+	int			(*hard_start_xmit_batch) (struct net_device
+							  *dev);
+
 	/* These may be needed for future network-power-down code. */
 	unsigned long		trans_start;	/* Time (in jiffies) of last Tx	*/
 
@@ -582,6 +587,8 @@ struct net_device
 #define	NETDEV_ALIGN		32
 #define	NETDEV_ALIGN_CONST	(NETDEV_ALIGN - 1)
 
+#define BATCHING_ON(dev)	((dev->features & NETIF_F_BATCH_ON) != 0)
+
 static inline void *netdev_priv(const struct net_device *dev)
 {
 	return dev->priv;
@@ -832,6 +839,8 @@ extern int		dev_set_mac_address(struct n
 					    struct sockaddr *);
 extern int		dev_hard_start_xmit(struct sk_buff *skb,
 					    struct net_device *dev);
+extern int		dev_add_skb_to_blist(struct sk_buff *skb,
+					     struct net_device *dev);
 
 extern void		dev_init(void);
 
@@ -1104,6 +1113,8 @@ extern void		dev_set_promiscuity(struct 
 extern void		dev_set_allmulti(struct net_device *dev, int inc);
 extern void		netdev_state_change(struct net_device *dev);
 extern void		netdev_features_change(struct net_device *dev);
+extern int		dev_change_tx_batching(struct net_device *dev,
+					       unsigned long new_batch_skb);
 /* Load a device via the kmod */
 extern void		dev_load(const char *name);
 extern void		dev_mcast_init(void);


From krkumar2 at in.ibm.com  Sun Jul 22 02:05:25 2007
From: krkumar2 at in.ibm.com (Krishna Kumar)
Date: Sun, 22 Jul 2007 14:35:25 +0530
Subject: [ofa-general] [PATCH 03/12 -Rev2] dev.c changes.
In-Reply-To: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
Message-ID: <20070722090525.7787.10432.sendpatchset@K50wks273871wss.in.ibm.com>

diff -ruNp org/net/core/dev.c rev2/net/core/dev.c
--- org/net/core/dev.c	2007-07-20 07:49:28.000000000 +0530
+++ rev2/net/core/dev.c	2007-07-21 23:08:33.000000000 +0530
@@ -875,6 +875,48 @@ void netdev_state_change(struct net_devi
 	}
 }
 
+/*
+ * dev_change_tx_batching - Enable or disable batching for a driver that
+ * supports batching.
+ */
+int dev_change_tx_batching(struct net_device *dev, unsigned long new_batch_skb)
+{
+	int ret;
+
+	if (!dev->hard_start_xmit_batch) {
+		/* Driver doesn't support skb batching */
+		ret = -ENOTSUPP;
+		goto out;
+	}
+
+	/* Handle invalid argument */
+	if (new_batch_skb < 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = 0;
+
+	/* Check if new value is same as the current */
+	if (!!(dev->features & NETIF_F_BATCH_ON) == !!new_batch_skb)
+		goto out;
+
+	spin_lock(&dev->queue_lock);
+	if (new_batch_skb) {
+		dev->features |= NETIF_F_BATCH_ON;
+		dev->tx_queue_len >>= 1;
+	} else {
+		if (!skb_queue_empty(&dev->skb_blist))
+			skb_queue_purge(&dev->skb_blist);
+		dev->features &= ~NETIF_F_BATCH_ON;
+		dev->tx_queue_len <<= 1;
+	}
+	spin_unlock(&dev->queue_lock);
+
+out:
+	return ret;
+}
+
 /**
  *	dev_load 	- load a network module
  *	@name: name of interface
@@ -1414,6 +1456,45 @@ static int dev_gso_segment(struct sk_buf
 	return 0;
 }
 
+/*
+ * Add skb (skbs in case segmentation is required) to dev->skb_blist. We are
+ * holding QDISC RUNNING bit, so no one else can add to this list. Also, skbs
+ * are dequeued from this list when we call the driver, so the list is safe
+ * from simultaneous deletes too.
+ *
+ * Returns count of successful skb(s) added to skb_blist.
+ */
+int dev_add_skb_to_blist(struct sk_buff *skb, struct net_device *dev)
+{
+	if (!list_empty(&ptype_all))
+		dev_queue_xmit_nit(skb, dev);
+
+	if (netif_needs_gso(dev, skb)) {
+		if (unlikely(dev_gso_segment(skb))) {
+			kfree(skb);
+			return 0;
+		}
+
+		if (skb->next) {
+			int count = 0;
+
+			do {
+				struct sk_buff *nskb = skb->next;
+
+				skb->next = nskb->next;
+				__skb_queue_tail(&dev->skb_blist, nskb);
+				count++;
+			} while (skb->next);
+
+			skb->destructor = DEV_GSO_CB(skb)->destructor;
+			kfree_skb(skb);
+			return count;
+		}
+	}
+	__skb_queue_tail(&dev->skb_blist, skb);
+	return 1;
+}
+
 int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	if (likely(!skb->next)) {
@@ -3397,6 +3483,12 @@ int register_netdevice(struct net_device
 		}
 	}
 
+	if (dev->hard_start_xmit_batch) {
+		dev->features |= NETIF_F_BATCH_ON;
+		skb_queue_head_init(&dev->skb_blist);
+		dev->tx_queue_len >>= 1;
+	}
+
 	/*
 	 *	nil rebuild_header routine,
 	 *	that should be never called and used as just bug trap.


From krkumar2 at in.ibm.com  Sun Jul 22 02:05:35 2007
From: krkumar2 at in.ibm.com (Krishna Kumar)
Date: Sun, 22 Jul 2007 14:35:35 +0530
Subject: [ofa-general] [PATCH 04/12 -Rev2] Ethtool changes
In-Reply-To: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
Message-ID: <20070722090534.7787.8673.sendpatchset@K50wks273871wss.in.ibm.com>

diff -ruNp org/include/linux/ethtool.h rev2/include/linux/ethtool.h
--- org/include/linux/ethtool.h	2007-07-21 13:39:50.000000000 +0530
+++ rev2/include/linux/ethtool.h	2007-07-21 13:40:57.000000000 +0530
@@ -414,6 +414,8 @@ struct ethtool_ops {
 #define ETHTOOL_SUFO		0x00000022 /* Set UFO enable (ethtool_value) */
 #define ETHTOOL_GGSO		0x00000023 /* Get GSO enable (ethtool_value) */
 #define ETHTOOL_SGSO		0x00000024 /* Set GSO enable (ethtool_value) */
+#define ETHTOOL_GBTX		0x00000025 /* Get Batching (ethtool_value) */
+#define ETHTOOL_SBTX		0x00000026 /* Set Batching (ethtool_value) */
 
 /* compatibility with older code */
 #define SPARC_ETH_GSET		ETHTOOL_GSET
diff -ruNp org/net/core/ethtool.c rev2/net/core/ethtool.c
--- org/net/core/ethtool.c	2007-07-21 13:37:17.000000000 +0530
+++ rev2/net/core/ethtool.c	2007-07-21 22:55:38.000000000 +0530
@@ -648,6 +648,26 @@ static int ethtool_set_gso(struct net_de
 	return 0;
 }
 
+static int ethtool_get_batch(struct net_device *dev, char __user *useraddr)
+{
+	struct ethtool_value edata = { ETHTOOL_GBTX };
+
+	edata.data = BATCHING_ON(dev);
+	if (copy_to_user(useraddr, &edata, sizeof(edata)))
+		 return -EFAULT;
+	return 0;
+}
+
+static int ethtool_set_batch(struct net_device *dev, char __user *useraddr)
+{
+	struct ethtool_value edata;
+
+	if (copy_from_user(&edata, useraddr, sizeof(edata)))
+		return -EFAULT;
+
+	return dev_change_tx_batching(dev, edata.data);
+}
+
 static int ethtool_self_test(struct net_device *dev, char __user *useraddr)
 {
 	struct ethtool_test test;
@@ -959,6 +979,12 @@ int dev_ethtool(struct ifreq *ifr)
 	case ETHTOOL_SGSO:
 		rc = ethtool_set_gso(dev, useraddr);
 		break;
+	case ETHTOOL_GBTX:
+		rc = ethtool_get_batch(dev, useraddr);
+		break;
+	case ETHTOOL_SBTX:
+		rc = ethtool_set_batch(dev, useraddr);
+		break;
 	default:
 		rc =  -EOPNOTSUPP;
 	}


From krkumar2 at in.ibm.com  Sun Jul 22 02:05:44 2007
From: krkumar2 at in.ibm.com (Krishna Kumar)
Date: Sun, 22 Jul 2007 14:35:44 +0530
Subject: [ofa-general] [PATCH 05/12 -Rev2] sysfs changes.
In-Reply-To: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
Message-ID: <20070722090544.7787.87947.sendpatchset@K50wks273871wss.in.ibm.com>

diff -ruNp org/net/core/net-sysfs.c rev2/net/core/net-sysfs.c
--- org/net/core/net-sysfs.c	2007-07-20 07:49:28.000000000 +0530
+++ rev2/net/core/net-sysfs.c	2007-07-21 22:56:32.000000000 +0530
@@ -230,6 +230,21 @@ static ssize_t store_weight(struct devic
 	return netdev_store(dev, attr, buf, len, change_weight);
 }
 
+static ssize_t show_tx_batch_skb(struct device *dev,
+				 struct device_attribute *attr, char *buf)
+{
+	struct net_device *netdev = to_net_dev(dev);
+
+	return sprintf(buf, fmt_dec, BATCHING_ON(netdev));
+}
+
+static ssize_t store_tx_batch_skb(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t len)
+{
+	return netdev_store(dev, attr, buf, len, dev_change_tx_batching);
+}
+
 static struct device_attribute net_class_attributes[] = {
 	__ATTR(addr_len, S_IRUGO, show_addr_len, NULL),
 	__ATTR(iflink, S_IRUGO, show_iflink, NULL),
@@ -246,6 +261,8 @@ static struct device_attribute net_class
 	__ATTR(flags, S_IRUGO | S_IWUSR, show_flags, store_flags),
 	__ATTR(tx_queue_len, S_IRUGO | S_IWUSR, show_tx_queue_len,
 	       store_tx_queue_len),
+	__ATTR(tx_batch_skbs, S_IRUGO | S_IWUSR, show_tx_batch_skb,
+	       store_tx_batch_skb),
 	__ATTR(weight, S_IRUGO | S_IWUSR, show_weight, store_weight),
 	{}
 };


From krkumar2 at in.ibm.com  Sun Jul 22 02:05:53 2007
From: krkumar2 at in.ibm.com (Krishna Kumar)
Date: Sun, 22 Jul 2007 14:35:53 +0530
Subject: [ofa-general] [PATCH 06/12 -Rev2] rtnetlink changes.
In-Reply-To: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
Message-ID: <20070722090553.7787.28728.sendpatchset@K50wks273871wss.in.ibm.com>

diff -ruNp org/include/linux/if_link.h rev2/include/linux/if_link.h
--- org/include/linux/if_link.h	2007-07-20 16:33:35.000000000 +0530
+++ rev2/include/linux/if_link.h	2007-07-20 16:35:08.000000000 +0530
@@ -78,6 +78,8 @@ enum
 	IFLA_LINKMODE,
 	IFLA_LINKINFO,
 #define IFLA_LINKINFO IFLA_LINKINFO
+	IFLA_TXBTHSKB,		/* Driver support for Batch'd skbs */
+#define IFLA_TXBTHSKB IFLA_TXBTHSKB
 	__IFLA_MAX
 };
 
diff -ruNp org/net/core/rtnetlink.c rev2/net/core/rtnetlink.c
--- org/net/core/rtnetlink.c	2007-07-20 16:31:59.000000000 +0530
+++ rev2/net/core/rtnetlink.c	2007-07-21 22:27:10.000000000 +0530
@@ -634,6 +634,7 @@ static int rtnl_fill_ifinfo(struct sk_bu
 
 	NLA_PUT_STRING(skb, IFLA_IFNAME, dev->name);
 	NLA_PUT_U32(skb, IFLA_TXQLEN, dev->tx_queue_len);
+	NLA_PUT_U32(skb, IFLA_TXBTHSKB, BATCHING_ON(dev));
 	NLA_PUT_U32(skb, IFLA_WEIGHT, dev->weight);
 	NLA_PUT_U8(skb, IFLA_OPERSTATE,
 		   netif_running(dev) ? dev->operstate : IF_OPER_DOWN);
@@ -833,7 +834,8 @@ static int do_setlink(struct net_device 
 
 	if (tb[IFLA_TXQLEN])
 		dev->tx_queue_len = nla_get_u32(tb[IFLA_TXQLEN]);
-
+	if (tb[IFLA_TXBTHSKB])
+		dev_change_tx_batching(dev, nla_get_u32(tb[IFLA_TXBTHSKB]));
 	if (tb[IFLA_WEIGHT])
 		dev->weight = nla_get_u32(tb[IFLA_WEIGHT]);
 
@@ -1072,6 +1074,9 @@ replay:
 			       nla_len(tb[IFLA_BROADCAST]));
 		if (tb[IFLA_TXQLEN])
 			dev->tx_queue_len = nla_get_u32(tb[IFLA_TXQLEN]);
+		if (tb[IFLA_TXBTHSKB])
+			dev_change_tx_batching(dev,
+					       nla_get_u32(tb[IFLA_TXBTHSKB]));
 		if (tb[IFLA_WEIGHT])
 			dev->weight = nla_get_u32(tb[IFLA_WEIGHT]);
 		if (tb[IFLA_OPERSTATE])


From krkumar2 at in.ibm.com  Sun Jul 22 02:06:02 2007
From: krkumar2 at in.ibm.com (Krishna Kumar)
Date: Sun, 22 Jul 2007 14:36:02 +0530
Subject: [ofa-general] [PATCH 07/12 -Rev2] Change qdisc_run & qdisc_restart
	API, callers
In-Reply-To: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
Message-ID: <20070722090602.7787.50560.sendpatchset@K50wks273871wss.in.ibm.com>

diff -ruNp org/include/net/pkt_sched.h rev2/include/net/pkt_sched.h
--- org/include/net/pkt_sched.h	2007-07-20 07:49:28.000000000 +0530
+++ rev2/include/net/pkt_sched.h	2007-07-20 16:09:45.000000000 +0530
@@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge
 		struct rtattr *tab);
 extern void qdisc_put_rtab(struct qdisc_rate_table *tab);
 
-extern void __qdisc_run(struct net_device *dev);
+extern void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist);
 
-static inline void qdisc_run(struct net_device *dev)
+static inline void qdisc_run(struct net_device *dev, struct sk_buff_head *blist)
 {
 	if (!netif_queue_stopped(dev) &&
 	    !test_and_set_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
-		__qdisc_run(dev);
+		__qdisc_run(dev, blist);
 }
 
 extern int tc_classify_compat(struct sk_buff *skb, struct tcf_proto *tp,
diff -ruNp org/net/sched/sch_generic.c rev2/net/sched/sch_generic.c
--- org/net/sched/sch_generic.c	2007-07-20 07:49:28.000000000 +0530
+++ rev2/net/sched/sch_generic.c	2007-07-22 12:11:10.000000000 +0530
@@ -59,10 +59,12 @@ static inline int qdisc_qlen(struct Qdis
 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev,
 				  struct Qdisc *q)
 {
-	if (unlikely(skb->next))
-		dev->gso_skb = skb;
-	else
-		q->ops->requeue(skb, q);
+	if (likely(skb)) {
+		if (unlikely(skb->next))
+			dev->gso_skb = skb;
+		else
+			q->ops->requeue(skb, q);
+	}
 
 	netif_schedule(dev);
 	return 0;
@@ -91,18 +93,23 @@ static inline int handle_dev_cpu_collisi
 		/*
 		 * Same CPU holding the lock. It may be a transient
 		 * configuration error, when hard_start_xmit() recurses. We
-		 * detect it by checking xmit owner and drop the packet when
-		 * deadloop is detected. Return OK to try the next skb.
+		 * detect it by checking xmit owner and drop skb (or all
+		 * skbs in batching case) when deadloop is detected. Return
+		 * OK to try the next skb.
 		 */
-		kfree_skb(skb);
+		if (likely(skb))
+			kfree_skb(skb);
+		else if (!skb_queue_empty(&dev->skb_blist))
+			skb_queue_purge(&dev->skb_blist);
+
 		if (net_ratelimit())
 			printk(KERN_WARNING "Dead loop on netdevice %s, "
 			       "fix it urgently!\n", dev->name);
 		ret = qdisc_qlen(q);
 	} else {
 		/*
-		 * Another cpu is holding lock, requeue & delay xmits for
-		 * some time.
+		 * Another cpu is holding lock. Requeue skb and delay xmits
+		 * for some time.
 		 */
 		__get_cpu_var(netdev_rx_stat).cpu_collision++;
 		ret = dev_requeue_skb(skb, dev, q);
@@ -112,6 +119,38 @@ static inline int handle_dev_cpu_collisi
 }
 
 /*
+ * Algorithm to get skb(s) is:
+ *	- Non batching drivers, or if the batch list is empty and there is
+ *	  atmost one skb in the queue - dequeue skb and put it in *skbp to
+ *	  tell the caller to use the single xmit API.
+ *	- Batching drivers where the batch list already contains atleast one
+ *	  skb or if there are multiple skbs in the queue: keep dequeue'ing
+ *	  skb's upto a limit and set *skbp to NULL to tell the caller to use
+ *	  the multiple xmit API.
+ *
+ * Returns:
+ *	1 - atleast one skb is to be sent out, *skbp contains skb or NULL
+ *	    (in case >1 skbs present in blist for batching)
+ *	0 - no skbs to be sent.
+ */
+static inline int get_skb(struct net_device *dev, struct Qdisc *q,
+			  struct sk_buff_head *blist, struct sk_buff **skbp)
+{
+	if (likely(!blist) || (!skb_queue_len(blist) && qdisc_qlen(q) <= 1)) {
+		return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
+	} else {
+		int max = dev->tx_queue_len - skb_queue_len(blist);
+		struct sk_buff *skb;
+
+		while (max > 0 && (skb = dev_dequeue_skb(dev, q)) != NULL)
+			max -= dev_add_skb_to_blist(skb, dev);
+
+		*skbp = NULL;
+		return 1;	/* we have atleast one skb in blist */
+	}
+}
+
+/*
  * NOTE: Called under dev->queue_lock with locally disabled BH.
  *
  * __LINK_STATE_QDISC_RUNNING guarantees only one CPU can process this
@@ -130,7 +169,8 @@ static inline int handle_dev_cpu_collisi
  *				>0 - queue is not empty.
  *
  */
-static inline int qdisc_restart(struct net_device *dev)
+static inline int qdisc_restart(struct net_device *dev,
+				struct sk_buff_head *blist)
 {
 	struct Qdisc *q = dev->qdisc;
 	struct sk_buff *skb;
@@ -138,7 +178,7 @@ static inline int qdisc_restart(struct n
 	int ret;
 
 	/* Dequeue packet */
-	if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL))
+	if (unlikely(!get_skb(dev, q, blist, &skb)))
 		return 0;
 
 	/*
@@ -158,7 +198,10 @@ static inline int qdisc_restart(struct n
 	/* And release queue */
 	spin_unlock(&dev->queue_lock);
 
-	ret = dev_hard_start_xmit(skb, dev);
+	if (likely(skb))
+		ret = dev_hard_start_xmit(skb, dev);
+	else
+		ret = dev->hard_start_xmit_batch(dev);
 
 	if (!lockless)
 		netif_tx_unlock(dev);
@@ -168,7 +211,7 @@ static inline int qdisc_restart(struct n
 
 	switch (ret) {
 	case NETDEV_TX_OK:
-		/* Driver sent out skb successfully */
+		/* Driver sent out skb (or entire skb_blist) successfully */
 		ret = qdisc_qlen(q);
 		break;
 
@@ -179,8 +222,8 @@ static inline int qdisc_restart(struct n
 
 	default:
 		/* Driver returned NETDEV_TX_BUSY - requeue skb */
-		if (unlikely (ret != NETDEV_TX_BUSY && net_ratelimit()))
-			printk(KERN_WARNING "BUG %s code %d qlen %d\n",
+		if (unlikely(ret != NETDEV_TX_BUSY) && net_ratelimit())
+			printk(KERN_WARNING " %s: BUG. code %d qlen %d\n",
 			       dev->name, ret, q->q.qlen);
 
 		ret = dev_requeue_skb(skb, dev, q);
@@ -190,10 +233,10 @@ static inline int qdisc_restart(struct n
 	return ret;
 }
 
-void __qdisc_run(struct net_device *dev)
+void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist)
 {
 	do {
-		if (!qdisc_restart(dev))
+		if (!qdisc_restart(dev, blist))
 			break;
 	} while (!netif_queue_stopped(dev));
 
@@ -567,6 +610,13 @@ void dev_deactivate(struct net_device *d
 
 	skb = dev->gso_skb;
 	dev->gso_skb = NULL;
+
+	if (BATCHING_ON(dev)) {
+		/* Free skbs on batch list */
+		if (!skb_queue_empty(&dev->skb_blist))
+			skb_queue_purge(&dev->skb_blist);
+	}
+
 	spin_unlock_bh(&dev->queue_lock);
 
 	kfree_skb(skb);
diff -ruNp org/net/core/dev.c rev2/net/core/dev.c
--- org/net/core/dev.c	2007-07-20 07:49:28.000000000 +0530
+++ rev2/net/core/dev.c	2007-07-21 23:08:33.000000000 +0530
@@ -1647,7 +1647,7 @@ gso:
 			/* reset queue_mapping to zero */
 			skb->queue_mapping = 0;
 			rc = q->enqueue(skb, q);
-			qdisc_run(dev);
+			qdisc_run(dev, NULL);
 			spin_unlock(&dev->queue_lock);
 
 			rc = rc == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : rc;
@@ -1844,7 +1844,12 @@ static void net_tx_action(struct softirq
 			clear_bit(__LINK_STATE_SCHED, &dev->state);
 
 			if (spin_trylock(&dev->queue_lock)) {
-				qdisc_run(dev);
+				/*
+				 * Try to send out all skbs if batching is
+				 * enabled.
+				 */
+				qdisc_run(dev, BATCHING_ON(dev) ?
+					       &dev->skb_blist : NULL);
 				spin_unlock(&dev->queue_lock);
 			} else {
 				netif_schedule(dev);


From krkumar2 at in.ibm.com  Sun Jul 22 02:06:17 2007
From: krkumar2 at in.ibm.com (Krishna Kumar)
Date: Sun, 22 Jul 2007 14:36:17 +0530
Subject: [ofa-general] [PATCH 08/12 -Rev2] IPoIB include file changes.
In-Reply-To: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
Message-ID: <20070722090612.7787.63282.sendpatchset@K50wks273871wss.in.ibm.com>

diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib.h rev2/drivers/infiniband/ulp/ipoib/ipoib.h
--- org/drivers/infiniband/ulp/ipoib/ipoib.h	2007-07-20 07:49:28.000000000 +0530
+++ rev2/drivers/infiniband/ulp/ipoib/ipoib.h	2007-07-20 16:09:45.000000000 +0530
@@ -269,8 +269,8 @@ struct ipoib_dev_priv {
 	struct ipoib_tx_buf *tx_ring;
 	unsigned             tx_head;
 	unsigned             tx_tail;
-	struct ib_sge        tx_sge;
-	struct ib_send_wr    tx_wr;
+	struct ib_sge        *tx_sge;
+	struct ib_send_wr    *tx_wr;
 
 	struct ib_wc ibwc[IPOIB_NUM_WC];
 
@@ -365,8 +365,11 @@ static inline void ipoib_put_ah(struct i
 int ipoib_open(struct net_device *dev);
 int ipoib_add_pkey_attr(struct net_device *dev);
 
+int ipoib_process_skb(struct net_device *dev, struct sk_buff *skb,
+		      struct ipoib_dev_priv *priv, int snum, int tx_index,
+		      struct ipoib_ah *address, u32 qpn);
 void ipoib_send(struct net_device *dev, struct sk_buff *skb,
-		struct ipoib_ah *address, u32 qpn);
+		struct ipoib_ah *address, u32 qpn, int num_skbs);
 void ipoib_reap_ah(struct work_struct *work);
 
 void ipoib_flush_paths(struct net_device *dev);


From krkumar2 at in.ibm.com  Sun Jul 22 02:06:26 2007
From: krkumar2 at in.ibm.com (Krishna Kumar)
Date: Sun, 22 Jul 2007 14:36:26 +0530
Subject: [ofa-general] [PATCH 09/12 -Rev2] IPoIB verbs changes
In-Reply-To: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
Message-ID: <20070722090626.7787.25000.sendpatchset@K50wks273871wss.in.ibm.com>

diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_verbs.c rev2/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-07-20 07:49:28.000000000 +0530
+++ rev2/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-07-20 16:09:45.000000000 +0530
@@ -152,11 +152,11 @@ int ipoib_transport_dev_init(struct net_
 			.max_send_sge = 1,
 			.max_recv_sge = 1
 		},
-		.sq_sig_type = IB_SIGNAL_ALL_WR,
+		.sq_sig_type = IB_SIGNAL_REQ_WR,	/* 11.2.4.1 */
 		.qp_type     = IB_QPT_UD
 	};
-
-	int ret, size;
+	struct ib_send_wr *next_wr = NULL;
+	int i, ret, size;
 
 	priv->pd = ib_alloc_pd(priv->ca);
 	if (IS_ERR(priv->pd)) {
@@ -197,12 +197,17 @@ int ipoib_transport_dev_init(struct net_
 	priv->dev->dev_addr[2] = (priv->qp->qp_num >>  8) & 0xff;
 	priv->dev->dev_addr[3] = (priv->qp->qp_num      ) & 0xff;
 
-	priv->tx_sge.lkey 	= priv->mr->lkey;
-
-	priv->tx_wr.opcode 	= IB_WR_SEND;
-	priv->tx_wr.sg_list 	= &priv->tx_sge;
-	priv->tx_wr.num_sge 	= 1;
-	priv->tx_wr.send_flags 	= IB_SEND_SIGNALED;
+	for (i = ipoib_sendq_size - 1; i >= 0; i--) {
+		priv->tx_sge[i].lkey		= priv->mr->lkey;
+		priv->tx_wr[i].opcode		= IB_WR_SEND;
+		priv->tx_wr[i].sg_list		= &priv->tx_sge[i];
+		priv->tx_wr[i].num_sge		= 1;
+		priv->tx_wr[i].send_flags	= 0;
+
+		/* Link the list properly for provider to use */
+		priv->tx_wr[i].next		= next_wr;
+		next_wr				= &priv->tx_wr[i];
+	}
 
 	return 0;
 

From krkumar2 at in.ibm.com  Sun Jul 22 02:06:40 2007
From: krkumar2 at in.ibm.com (Krishna Kumar)
Date: Sun, 22 Jul 2007 14:36:40 +0530
Subject: [ofa-general] [PATCH 10/12 -Rev2] IPoIB multicast, CM changes
In-Reply-To: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
Message-ID: <20070722090640.7787.17578.sendpatchset@K50wks273871wss.in.ibm.com>

diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_cm.c rev2/drivers/infiniband/ulp/ipoib/ipoib_cm.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-07-20 07:49:28.000000000 +0530
+++ rev2/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-07-20 16:09:45.000000000 +0530
@@ -493,14 +493,19 @@ static inline int post_send(struct ipoib
 			    unsigned int wr_id,
 			    u64 addr, int len)
 {
+	int ret;
 	struct ib_send_wr *bad_wr;
 
-	priv->tx_sge.addr             = addr;
-	priv->tx_sge.length           = len;
+	priv->tx_sge[0].addr          = addr;
+	priv->tx_sge[0].length        = len;
+
+	priv->tx_wr[0].wr_id 	      = wr_id;
 
-	priv->tx_wr.wr_id 	      = wr_id;
+	priv->tx_wr[0].next = NULL;
+	ret = ib_post_send(tx->qp, priv->tx_wr, &bad_wr);
+	priv->tx_wr[0].next = &priv->tx_wr[1];
 
-	return ib_post_send(tx->qp, &priv->tx_wr, &bad_wr);
+	return ret;
 }
 
 void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx)
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_multicast.c rev2/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-07-20 07:49:28.000000000 +0530
+++ rev2/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-07-20 16:09:45.000000000 +0530
@@ -217,7 +217,7 @@ static int ipoib_mcast_join_finish(struc
 	if (!memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4,
 		    sizeof (union ib_gid))) {
 		priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey);
-		priv->tx_wr.wr.ud.remote_qkey = priv->qkey;
+		priv->tx_wr[0].wr.ud.remote_qkey = priv->qkey;
 	}
 
 	if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) {
@@ -736,7 +736,7 @@ out:
 			}
 		}
 
-		ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN);
+		ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN, 1);
 	}
 
 unlock:


From krkumar2 at in.ibm.com  Sun Jul 22 02:06:49 2007
From: krkumar2 at in.ibm.com (Krishna Kumar)
Date: Sun, 22 Jul 2007 14:36:49 +0530
Subject: [ofa-general] [PATCH 11/12 -Rev2] IPoIB xmit API addition
In-Reply-To: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
Message-ID: <20070722090649.7787.47960.sendpatchset@K50wks273871wss.in.ibm.com>

diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_ib.c rev2/drivers/infiniband/ulp/ipoib/ipoib_ib.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-07-20 07:49:28.000000000 +0530
+++ rev2/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-07-22 00:08:37.000000000 +0530
@@ -242,8 +242,9 @@ repost:
 static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int i = 0, num_completions;
+	int tx_ring_index = priv->tx_tail & (ipoib_sendq_size - 1);
 	unsigned int wr_id = wc->wr_id;
-	struct ipoib_tx_buf *tx_req;
 	unsigned long flags;
 
 	ipoib_dbg_data(priv, "send completion: id %d, status: %d\n",
@@ -255,23 +256,57 @@ static void ipoib_ib_handle_tx_wc(struct
 		return;
 	}
 
-	tx_req = &priv->tx_ring[wr_id];
+	num_completions = wr_id - tx_ring_index + 1;
+	if (num_completions <= 0)
+		num_completions += ipoib_sendq_size;
 
-	ib_dma_unmap_single(priv->ca, tx_req->mapping,
-			    tx_req->skb->len, DMA_TO_DEVICE);
+	/*
+	 * Handle skbs completion from tx_tail to wr_id. It is possible to
+	 * handle WC's from earlier post_sends (possible multiple) in this
+	 * iteration as we move from tx_tail to wr_id, since if the last
+	 * WR (which is the one which had a completion request) failed to be
+	 * sent for any of those earlier request(s), no completion
+	 * notification is generated for successful WR's of those earlier
+	 * request(s).
+	 */
+	while (1) {
+		/*
+		 * Could use while (i < num_completions), but it is costly
+		 * since in most cases there is 1 completion, and we end up
+		 * doing an extra "index = (index+1) & (ipoib_sendq_size-1)"
+		 */
+		struct ipoib_tx_buf *tx_req = &priv->tx_ring[tx_ring_index];
+
+		if (likely(tx_req->skb)) {
+			ib_dma_unmap_single(priv->ca, tx_req->mapping,
+					    tx_req->skb->len, DMA_TO_DEVICE);
 
-	++priv->stats.tx_packets;
-	priv->stats.tx_bytes += tx_req->skb->len;
+			++priv->stats.tx_packets;
+			priv->stats.tx_bytes += tx_req->skb->len;
 
-	dev_kfree_skb_any(tx_req->skb);
+			dev_kfree_skb_any(tx_req->skb);
+		}
+		/*
+		 * else this skb failed synchronously when posted and was
+		 * freed immediately.
+		 */
+
+		if (++i == num_completions)
+			break;
+
+		/* More WC's to handle */
+		tx_ring_index = (tx_ring_index + 1) & (ipoib_sendq_size - 1);
+	}
 
 	spin_lock_irqsave(&priv->tx_lock, flags);
-	++priv->tx_tail;
+
+	priv->tx_tail += num_completions;
 	if (unlikely(test_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags)) &&
 	    priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) {
 		clear_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
 		netif_wake_queue(dev);
 	}
+
 	spin_unlock_irqrestore(&priv->tx_lock, flags);
 
 	if (wc->status != IB_WC_SUCCESS &&
@@ -340,78 +375,178 @@ void ipoib_ib_completion(struct ib_cq *c
 	netif_rx_schedule(dev_ptr);
 }
 
-static inline int post_send(struct ipoib_dev_priv *priv,
-			    unsigned int wr_id,
-			    struct ib_ah *address, u32 qpn,
-			    u64 addr, int len)
+/*
+ * post_send : Post WR(s) to the device.
+ *
+ * num_skbs is the number of WR's, 'start_index' is the first slot in
+ * tx_wr[] or tx_sge[]. Note: 'start_index' is normally zero, unless a
+ * previous post_send returned error and we are trying to send the untried
+ * WR's, in which case start_index will point to the first untried WR.
+ *
+ * We also break the WR link before posting so that the driver knows how
+ * many WR's to process, and this is set back after the post.
+ */
+static inline int post_send(struct ipoib_dev_priv *priv, u32 qpn,
+			    int start_index, int num_skbs,
+			    struct ib_send_wr **bad_wr)
 {
-	struct ib_send_wr *bad_wr;
+	int ret;
+	struct ib_send_wr *last_wr, *next_wr;
+
+	last_wr = &priv->tx_wr[start_index + num_skbs - 1];
+
+	/* Set Completion Notification for last WR */
+	last_wr->send_flags = IB_SEND_SIGNALED;
 
-	priv->tx_sge.addr             = addr;
-	priv->tx_sge.length           = len;
+	/* Terminate the last WR */
+	next_wr = last_wr->next;
+	last_wr->next = NULL;
 
-	priv->tx_wr.wr_id 	      = wr_id;
-	priv->tx_wr.wr.ud.remote_qpn  = qpn;
-	priv->tx_wr.wr.ud.ah 	      = address;
+	/* Send all the WR's in one doorbell */
+	ret = ib_post_send(priv->qp, &priv->tx_wr[start_index], bad_wr);
 
-	return ib_post_send(priv->qp, &priv->tx_wr, &bad_wr);
+	/* Restore send_flags & WR chain */
+	last_wr->send_flags = 0;
+	last_wr->next = next_wr;
+
+	return ret;
 }
 
-void ipoib_send(struct net_device *dev, struct sk_buff *skb,
-		struct ipoib_ah *address, u32 qpn)
+/*
+ * Map skb & store skb/mapping in tx_req; and details of the WR in tx_wr
+ * to pass to the driver.
+ *
+ * Returns :
+ *	- 0 on successful processing of the skb
+ *	- 1 if the skb was freed.
+ */
+int ipoib_process_skb(struct net_device *dev, struct sk_buff *skb,
+		      struct ipoib_dev_priv *priv, int wr_num,
+		      int tx_ring_index, struct ipoib_ah *address, u32 qpn)
 {
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	struct ipoib_tx_buf *tx_req;
 	u64 addr;
+	struct ipoib_tx_buf *tx_req;
 
 	if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) {
-		ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n",
+		ipoib_warn(priv, "packet len %d (> %d) too long to "
+			   "send, dropping\n",
 			   skb->len, priv->mcast_mtu + IPOIB_ENCAP_LEN);
 		++priv->stats.tx_dropped;
 		++priv->stats.tx_errors;
 		ipoib_cm_skb_too_long(dev, skb, priv->mcast_mtu);
-		return;
+		return 1;
 	}
 
-	ipoib_dbg_data(priv, "sending packet, length=%d address=%p qpn=0x%06x\n",
+	ipoib_dbg_data(priv, "sending packet, length=%d address=%p "
+		       "qpn=0x%06x\n",
 		       skb->len, address, qpn);
 
 	/*
 	 * We put the skb into the tx_ring _before_ we call post_send()
 	 * because it's entirely possible that the completion handler will
-	 * run before we execute anything after the post_send().  That
+	 * run before we execute anything after the post_send(). That
 	 * means we have to make sure everything is properly recorded and
 	 * our state is consistent before we call post_send().
 	 */
-	tx_req = &priv->tx_ring[priv->tx_head & (ipoib_sendq_size - 1)];
-	tx_req->skb = skb;
-	addr = ib_dma_map_single(priv->ca, skb->data, skb->len,
-				 DMA_TO_DEVICE);
+	addr = ib_dma_map_single(priv->ca, skb->data, skb->len, DMA_TO_DEVICE);
 	if (unlikely(ib_dma_mapping_error(priv->ca, addr))) {
 		++priv->stats.tx_errors;
 		dev_kfree_skb_any(skb);
-		return;
+		return 1;
 	}
+
+	tx_req = &priv->tx_ring[tx_ring_index];
+	tx_req->skb = skb;
 	tx_req->mapping = addr;
+	priv->tx_sge[wr_num].addr = addr;
+	priv->tx_sge[wr_num].length = skb->len;
+	priv->tx_wr[wr_num].wr_id = tx_ring_index;
+	priv->tx_wr[wr_num].wr.ud.remote_qpn = qpn;
+	priv->tx_wr[wr_num].wr.ud.ah = address->ah;
 
-	if (unlikely(post_send(priv, priv->tx_head & (ipoib_sendq_size - 1),
-			       address->ah, qpn, addr, skb->len))) {
-		ipoib_warn(priv, "post_send failed\n");
-		++priv->stats.tx_errors;
-		ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE);
-		dev_kfree_skb_any(skb);
-	} else {
-		dev->trans_start = jiffies;
+	return 0;
+}
+
+/*
+ * If an skb is passed to this function, it is the single, unprocessed skb
+ * send case. Otherwise if skb is NULL, it means that all skbs are already
+ * processed and put on the priv->tx_wr,tx_sge,tx_ring, etc.
+ */
+void ipoib_send(struct net_device *dev, struct sk_buff *skb,
+		struct ipoib_ah *address, u32 qpn, int num_skbs)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int start_index = 0;
 
-		address->last_send = priv->tx_head;
-		++priv->tx_head;
+	if (skb && ipoib_process_skb(dev, skb, priv, 0, priv->tx_head &
+				     (ipoib_sendq_size - 1), address, qpn))
+		return;
 
-		if (priv->tx_head - priv->tx_tail == ipoib_sendq_size) {
-			ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
-			netif_stop_queue(dev);
-			set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
+	/* Send out all the skb's in one post */
+	while (num_skbs) {
+		struct ib_send_wr *bad_wr;
+
+		if (unlikely((post_send(priv, qpn, start_index, num_skbs,
+					&bad_wr)))) {
+			int done;
+
+			/*
+			 * Better error handling can be done here, like free
+			 * all untried skbs if err == -ENOMEM. However at this
+			 * time, we re-try all the skbs, all of which will
+			 * likely fail anyway (unless device finished sending
+			 * some out in the meantime). This is not a regression
+			 * since the earlier code is not doing this either.
+			 */
+			ipoib_warn(priv, "post_send failed\n");
+
+			/* Get #WR's that finished successfully */
+			done = bad_wr - &priv->tx_wr[start_index];
+
+			/* Handle 1 error */
+			priv->stats.tx_errors++;
+			ib_dma_unmap_single(priv->ca,
+				priv->tx_sge[start_index + done].addr,
+				priv->tx_sge[start_index + done].length,
+				DMA_TO_DEVICE);
+
+			/* Handle 'n' successes */
+			if (done) {
+				dev->trans_start = jiffies;
+				address->last_send = priv->tx_head;
+			}
+
+			/* Free failed WR & reset for WC handler to recognize */
+			dev_kfree_skb_any(priv->tx_ring[bad_wr->wr_id].skb);
+			priv->tx_ring[bad_wr->wr_id].skb = NULL;
+
+			/* Move head to first untried WR */
+			priv->tx_head += (done + 1);
+				/* + 1 for WR that was tried & failed */
+
+			/* Get count of skbs that were not tried */
+			num_skbs -= (done + 1);
+
+			/* Get start index for next iteration */
+			start_index += (done + 1);
+		} else {
+			dev->trans_start = jiffies;
+
+			address->last_send = priv->tx_head;
+			priv->tx_head += num_skbs;
+			num_skbs = 0;
 		}
 	}
+
+	if (unlikely(priv->tx_head - priv->tx_tail == ipoib_sendq_size)) {
+		/*
+		 * Not accurate as some intermediate slots could have been
+		 * freed on error, but no harm - only queue stopped earlier.
+		 */
+		ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
+		netif_stop_queue(dev);
+		set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
+	}
 }
 
 static void __ipoib_reap_ah(struct net_device *dev)


From krkumar2 at in.ibm.com  Sun Jul 22 02:06:59 2007
From: krkumar2 at in.ibm.com (Krishna Kumar)
Date: Sun, 22 Jul 2007 14:36:59 +0530
Subject: [ofa-general] [PATCH 12/12 -Rev2] IPoIB xmit internals changes
	(ipoib_ib.c)
In-Reply-To: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
Message-ID: <20070722090659.7787.47401.sendpatchset@K50wks273871wss.in.ibm.com>

diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_main.c rev2/drivers/infiniband/ulp/ipoib/ipoib_main.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-07-20 07:49:28.000000000 +0530
+++ rev2/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-07-22 00:08:28.000000000 +0530
@@ -558,7 +558,8 @@ static void neigh_add_path(struct sk_buf
 				goto err_drop;
 			}
 		} else
-			ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb->dst->neighbour->ha));
+			ipoib_send(dev, skb, path->ah,
+				   IPOIB_QPN(skb->dst->neighbour->ha), 1);
 	} else {
 		neigh->ah  = NULL;
 
@@ -638,7 +639,7 @@ static void unicast_arp_send(struct sk_b
 		ipoib_dbg(priv, "Send unicast ARP to %04x\n",
 			  be16_to_cpu(path->pathrec.dlid));
 
-		ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr));
+		ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr), 1);
 	} else if ((path->query || !path_rec_start(dev, path)) &&
 		   skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
 		/* put pseudoheader back on for next time */
@@ -704,7 +705,8 @@ static int ipoib_start_xmit(struct sk_bu
 				goto out;
 			}
 
-			ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(skb->dst->neighbour->ha));
+			ipoib_send(dev, skb, neigh->ah,
+				   IPOIB_QPN(skb->dst->neighbour->ha), 1);
 			goto out;
 		}
 
@@ -753,6 +755,175 @@ out:
 	return NETDEV_TX_OK;
 }
 
+#define	XMIT_QUEUED_SKBS()						\
+	do {								\
+		if (num_skbs) {						\
+			ipoib_send(dev, NULL, old_neigh->ah, old_qpn,	\
+				   num_skbs);				\
+			num_skbs = 0;					\
+		}							\
+	} while (0)
+
+/*
+ * TODO: Merge with ipoib_start_xmit to use the same code and have a
+ * transparent wrapper caller to xmit's, etc.
+ */
+static int ipoib_start_xmit_frames(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct sk_buff *skb;
+	struct sk_buff_head *blist;
+	int max_skbs, num_skbs = 0, tx_ring_index = -1;
+	u32 qpn, old_qpn = 0;
+	struct ipoib_neigh *neigh, *old_neigh = NULL;
+	unsigned long flags;
+
+	if (unlikely(!spin_trylock_irqsave(&priv->tx_lock, flags)))
+		return NETDEV_TX_LOCKED;
+
+	blist = &dev->skb_blist;
+
+	/*
+	 * Send atmost 'max_skbs' skbs. This also prevents the device getting
+	 * full.
+	 */
+	max_skbs = ipoib_sendq_size - (priv->tx_head - priv->tx_tail);
+	while (max_skbs-- > 0 && (skb = __skb_dequeue(blist)) != NULL) {
+		/*
+		 * From here on, ipoib_send() cannot stop the queue as it
+		 * uses the same initialization as 'max_skbs'. So we can
+		 * optimize to not check for queue stopped for every skb.
+		 */
+		if (likely(skb->dst && skb->dst->neighbour)) {
+			if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) {
+				XMIT_QUEUED_SKBS();
+				ipoib_path_lookup(skb, dev);
+				continue;
+			}
+
+			neigh = *to_ipoib_neigh(skb->dst->neighbour);
+
+			if (ipoib_cm_get(neigh)) {
+				if (ipoib_cm_up(neigh)) {
+					XMIT_QUEUED_SKBS();
+					ipoib_cm_send(dev, skb,
+						      ipoib_cm_get(neigh));
+					continue;
+				}
+			} else if (neigh->ah) {
+				if (unlikely(memcmp(&neigh->dgid.raw,
+						    skb->dst->neighbour->ha + 4,
+						    sizeof(union ib_gid)))) {
+					spin_lock(&priv->lock);
+					/*
+					 * It's safe to call ipoib_put_ah()
+					 * inside priv->lock here, because we
+					 * know that path->ah will always hold
+					 * one more reference, so ipoib_put_ah()
+					 * will never do more than decrement
+					 * the ref count.
+					 */
+					ipoib_put_ah(neigh->ah);
+					list_del(&neigh->list);
+					ipoib_neigh_free(dev, neigh);
+					spin_unlock(&priv->lock);
+					XMIT_QUEUED_SKBS();
+					ipoib_path_lookup(skb, dev);
+					continue;
+				}
+
+				qpn = IPOIB_QPN(skb->dst->neighbour->ha);
+				if (neigh != old_neigh || qpn != old_qpn) {
+					/*
+					 * Sending to a different destination
+					 * from earlier skb's - send all
+					 * existing skbs (if any).
+					 */
+					if (tx_ring_index == -1) {
+						/*
+						 * First time, find where to
+						 * store skb.
+						 */
+						tx_ring_index = priv->tx_head &
+							(ipoib_sendq_size - 1);
+					} else {
+						/* Some skbs to send */
+						XMIT_QUEUED_SKBS();
+					}
+					old_neigh = neigh;
+					old_qpn = IPOIB_QPN(skb->dst->neighbour->ha);
+				}
+
+				if (ipoib_process_skb(dev, skb, priv, num_skbs,
+						      tx_ring_index, neigh->ah,
+						      qpn))
+					continue;
+
+				num_skbs++;
+
+				/* Queue'd one skb, get index for next skb */
+				if (max_skbs)
+					tx_ring_index = (tx_ring_index + 1) &
+							(ipoib_sendq_size - 1);
+				continue;
+			}
+
+			if (skb_queue_len(&neigh->queue) <
+			    IPOIB_MAX_PATH_REC_QUEUE) {
+				spin_lock(&priv->lock);
+				__skb_queue_tail(&neigh->queue, skb);
+				spin_unlock(&priv->lock);
+			} else {
+				dev_kfree_skb_any(skb);
+				++priv->stats.tx_dropped;
+				++max_skbs;
+			}
+		} else {
+			struct ipoib_pseudoheader *phdr =
+				(struct ipoib_pseudoheader *) skb->data;
+			skb_pull(skb, sizeof *phdr);
+
+			if (phdr->hwaddr[4] == 0xff) {
+				/* Add in the P_Key for multicast*/
+				phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff;
+				phdr->hwaddr[9] = priv->pkey & 0xff;
+
+				XMIT_QUEUED_SKBS();
+				ipoib_mcast_send(dev, phdr->hwaddr + 4, skb);
+			} else {
+				/* unicast GID -- should be ARP or RARP reply */
+
+				if ((be16_to_cpup((__be16 *) skb->data) !=
+				    ETH_P_ARP) &&
+				    (be16_to_cpup((__be16 *) skb->data) !=
+				    ETH_P_RARP)) {
+					ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x "
+						IPOIB_GID_FMT "\n",
+						skb->dst ? "neigh" : "dst",
+						be16_to_cpup((__be16 *)
+						skb->data),
+						IPOIB_QPN(phdr->hwaddr),
+						IPOIB_GID_RAW_ARG(phdr->hwaddr
+								  + 4));
+					dev_kfree_skb_any(skb);
+					++priv->stats.tx_dropped;
+					++max_skbs;
+					continue;
+				}
+				XMIT_QUEUED_SKBS();
+				unicast_arp_send(skb, dev, phdr);
+			}
+		}
+	}
+
+	/* Send out last packets (if any) */
+	XMIT_QUEUED_SKBS();
+
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
+
+	return skb_queue_empty(blist) ? NETDEV_TX_OK : NETDEV_TX_BUSY;
+}
+
 static struct net_device_stats *ipoib_get_stats(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -898,11 +1069,35 @@ int ipoib_dev_init(struct net_device *de
 
 	/* priv->tx_head & tx_tail are already 0 */
 
-	if (ipoib_ib_dev_init(dev, ca, port))
+	/* Allocate tx_sge */
+	priv->tx_sge = kmalloc(ipoib_sendq_size * sizeof *priv->tx_sge,
+			       GFP_KERNEL);
+	if (!priv->tx_sge) {
+		printk(KERN_WARNING "%s: failed to allocate TX sge (%d entries)\n",
+		       ca->name, ipoib_sendq_size);
 		goto out_tx_ring_cleanup;
+	}
+
+	/* Allocate tx_wr */
+	priv->tx_wr = kmalloc(ipoib_sendq_size * sizeof *priv->tx_wr,
+			      GFP_KERNEL);
+	if (!priv->tx_wr) {
+		printk(KERN_WARNING "%s: failed to allocate TX wr (%d entries)\n",
+		       ca->name, ipoib_sendq_size);
+		goto out_tx_sge_cleanup;
+	}
+
+	if (ipoib_ib_dev_init(dev, ca, port))
+		goto out_tx_wr_cleanup;
 
 	return 0;
 
+out_tx_wr_cleanup:
+	kfree(priv->tx_wr);
+
+out_tx_sge_cleanup:
+	kfree(priv->tx_sge);
+
 out_tx_ring_cleanup:
 	kfree(priv->tx_ring);
 
@@ -930,9 +1125,13 @@ void ipoib_dev_cleanup(struct net_device
 
 	kfree(priv->rx_ring);
 	kfree(priv->tx_ring);
+	kfree(priv->tx_sge);
+	kfree(priv->tx_wr);
 
 	priv->rx_ring = NULL;
 	priv->tx_ring = NULL;
+	priv->tx_sge = NULL;
+	priv->tx_wr = NULL;
 }
 
 static void ipoib_setup(struct net_device *dev)
@@ -943,6 +1142,7 @@ static void ipoib_setup(struct net_devic
 	dev->stop 		 = ipoib_stop;
 	dev->change_mtu 	 = ipoib_change_mtu;
 	dev->hard_start_xmit 	 = ipoib_start_xmit;
+	dev->hard_start_xmit_batch = ipoib_start_xmit_frames;
 	dev->get_stats 		 = ipoib_get_stats;
 	dev->tx_timeout 	 = ipoib_timeout;
 	dev->hard_header 	 = ipoib_hard_header;


From ogerlitz at voltaire.com  Sun Jul 22 02:11:52 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 22 Jul 2007 12:11:52 +0300
Subject: [ofa-general] QoS RFC
In-Reply-To: <46A283B6.1070105@dev.mellanox.co.il>
References: <46A283B6.1070105@dev.mellanox.co.il>
Message-ID: <46A31F58.3010209@voltaire.com>

Yevgeny Kliteynik wrote:

> Please find the attached RFC describing how QoS policy support could be 
> implemented in the OpenFabrics stack.
> Your comments are welcome.

Hi Yevgeny,

Some quick comments from first re-read

1) IPoIB - just to make sure I am on the right page:

the intention is that the QoS params would be per --partition-- and 
hence the IPv4 broadcast multicast group sl, rate etc params would be 
used for each address handle created by this IPoIB device

2) RDMA CM (CMA) based ULPs -

Assuming the rdma cm api would be enhanced for the consumer to 
optionally provide the "qos class", why have a dedicated section at the 
doc for iSER? there are bunch of other rdma cm based ULPs (eg 
Lustre/rNFS/RDS/etc/etc) which would be able to get QoS through the IB 
sys admin configuration of QoS policy at the SM/SA

3) RC based ULPs - I was thinking that the SL should be derived from the 
sid AND the pkey, I wonder if the IBTA related annex addresses this.

4) at some cases, the SID to be used is not known in advance: 
specifically the somehow canonical example is MPI implementations that 
request for the CM to allocate SID per rank per job, which means that 
you want huge dynamic bunch of SIDs to be mapped by the SA to the same SL.

At the past my thinking to handle this was to change the CM such that 
users can ask for a --SID in a range-- and have this range be mapped to 
a specific SID in the SM/SA (same here maybe the IBTA annex says 
something re that)

Or.


From mst at dev.mellanox.co.il  Sun Jul 22 02:13:26 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 22 Jul 2007 12:13:26 +0300
Subject: [ofa-general] Re: [PATCH] IB/mlx4: enable MSI-X by default
In-Reply-To: <adaejj24arg.fsf@cisco.com>
References: <20070719112155.GJ24018@mellanox.co.il> <adaejj24arg.fsf@cisco.com>
Message-ID: <20070722091326.GA7800@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH] IB/mlx4: enable MSI-X by default
> 
>  > -	mlx4_enable_msi_x(dev);
>  > -
>  >  	if (mlx4_cmd_init(dev)) {
>  >  		mlx4_err(dev, "Failed to init command interface, aborting.\n");
>  >  		goto err_free_dev;
>  >  	}
>  >  
>  > +	mlx4_enable_msi_x(dev);
> 
> Why this change?  I don't see anything in mlx4_cmd_init() that seems
> to matter in terms of coming before or after enabling MSI-X.
> 
>  >  	err = mlx4_init_hca(dev);
>  > +	if (err == -EBUSY && (dev->flags & MLX4_FLAG_MSI_X)) {
>  > +		mlx4_warn(dev, "Trying again with MSI/MSI-X disabled.\n");
>  > +		dev->flags &= ~MLX4_FLAG_MSI_X;
>  > +		pci_disable_msix(pdev);
>  > +		err = mlx4_init_hca(dev);
>  > +	}
>  > +
>  >  	if (err)
>  >  		goto err_cmd;
>  >  
>  > +	mlx4_enable_msi_x(dev);
>  > +
>  >  	err = mlx4_setup_hca(dev);

You are right. I tried to copy the working mthca code
as closely as possible, but it looks like I made a mistake there.

> Have you actually tested this on a system where MSI-X fails?  Because
> I don't see how it could work-- we don't actually try interrupts until
> mlx4_setup_hca() (in fact we don't even create any EQs until then).
> So I don't see how mlx4_init_hca() could tell if MSI-X is OK...

I only have a box with buggy PCI-X chipset - I'm not sure there are PCI-Express
chipsets with broken MSI out there.  So while I did test that my patch brakes
nothing, the recovery code was untested.  I will patch in code to simulate
failure before reposting.


-- 
MST


From mst at dev.mellanox.co.il  Sun Jul 22 02:15:44 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 22 Jul 2007 12:15:44 +0300
Subject: [ofa-general] [TEST] test code to make msi-x fail
Message-ID: <20070722091544.GB7800@mellanox.co.il>

Here's a patch I used to test MSI-X failure recovery code
in mlx4 and mthca. Posted in case it's useful to someone.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_eq.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_eq.c	2007-07-19 09:36:11.000000000 +0300
+++ linux-2.6/drivers/infiniband/hw/mthca/mthca_eq.c	2007-07-22 12:02:17.000000000 +0300
@@ -436,7 +436,8 @@ static irqreturn_t mthca_tavor_msi_x_int
 	struct mthca_eq  *eq  = eq_ptr;
 	struct mthca_dev *dev = eq->dev;
 
-	mthca_eq_int(dev, eq);
+	if (0)
+		mthca_eq_int(dev, eq);
 	tavor_set_eq_ci(dev, eq, eq->cons_index);
 	tavor_eq_req_not(dev, eq->eqn);
 
Index: linux-2.6/drivers/net/mlx4/eq.c
===================================================================
--- linux-2.6.orig/drivers/net/mlx4/eq.c	2007-07-19 09:30:35.000000000 +0300
+++ linux-2.6/drivers/net/mlx4/eq.c	2007-07-22 12:01:35.000000000 +0300
@@ -273,7 +273,8 @@ static irqreturn_t mlx4_msi_x_interrupt(
 	struct mlx4_eq  *eq  = eq_ptr;
 	struct mlx4_dev *dev = eq->dev;
 
-	mlx4_eq_int(dev, eq);
+	if (0)
+		mlx4_eq_int(dev, eq);
 
 	/* MSI-X vectors always belong to us */
 	return IRQ_HANDLED;

-- 
MST


From mst at dev.mellanox.co.il  Sun Jul 22 02:19:44 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 22 Jul 2007 12:19:44 +0300
Subject: [ofa-general] [PATCH V2] IB/mlx4: enable MSI-X by default
Message-ID: <20070722091944.GC7800@mellanox.co.il>

Recover from MSI-X errors by automatically falling back on regular interrupt,
instead of asking the user to do this manually.  This makes it possible to
enable MSI-X by default, and will make it possible to get rid of msi_x module
option in the future.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

While the previous version worked fine in the good case,
it turns out it didn't actually recover from errors as intended.
This version was tested by patching the MSI-X handler routine.

diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index 4dc9dc1..b01d543 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -61,7 +61,7 @@ MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0");
 
 #ifdef CONFIG_PCI_MSI
 
-static int msi_x;
+static int msi_x = 1;
 module_param(msi_x, int, 0444);
 MODULE_PARM_DESC(msi_x, "attempt to use MSI-X if nonzero");
 
@@ -602,10 +602,7 @@ static int __devinit mlx4_setup_hca(struct mlx4_dev *dev)
 		mlx4_err(dev, "NOP command failed to generate interrupt "
 			 "(IRQ %d), aborting.\n",
 			 priv->eq_table.eq[MLX4_EQ_ASYNC].irq);
-		if (dev->flags & MLX4_FLAG_MSI_X)
-			mlx4_err(dev, "Try again with MSI-X disabled.\n");
-		else
-			mlx4_err(dev, "BIOS or ACPI interrupt routing problem?\n");
+		mlx4_err(dev, "BIOS or ACPI interrupt routing problem?\n");
 
 		goto err_cmd_poll;
 	}
@@ -803,8 +800,6 @@ static int __devinit mlx4_init_one(struct pci_dev *pdev,
 		goto err_free_dev;
 	}
 
-	mlx4_enable_msi_x(dev);
-
 	if (mlx4_cmd_init(dev)) {
 		mlx4_err(dev, "Failed to init command interface, aborting.\n");
 		goto err_free_dev;
@@ -814,7 +809,16 @@ static int __devinit mlx4_init_one(struct pci_dev *pdev,
 	if (err)
 		goto err_cmd;
 
+	mlx4_enable_msi_x(dev);
+
 	err = mlx4_setup_hca(dev);
+	if (err == -EBUSY && (dev->flags & MLX4_FLAG_MSI_X)) {
+		mlx4_warn(dev, "Trying again with MSI/MSI-X disabled.\n");
+		dev->flags &= ~MLX4_FLAG_MSI_X;
+		pci_disable_msix(pdev);
+		err = mlx4_setup_hca(dev);
+	}
+
 	if (err)
 		goto err_close;
 
@@ -838,15 +842,15 @@ err_cleanup:
 	mlx4_cleanup_uar_table(dev);
 
 err_close:
+	if (dev->flags & MLX4_FLAG_MSI_X)
+		pci_disable_msix(pdev);
+
 	mlx4_close_hca(dev);
 
 err_cmd:
 	mlx4_cmd_cleanup(dev);
 
 err_free_dev:
-	if (dev->flags & MLX4_FLAG_MSI_X)
-		pci_disable_msix(pdev);
-
 	kfree(priv);
 
 err_release_bar2:

-- 
MST


From mst at dev.mellanox.co.il  Sun Jul 22 02:41:36 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 22 Jul 2007 12:41:36 +0300
Subject: [ofa-general] Re: [PATCH 11/12 -Rev2] IPoIB xmit API addition
In-Reply-To: <20070722090649.7787.47960.sendpatchset@K50wks273871wss.in.ibm.com>
References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
	<20070722090649.7787.47960.sendpatchset@K50wks273871wss.in.ibm.com>
Message-ID: <20070722094136.GD7800@mellanox.co.il>

> +	/*
> +	 * Handle skbs completion from tx_tail to wr_id. It is possible to
> +	 * handle WC's from earlier post_sends (possible multiple) in this
> +	 * iteration as we move from tx_tail to wr_id, since if the last
> +	 * WR (which is the one which had a completion request) failed to be
> +	 * sent for any of those earlier request(s), no completion
> +	 * notification is generated for successful WR's of those earlier
> +	 * request(s).
> +	 */

AFAIK a signalled WR will always generate a completion.
What am I missing?

> 
> +			/*
> +			 * Better error handling can be done here, like free
> +			 * all untried skbs if err == -ENOMEM. However at this
> +			 * time, we re-try all the skbs, all of which will
> +			 * likely fail anyway (unless device finished sending
> +			 * some out in the meantime). This is not a regression
> +			 * since the earlier code is not doing this either.
> +			 */

Are you retrying posting skbs? Why is this a good idea?
AFAIK, earlier code did not retry posting WRs at all.
The comment seems to imply that post send fails as a result of SQ overflow -
do you see SQ overflow errors in your testing?
AFAIK, IPoIB should never overflow the SQ.

-- 
MST


From vlad at lists.openfabrics.org  Sun Jul 22 02:44:31 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Sun, 22 Jul 2007 02:44:31 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_c_kernel 20070722-0200 daily build status
Message-ID: <20070722094431.17323E60825@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2_c

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.13
Passed on ia64 with linux-2.6.19
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.22
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.12
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.13
Passed on ia64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18
Passed on ia64 with linux-2.6.14
Passed on x86_64 with linux-2.6.17
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.14
Passed on ppc64 with linux-2.6.17
Passed on x86_64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ia64 with linux-2.6.15
Passed on powerpc with linux-2.6.14
Passed on x86_64 with linux-2.6.15
Passed on powerpc with linux-2.6.13
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on powerpc with linux-2.6.15
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.12
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.13
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6

Failed:


From sashak at voltaire.com  Sun Jul 22 03:22:09 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 22 Jul 2007 13:22:09 +0300
Subject: [ofa-general] Re: opensm: a bug in heavy sweep? - no LFT
	re-configuration
In-Reply-To: <863azhrlm1.fsf@sw053.lab.mtl.com>
References: <863azhrlm1.fsf@sw053.lab.mtl.com>
Message-ID: <20070722102209.GR16597@sashak.voltaire.com>

Hi Eitan,

On 09:36 Sun 22 Jul     , Eitan Zahavi wrote:
> Hi Sasha
> 
> I am running some tests manually and apparently it looks like 
> I found a bug. Here is the sequence of things:
> 1. SM sweeps the fabric assign LFTs  
> 2. I manually modify some LFTs (single entry now marked UNREACHABLE
> 3. I force some switch change bit to 1 or issue kill -HUP
> 4. The SM reports SUBNET UP
> 5. The modified LFT entry is still UNREACHABLE and the path is broken

Right, in most cases (unless OpenSM has its own changes in the same LFT
block) OpenSM will refer its own LFT image for  "need to update"
decision, so _manual_ changes will not trigger new update. Rerunning
OpenSM should help however.

> It looks to me some optimization of routing does not fully reroute
> unless some condition is met - but that condition does not include the
> above triggers listed in step 3.

Rereading all fabrics LFTs by default seems to be too expensive
operations. At least by default, if it is real requirement this could be
enforced manually, for example when kill -HUP is used. Thoughts?

Sasha


From eitan at mellanox.co.il  Sun Jul 22 04:59:23 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sun, 22 Jul 2007 14:59:23 +0300
Subject: [ofa-general] RE: opensm: a bug in heavy sweep? - no LFT
	re-configuration
References: <863azhrlm1.fsf@sw053.lab.mtl.com>
	<20070722102209.GR16597@sashak.voltaire.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com>

Hi Sasha

Let's assume someone has reset a switch on the fabric.
What would cause the SM to re-assign the LFT of that switch?
I assumed that there is a mechanism to do that.

Anyway, kill -HUP should flush out the state and restart from scratch.


Eitan

> -----Original Message-----
> From: Sasha Khapyorsky [mailto:sashak at voltaire.com] 
> Sent: Sunday, July 22, 2007 1:22 PM
> To: Eitan Zahavi
> Cc: OPENIB; hal.rosenstock at gmail.com; Yevgeny Kliteynik
> Subject: Re: opensm: a bug in heavy sweep? - no LFT re-configuration
> 
> Hi Eitan,
> 
> On 09:36 Sun 22 Jul     , Eitan Zahavi wrote:
> > Hi Sasha
> > 
> > I am running some tests manually and apparently it looks 
> like I found 
> > a bug. Here is the sequence of things:
> > 1. SM sweeps the fabric assign LFTs
> > 2. I manually modify some LFTs (single entry now marked 
> UNREACHABLE 3. 
> > I force some switch change bit to 1 or issue kill -HUP 4. The SM 
> > reports SUBNET UP 5. The modified LFT entry is still 
> UNREACHABLE and 
> > the path is broken
> 
> Right, in most cases (unless OpenSM has its own changes in 
> the same LFT
> block) OpenSM will refer its own LFT image for  "need to update"
> decision, so _manual_ changes will not trigger new update. 
> Rerunning OpenSM should help however.
> 
> > It looks to me some optimization of routing does not fully reroute 
> > unless some condition is met - but that condition does not 
> include the 
> > above triggers listed in step 3.
> 
> Rereading all fabrics LFTs by default seems to be too 
> expensive operations. At least by default, if it is real 
> requirement this could be enforced manually, for example when 
> kill -HUP is used. Thoughts?
> 
> Sasha
> 


From hadi at cyberus.ca  Sun Jul 22 05:51:09 2007
From: hadi at cyberus.ca (jamal)
Date: Sun, 22 Jul 2007 08:51:09 -0400
Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API
In-Reply-To: <OF0F9581FE.472D94A6-ON65257320.001FC787-65257320.00238386@in.ibm.com>
References: <OF0F9581FE.472D94A6-ON65257320.001FC787-65257320.00238386@in.ibm.com>
Message-ID: <1185108670.5192.122.camel@localhost>

KK,

On Sun, 2007-22-07 at 11:57 +0530, Krishna Kumar2 wrote:

> Batching need not be useful for every hardware. 

My concern is there is no consistency in results. I see improvements on
something which you say dont. You see improvement in something that
Evgeniy doesnt etc. 
There are many knobs and we need in the minimal to find out how those
play.

> I don't quite agree with that approach, eg, if the blist is empty and the
> driver tells there is space for one packet, you will add one packet and
> the driver sends it out and the device is stopped (with potentially lot of
> skbs on dev->q). Then no packets are added till the queue is enabled, at
> which time a flood of skbs will be processed increasing latency and holding
> lock for a single longer duration. My approach will mitigate holding lock
> for longer times and instead send skbs to the device as long as we are
> within the limits.

Just as a side note _I do have this feature_ in the pktgen piece.
Infact, You can tell pktgen what that bound is as opposed to the hard
coding(look at the pktgen "batchl" parameter). I have not found it to be
useful experimentally; actually, i should say i could not "zone" on a
useful value by experimenting and it was better to turn it off.
I never tried adding it to the qdisc path - but this is something i
could try and as i said it may prove useful.

> Since E1000 doesn't seem to use the TX lock on RX (atleast I couldn't find
> it),
> I feel having prep will not help as no other cpu can execute the queue/xmit
> code anyway (E1000 is also a LLTX driver).

My experiments show it is useful (in a very visible way using pktgen)
for e1000 to have the prep() interface.

>  Other driver that hold tx lock could get improvement however.

So you do see the value then with non LLTX drivers, right? ;-> 
The value is also there in LLTX drivers even if in just formating a skb
ready for transmit. If this is not clear i could do a much longer
writeup on my thought evolution towards adding prep().

> I wonder if you tried enabling/disabling 'prep' on E1000 to see how the
> performance is affected. 

Absolutely. And regardless of whether its beneficial or not for e1000,
theres clear benefit in the tg3 for example.

> If it helps, I guess you could send me a patch to
> add that and I can also test it to see what the effect is. I didn't add it
> since IPoIB wouldn't be able to exploit it (unless someone is kind enough
> to show me how to).

Such core code should not just be focussed on IPOIB.

> I think the code I have is ready and stable, 

I am not sure how to intepret that - are you saying all-is-good and we
should just push your code in? It sounds disingenuous but i may have
misread you. 

cheers,
jamal


From pradeeps at linux.vnet.ibm.com  Sun Jul 22 07:13:11 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Sun, 22 Jul 2007 07:13:11 -0700
Subject: [ofa-general] Re: NOSRQ misc patch [PATCH V1]
In-Reply-To: <20070722072043.GB7188@mellanox.co.il>
References: <46A28CB7.1040509@linux.vnet.ibm.com>
	<20070722060557.GB20438@mellanox.co.il>
	<46A3043A.3030200@linux.vnet.ibm.com>
	<20070722072043.GB7188@mellanox.co.il>
Message-ID: <46A365F7.7090001@linux.vnet.ibm.com>

Michael S. Tsirkin wrote:
>> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
>> Subject: Re: NOSRQ misc patch [PATCH V1]
>>
>> Michael S. Tsirkin wrote:
>>>> @@ -1168,9 +1170,9 @@ static struct ib_qp *ipoib_cm_create_tx_
>>>>  	attr.recv_cq = priv->cq;
>>>>  	attr.srq = priv->cm.srq;
>>>>  	attr.cap.max_send_wr = ipoib_sendq_size;
>>>> -	attr.cap.max_recv_wr = 1;
>>>> +	attr.cap.max_recv_wr = 0;
>>>>  	attr.cap.max_send_sge = 1;
>>>> -	attr.cap.max_recv_sge = 1;
>>>> +	attr.cap.max_recv_sge = 0;
>>>>  	attr.sq_sig_type = IB_SIGNAL_ALL_WR;
>>>>  	attr.qp_type = IB_QPT_RC;
>>>>  	attr.send_cq = cq;
>>> I don't see how does this fix things.
>>> This line 
>>>>  	attr.srq = priv->cm.srq;
>>> connected the TX QP to SRQ, making it possible to get packets on this QP.
>>> But if cm.srq is NULL, and a remote sends a packet on this connection,
>>> the connection will get closed. Which is a quality of implementation issue.
>>>
>> When the QP numbers are exchanged correctly, then it should not receive
>> a packet on this QP in the first place.
> 
> Re-read the RFC. It is perfectly legal to reuse a passive QP for transmitting
> packets. We don't do this currently but we might in the future.

I presume you mean passive side for receiving. Let us revisit the issue when there
is a need. At this point it is not relevant.

Pradeep


From mst at dev.mellanox.co.il  Sun Jul 22 07:25:02 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 22 Jul 2007 17:25:02 +0300
Subject: [ofa-general] Re: NOSRQ misc patch [PATCH V1]
In-Reply-To: <46A365F7.7090001@linux.vnet.ibm.com>
References: <46A28CB7.1040509@linux.vnet.ibm.com>
	<20070722060557.GB20438@mellanox.co.il>
	<46A3043A.3030200@linux.vnet.ibm.com>
	<20070722072043.GB7188@mellanox.co.il>
	<46A365F7.7090001@linux.vnet.ibm.com>
Message-ID: <20070722142502.GA8102@mellanox.co.il>

> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
> Subject: Re: NOSRQ misc patch [PATCH V1]
> 
> Michael S. Tsirkin wrote:
> >> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
> >> Subject: Re: NOSRQ misc patch [PATCH V1]
> >>
> >> Michael S. Tsirkin wrote:
> >>>> @@ -1168,9 +1170,9 @@ static struct ib_qp *ipoib_cm_create_tx_
> >>>>  	attr.recv_cq = priv->cq;
> >>>>  	attr.srq = priv->cm.srq;
> >>>>  	attr.cap.max_send_wr = ipoib_sendq_size;
> >>>> -	attr.cap.max_recv_wr = 1;
> >>>> +	attr.cap.max_recv_wr = 0;
> >>>>  	attr.cap.max_send_sge = 1;
> >>>> -	attr.cap.max_recv_sge = 1;
> >>>> +	attr.cap.max_recv_sge = 0;
> >>>>  	attr.sq_sig_type = IB_SIGNAL_ALL_WR;
> >>>>  	attr.qp_type = IB_QPT_RC;
> >>>>  	attr.send_cq = cq;
> >>> I don't see how does this fix things.
> >>> This line 
> >>>>  	attr.srq = priv->cm.srq;
> >>> connected the TX QP to SRQ, making it possible to get packets on this QP.
> >>> But if cm.srq is NULL, and a remote sends a packet on this connection,
> >>> the connection will get closed. Which is a quality of implementation issue.
> >>>
> >> When the QP numbers are exchanged correctly, then it should not receive
> >> a packet on this QP in the first place.
> > 
> > Re-read the RFC. It is perfectly legal to reuse a passive QP for transmitting
> > packets. We don't do this currently but we might in the future.
> 
> I presume you mean passive side for receiving.

A passive side is the one that gets a REQ (look in IB spec section 12.9.6).
Under IPoIB passive side can perform post send on the QP created.
To make this work, I connect the QP to the SRQ on the active side:
>         attr.srq = priv->cm.srq;

However, with your patch, priv->cm.srq might be NULL, which
means that the QP won't be attached to SRQ. This is
a quality of implementation issue that your patch is introducing.

-- 
MST


From monisonlists at gmail.com  Sun Jul 22 07:49:27 2007
From: monisonlists at gmail.com (Moni Shoua)
Date: Sun, 22 Jul 2007 17:49:27 +0300
Subject: [ofa-general] PATCH] IB/ipoib: ignore membership bit when looking
 for a P_Key in the table
Message-ID: <46A36E77.5020307@gmail.com>

IPoIB turns on the P_Key membership bit of limited membership P_Keys
when creating a child interface. After that IPoIB looks for the full
membership P_key in the table to make the interface "RUNNING". This 
patch fixes the pkey lookup in order to match full and partial membership 
keys that belong of the same partition.

 device.c |    2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

Index: infiniband/drivers/infiniband/core/device.c
===================================================================
--- infiniband.orig/drivers/infiniband/core/device.c	2007-07-08 12:45:07.000000000 +0300
+++ infiniband/drivers/infiniband/core/device.c	2007-07-22 17:43:32.440829619 +0300
@@ -702,7 +702,7 @@ int ib_find_pkey(struct ib_device *devic
 		if (ret)
 			return ret;
 
-		if (pkey == tmp_pkey) {
+		if ((pkey & 0x7fff) == (tmp_pkey & 0x7fff)) {
 			*index = i;
 			return 0;
 		}


From kaber at trash.net  Sun Jul 22 10:03:01 2007
From: kaber at trash.net (Patrick McHardy)
Date: Sun, 22 Jul 2007 19:03:01 +0200
Subject: [ofa-general] Re: [PATCH 05/10] sch_generic.c changes.
In-Reply-To: <OF206CB1B7.31D1C39B-ON6525731F.00254E4C-6525731F.00261F1A@in.ibm.com>
References: <OF206CB1B7.31D1C39B-ON6525731F.00254E4C-6525731F.00261F1A@in.ibm.com>
Message-ID: <46A38DC5.4040800@trash.net>

Krishna Kumar2 wrote:
> Patrick McHardy <kaber at trash.net> wrote on 07/20/2007 11:46:36 PM:
> 
>>The check for tx_queue_len is wrong though,
>>its only a default which can be overriden and some qdiscs don't
>>care for it at all.
> 
> 
> I think it should not matter whether qdiscs use this or not, or even if it
> is modified (unless it is made zero in which case this breaks). The
> intention behind this check is to make sure that not more than tx_queue_len
> skbs are in all queues put together (q->qdisc + dev->skb_blist), otherwise
> the blist can become too large and breaks the idea of tx_queue_len. Is that
> a good justification ?


Its a good justification, but on second thought the entire idea of
a single queue after qdiscs that is refilled independantly of
transmissions times etc. make be worry a bit. By changing the
timing you're effectively changing the qdiscs behaviour, at least
in some cases. SFQ is a good example, but I believe it affects most
work-conserving qdiscs. Think of this situation:

100 packets of flow 1 arrive
50 packets of flow 1 are sent
100 packets for flow 2 arrive
remaining packets are sent

On the wire you'll first see 50 packets of flow 1, than 100 packets
alternate of flow 1 and 2, then 50 packets flow 2.

With your additional queue all packets of flow 1 are pulled out of
the qdisc immediately and put in the fifo. When the 100 packets of
the second flow arrive they will also get pulled out immediately
and are put in the fifo behind the remaining 50 packets of flow 1.
So what you get on the wire is:

100 packets of flow 1
100 packets of flow 1

So SFQ is without any effect. This is not completely avoidable of
course, but you can and should limit the damage by only pulling
out as much packets as the driver can take and have the driver
stop the queue afterwards.


From kaber at trash.net  Sun Jul 22 10:06:51 2007
From: kaber at trash.net (Patrick McHardy)
Date: Sun, 22 Jul 2007 19:06:51 +0200
Subject: [ofa-general] Re: [PATCH 02/12 -Rev2] Changes to netdevice.h
In-Reply-To: <20070722090516.7787.79695.sendpatchset@K50wks273871wss.in.ibm.com>
References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
	<20070722090516.7787.79695.sendpatchset@K50wks273871wss.in.ibm.com>
Message-ID: <46A38EAB.6050300@trash.net>

Krishna Kumar wrote:
> @@ -472,6 +474,9 @@ struct net_device
>  	void			*priv;	/* pointer to private data	*/
>  	int			(*hard_start_xmit) (struct sk_buff *skb,
>  						    struct net_device *dev);
> +	int			(*hard_start_xmit_batch) (struct net_device
> +							  *dev);
> +


Os this function really needed? Can't you just call hard_start_xmit with
a NULL skb and have the driver use dev->blist?

>  	/* These may be needed for future network-power-down code. */
>  	unsigned long		trans_start;	/* Time (in jiffies) of last Tx	*/
>  
> @@ -582,6 +587,8 @@ struct net_device
>  #define	NETDEV_ALIGN		32
>  #define	NETDEV_ALIGN_CONST	(NETDEV_ALIGN - 1)
>  
> +#define BATCHING_ON(dev)	((dev->features & NETIF_F_BATCH_ON) != 0)
> +
>  static inline void *netdev_priv(const struct net_device *dev)
>  {
>  	return dev->priv;
> @@ -832,6 +839,8 @@ extern int		dev_set_mac_address(struct n
>  					    struct sockaddr *);
>  extern int		dev_hard_start_xmit(struct sk_buff *skb,
>  					    struct net_device *dev);
> +extern int		dev_add_skb_to_blist(struct sk_buff *skb,
> +					     struct net_device *dev);


Again, function signatures should be introduced in the same patch
that contains the function. Splitting by file doesn't make sense.


From kaber at trash.net  Sun Jul 22 10:10:37 2007
From: kaber at trash.net (Patrick McHardy)
Date: Sun, 22 Jul 2007 19:10:37 +0200
Subject: [ofa-general] Re: [PATCH 06/12 -Rev2] rtnetlink changes.
In-Reply-To: <20070722090553.7787.28728.sendpatchset@K50wks273871wss.in.ibm.com>
References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
	<20070722090553.7787.28728.sendpatchset@K50wks273871wss.in.ibm.com>
Message-ID: <46A38F8D.6080109@trash.net>

Krishna Kumar wrote:
> diff -ruNp org/include/linux/if_link.h rev2/include/linux/if_link.h
> --- org/include/linux/if_link.h	2007-07-20 16:33:35.000000000 +0530
> +++ rev2/include/linux/if_link.h	2007-07-20 16:35:08.000000000 +0530
> @@ -78,6 +78,8 @@ enum
>  	IFLA_LINKMODE,
>  	IFLA_LINKINFO,
>  #define IFLA_LINKINFO IFLA_LINKINFO
> +	IFLA_TXBTHSKB,		/* Driver support for Batch'd skbs */
> +#define IFLA_TXBTHSKB IFLA_TXBTHSKB


Ughh what a name :) I prefer pronouncable names since they are
much easier to remember and don't need comments explaining
what they mean.

But I actually think offering just an ethtool interface would
be better, at least for now.


From sashak at voltaire.com  Sun Jul 22 10:40:48 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 22 Jul 2007 20:40:48 +0300
Subject: [ofa-general] Re: opensm: a bug in heavy sweep? - no LFT
	re-configuration
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com>
References: <863azhrlm1.fsf@sw053.lab.mtl.com>
	<20070722102209.GR16597@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com>
Message-ID: <20070722174048.GO27878@sashak.voltaire.com>

On 14:59 Sun 22 Jul     , Eitan Zahavi wrote:
> Hi Sasha
> 
> Let's assume someone has reset a switch on the fabric.
> What would cause the SM to re-assign the LFT of that switch?

OpenSM will sweep and drop this switch and when switch will back it will
be initialized again. But if the reset was too fast (relative to
discovery), we can be in trouble (and maybe not only with LFTs).

> I assumed that there is a mechanism to do that.

Not for "fast" switch reboot.

Hmm, I think we could try to detect this case by comparing
SwitchInfo:LinerFDBTop with current p_sw->max_lid_ho or even by seeing
that PortInfo:LID is not set. Something like below:


diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h
index 5b2b19e..62c072f 100644
--- a/opensm/include/opensm/osm_switch.h
+++ b/opensm/include/opensm/osm_switch.h
@@ -112,6 +112,7 @@ typedef struct _osm_switch
 	osm_fwd_tbl_t				fwd_tbl;
 	osm_mcast_tbl_t				mcast_tbl;
 	uint32_t				discovery_count;
+	unsigned				update_ft;
 	void					*priv;
 } osm_switch_t;
 /*
@@ -152,6 +153,10 @@ typedef struct _osm_switch
 *		during the current fabric sweep.  This number is reset
 *		to zero at the start of a sweep.
 *
+*	update_ft
+*		When set fwd tables will be updated regardless to entry
+*		values locally stored in fwd tables images
+*
 * SEE ALSO
 *	Switch object
 *********/
diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c
index adece65..8bbbcac 100644
--- a/opensm/opensm/osm_port_info_rcv.c
+++ b/opensm/opensm/osm_port_info_rcv.c
@@ -336,6 +336,9 @@ __osm_pi_rcv_process_switch_port(
       break;
     }
   }
+  else if (port_num == 0 && p_node->sw &&
+           (!p_pi->base_lid || !p_pi->master_sm_base_lid))
+    p_node->sw->update_ft = 1;
 
   /*
     Update the PortInfo attribute.
diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
index b44a3ba..03516ae 100644
--- a/opensm/opensm/osm_ucast_mgr.c
+++ b/opensm/opensm/osm_ucast_mgr.c
@@ -811,7 +811,8 @@ osm_ucast_mgr_set_fwd_table(
        osm_switch_get_fwd_tbl_block( p_sw, block_id_ho, block ) ;
        block_id_ho++ )
   {
-    if (!memcmp(block, p_mgr->lft_buf + block_id_ho * 64, 64))
+    if (!p_sw->update_ft &&
+        !memcmp(block, p_mgr->lft_buf + block_id_ho * 64, 64))
       continue;
 
     if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
@@ -850,6 +851,7 @@ osm_ucast_mgr_set_fwd_table(
     }
   }
 
+  p_sw->update_ft = 0;
   OSM_LOG_EXIT( p_mgr->p_log );
 }
 

BTW what do you think is the best way to detect switch power up? I
didn't really find a strong requirement for at powerup initialization of
any suitable component.

> Anyway, kill -HUP should flush out the state and restart from scratch.

Thinking more about it I'm not sure. Similar flush will be required for
another "stored" components like pkey, sl2vl tables etc.. So it is more
than just "regular" heavy sweep, another signal or option could be used
for this, but OTOH it becomes very close to OpenSM restarting..

Sasha

> 
> 
> Eitan
> 
> > -----Original Message-----
> > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] 
> > Sent: Sunday, July 22, 2007 1:22 PM
> > To: Eitan Zahavi
> > Cc: OPENIB; hal.rosenstock at gmail.com; Yevgeny Kliteynik
> > Subject: Re: opensm: a bug in heavy sweep? - no LFT re-configuration
> > 
> > Hi Eitan,
> > 
> > On 09:36 Sun 22 Jul     , Eitan Zahavi wrote:
> > > Hi Sasha
> > > 
> > > I am running some tests manually and apparently it looks 
> > like I found 
> > > a bug. Here is the sequence of things:
> > > 1. SM sweeps the fabric assign LFTs
> > > 2. I manually modify some LFTs (single entry now marked 
> > UNREACHABLE 3. 
> > > I force some switch change bit to 1 or issue kill -HUP 4. The SM 
> > > reports SUBNET UP 5. The modified LFT entry is still 
> > UNREACHABLE and 
> > > the path is broken
> > 
> > Right, in most cases (unless OpenSM has its own changes in 
> > the same LFT
> > block) OpenSM will refer its own LFT image for  "need to update"
> > decision, so _manual_ changes will not trigger new update. 
> > Rerunning OpenSM should help however.
> > 
> > > It looks to me some optimization of routing does not fully reroute 
> > > unless some condition is met - but that condition does not 
> > include the 
> > > above triggers listed in step 3.
> > 
> > Rereading all fabrics LFTs by default seems to be too 
> > expensive operations. At least by default, if it is real 
> > requirement this could be enforced manually, for example when 
> > kill -HUP is used. Thoughts?
> > 
> > Sasha
> > 


From pradeeps at linux.vnet.ibm.com  Sun Jul 22 11:39:00 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Sun, 22 Jul 2007 11:39:00 -0700
Subject: [ofa-general] Re: NOSRQ misc patch [PATCH V1]
In-Reply-To: <20070722142502.GA8102@mellanox.co.il>
References: <46A28CB7.1040509@linux.vnet.ibm.com>
	<20070722060557.GB20438@mellanox.co.il>
	<46A3043A.3030200@linux.vnet.ibm.com>
	<20070722072043.GB7188@mellanox.co.il>
	<46A365F7.7090001@linux.vnet.ibm.com>
	<20070722142502.GA8102@mellanox.co.il>
Message-ID: <46A3A444.5050802@linux.vnet.ibm.com>

Michael S. Tsirkin wrote:
>> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
>> Subject: Re: NOSRQ misc patch [PATCH V1]
>>
>> Michael S. Tsirkin wrote:
>>>> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
>>>> Subject: Re: NOSRQ misc patch [PATCH V1]
>>>>
>>>> Michael S. Tsirkin wrote:
>>>>>> @@ -1168,9 +1170,9 @@ static struct ib_qp *ipoib_cm_create_tx_
>>>>>>  	attr.recv_cq = priv->cq;
>>>>>>  	attr.srq = priv->cm.srq;
>>>>>>  	attr.cap.max_send_wr = ipoib_sendq_size;
>>>>>> -	attr.cap.max_recv_wr = 1;
>>>>>> +	attr.cap.max_recv_wr = 0;
>>>>>>  	attr.cap.max_send_sge = 1;
>>>>>> -	attr.cap.max_recv_sge = 1;
>>>>>> +	attr.cap.max_recv_sge = 0;
>>>>>>  	attr.sq_sig_type = IB_SIGNAL_ALL_WR;
>>>>>>  	attr.qp_type = IB_QPT_RC;
>>>>>>  	attr.send_cq = cq;
>>>>> I don't see how does this fix things.
>>>>> This line 
>>>>>>  	attr.srq = priv->cm.srq;
>>>>> connected the TX QP to SRQ, making it possible to get packets on this QP.
>>>>> But if cm.srq is NULL, and a remote sends a packet on this connection,
>>>>> the connection will get closed. Which is a quality of implementation issue.
>>>>>
>>>> When the QP numbers are exchanged correctly, then it should not receive
>>>> a packet on this QP in the first place.
>>> Re-read the RFC. It is perfectly legal to reuse a passive QP for transmitting
>>> packets. We don't do this currently but we might in the future.
>> I presume you mean passive side for receiving.
> 
> A passive side is the one that gets a REQ (look in IB spec section 12.9.6).
> Under IPoIB passive side can perform post send on the QP created.
> To make this work, I connect the QP to the SRQ on the active side:
>>         attr.srq = priv->cm.srq;
> 
> However, with your patch, priv->cm.srq might be NULL, which
> means that the QP won't be attached to SRQ. This is
> a quality of implementation issue that your patch is introducing.
> 

I do not understand -for one you mention transmitting packets and, on the
other hand you mention SRQ. Are you hinting at Shared Send Queues (which may
be in the future as you state)?

I have already tested the series of NOSRQ patches for interoperability between 
IBM and Mellanox adapters and it works. I do not see the quality of implementation
issues that you keep referring to.

I believe this should not pose issues for merging this immediately into 2.6.23. 
If you have ideas that can be implemented in the future, we can discuss that 
but outside the context of this patch.

Pradeep


From mst at dev.mellanox.co.il  Sun Jul 22 14:02:55 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 23 Jul 2007 00:02:55 +0300
Subject: [ofa-general] Re: NOSRQ misc patch [PATCH V1]
In-Reply-To: <46A3A444.5050802@linux.vnet.ibm.com>
References: <46A28CB7.1040509@linux.vnet.ibm.com>
	<20070722060557.GB20438@mellanox.co.il>
	<46A3043A.3030200@linux.vnet.ibm.com>
	<20070722072043.GB7188@mellanox.co.il>
	<46A365F7.7090001@linux.vnet.ibm.com>
	<20070722142502.GA8102@mellanox.co.il>
	<46A3A444.5050802@linux.vnet.ibm.com>
Message-ID: <20070722210255.GA25023@mellanox.co.il>

> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
> Subject: Re: NOSRQ misc patch [PATCH V1]
> 
> Michael S. Tsirkin wrote:
> >> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
> >> Subject: Re: NOSRQ misc patch [PATCH V1]
> >>
> >> Michael S. Tsirkin wrote:
> >>>> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
> >>>> Subject: Re: NOSRQ misc patch [PATCH V1]
> >>>>
> >>>> Michael S. Tsirkin wrote:
> >>>>>> @@ -1168,9 +1170,9 @@ static struct ib_qp *ipoib_cm_create_tx_
> >>>>>>  	attr.recv_cq = priv->cq;
> >>>>>>  	attr.srq = priv->cm.srq;
> >>>>>>  	attr.cap.max_send_wr = ipoib_sendq_size;
> >>>>>> -	attr.cap.max_recv_wr = 1;
> >>>>>> +	attr.cap.max_recv_wr = 0;
> >>>>>>  	attr.cap.max_send_sge = 1;
> >>>>>> -	attr.cap.max_recv_sge = 1;
> >>>>>> +	attr.cap.max_recv_sge = 0;
> >>>>>>  	attr.sq_sig_type = IB_SIGNAL_ALL_WR;
> >>>>>>  	attr.qp_type = IB_QPT_RC;
> >>>>>>  	attr.send_cq = cq;
> >>>>> I don't see how does this fix things.
> >>>>> This line 
> >>>>>>  	attr.srq = priv->cm.srq;
> >>>>> connected the TX QP to SRQ, making it possible to get packets on this QP.
> >>>>> But if cm.srq is NULL, and a remote sends a packet on this connection,
> >>>>> the connection will get closed. Which is a quality of implementation issue.
> >>>>>
> >>>> When the QP numbers are exchanged correctly, then it should not receive
> >>>> a packet on this QP in the first place.
> >>> Re-read the RFC. It is perfectly legal to reuse a passive QP for transmitting
> >>> packets. We don't do this currently but we might in the future.
> >> I presume you mean passive side for receiving.
> > 
> > A passive side is the one that gets a REQ (look in IB spec section 12.9.6).
> > Under IPoIB passive side can perform post send on the QP created.
> > To make this work, I connect the QP to the SRQ on the active side:
> >>         attr.srq = priv->cm.srq;
> > 
> > However, with your patch, priv->cm.srq might be NULL, which
> > means that the QP won't be attached to SRQ. This is
> > a quality of implementation issue that your patch is introducing.
> > 
> 
> I do not understand -for one you mention transmitting packets and, on the
> other hand you mention SRQ.
> I have already tested the series of NOSRQ patches for interoperability between 
> IBM and Mellanox adapters and it works. I do not see the quality of implementation
> issues that you keep referring to.

Do you understand why is this line there?
         attr.srq = priv->cm.srq;

I'll try to explain.

The snippet above creates a QP (that will then be connected to remote side).
According to IPoIB CM RFC, once the connection is set up, and if the remote
wants to send us some packets, it could send them over this
existing connection.

And with current code, this would work correctly, because
all RC QP we create are connected to the common SRQ and a common CQ,
thus such packets get receive WCs and get sent up the stack.

But with your patch, priv->cm.srq == NULL so you create RC QPs that
are not connected to SRQ, and you never post any receive WRs,
so if the remote sends even a single packet on this QP, the QP
will transfer to error state.

This is a regression: QPs are supposed to have receive WRs
preposed. If you consider e.g. TCP, it's easy to imagine that
the packet remote was sending was an ACK, so it won't
retry - until we destroy the connection, create a new one,
resend the packet - and an ACK will kill the QP again.

This just happens to never occur for you because you have recent linux kernel
on both sides of the link, and because linux is currently not smart enough
to reuse an existing connection - so it creates a new one and
your bug is hidden from view.

-- 
MST


From sashak at voltaire.com  Sun Jul 22 14:48:09 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 23 Jul 2007 00:48:09 +0300
Subject: [ofa-general] [PATCH] opensm: set PortInfo:LinkSpeed in link_mgr
	only
Message-ID: <20070722214809.GP27878@sashak.voltaire.com>


PortInfo:LinkSpeed setup is performed (in accordance with link_speed
option value) in link_mgr and not in lid_mgr.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_lid_mgr.c |   16 ----------------
 1 files changed, 0 insertions(+), 16 deletions(-)

diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c
index 79a0ea8..f1f4707 100644
--- a/opensm/opensm/osm_lid_mgr.c
+++ b/opensm/opensm/osm_lid_mgr.c
@@ -1108,22 +1108,6 @@ __osm_lid_mgr_set_physp_pi(
                 sizeof(p_pi->link_width_enabled) ))
       send_set = TRUE;
 
-    if ( p_mgr->p_subn->opt.force_link_speed )
-    {
-      if ( p_mgr->p_subn->opt.force_link_speed == 15 )  /* LinkSpeedSupported */
-      {
-        if (ib_port_info_get_link_speed_enabled( p_old_pi ) != ib_port_info_get_link_speed_sup( p_pi ))
-          ib_port_info_set_link_speed_enabled( p_pi, IB_PORT_LINK_SPEED_ENABLED_MASK );
-        else
-          ib_port_info_set_link_speed_enabled( p_pi, ib_port_info_get_link_speed_enabled( p_old_pi ));
-      }
-      else
-        ib_port_info_set_link_speed_enabled( p_pi, p_mgr->p_subn->opt.force_link_speed );
-      if (memcmp( &p_pi->link_speed, &p_old_pi->link_speed,
-                  sizeof(p_pi->link_speed) ))
-        send_set = TRUE;
-    }
-
     /* M_KeyProtectBits are always zero */
     p_pi->mkey_lmc = p_mgr->p_subn->opt.lmc;
     /* Check to see if the value we are setting is different than
-- 
1.5.3.rc2.29.gc4640f


From sashak at voltaire.com  Sun Jul 22 14:51:54 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 23 Jul 2007 00:51:54 +0300
Subject: [ofa-general] [PATCH] management/gen_chlog.sh: simple ChangeLog
	generator
Message-ID: <20070722215154.GQ27878@sashak.voltaire.com>


This gen_chlog.sh scripts generates ChangeLog (from git logs) for
specified subdirectory or for whole tree if "." is used. This supports
ChangleLog and spec file formats.

The script can be used during tarballs generation by make.dist or
'make dist'.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 gen_chlog.sh |   68 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 68 insertions(+), 0 deletions(-)
 create mode 100755 gen_chlog.sh

diff --git a/gen_chlog.sh b/gen_chlog.sh
new file mode 100755
index 0000000..9d60081
--- /dev/null
+++ b/gen_chlog.sh
@@ -0,0 +1,68 @@
+#!/bin/sh
+
+usage()
+{
+	echo "Usage: $0 [--spec] <target>"
+	exit 2
+}
+
+test -z "$1" && usage
+
+if [ "$1" = "--spec" ] ; then
+	spec_format=1
+	shift
+	test -z "$1" && usage
+fi
+
+TARGET=$1
+
+GIT_DIR=`git-rev-parse --git-dir 2>/dev/null`
+
+test -z "$GIT_DIR" && usage
+
+
+export GIT_DIR
+export GIT_PAGER=""
+export PAGER=""
+
+
+mkchlog()
+{
+	target=$1
+	format=$2
+
+	prev_tag=""
+
+	for tag in `git-tag -l $target` ; do
+		obj=`git-cat-file tag $tag | awk '/^object /{print $2}'`
+		base=`git-merge-base $obj HEAD`
+		if [ -z "$base" -o "$base" != $obj ] ; then
+			continue
+		fi
+		all_vers="$prev_tag$tag $all_vers"
+		prev_tag=$tag..
+	done
+
+	if [ -z "$prev_tag" ] ; then
+		all_vers=HEAD
+	else
+		all_vers="${prev_tag}HEAD $all_vers"
+	fi
+
+	for ver in $all_vers ; do
+		ver_name=`echo $ver | sed -e 's/^.*\.\.//'`
+		echo "* Version: $ver_name"
+		echo ""
+		git-log --no-merges "${format}" $ver -- $target
+		prev_t=$tag..
+	done
+}
+
+
+if [ -z "$spec_format" ] ; then
+	mkchlog $TARGET --pretty=format:"commit %H%n%ad %an%n%n    %s%n"
+else
+	echo "%changelog"
+	mkchlog $TARGET --pretty=format:"- %ad %an: %s"
+	echo ""
+fi
-- 
1.5.3.rc2.29.gc4640f


From sashak at voltaire.com  Sun Jul 22 15:14:55 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 23 Jul 2007 01:14:55 +0300
Subject: [ofa-general] [PATCH resend] opensm/osm_indent: go closer to
	opensm-coding-style.txt
Message-ID: <20070722221455.GR27878@sashak.voltaire.com>


This updates the script according to recent doc/opensm-coding-style.txt
(in short K&R, tabs, etc.).

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_indent |   57 +++------------------------------------------
 1 files changed, 4 insertions(+), 53 deletions(-)

diff --git a/opensm/opensm/osm_indent b/opensm/opensm/osm_indent
index bed2ba1..621184b 100755
--- a/opensm/opensm/osm_indent
+++ b/opensm/opensm/osm_indent
@@ -1,6 +1,6 @@
 #!/bin/bash
 #
-# Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved.
+# Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
 # Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
 # Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
 #
@@ -40,56 +40,7 @@
 #  Environment:
 #  	Linux User Mode
 # 
-#  $Revision: 1.4 $
-# 
-#
-# This is the indent format used for OpenSM.
-#
-# format the source code according to the ACD standard
-# -bad	Blank line after declarations
-# -bap	Blank line after Procedures
-# -bbb	Blank line before block comments
-# -nbbo	Break after Boolean operator
-# -bl	Break after if line
-# -bli0 Indent for braces is 0
-# -bls	Break after struct declarations
-# -cbi0	Case break indent 0
-# -ci3	Continue indent 3 spaces
-# -cli0	Case label indent 0 spaces
-# -ncs	No space after cast operator
-# -hnl	Honor existing newlines on long lines
-# -i3	Substitute indent with 3 spaces
-# -npcs	No space after procedure calls
-# -prs	Space after parenthesis
-# -nsai	No space after if keyword - removed
-# -nsaw	No space after while keyword - removed
-# -sc	Put * at left of comments in a block comment style
-# -nsob	Don't swallow unnecessary blank lines
-# -ts3	Tab size is 3
-# -psl	Type of procedure return in a separate line
-# -bfda	Function declaration arguments in a separate line.
-# -nut   No tabs as we allow spaces
-#
-#########################################################################
-
-# indent the world
-for sourcefile in $*; do
-    if test -f "$sourcefile"; then
-        # first, string DOS style linefeeds
-        perl -piW -e's/\x0D//' "$sourcefile"
-        echo Processing $sourcefile
-        indent -bad -bap -bbb -nbbo -bl -bli0 -bls -cbi0 -ci3 -cli0 -ncs \
-                -hnl -i3 -npcs -prs -sc -nsob -ts3 -psl -bfda -nut $sourcefile
-
-        rm ${sourcefile}W
+# This is the indent format used for OpenSM (similar to one used in
+# linux/scripts/Lindent).
 
-        # the -bb also affect the first line in each file - so clean it up
-        if test `head -1 $sourcefile | egrep -v '^$' | wc -l` = 0; then
-            echo Cleaning up first empty line of $sourcefile
-            awk '{if(n){print};n++}' $sourcefile > ${sourcefile}W
-            mv -f ${sourcefile}W $sourcefile
-        fi
-    else
-        echo Could not find file:$sourcefile
-    fi
-done
+indent -npro -kr -i8 -ts8 -sob -l80 -ss -ncs "$@"
-- 
1.5.3.rc2.29.gc4640f


From pradeeps at linux.vnet.ibm.com  Sun Jul 22 17:06:10 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Sun, 22 Jul 2007 17:06:10 -0700
Subject: [ofa-general] Re: NOSRQ misc patch [PATCH V1]
In-Reply-To: <20070722210255.GA25023@mellanox.co.il>
References: <46A28CB7.1040509@linux.vnet.ibm.com>
	<20070722060557.GB20438@mellanox.co.il>
	<46A3043A.3030200@linux.vnet.ibm.com>
	<20070722072043.GB7188@mellanox.co.il>
	<46A365F7.7090001@linux.vnet.ibm.com>
	<20070722142502.GA8102@mellanox.co.il>
	<46A3A444.5050802@linux.vnet.ibm.com>
	<20070722210255.GA25023@mellanox.co.il>
Message-ID: <46A3F0F2.7080500@linux.vnet.ibm.com>

Michael S. Tsirkin wrote:
>> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
>> Subject: Re: NOSRQ misc patch [PATCH V1]
>>
>> Michael S. Tsirkin wrote:
>>>> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
>>>> Subject: Re: NOSRQ misc patch [PATCH V1]
>>>>
>>>> Michael S. Tsirkin wrote:
>>>>>> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
>>>>>> Subject: Re: NOSRQ misc patch [PATCH V1]
>>>>>>
>>>>>> Michael S. Tsirkin wrote:
>>>>>>>> @@ -1168,9 +1170,9 @@ static struct ib_qp *ipoib_cm_create_tx_
>>>>>>>>  	attr.recv_cq = priv->cq;
>>>>>>>>  	attr.srq = priv->cm.srq;
>>>>>>>>  	attr.cap.max_send_wr = ipoib_sendq_size;
>>>>>>>> -	attr.cap.max_recv_wr = 1;
>>>>>>>> +	attr.cap.max_recv_wr = 0;
>>>>>>>>  	attr.cap.max_send_sge = 1;
>>>>>>>> -	attr.cap.max_recv_sge = 1;
>>>>>>>> +	attr.cap.max_recv_sge = 0;
>>>>>>>>  	attr.sq_sig_type = IB_SIGNAL_ALL_WR;
>>>>>>>>  	attr.qp_type = IB_QPT_RC;
>>>>>>>>  	attr.send_cq = cq;
>>>>>>> I don't see how does this fix things.
>>>>>>> This line 
>>>>>>>>  	attr.srq = priv->cm.srq;
>>>>>>> connected the TX QP to SRQ, making it possible to get packets on this QP.
>>>>>>> But if cm.srq is NULL, and a remote sends a packet on this connection,
>>>>>>> the connection will get closed. Which is a quality of implementation issue.
>>>>>>>
>>>>>> When the QP numbers are exchanged correctly, then it should not receive
>>>>>> a packet on this QP in the first place.
>>>>> Re-read the RFC. It is perfectly legal to reuse a passive QP for transmitting
>>>>> packets. We don't do this currently but we might in the future.
>>>> I presume you mean passive side for receiving.
>>> A passive side is the one that gets a REQ (look in IB spec section 12.9.6).
>>> Under IPoIB passive side can perform post send on the QP created.
>>> To make this work, I connect the QP to the SRQ on the active side:
>>>>         attr.srq = priv->cm.srq;
>>> However, with your patch, priv->cm.srq might be NULL, which
>>> means that the QP won't be attached to SRQ. This is
>>> a quality of implementation issue that your patch is introducing.
>>>
>> I do not understand -for one you mention transmitting packets and, on the
>> other hand you mention SRQ.
>> I have already tested the series of NOSRQ patches for interoperability between 
>> IBM and Mellanox adapters and it works. I do not see the quality of implementation
>> issues that you keep referring to.
> 
> Do you understand why is this line there?
>          attr.srq = priv->cm.srq;
> 
> I'll try to explain.
> 
> The snippet above creates a QP (that will then be connected to remote side).
> According to IPoIB CM RFC, once the connection is set up, and if the remote
> wants to send us some packets, it could send them over this
> existing connection.
> 
> And with current code, this would work correctly, because
> all RC QP we create are connected to the common SRQ and a common CQ,
> thus such packets get receive WCs and get sent up the stack.
> 
> But with your patch, priv->cm.srq == NULL so you create RC QPs that
> are not connected to SRQ, and you never post any receive WRs,
> so if the remote sends even a single packet on this QP, the QP
> will transfer to error state.
> 
I do not post any WRs because I do not expect any packets to be received.
If it does receive any packets an RNR will be returned (as expected).
The Queues in the Queue Pairs are not being used symmetrically that is all.
Also the priv->cm.srq is to NULL only in the non-SRQ case. The SRQ case 
is as before.

> This is a regression: QPs are supposed to have receive WRs
> preposed. If you consider e.g. TCP, it's easy to imagine that
> the packet remote was sending was an ACK, so it won't
> retry - until we destroy the connection, create a new one,
> resend the packet - and an ACK will kill the QP again.
> 

There is nothing about asymmetric usage of the Queues. And hence
I see no problems. If in TCP one sends to the wrong port, the 
packet gets dropped. This is similar to that.

> This just happens to never occur for you because you have recent linux kernel
> on both sides of the link, and because linux is currently not smart enough
> to reuse an existing connection - so it creates a new one and
> your bug is hidden from view.
> 

This code has been there since day one. I do not understand the reasoning for raising 
issues on the eve of the acceptance of this patch. Why bring it up now?

Pradeep


From sashak at voltaire.com  Sun Jul 22 17:20:11 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 23 Jul 2007 03:20:11 +0300
Subject: [ofa-general] QoS RFC
In-Reply-To: <46A283B6.1070105@dev.mellanox.co.il>
References: <46A283B6.1070105@dev.mellanox.co.il>
Message-ID: <20070723002010.GU27878@sashak.voltaire.com>

Hi Yevgeny,

Some initial comments.

On 01:07 Sun 22 Jul     , Yevgeny Kliteynik wrote:
>  Hi All
> 
>  Please find the attached RFC describing how QoS policy support could be 
>  implemented in the OpenFabrics stack.
>  Your comments are welcome.
> 
>  -- Yevgeny
> 
>                RFC: OpenFabrics Enhancements for QoS Support
>               ===============================================
> 
>  Authors: . Eitan Zahavi <eitan at mellanox.co.il>
>  Authors: . Yevgeny Kliteynik <kliteyn at mellanox.co.il>
>  Date: .... Jul 2007.
>  Revision:  0.2
> 
>  Table of contents:
>  1. Overview
>  2. Architecture
>  3. Supported Policy
>  4. CMA functionality
>  5. IPoIB functionality
>  6. SDP functionality
>  7. SRP functionality
>  8. iSER functionality
>  9. OpenSM functionality
> 
>  1. Overview
>  ------------
>  Quality of Service requirements stem from the realization of I/O 
>  consolidation
>  over IB network: As multiple applications and ULPs share the same fabric, 
>  means
>  to control their use of the network resources are becoming a must. The basic
>  need is to differentiate the service levels provided to different traffic 
>  flows,
>  such that a policy could be enforced and control each flow utilization of 
>  the
>  fabric resources.
> 
>  IBTA specification defined several hardware features and management 
>  interfaces
>  to support QoS:
>  * Up to 15 Virtual Lanes (VL) carry traffic in a non-blocking manner
>  * Arbitration between traffic of different VLs is performed by a 2 priority
>    levels weighted round robin arbiter. The arbiter is programmable with
>    a sequence of (VL, weight) pairs and maximal number of high priority 
>  credits
>    to be processed before low priority is served
>  * Packets carry class of service marking in the range 0 to 15 in their
>    header SL field
>  * Each switch can map the incoming packet by its SL to a particular output
>    VL based on programmable table VL=SL-to-VL-MAP(in-port, out-port, SL)
>  * The Subnet Administrator controls each communication flow parameters
>    by providing them as a response to Path Record (PR) or MultiPathRecord 
>  (MPR)
>    queries
> 
>  The IB QoS features provide the means to implement a DiffServ like 
>  architecture.
>  DiffServ architecture (IETF RFC2474 2475) is widely used today in highly 
>  dynamic
>  fabrics.
> 
>  This proposal provides the detailed functional definition for the various
>  software elements that are required to enable a DiffServ like architecture 
>  over
>  the OpenFabrics software stack.
> 
> 
> 
>  2. Architecture
>  ----------------
>  This proposal split the QoS functionality between the SM/SA, CMA and the 
>  various
>  ULPS. We take the "chronology approach" to describe how the overall system
>  works:
> 
>  2.1. The network manager (human) provides a set of rules (policy) that 
>  defines
>  how the network is being configured and how its resources are split to 
>  different
>  QoS-Levels. The policy also define how to decide which QoS-Level each
>  application or ULP or service use.
> 
>  2.2. The SM analyzes the provided policy to see if it is realizable and 
>  performs
>  the necessary fabric setup. The SM may continuously monitor the policy and 
>  adapt
>  to changes in it. Part of this policy defines the default QoS-Level of each
>  partition. The SA is being enhanced to match the requested Source, 
>  Destination,
>  QoS-Class, Service-ID (and optionally SL and priority) against the policy. 
>  So
>  clients (ULPs, programs) can obtain a policy enforced QoS. The SM is also
>  enhanced to support setting up partitions with appropriate IPoIB broadcast
>  group. This broadcast group carries its QoS attributes: SL, MTU and
>  RATE.
> 
>  2.3. IPoIB is being setup. IPoIB uses the SL, MTU and RATE available on the
>  multicast group which forms the broadcast group of this partition.
> 
>  2.4. MPI which provides non IB based connection management should be 
>  configured
>  to run using hard coded SLs. It uses these SLs for every QP being opened.
> 
>  2.5. ULPs that use CM interface (like SRP) should have their own 
>  pre-assigned
>  Service-ID and use it while obtaining PR/MPR for establishing connections.
>  The SA receiving the PR/MPR should match it against the policy and return
>  the appropriate PR/MPR including SL, MTU and RATE.
> 
>  2.6. ULPs and programs using CMA to establish RC connection should provide 
>  the
>  CMA the target IP and Service-ID. Some of the ULPs might also provide 
>  QoS-Class
>  (E.g. for SDP sockets that are provided the TOS socket option). The CMA 
>  should
>  then use the provided Service-ID and optional QoS-Class and pass them in the
>  PR/MPR request. The resulting PR/MPR should be used for configuring the
>  connection QP.
> 
>  PathRecord and MultiPathRecord enhancement for QoS:
>  As mentioned above the PathRecord and MultiPathRecord attributes should be
>  enhanced to carry the Service-ID which is a 64bit value, which has been
>  standardized by the IBTA. A new field QoS-Class is also provided.
>  A new capability bit should describe the SM QoS support in the SA class port
>  info. This approach provides an easy migration path for existing access 
>  layer
>  and ULPs by not introducing new set of PR/MPR attribute.
> 
> 
>  3. Supported Policy
>  --------------------
> 
>  The QoS policy supported by this proposal is divided into 4 sub sections:
> 
>  I) Port Group: a set of CAs, Routers or Switches that share the same 
>  settings.
>  A port group might be a partition defined by the partition manager policy in
>  terms of GUIDs. Future implementations might provide support for 
>  NodeDescription
>  based definition of port groups.

Isn't it better to have port group definitions in separate file? So
groups could be shared with other OpenSM components (as discussed). Even
if such group sharing is not high priority functionality this should
save us from redoing things later.

>  II) Fabric Setup:
>  Defines how the SL2VL and VLArb tables should be setup. This policy 
>  definition
>  assumes the computation of overall end to end network behavior should be 
>  performed
>  outside of OpenSM.
> 
>  III) QoS-Levels Definition:
>  This section defines the possible sets of parameters for QoS that a client
>  might be mapped to. Each set holds: SL and optionally: Max MTU, Max Rate,
>  Packet Lifetime and Path Bits (in case LMC > 0 is used for QoS).
> 
>  IV) Matching Rules:
>  A list of rules that match an incoming PR/MPR request to a QoS-Level. The
>  rules are processed in order such as the first match is applied. Each rule 
>  is
>  built out of a set of match expressions which should all match for the rule 
>  to
>  apply. The matching expressions are defined for the following fields
>  ** SRC and DST to lists of port groups
>  ** Service-ID to a list of Service-ID or Service-ID ranges
>  ** QoS-Class to a list of QoS-Class values or ranges
> 
>  QoS Policy file syntax
> 
>  * Empty lines are ignored
>  * Leading and trailing blanks, as well as empty lines, are ignored, so the
>    indentation in the example is just for better readability
>  * Comments are started with the pound sign (#) and terminated by EOL
>  * Comments may appear only in a separate line

Why? What is wrong with:

	port-name: vs1/HCA-1/P1   # my best port

>  * Keywords that denote section/subsection start have matching closing 
>  keywords
>  * Any keyword should be the first non-blank in the line
> 
>  QoS Policy file example
> 
>      # Port Groups define sets of ports to be used later in the settings
>      port-groups
>          # using port GUIDs
>          port-group
>              name: Storage
>              # "use" is just a description that is used for logging.
>              #  Other than that, it is just a commentary
>              use: our SRP storage targets
>              port-guid: 0x1000000000000001
>              port-guid: 0x1000000000000002
>          end-port-group
> 
>          port-group
>              name: Virtual Servers
>              use: node desc and IB port num
>              # The syntax of the port name is as follows: 
>  "hostname/CA-num/Pnum".
>              # "hostname" and "CA-num" are compared to the first 2 words of
>              # NodeDescription, and "Pnum" is a port number on that node.
>              port-name: vs1/HCA-1/P1
>              port-name: vs3/HCA-1/P1
>              port-name: vs3/HCA-2/P2

What about wild carding here, like vs1/*/* or just vs1?

>          end-port-group
> 
>          # using partitions defined in the partition policy
>          port-group
>              name: Group for Partition 1
>              use: default settings
>              partition: Part1
>          end-port-group
> 
>          # using node types CA|ROUTER|SWITCH

Probably also ALL (for all ports), SELF (for SM port)?

>          port-group
>              name: Routers
>              use: all routers
>              node-type: ROUTER
>          end-port-group
> 
>      end-port-groups

I agree that proposed syntax has better for human readability than pure
XML, but isn't stuff like this will be more user-friendly?

Storage "Free Text description" = 0x10001, 0x10002, 0x10003 ;

, or

Storage "Free Text description" { 0x10001, 0x10002, 0x10003 };

, or

Storage "Free Text description": ROUTERS, CAS ;


> 
>      qos-setup
> 
>          # define all types of VLArb tables. The length of the tables should
>          # match the physically supported tables by their target ports
>          vlarb-tables
>              # scope defines the exact ports the VLArb tables apply to
>              vlarb-scope
>                  # defining VLArb tables on all the ports that belong to
>                  # port group 'Storage', and on all the ports connected
>                  # to ports of port group 'Storage'
>                  group: Storage

So "group" is only for ports that belong to 'Storage'?

>                  # "across" means all the ports that are connected to ports
>                  # that belong to the specified port group
>                  across: Storage
>                  # VLArb table holds VL and weight pairs
>                  vlarb-high: 0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1
>                  vlarb-low: 8:255,9:127,10:63,11:31,12:15,13:7,14:3
>                  vl-high-limit: 10
>              end-vlarb-scope
>              # There can be several scopes
>          end-vlarb-tables
> 
>          sl2vl-tables
>              # Scope defines the exact devices and in/out ports tables apply 
>  to.
>              # Note: if the same port is matching several rules the *FIRST* 
>  one applies.
>              sl2vl-scope
>                  # SL2VL tables are orgnized as SL2VL(in-port,out-port)
>                  # "from: n,m" means we define the SL2VL(n,*) and SL2VL(m,*)
>                  # "to: n,m" means we define the SL2VL(*,n) and SL2VL(*,m)
>                  #
>                  # The following example specifies that all the SL2VL tables
>                  # entries should be defined for all the ports of group 
>  Part1:
>                  group: Part1
>                  from: *
>                  to: *
>                  # SL2VL table has to have 16 values at max - one for each 
>  SL.
>                  # If the user specifies less than 16 values, all the missing
>                  # VL values will be implicitly set to 0
>                  sl2vl-table: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
>              end-sl2vl-scope
> 
>              sl2vl-scope
>                  # "across-to" is a combination of "across" keyword 
>  (definition can be found
>                  # in VLArb tables section) and "to" keyword.
>                  # "across: PortGroupName" refers to all the ports that are 
>  connected
>                  # to ports that belong to PortGroupName.
>                  #
>                  # Example of "across-to" usage:
>                  #   A user has a set of 'special' nodes (e.g. storage 
>  nodes), and all
>                  #   the traffic to these nodes has to get specific VL.
>                  #   The solution is to define port group (i.g. "Storage") 
>  that will
>                  #   include all the ports of these nodes, and then to 
>  configure SL2VL
>                  #   tables on all the switch ports that are connected to the 
>  Storage
>                  #   port group by specifying "across-to: Storage".
>                  #
>                  across-to: Storage2
>                  # Similar to "across-to", "across-from" is a combination of 
>  "across"
>                  # and "to" keywords
>                  across-from: Storage1
>                  sl2vl-table: 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0
>              end-sl2vl-scope
>          end-sl2vl-tables
> 
>      end-qos-setup
> 
> 
>      qos-levels
> 
>          # the first one is just setting SL
>          qos-level
>              use: for the lowest priority communication
>              sl: 15
>              packet-life: 16
>          end-qos-level
>          # the second sets SL and QoS Class
>          qos-level
>              use: low latency best bandwidth
>              sl: 0
>          end-qos-level
>          # the whole set: SL, MTU-Limit, Rate-Limit, Packet Lifetime, Path 
>  Bits
>          qos-level
>              use: just an example
>              sl: 0
>              mtu-limit: 1
>              rate-limit: 1
>              packet-life: 12
>              # Path Bits can be used e.g. to provide a different routes 
>  through the
>              # subnet to a particular port
>              path-bits: 2,4,8-32
>          end-qos-level
> 
>      end-qos-levels
> 
> 
>      # Match rules are scanned in a first-fit manner (like firewall rules 
>  table)
>      qos-match-rules
> 
>          # matching by single criteria: class (list of values and ranges)
>          qos-match-rule
>              # just a description
>              use: low latency by class 7-9 or 11
>              qos-class: 7-9,11
>              # number of qos-level to apply to the matching PR/MPR
>              qos-level-sn: 1

Isn't it better and less error prone to match qos_level by name and not
by sequential number?

>          end-qos-match-rule
>          # show matching by destination group AND service-ids
>          qos-match-rule
>              use: Storage targets connection
>              destination: Storage
>              service-id: 22,4719-5000
>              qos-level-sn: 2
>          end-qos-match-rule
>          # show matching by source group only
>          qos-match-rule
>              use: bla bla
>              source: Storage
>              qos-level-sn: 3
>          end-qos-match-rule
> 
>      end-qos-match-rules
> 
> 
>  4. IPoIB
>  ---------
> 
>  IPoIB already query the SA for its broadcast group information. The 
>  additional
>  functionality required is for IPoIB to provide the broadcast group SL, MTU,
>  and RATE in every following PathRecord query performed when a new UDAV is
>  needed by IPoIB.
>  We could assign a special Service-ID for IPoIB use but since all 
>  communication
>  on the same IPoIB interface shares the same QoS-Level without the ability to
>  differentiate it by target service we can ignore it for simplicity.
> 
>  5. CMA features
>  ----------------
> 
>  The CMA interface supports Service-ID through the notion of port space as a
>  prefixes to the port_num which is part of the sockaddr provided to
>  rdma_resolve_add(). What is missing is the explicit request for a QoS-Class 
>  that
>  should allow the ULP (like SDP) to propagate a specific request for a class 
>  of
>  service. A mechanism for providing the QoS-Class is available in the IPv6 
>  address,
>  so we could use that address field. Another option is to implement a special
>  connection options API for CMA.
> 
>  Missing functionality by CMA is the usage of the provided QoS-Class and 
>  Service-ID
>  in the sent PR/MPR. When a response is obtained it is an existing 
>  requirement for
>  the CMA to use the PR/MPR from the response in setting up the QP address 
>  vector.
> 
> 
>  6. SDP
>  -------
> 
>  SDP uses CMA for building its connections.
>  The Service-ID for SDP is 0x000000000001PPPP, where PPPP are 4 hex digits
>  holding the remote TCP/IP Port Number to connect to.
>  SDP might be provided with SO_PRIORITY socket option. In that case the value
>  provided should be sent to the CMA as the TClass option of that connection.
> 
>  7. SRP
>  -------
> 
>  Current SRP implementation uses its own CM callbacks (not CMA). So SRP 
>  should
>  fill in the Service-ID in the PR/MPR by itself and use that information in
>  setting up the QP. The T10 SRP standard defines the SRP Service-ID to be 
>  defined
>  by the SRP target I/O Controller (but they should also comply with IBTA 
>  Service-
>  ID rules). Anyway, the Service-ID is reported by the I/O Controller in the
>  ServiceEntries DMA attribute and should be used in the PR/MPR if the SA
>  reports its ability to handle QoS PR/MPRs.
> 
>  8. iSER
>  --------
>  iSER uses CMA and thus should be very close to SDP. The Service-ID for iSER
>  should be TBD.
> 
> 
>  9. OpenSM features
>  -------------------
>  The QoS related functionality to be provided by OpenSM can be split into two
>  main parts:
> 
>  3.1. Fabric Setup
>  During fabric initialization the SM should parse the policy and apply its
>  settings to the discovered fabric elements. The following actions should be
>  performed:
>  * Parsing of policy
>  * Node Group identification. Warning should be provided for each node not
>    specified but found.
>  * SL2VL settings validation should be checked:
>    + A warning will be provided if there are no matching targets for the 
>  SL2VL
>      setting statement.
>    + An error message will be printed to the log file if an invalid setting 
>  is
>      found. A setting is invalid if it refers to:
>      - Non existing port numbers of the target devices
>      - Unsupported VLs for the target device. In the later case the map to 
>  non
>        existing VLs should be replaced to VL15 i.e. packets will be dropped.

I'm not sure it is optimal. We could have well documented or even
configurable mapping rule instead, then this will not limit devices with
higher capabilities.

>  * SL2VL setting is to be performed
>  * VL Arbitration table settings should be validated according to the 
>  following
>    rules:
>    + A warning will be provided if there are no matching targets for the 
>  setting
>      statement
>    + An error will be provided if the port number exceeds the target ports
>    + An error will be generated if the table length exceeds device 
>  capabilities

Ditto.

>    + A warning will be generated if the table quote a VL that is not 
>  supported
>      by the target device

What is "table quote" here?

>  * VL Arbitration tables will be set on the appropriate targets
> 
>  3.2. PR/MPR query handling:
>  OpenSM should be able to enforce the provided policy on client request.
>  The overall flow for such requests is: first the request is matched against 
>  the
>  defined match rules such that the target QoS-Level definition is found. 
>  Given
>  the QoS-Level a path(s) search is performed with the given restrictions 
>  imposed
>  by that level. The following two sections describe these steps.
> 
>  How Service-ID is carried in the PathRecord and MultiPathRecord attributes 
>  is
>  now standardized by the IBTA.
> 
> 
>  3.2.1. Matching rule search:
>  A rule is "matching" a PR/MPR request using the following criteria:
>  * Matching rules provide values in a list of either single value, or range 
>  of
>    values. A PR/MPR field is "matching" the rule field if it is explicitly
>    noted in the list of values or is one of the values covered by a range
>    included in the field values list.
>  * Only PR/MPR fields that have their component mask bit set should be
>    compared.
>  * For a rule to be "matching" a PR/MPR request all the rule fields should be
>    "matching" their PR/MPR fields. Such that a PR/MPR request that does
>    not have a component mask field set for one of the rule defined fields  
>  can
>    not match that rule.
>  * A PR/MPR request that have a component mask bit set for one of the fields
>    that is not defined by the rule can match the rule.

Aren't last two too restrictive? SA can just to filter-out paths in
response to match rest of the rule. No?

>  The algorithm to be used for searching for a rule match might be as simple 
>  as a
>  sequential search through all rules or enhanced for better performance. The
>  semantics of every rule field and its matching PR/MPR field are described
>  below:
>  * Source: the SGID or SLID should be part of this group
>  * Destination: the DGID or DLID should be part of this group
>  * Service-ID: check if the requested Service-ID (available in the PR/MPR old
>    SM-Key field) is matching any of this rule Service-IDs
>  * TClass: check if the PR/MPR TClass field is matching
> 
>  3.2.2 PR/MPR response generation:
>  The QoS-Level pointed by the first rule that matches the PR/MPR request
>  should be used for obtaining the response SL, MTU-Limit, RATE-Limit, 
>  Path-Bits
>  and QoS-Class. A default QoS-Level should be used if no rule is matching the 
>  query.

Where this default should be defined?

Sasha


>  The efficient algorithm for finding paths that meet the QoS-Level criteria 
>  is
>  beyond the scope of this RFC and left for the implementer to provide. 
>  However
>  the criteria by which the paths match the QoS-Level are described below:
> 
>  * SL: The paths found should all use the given SL. For that sake PR/MPR
>    algorithm should traverse the path from source to destination only through
>    ports that carry a valid VL (not VL15) by the SL2VL map (should consider 
>  input
>    and output ports and SL).
>  * MTU-Limit: The resulting paths MTU should not exceed the given MTU-Limit
>  * Rate-Limit: The resulting paths RATE should not exceed the given 
>  RATE-Limit
>    (rate limit is given in units of link BW = Width*Speed according to IBTA
>    Specification Vol-1 table-205 p-901 l-24).
>  * Path-Bits: define the target LID lowest bits (number of bits defined by 
>  the
>    target port PortInfo.LMC field). The path should traverse the LFT using 
>  the
>    target port LID with the path-bits set.
>  * QoS-Class: should be returned in the result PR/MPR. When routing is going 
>  to
>    be supported by OpenSM we might use this field in selecting the target
>    router too in a TBD way.
> 
>  _______________________________________________
>  general mailing list
>  general at lists.openfabrics.org
>  http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
>  To unsubscribe, please visit 
>  http://openib.org/mailman/listinfo/openib-general


From krkumar2 at in.ibm.com  Sun Jul 22 19:53:25 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Mon, 23 Jul 2007 08:23:25 +0530
Subject: [ofa-general] Re: [PATCH 11/12 -Rev2] IPoIB xmit API addition
In-Reply-To: <20070722094136.GD7800@mellanox.co.il>
Message-ID: <OFDE4B1C58.3892597B-ON65257321.000EF65E-65257321.000FE0C9@in.ibm.com>

Hi Micheal,

"Michael S. Tsirkin" <mst at dev.mellanox.co.il> wrote on 07/22/2007 03:11:36
PM:

> > +   /*
> > +    * Handle skbs completion from tx_tail to wr_id. It is possible to
> > +    * handle WC's from earlier post_sends (possible multiple) in this
> > +    * iteration as we move from tx_tail to wr_id, since if the last
> > +    * WR (which is the one which had a completion request) failed to
be
> > +    * sent for any of those earlier request(s), no completion
> > +    * notification is generated for successful WR's of those earlier
> > +    * request(s).
> > +    */
>
> AFAIK a signalled WR will always generate a completion.
> What am I missing?

Yes, signalled WR will generate a completion. I am trying to catch the case
where, say, I send 64 skbs and set signalling for only the last skb and the
others are set to NO signalling. Now if the driver found the last WR was
bad
for some reason, it will synchronously fail the send for that WR (which
happens to be the only one that is signalled). So after the 1 to 63 skbs
are
finished, there will be no completion called. That was my understanding of
how
this works, and coded it that way so that the next post will clean up the
previous one's completion.

> >
> > +         /*
> > +          * Better error handling can be done here, like free
> > +          * all untried skbs if err == -ENOMEM. However at this
> > +          * time, we re-try all the skbs, all of which will
> > +          * likely fail anyway (unless device finished sending
> > +          * some out in the meantime). This is not a regression
> > +          * since the earlier code is not doing this either.
> > +          */
>
> Are you retrying posting skbs? Why is this a good idea?
> AFAIK, earlier code did not retry posting WRs at all.

Not exactly. If I send 64 skbs to the device and the provider returned a
bad WR at skb # 50, then I will have to try skb# 51-64 again since the
provider has not attemped to send those out as it bails out at the first
failure. The provider ofcourse has already sent out skb# 1-49 before
returning failure at skb# 50. So it is not strictly retry, just xmit of
next skbs which is what the current code also does. I tested this part out
by simulating errors in mthca_post_send and verified that the next
iteration clears up the remaining skbs.

> The comment seems to imply that post send fails as a result of SQ
overflow -

Correct.

> do you see SQ overflow errors in your testing?

No.

> AFAIK, IPoIB should never overflow the SQ.

Correct. It should never happen unless IPoIB has a bug :) I guess the
comment
should be removed ?

Thanks,

- KK


From krkumar2 at in.ibm.com  Sun Jul 22 19:54:58 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Mon, 23 Jul 2007 08:24:58 +0530
Subject: [ofa-general] Re: [PATCH 06/12 -Rev2] rtnetlink changes.
In-Reply-To: <46A38F8D.6080109@trash.net>
Message-ID: <OF5ECADA10.ABB9C3B0-ON65257321.000FEF47-65257321.001004EC@in.ibm.com>

Hi Patrick,

Patrick McHardy <kaber at trash.net> wrote on 07/22/2007 10:40:37 PM:

> Krishna Kumar wrote:
> > diff -ruNp org/include/linux/if_link.h rev2/include/linux/if_link.h
> > --- org/include/linux/if_link.h   2007-07-20 16:33:35.000000000 +0530
> > +++ rev2/include/linux/if_link.h   2007-07-20 16:35:08.000000000 +0530
> > @@ -78,6 +78,8 @@ enum
> >     IFLA_LINKMODE,
> >     IFLA_LINKINFO,
> >  #define IFLA_LINKINFO IFLA_LINKINFO
> > +   IFLA_TXBTHSKB,      /* Driver support for Batch'd skbs */
> > +#define IFLA_TXBTHSKB IFLA_TXBTHSKB
>
>
> Ughh what a name :) I prefer pronouncable names since they are
> much easier to remember and don't need comments explaining
> what they mean.
>
> But I actually think offering just an ethtool interface would
> be better, at least for now.

Great, I will remove /sys and rtnetlink and keep the Ethtool i/f.

Thanks,

- KK


From krkumar2 at in.ibm.com  Sun Jul 22 19:57:53 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Mon, 23 Jul 2007 08:27:53 +0530
Subject: [ofa-general] Re: [PATCH 02/12 -Rev2] Changes to netdevice.h
In-Reply-To: <46A38EAB.6050300@trash.net>
Message-ID: <OFFAB073E4.487DACF3-ON65257321.00101BE3-65257321.00104971@in.ibm.com>

Hi Patrick,

Patrick McHardy <kaber at trash.net> wrote on 07/22/2007 10:36:51 PM:

> Krishna Kumar wrote:
> > @@ -472,6 +474,9 @@ struct net_device
> >     void         *priv;   /* pointer to private data   */
> >     int         (*hard_start_xmit) (struct sk_buff *skb,
> >                        struct net_device *dev);
> > +   int         (*hard_start_xmit_batch) (struct net_device
> > +                       *dev);
> > +
>
>
> Os this function really needed? Can't you just call hard_start_xmit with
> a NULL skb and have the driver use dev->blist?

Probably not. I will see how to do it this way and get back to you.

> >     /* These may be needed for future network-power-down code. */
> >     unsigned long      trans_start;   /* Time (in jiffies) of last Tx
*/
> >
> > @@ -582,6 +587,8 @@ struct net_device
> >  #define   NETDEV_ALIGN      32
> >  #define   NETDEV_ALIGN_CONST   (NETDEV_ALIGN - 1)
> >
> > +#define BATCHING_ON(dev)   ((dev->features & NETIF_F_BATCH_ON) != 0)
> > +
> >  static inline void *netdev_priv(const struct net_device *dev)
> >  {
> >     return dev->priv;
> > @@ -832,6 +839,8 @@ extern int      dev_set_mac_address(struct n
> >                     struct sockaddr *);
> >  extern int      dev_hard_start_xmit(struct sk_buff *skb,
> >                     struct net_device *dev);
> > +extern int      dev_add_skb_to_blist(struct sk_buff *skb,
> > +                    struct net_device *dev);
>
>
> Again, function signatures should be introduced in the same patch
> that contains the function. Splitting by file doesn't make sense.

Right. I did it for some but missed this. Sorry, will redo.

thanks,

- KK


From rdreier at cisco.com  Sun Jul 22 20:48:39 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 22 Jul 2007 20:48:39 -0700
Subject: [ofa-general] Merge window for 2.6.23 closed
Message-ID: <adak5sr231k.fsf@cisco.com>

Linus has released 2.6.23-rc1, and so the merge window for new
features has closed.  Of course fixes are always accepted at any time,
and it definitely makes sense to submit new features early -- I will
happily queue things up for 2.6.24 as soon as they are ready.

Several things missed the merge window:

 - Sean's local SA changes.  I guess I scared everyone into going
   slow, so I didn't have to make a hard choice here.  However let's
   try to keep the discussion going so that we can finish this for
   2.6.24.

 - IPoIB CM without SRQ.  Pradeep, I'm sorry this missed the window
   but the patch quality really doesn't look up to par to me, and
   your being in a rush to get this merged I think has actually slowed
   things up.  I think the basic idea is OK, but I have doubts about
   a static array as a data structure, and MST's comments about not
   dealing with remote implementations that send packets on passive
   connections looks quite serious as well.  I would like to close
   this for 2.6.24 so (as above) please let's keep working this and
   not wait for the 2.6.24 merge window.

 - MST's "MSI-X by default" patches.  The idea seems fine but I found
   a few minor issues and just ran out of time to review it.  My fault--
   sorry.

And now I'm going on vacation for a week, so talk amongst yourselves...

 - Roland


From rdreier at cisco.com  Sun Jul 22 20:50:06 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 22 Jul 2007 20:50:06 -0700
Subject: [ofa-general] Re: IPoIB path caching
In-Reply-To: <46A2F696.4060007@voltaire.com> (Or Gerlitz's message of "Sun,
	22 Jul 2007 09:17:58 +0300")
References: <adalkdl43w0.fsf@cisco.com> <adahco943ip.fsf@cisco.com>
	<4696D1F3.2040507@ichips.intel.com>
	<15ddcffd0707172020j5b68fcb2v7d3ca77863998020@mail.gmail.com>
	<adaabtuo0n9.fsf@cisco.com>
	<f0e08f230707172140y36569958xadcba6faae1d65b3@mail.gmail.com>
	<20070718050928.GA3103@obsidianresearch.com>
	<ada7ioymfsw.fsf@cisco.com> <20070718072841.GC1115@mellanox.co.il>
	<469DD7BB.6060009@voltaire.com> <46A2F696.4060007@voltaire.com>
Message-ID: <adafy3f22z5.fsf@cisco.com>

 > Do you agree that using cached IB L2 info where the net stack wants to
 > renew its IPoIB L2 (which is IB L3 && L4) info is a bug?

Yes, looks that way.

Also your point that there's no reason for IPoIB to keep the path info
once it has created the AH makes sense to me.  I haven't had a chance
to look at the code but it seems we could kill off a lot of stuff by
just creating AHs immediately and then dumping the path record.

 - R.


From krkumar2 at in.ibm.com  Sun Jul 22 21:23:29 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Mon, 23 Jul 2007 09:53:29 +0530
Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API
In-Reply-To: <20070720125423.GB13468@2ka.mipt.ru>
Message-ID: <OF00C1E678.386A9C8E-ON65257321.0017122B-65257321.00181FC6@in.ibm.com>

Hi Evgeniy,

Evgeniy Polyakov <johnpol at 2ka.mipt.ru> wrote on 07/20/2007 06:24:23 PM:

> Hi Krishna.
>
> On Fri, Jul 20, 2007 at 12:01:49PM +0530, Krishna Kumar
(krkumar2 at in.ibm.com) wrote:
> > After fine-tuning qdisc and other changes, I modified IPoIB to use this
API,
> > and now get good gains. Summary for TCP & No Delay: 1 process improves
for
> > all cases from 1.4% to 49.5%; 4 process has almost identical
improvements
> > from -1.7% to 59.1%; 16 process case also improves in the range of
-1.2% to
> > 33.4%; while 64 process doesn't have much improvement (-3.3% to 12.4%).
UDP
> > was tested with 1 process netperf with small increase in BW but big
> > improvement in Service Demand. Netperf latency tests show small drop in
> > transaction rate (results in separate attachment).
>
> What about round-robin tcp time and latency test? In theory such batching
> mode should not change that timings, but practice can show new aspects.

The TCP RR results show a slight impact, however the service demand shows
good improvement. The results are (I did TCP RR - 1 process, 1,8,32,128,512
buffer sizes; and UDP RR - 1 process, 1 byte buffer size) :

        Results for TCR RR (1 process) ORG code:
Size        R-R                   CPU%            S.Demand
------------------------------------------------------------
1         521346.02               5.48            1346.145
8         129463.14               6.74            418.370
32        128899.73               7.51            467.106
128       127230.15               5.42            340.876
512       119605.68               6.48            435.650


        Results for TCR RR (1 process) NEW code (and change%):
Size        R-R                   CPU%            S.Demand
--------------------------------------------------------------------
1         516596.62 (-0.91%)      5.74            1423.819 (5.77%)
8         129184.46 (-.22%)       5.43            336.747 (-19.51%)
32        128238.35 (-.51%)       5.43            339.213 (-27.38%)
128       126545.79 (-.54%)       5.36            339.188 (-0.50%)
512       119297.49 (-.26%)       5.16            346.185 (-20.54%)


              Results for UDP RR 1 process ORG & NEW code:
Code   Size      R-R                CPU%      S.Demand
----------------------------------------------------------------------
ORG     1        539327.86          5.68      1348.985
NEW     1        540669.33 (0.25%)  6.05      1434.180 (6.32%)


> I will review code later this week (likely tomorrow) and if there will
> be some issues return back.

Thanks! I had just submitted Rev2 on Sunday, please let me know what you
find.

Regards,

- KK


From kliteyn at mellanox.co.il  Sun Jul 22 21:43:44 2007
From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il)
Date: 23 Jul 2007 07:43:44 +0300
Subject: [ofa-general] nightly osm_sim report 2007-07-23:normal completion
Message-ID: <MTLEXCH01ryWtIIZS2T00000078@mtlexch01.mtl.com>

OSM Simulation Regression Summary
 
[Generated mail - please do NOT reply]
 
 
OpenSM rev = Thu_Jul_12_11:56:08_2007 [de69204d60071532833b0cdd3baa5e2386dc2c73]
ibutils rev = Tue_Mar_13_14:36:32_2007 [80aaff94f0eb65117db39b9db7d609ffdcc055de]
 
 
Total=560  Pass=560  Fail=0
 
 
Pass:
42 Stability IS1-16.topo
42 Pkey IS1-16.topo
42 OsmTest IS1-16.topo
42 OsmStress IS1-16.topo
42 Multicast IS1-16.topo
42 LidMgr IS1-16.topo
14 Stability IS3-loop.topo
14 Stability IS3-128.topo
14 Pkey IS3-128.topo
14 OsmTest IS3-loop.topo
14 OsmTest IS3-128.topo
14 OsmStress IS3-128.topo
14 Multicast IS3-loop.topo
14 Multicast IS3-128.topo
14 LidMgr IS3-128.topo
14 FatTree merge-roots-4-ary-2-tree.topo
14 FatTree merge-root-4-ary-3-tree.topo
14 FatTree gnu-stallion-64.topo
14 FatTree blend-4-ary-2-tree.topo
14 FatTree RhinoDDR.topo
14 FatTree FullGnu.topo
14 FatTree 4-ary-2-tree.topo
14 FatTree 2-ary-4-tree.topo
14 FatTree 12-node-spaced.topo
14 FTreeFail 4-ary-2-tree-missing-sw-link.topo
14 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo
14 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo
14 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo

Failures:


From krkumar2 at in.ibm.com  Sun Jul 22 21:49:53 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Mon, 23 Jul 2007 10:19:53 +0530
Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API
In-Reply-To: <1185108670.5192.122.camel@localhost>
Message-ID: <OFFA8C2879.1D54F523-ON65257321.0010F240-65257321.001A8A31@in.ibm.com>

Hi Jamal,

J Hadi Salim <j.hadi123 at gmail.com> wrote on 07/22/2007 06:21:09 PM:

> My concern is there is no consistency in results. I see improvements on
> something which you say dont. You see improvement in something that
> Evgeniy doesnt etc.

Hmmm ? Evgeniy has not even tested my code to find some regression :) And
you may possibly not find much improvement in E1000 when you run iperf
(which
is what I do) compared to pktgen. I can re-run and confirm this since my
last
E1000 run was quite some time back.

My point is that batching not being viable for E1000 (or tg3) need not be
the
sole criterea for inclusion. If IPoIB or other drivers can take advantage
of
it and get better results, then batching can be considered. Maybe E1000 too
can get improvements if some one with more expertise tries to add this API
(not judging your driver writing capabilities - just stating that driver
writers will know more knobs to exploit a complex device like E1000).

> > Since E1000 doesn't seem to use the TX lock on RX (atleast I couldn't
find
> > it),
> > I feel having prep will not help as no other cpu can execute the
queue/xmit
> > code anyway (E1000 is also a LLTX driver).
>
> My experiments show it is useful (in a very visible way using pktgen)
> for e1000 to have the prep() interface.

I meant : have you compared results of batching with prep on vs prep off,
and
what is the difference in BW ?

> >  Other driver that hold tx lock could get improvement however.
>
> So you do see the value then with non LLTX drivers, right? ;->

No. I see value only in non-LLTX drivers which also gets the same TX lock
in the RX path. If different locks are got by TX/RX, then since you are
holding queue_lock before calling 'prep', this excludes other TX from
running at the same time. In that case, pre-poning the get of the tx_lock
to do the 'prep' will not cause any degradation (since no other tx can run
anyway, while rx can run as it gets a different lock).

> The value is also there in LLTX drivers even if in just formating a skb
> ready for transmit. If this is not clear i could do a much longer
> writeup on my thought evolution towards adding prep().

In LLTX drivers, the driver does the 'prep' without holding the tx_lock in
any case, so there should be no improvement. Could you send the write-up
since
I really don't see the value in prep unless the driver is non-LLTX *and*
TX/RX holds the same TX lock. I think that is the sole criterea, right ?

> > If it helps, I guess you could send me a patch to
> > add that and I can also test it to see what the effect is. I didn't add
it
> > since IPoIB wouldn't be able to exploit it (unless someone is kind
enough
> > to show me how to).
>
> Such core code should not just be focussed on IPOIB.

There is *nothing* IPoIB specific or focus in my code. I said adding prep
doesn't
work for IPoIB and so it is pointless to add bloat to the code until some
code can
actually take advantage of this feature (I am sure you will agree). Which
is why I
also mentioned to please send me a patch if you find it useful for any
driver
rather than rejecting this idea.

> > I think the code I have is ready and stable,
>
> I am not sure how to intepret that - are you saying all-is-good and we
> should just push your code in?

I am only too well aware that Dave will not accept any code (having
experienced with Mobile IPv6 a long time back when he said to move most
of it to userspace and he was absolutely correct :). What I meant to say
is that there isn't much point in saying that your code is not ready or
you are using old code base, or has multiple restart functions, or is not
tested enough, etc, and then say let's re-do/rethink the whole
implementation when my code is already working and giving good results.
Unless you have some design issues with it, or code is written badly, is
not maintainable, not linux style compliant, is buggy, will not handle
some case/workload, type of issues.

OTOH, if you find some cases that are better handled with :
      1. prep handler
      2. xmit_win (which I don't have now),
then please send me patches and I will also test out and incorporate.

> It sounds disingenuous but i may have misread you.

("lacking in frankness, candor, or sincerity; falsely or hypocritically
ingenuous; insincere") ???? Sorry, no response to personal comments and
have a flame-war :)

Thanks,

- KK


From sri at us.ibm.com  Sun Jul 22 22:59:39 2007
From: sri at us.ibm.com (Sridhar Samudrala)
Date: Sun, 22 Jul 2007 22:59:39 -0700
Subject: [ofa-general] Re: [PATCH 02/10] Networking include file changes.
In-Reply-To: <OF46F8CFAE.90A18F30-ON6525731F.0023182A-6525731F.0023B8DF@in.ibm.com>
References: <OF46F8CFAE.90A18F30-ON6525731F.0023182A-6525731F.0023B8DF@in.ibm.com>
Message-ID: <46A443CB.6060200@us.ibm.com>

Krishna Kumar2 wrote:
> Hi Sridhar,
> 
> Sridhar Samudrala <sri at us.ibm.com> wrote on 07/20/2007 10:55:05 PM:
>>> diff -ruNp org/include/net/pkt_sched.h new/include/net/pkt_sched.h
>>> --- org/include/net/pkt_sched.h   2007-07-20 07:49:28.000000000 +0530
>>> +++ new/include/net/pkt_sched.h   2007-07-20 08:30:22.000000000 +0530
>>> @@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge
>>>        struct rtattr *tab);
>>>  extern void qdisc_put_rtab(struct qdisc_rate_table *tab);
>>>
>>> -extern void __qdisc_run(struct net_device *dev);
>>> +extern void __qdisc_run(struct net_device *dev, struct sk_buff_head
> *blist);
>> Why do we need this additional 'blist' argument?
>> Is this different from dev->skb_blist?
> 
> It is the same, but I want to call it mostly with NULL and rarely with the
> batch list pointer (so it is related to your other question). My original
> code didn't have this and was trying batching in all cases. But in most
> xmit's (probably almost all), there will be only one packet in the queue to
> send and batching will never happen. When there is a lock contention or if
> the queue is stopped, then the next iteration will find >1 packets. But I
> still will try no batching for the lock failure case as there be probably
> 2 packets (one from previous time and 1 from this time, or 3 if two
> failures,
> etc), and try batching only when queue was stopped from net_tx_action (this
> was based on Dave Miller's idea).


Is this right to say that the above change is to get this behavior?
   If qdisc_run() is called from dev_queue_xmit() don't use batching.
   If qdisc_run() is called from net_tx_action(), do batching.

Isn't it possible to have multiple skb's in the qdisc queue in the
first case?

If this additional argument is used to indicate if we should do batching
or not, then passing a flag may be much more cleaner than passing the blist.

Thanks
Sridhar


From krkumar2 at in.ibm.com  Sun Jul 22 23:27:07 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Mon, 23 Jul 2007 11:57:07 +0530
Subject: [ofa-general] Re: [PATCH 02/10] Networking include file changes.
In-Reply-To: <46A443CB.6060200@us.ibm.com>
Message-ID: <OF9DDA4E7D.6ED7F3A1-ON65257321.0022D06C-65257321.0023714C@in.ibm.com>

Hi Sridhar,

Sridhar Samudrala <sri at us.ibm.com> wrote on 07/23/2007 11:29:39 AM:

> Krishna Kumar2 wrote:
> > Hi Sridhar,
> >
> > Sridhar Samudrala <sri at us.ibm.com> wrote on 07/20/2007 10:55:05 PM:
> >>> diff -ruNp org/include/net/pkt_sched.h new/include/net/pkt_sched.h
> >>> --- org/include/net/pkt_sched.h   2007-07-20 07:49:28.000000000 +0530
> >>> +++ new/include/net/pkt_sched.h   2007-07-20 08:30:22.000000000 +0530
> >>> @@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge
> >>>        struct rtattr *tab);
> >>>  extern void qdisc_put_rtab(struct qdisc_rate_table *tab);
> >>>
> >>> -extern void __qdisc_run(struct net_device *dev);
> >>> +extern void __qdisc_run(struct net_device *dev, struct sk_buff_head
> > *blist);
> >> Why do we need this additional 'blist' argument?
> >> Is this different from dev->skb_blist?
> >
> > It is the same, but I want to call it mostly with NULL and rarely with
the
> > batch list pointer (so it is related to your other question). My
original
> > code didn't have this and was trying batching in all cases. But in most
> > xmit's (probably almost all), there will be only one packet in the
queue to
> > send and batching will never happen. When there is a lock contention or
if
> > the queue is stopped, then the next iteration will find >1 packets. But
I
> > still will try no batching for the lock failure case as there be
probably
> > 2 packets (one from previous time and 1 from this time, or 3 if two
> > failures,
> > etc), and try batching only when queue was stopped from net_tx_action
(this
> > was based on Dave Miller's idea).
>
> Is this right to say that the above change is to get this behavior?
>    If qdisc_run() is called from dev_queue_xmit() don't use batching.
>    If qdisc_run() is called from net_tx_action(), do batching.

Correct.

> Isn't it possible to have multiple skb's in the qdisc queue in the
> first case?

It is possible but rarer (so unnecessary checking most of the time). From
net_tx_action you are guaranteed to have multiple skbs, but from xmit you
will almost always get one skb (since most send of 1 skb will go out OK).
And also in the xmit path, it is more likely to have few skbs compared to
possibly hundreds in the net_tx_action path.

> If this additional argument is used to indicate if we should do batching
> or not, then passing a flag may be much more cleaner than passing the
blist.

OK, I will add this as another action item to check (along with Patrick's
suggestion to use single API) and will get back.

- KK


From eli at mellanox.co.il  Sun Jul 22 23:32:19 2007
From: eli at mellanox.co.il (Eli Cohen)
Date: Mon, 23 Jul 2007 09:32:19 +0300
Subject: [ofa-general] ipoib question
Message-ID: <1185172339.5513.11.camel@mtls03>

Roland,

can you explain why you add 1 to the size of the CQ in
ipoib_transport_dev_init()?

 
	priv->mr = ib_get_dma_mr(priv->pd, IB_ACCESS_LOCAL_WRITE);
	if (IS_ERR(priv->mr)) {
		printk(KERN_WARNING "%s: ib_get_dma_mr failed\n", ca->name);
		goto out_free_pd;
	}

	size = ipoib_sendq_size + ipoib_recvq_size + 1;
	ret = ipoib_cm_dev_init(dev);


From monisonlists at gmail.com  Mon Jul 23 00:07:42 2007
From: monisonlists at gmail.com (Moni Shoua)
Date: Mon, 23 Jul 2007 10:07:42 +0300
Subject: [ofa-general] PATCH] IB/core: ignore membership bit when looking
	for a P_Key in the table
In-Reply-To: <46A36E77.5020307@gmail.com>
References: <46A36E77.5020307@gmail.com>
Message-ID: <46A453BE.3030408@gmail.com>

I am resending the patch with Signed-off-by line. Sorry.
------------------------------------------------------------------

IPoIB turns on the P_Key membership bit of limited membership P_Keys
when creating a child interface. After that IPoIB looks for the full
membership P_key in the table to make the interface "RUNNING". This 
patch fixes the pkey lookup in order to match full and partial membership 
keys that belong of the same partition.

Signed-off-by: Moni Shoua <monis at voltaire.com>
---

 device.c |    2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)
Index: infiniband/drivers/infiniband/core/device.c
===================================================================
--- infiniband.orig/drivers/infiniband/core/device.c	2007-07-08 12:45:07.000000000 +0300
+++ infiniband/drivers/infiniband/core/device.c	2007-07-22 17:43:32.440829619 +0300
@@ -702,7 +702,7 @@ int ib_find_pkey(struct ib_device *devic
 		if (ret)
 			return ret;
 
-		if (pkey == tmp_pkey) {
+		if ((pkey & 0x7fff) == (tmp_pkey & 0x7fff)) {
 			*index = i;
 			return 0;
 		}


From dotanb at dev.mellanox.co.il  Mon Jul 23 00:36:44 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Mon, 23 Jul 2007 10:36:44 +0300
Subject: [ofa-general] I think that there is a resource leak in the core file
	mad_rmpp.c
Message-ID: <46A45A8C.2090800@dev.mellanox.co.il>

Hi.

I reviewed the file mad_rmpp.c and it seems that there is a leak of the 
Address Handle.
The AH that is being created in the function "alloc_response_msg" is 
never being destroyed.


This thing causes to resource (AH) and memory leak.

thanks
Dotan


From eitan at mellanox.co.il  Mon Jul 23 00:31:14 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Mon, 23 Jul 2007 10:31:14 +0300
Subject: [ofa-general] [PATCH resend] opensm/osm_indent: go closer
	toopensm-coding-style.txt
In-Reply-To: <20070722221455.GR27878@sashak.voltaire.com>
References: <20070722221455.GR27878@sashak.voltaire.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901ED5EF1@mtlexch01.mtl.com>

Hi Sasha,

So we will finally have a common enforced coding style!
When do you plan to run it on all the files?
Or should we just make sure every new committed file will first pass
this indent?

Thanks

Eitan

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL

 
> -----Original Message-----
> From: general-bounces at lists.openfabrics.org 
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of 
> Sasha Khapyorsky
> Sent: Monday, July 23, 2007 1:15 AM
> To: general at lists.openfabrics.org
> Cc: Yevgeny Kliteynik
> Subject: [ofa-general] [PATCH resend] opensm/osm_indent: go 
> closer toopensm-coding-style.txt
> 
> 
> This updates the script according to recent 
> doc/opensm-coding-style.txt (in short K&R, tabs, etc.).
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> ---
>  opensm/opensm/osm_indent |   57 
> +++------------------------------------------
>  1 files changed, 4 insertions(+), 53 deletions(-)
> 
> diff --git a/opensm/opensm/osm_indent 
> b/opensm/opensm/osm_indent index bed2ba1..621184b 100755
> --- a/opensm/opensm/osm_indent
> +++ b/opensm/opensm/osm_indent
> @@ -1,6 +1,6 @@
>  #!/bin/bash
>  #
> -# Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved.
> +# Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
>  # Copyright (c) 2002-2005 Mellanox Technologies LTD. All 
> rights reserved.
>  # Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>  #
> @@ -40,56 +40,7 @@
>  #  Environment:
>  #  	Linux User Mode
>  #
> -#  $Revision: 1.4 $
> -#
> -#
> -# This is the indent format used for OpenSM.
> -#
> -# format the source code according to the ACD standard
> -# -bad	Blank line after declarations
> -# -bap	Blank line after Procedures
> -# -bbb	Blank line before block comments
> -# -nbbo	Break after Boolean operator
> -# -bl	Break after if line
> -# -bli0 Indent for braces is 0
> -# -bls	Break after struct declarations
> -# -cbi0	Case break indent 0
> -# -ci3	Continue indent 3 spaces
> -# -cli0	Case label indent 0 spaces
> -# -ncs	No space after cast operator
> -# -hnl	Honor existing newlines on long lines
> -# -i3	Substitute indent with 3 spaces
> -# -npcs	No space after procedure calls
> -# -prs	Space after parenthesis
> -# -nsai	No space after if keyword - removed
> -# -nsaw	No space after while keyword - removed
> -# -sc	Put * at left of comments in a block comment style
> -# -nsob	Don't swallow unnecessary blank lines
> -# -ts3	Tab size is 3
> -# -psl	Type of procedure return in a separate line
> -# -bfda	Function declaration arguments in a separate line.
> -# -nut   No tabs as we allow spaces
> -#
> -#############################################################
> ############
> -
> -# indent the world
> -for sourcefile in $*; do
> -    if test -f "$sourcefile"; then
> -        # first, string DOS style linefeeds
> -        perl -piW -e's/\x0D//' "$sourcefile"
> -        echo Processing $sourcefile
> -        indent -bad -bap -bbb -nbbo -bl -bli0 -bls -cbi0 
> -ci3 -cli0 -ncs \
> -                -hnl -i3 -npcs -prs -sc -nsob -ts3 -psl 
> -bfda -nut $sourcefile
> -
> -        rm ${sourcefile}W
> +# This is the indent format used for OpenSM (similar to one 
> used in # 
> +linux/scripts/Lindent).
>  
> -        # the -bb also affect the first line in each file - 
> so clean it up
> -        if test `head -1 $sourcefile | egrep -v '^$' | wc 
> -l` = 0; then
> -            echo Cleaning up first empty line of $sourcefile
> -            awk '{if(n){print};n++}' $sourcefile > ${sourcefile}W
> -            mv -f ${sourcefile}W $sourcefile
> -        fi
> -    else
> -        echo Could not find file:$sourcefile
> -    fi
> -done
> +indent -npro -kr -i8 -ts8 -sob -l80 -ss -ncs "$@"
> --
> 1.5.3.rc2.29.gc4640f
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From eitan at mellanox.co.il  Mon Jul 23 00:35:25 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Mon, 23 Jul 2007 10:35:25 +0300
Subject: [ofa-general] RE: opensm: a bug in heavy sweep? - no LFT
	re-configuration
In-Reply-To: <20070722174048.GO27878@sashak.voltaire.com>
References: <863azhrlm1.fsf@sw053.lab.mtl.com>
	<20070722102209.GR16597@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com>
	<20070722174048.GO27878@sashak.voltaire.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901ED5EFE@mtlexch01.mtl.com>

Hi Sasha,

> On 14:59 Sun 22 Jul     , Eitan Zahavi wrote:
> > Hi Sasha
> > 
> > Let's assume someone has reset a switch on the fabric.
> > What would cause the SM to re-assign the LFT of that switch?
> 
> OpenSM will sweep and drop this switch and when switch will 
> back it will be initialized again. But if the reset was too 
> fast (relative to discovery), we can be in trouble (and maybe 
> not only with LFTs).
> 
> > I assumed that there is a mechanism to do that.
> 
> Not for "fast" switch reboot.
So we have a problem with these fast resetting devices.
> 
> Hmm, I think we could try to detect this case by comparing 
> SwitchInfo:LinerFDBTop with current p_sw->max_lid_ho or even 
> by seeing that PortInfo:LID is not set. Something like below:
> 
I think we should have a predicate that will be used to mark a
port/device as needing a full update.
Not just LFT but everything (SL2VL, VLArb, LID, PKey ... If a device was
reset then it probably lost everything). 
Another approach is to mark it for the entire fabric. 

The original intention of kill -HUP was to force a new heavy sweep and
setup.
I this another signal is acceptible but not required.

Thanks

Eitan


From mst at dev.mellanox.co.il  Mon Jul 23 00:57:54 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 23 Jul 2007 10:57:54 +0300
Subject: [ofa-general] commit bb350d1decd9c48ffaa7f7e263df3056df9f4f21
Message-ID: <20070723075754.GC20614@mellanox.co.il>

Hi!
commit bb350d1decd9c48ffaa7f7e263df3056df9f4f21 includes this snippet:

@@ -468,20 +465,8 @@ static void srp_unmap_data(struct scsi_cmnd *scmnd,
                req->fmr = NULL;
        }

-       /*
-        * This handling of non-SG commands can be killed when the
-        * SCSI midlayer no longer generates non-SG commands.
-        */
-       if (likely(scmnd->use_sg)) {
-               nents = scmnd->use_sg;
-               scat  = scmnd->request_buffer;
-       } else {
-               nents = 1;
-               scat  = &req->fake_sg;
-       }
-
-       ib_dma_unmap_sg(target->srp_host->dev->dev, scat, nents,
-                       scmnd->sc_data_direction);
+       ib_dma_unmap_sg(target->srp_host->dev->dev, scsi_sglist(scmnd),
+                       scsi_sg_count(scmnd), scmnd->sc_data_direction);

Since scsi_sg_count is simply use_sg, and scsi_sglist is simply request_buffer,
why is this the right things to do?
Is there a reason to believe that scsi_sg_count is never 0 here?


-- 
MST


From ogerlitz at voltaire.com  Mon Jul 23 01:26:31 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 23 Jul 2007 11:26:31 +0300
Subject: [ofa-general] Re: IPoIB path caching
In-Reply-To: <adafy3f22z5.fsf@cisco.com>
References: <adalkdl43w0.fsf@cisco.com>
	<adahco943ip.fsf@cisco.com><4696D1F3.2040507@ichips.intel.com><15ddcffd0707172020j5b68fcb2v7d3ca77863998020@mail.gmail.com><adaabtuo0n9.fsf@cisco.com><f0e08f230707172140y36569958xadcba6faae1d65b3@mail.gmail.com><20070718050928.GA3103@obsidianresearch.com><ada7ioymfsw.fsf@cisco.com>
	<20070718072841.GC1115@mellanox.co.il><469DD7BB.6060009@voltaire.com>
	<46A2F696.4060007@voltaire.com> <adafy3f22z5.fsf@cisco.com>
Message-ID: <46A46637.3080104@voltaire.com>

Roland Dreier wrote:
> 
>  > Do you agree that using cached IB L2 info where the net stack wants to
>  > renew its IPoIB L2 (which is IB L3 && L4) info is a bug?
> 
> Yes, looks that way.
> 
> Also your point that there's no reason for IPoIB to keep the path info
> once it has created the AH makes sense to me.  I haven't had a chance
> to look at the code but it seems we could kill off a lot of stuff by
> just creating AHs immediately and then dumping the path record.

Indeed.

It does make sense to keep the path info for admin / debugging purposes, 
eg printing them through debugfs etc, but no more.

In the context of the local sa, this seems to be another requirement 
namely: provide the consumer with an API to specify if it is willing to 
get from the ib_sa module a cached IB L2 info (path) or not.

As I said above, if the network stack decides to renew its IPoIB L2 
info, the IB stack must provide it with non-cached IB L2 info

Or.


From mst at dev.mellanox.co.il  Mon Jul 23 01:30:20 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 23 Jul 2007 11:30:20 +0300
Subject: [ofa-general] Re: IPoIB path caching
In-Reply-To: <46A46637.3080104@voltaire.com>
References: <adalkdl43w0.fsf@cisco.com> <46A2F696.4060007@voltaire.com>
	<adafy3f22z5.fsf@cisco.com> <46A46637.3080104@voltaire.com>
Message-ID: <20070723083020.GD20614@mellanox.co.il>

> Quoting Or Gerlitz <ogerlitz at voltaire.com>:
> Subject: Re: [ofa-general] Re: IPoIB path caching
> 
> Roland Dreier wrote:
> >
> > > Do you agree that using cached IB L2 info where the net stack wants to
> > > renew its IPoIB L2 (which is IB L3 && L4) info is a bug?
> >
> >Yes, looks that way.
> >
> >Also your point that there's no reason for IPoIB to keep the path info
> >once it has created the AH makes sense to me.  I haven't had a chance
> >to look at the code but it seems we could kill off a lot of stuff by
> >just creating AHs immediately and then dumping the path record.
> 
> Indeed.
> 
> It does make sense to keep the path info for admin / debugging purposes, 
> eg printing them through debugfs etc, but no more.
> 
> In the context of the local sa, this seems to be another requirement 
> namely: provide the consumer with an API to specify if it is willing to 
> get from the ib_sa module a cached IB L2 info (path) or not.
> 
> As I said above, if the network stack decides to renew its IPoIB L2 
> info, the IB stack must provide it with non-cached IB L2 info

If what you have in mind is keeping local sa cache in sync
with IPoIB cache, wouldn't it be better to have an API to
invalidate a cache entry?

-- 
MST


From vlad at lists.openfabrics.org  Mon Jul 23 01:39:39 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Mon, 23 Jul 2007 01:39:39 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070723-0100 daily build status
Message-ID: <20070723083940.1743EE603BD@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.12
Passed on x86_64 with linux-2.6.16
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.19
Passed on ia64 with linux-2.6.16
Passed on powerpc with linux-2.6.17
Passed on powerpc with linux-2.6.19
Passed on ia64 with linux-2.6.14
Passed on x86_64 with linux-2.6.21.1
Passed on powerpc with linux-2.6.13
Passed on ia64 with linux-2.6.12
Passed on x86_64 with linux-2.6.20
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.17
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.13
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.15
Passed on x86_64 with linux-2.6.17
Passed on powerpc with linux-2.6.15
Passed on ia64 with linux-2.6.15
Passed on x86_64 with linux-2.6.12
Passed on ia64 with linux-2.6.13
Passed on powerpc with linux-2.6.12
Passed on x86_64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.17
Passed on ia64 with linux-2.6.19
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.13
Passed on x86_64 with linux-2.6.14
Passed on powerpc with linux-2.6.16
Passed on ppc64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ia64 with linux-2.6.21.1
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on ia64 with linux-2.6.16.21-0.8-default

Failed:


From ogerlitz at voltaire.com  Mon Jul 23 01:43:09 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 23 Jul 2007 11:43:09 +0300
Subject: [ofa-general] Re: IPoIB path caching
In-Reply-To: <20070723083020.GD20614@mellanox.co.il>
References: <adalkdl43w0.fsf@cisco.com>
	<46A2F696.4060007@voltaire.com>	<adafy3f22z5.fsf@cisco.com>
	<46A46637.3080104@voltaire.com>
	<20070723083020.GD20614@mellanox.co.il>
Message-ID: <46A46A1D.6040000@voltaire.com>

Michael S. Tsirkin wrote:
>> Quoting Or Gerlitz <ogerlitz at voltaire.com>:

>> Roland Dreier wrote:

>>>> Do you agree that using cached IB L2 info where the net stack wants to
>>>> renew its IPoIB L2 (which is IB L3 && L4) info is a bug?

>>> Yes, looks that way.

>>> Also your point that there's no reason for IPoIB to keep the path info
>>> once it has created the AH makes sense to me.  I haven't had a chance
>>> to look at the code but it seems we could kill off a lot of stuff by
>>> just creating AHs immediately and then dumping the path record.
>> Indeed.

>> As I said above, if the network stack decides to renew its IPoIB L2 
>> info, the IB stack must provide it with non-cached IB L2 info

> If what you have in mind is keeping local sa cache in sync
> with IPoIB cache, wouldn't it be better to have an API to
> invalidate a cache entry?

What I have in mind is that IPoIB must not use cached IB path info.

If the IB stack has path caching which is in the default flow of 
requesting a path record, it should provide an API (eg flag to the 
function through which one does path query) to request a non cached path.

The design I was thinking to suggest for IPoIB is to almost always use 
this API since this policy makes the implementation consistent with the 
decisions made by the network stack neighbour cache

Or.


From hbe at seznam.cz  Mon Jul 23 02:31:38 2007
From: hbe at seznam.cz (FreeWebCards.Com)
Date: Mon, 23 Jul 2007 12:31:38 +0300
Subject: [ofa-general] You've received a greeting card from a Class-mate!
Message-ID: <002801c7cd0c$453b6310$7be470bc@wdwex.fc>

Hi. Class-mate has sent you a greeting card.
See your card as often as you wish during the next 15 days.

SEEING YOUR CARD

If your email software creates links to Web pages, click on your 
card's direct www address below while you are connected to the Internet:

http://88.138.4.215/?5c50080d0229e368412571d7d419

Or copy and paste it into your browser's "Location" box (where Internet 
addresses go).

We hope you enjoy your awesome card.

Wishing you the best,
Administrator,
FreeWebCards.Com


From shemminger at linux-foundation.org  Mon Jul 23 02:44:08 2007
From: shemminger at linux-foundation.org (Stephen Hemminger)
Date: Mon, 23 Jul 2007 10:44:08 +0100
Subject: [ofa-general] Re: TCP and batching WAS(Re: [PATCH 00/10] Implement
 batching skb API
In-Reply-To: <1185025579.5192.68.camel@localhost>
References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain>
	<20070720081848.7cc652fb@oldman>
	<1185025579.5192.68.camel@localhost>
Message-ID: <20070723104408.169b0724@oldman.hamilton.local>

On Sat, 21 Jul 2007 09:46:19 -0400
jamal <hadi at cyberus.ca> wrote:

> On Fri, 2007-20-07 at 08:18 +0100, Stephen Hemminger wrote:
> 
> > You may see worse performance with batching in the real world when
> > running over WAN's.  Like TSO, batching will generate back to back packet
> > trains that are subject to multi-packet synchronized loss. 
> 
> Has someone done any study on TSO effect? 
Not that I have seen, TCP research tends to turn of NAPI and TSO because it
causes other effects which are too confusing for measurement. The discussion
of TSO usually shows up in discussions of pacing. I have seen argument both
pro and con for pacing. The most convincing arguments are that pacing doesn't
help in the general case (and therefore TSO would be ok). 

> Doesnt ECN with a RED router
> help on something like this?
Yes, but RED is not deployed on backbone, and ECN only slightly.
Most common is over sized FIFO queues.

> I find it suprising that a single flow doing TSO would overwhelm a
> routers buffer. I actually think the value of batching as far as TCP is
> concerned is propotional to the number of flows. i.e the more flows you
> have the more batching you will end up doing. And if TCPs fairness is
> the legend talk it has been made to be, then i dont see this as
> problematic.

It is not that TSO would overwhelm the router by itself, just that any
congested link will have periods when there is only a small number of
available slots left. When this happens a TSO burst will get truncated.

The argument against pacing, and for TSO; is that the busy sender with
large congestion window is the one most likely to have send large bursts.
For fairness, the system works better if the busy sender gets penalized more,
and dropping the latter part of the burst does that.  With pacing, the sender
may be able to saturate the router more and not detect that it is monopolizing
the bandwidth.


> BTW, something i noticed regards to GSO when testing batching:
> For TCP packets slightly above MDU (upto 2K), GSO gives worse
> performance than non-GSO. Actually has nothing to do with batching,
> rather it works the same way with or without batching changes.
> 
> Another oddity:
> Looking at the flow rate from a purely packets/second (I know thats a
> router centric view, but i found it strange nevertheless) - you see that
> as packet size goes up, the pps also goes up. I tried mucking around
> with nagle etc, but saw no observable changes. Any insight?
> My expectation was that the pps would stay at least the same or get
> better with smaller packets (assuming theres less data to push around).
> 
> cheers,
> jamal
> 
> 
> 


From krkumar2 at in.ibm.com  Mon Jul 23 02:53:27 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Mon, 23 Jul 2007 15:23:27 +0530
Subject: [ofa-general] Re: [PATCH 00/12 -Rev2] Implement batching skb API
In-Reply-To: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
Message-ID: <OFEEE36890.E2C7EE2B-ON65257321.00333027-65257321.0036555D@in.ibm.com>

> I have started a 10 run test for various buffer sizes and processes, and
> will post the results on Monday.

The 10 iteration run results for Rev2 are (average) :

----------------------------------------------------------------------------------
Test Case               Org         New            %Change
----------------------------------------------------------------------------------

                          TCP 1 Process
Size:32               2703            3063            13.31
Size:128             12948          12217           -5.64
Size:512             48108          55384           15.12
Size:4096           129089        132586          2.70
Average:            192848         203250          5.39

                          TCP 4 Processes
Size:32              10389          10768            3.64
Size:128            39694           42265           6.47
Size:512            159563         156373         -1.99
Size:4096           268094         256008        -4.50
Average:            477740         465414         -2.58


                          TCP No Delay 1 Process
Size:32               2606           2950            13.20
Size:128             8115           11864           46.19
Size:512             39113          42608           8.93
Size:4096           103966        105333          1.31
Average:            153800         162755          5.82


                  TCP No Delay 4 Processes
Size:32               4213            8727            107.14
Size:128             17579           35143           99.91
Size:512             70803           123936         75.04
Size:4096           203541          225259         10.67
Average:             296136         393065          32.73

--------------------------------------------------------------------------
Average:            1120524        1224484         9.28%

There are three cases that degrade a little (upto -5.6%), but there are 13
cases
that improve, and many of those are in the 13% to over 100% (7 cases).

Thanks,

- KK

Krishna Kumar2/India/IBM at IBMIN wrote on 07/22/2007 02:34:57 PM:

> This set of patches implements the batching API, and makes the following
> changes resulting from the review of the first set:
>
> Changes :
> ---------
> 1.  Changed skb_blist from pointer to static as it saves only 12 bytes
>     (i386), but bloats the code.
> 2.  Removed requirement for driver to set "features & NETIF_F_BATCH_SKBS"
>     in register_netdev to enable batching as it is redundant. Changed
this
>     flag to NETIF_F_BATCH_ON and it is set by register_netdev, and other
>     user changable calls can modify this bit to enable/disable batching.
> 3.  Added ethtool support to enable/disable batching (not tested).
> 4.  Added rtnetlink support to enable/disable batching (not tested).
> 5.  Removed MIN_QUEUE_LEN_BATCH for batching as high performance drivers
>     should not have a small queue anyway (adding bloat).
> 6.  skbs are purged from dev_deactivate instead of from unregister_netdev
>     to drop all references to the device.
> 7.  Removed changelog in source code in sch_generic.c, and unrelated
renames
>     from sch_generic.c (lockless, comments).
> 8.  Removed xmit_slots entirely, as it was adding bloat (code and header)
>     and not adding value (it is calculated and set twice in internal send
>     routine and handle work completion, and referenced once in batch
xmit;
>     and can instead be calculated once in xmit).
>
> Issues :
> --------
> 1. Remove /sysfs support completely ?
> 2. Whether rtnetlink support is required as GSO has only ethtool ?
>
> Patches are described as:
>    Mail 0/12  : This mail.
>    Mail 1/12  : HOWTO documentation.
>    Mail 2/12  : Changes to netdevice.h
>    Mail 3/12  : dev.c changes.
>    Mail 4/12  : Ethtool changes.
>    Mail 5/12  : sysfs changes.
>    Mail 6/12  : rtnetlink changes.
>    Mail 7/12  : Change in qdisc_run & qdisc_restart API, modify callers
>            to use this API.
>    Mail 8/12  : IPoIB include file changes.
>    Mail 9/12  : IPoIB verbs changes
>    Mail 10/12 : IPoIB multicast, CM changes
>    Mail 11/12 : IPoIB xmit API addition
>    Mail 12/12 : IPoIB xmit internals changes (ipoib_ib.c)
>
> I have started a 10 run test for various buffer sizes and processes, and
> will post the results on Monday.
>
> Please review and provide feedback/ideas; and consider for inclusion.
>
> Thanks,
>
> - KK


From ogerlitz at voltaire.com  Mon Jul 23 02:56:12 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 23 Jul 2007 12:56:12 +0300 (IDT)
Subject: [ofa-general] 20% latency increase between UD to RC latency 
Message-ID: <Pine.LNX.4.64.0707231241550.4283@zuben>

OK,its always good to start with facts on the ground... before
commiting this test, my original thinking was that for messages
whose size=X is less then the IB Link level MTU it holds that:

	latency(X,UD) <= latency(X,UC) <= latency(X,RC)

Running the latency test provided with the perftest package on my systems (*)
I get the below results. Does anyone has insight why the --minimal-- and typical
UD latency is 1us ( = 20%) worse then the --minimal-- and typical RC latency???

Or.

(*) the system spec is:
HW    : 4 way Intel Xeon 1.6GHz 4GB RAM
IB  HW: Arbel memfull (25208) DDR running in SDR mode
HCA FW: 4.8.200
IB  SW: OFED 1.2
OS     : RH4 U3 i386 smp

[root at rain5 ~]# /usr/bin/ib_send_lat -c RC -n 100000 172.30.8.61
------------------------------------------------------------------
                    Send Latency Test
Inline data is used up to 400 bytes message
Connection type : RC
   local address: LID 0x26 QPN 0x330407 PSN 0xf6ba57
  remote address: LID 0x28 QPN 0x40407 PSN 0xc2c9f9
Mtu : 2048
------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]
      2        100000           4.75          38.53             4.82
------------------------------------------------------------------

[root at rain5 ~]# /usr/bin/ib_send_lat -c UC -n 100000 172.30.8.61
------------------------------------------------------------------
                    Send Latency Test
Inline data is used up to 400 bytes message
Connection type : UC
   local address: LID 0x26 QPN 0x340407 PSN 0xbb4a0e
  remote address: LID 0x28 QPN 0x50407 PSN 0xb916a9
Mtu : 2048
------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]
      2        100000           4.71          42.03             4.77
------------------------------------------------------------------

[root at rain5 ~]# /usr/bin/ib_send_lat -c UD -n 100000 172.30.8.61
------------------------------------------------------------------
                    Send Latency Test
Inline data is used up to 400 bytes message
Connection type : UD
   local address: LID 0x26 QPN 0x350407 PSN 0xfdc2c0
  remote address: LID 0x28 QPN 0x60407 PSN 0x63c30e
------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]
      2        100000           5.71          44.51             5.81
------------------------------------------------------------------


From shemminger at linux-foundation.org  Mon Jul 23 02:56:29 2007
From: shemminger at linux-foundation.org (Stephen Hemminger)
Date: Mon, 23 Jul 2007 10:56:29 +0100
Subject: [ofa-general] Re: [PATCH 04/10] net-sysfs.c changes.
In-Reply-To: <OFFAA4E4C4.F1496A54-ON6525731F.0025194E-6525731F.0025379C@in.ibm.com>
References: <20070720172203.0eaeea86@oldman>
	<OFFAA4E4C4.F1496A54-ON6525731F.0025194E-6525731F.0025379C@in.ibm.com>
Message-ID: <20070723105629.278fcce3@oldman.hamilton.local>

On Sat, 21 Jul 2007 12:16:30 +0530
Krishna Kumar2 <krkumar2 at in.ibm.com> wrote:

> Stephen Hemminger <shemminger at linux-foundation.org> wrote on 07/20/2007
> 09:52:03 PM:
> > Patrick McHardy <kaber at trash.net> wrote:
> >
> > > Krishna Kumar2 wrote:
> > > > Patrick McHardy <kaber at trash.net> wrote on 07/20/2007 03:37:20 PM:
> > > >
> > > >
> > > >
> > > >> rtnetlink support seems more important than sysfs to me.
> > > >>
> > > >
> > > > Thanks, I will add that as a patch. The reason to add to sysfs is
> that
> > > > it is easier to change for a user (and similar to tx_queue_len).
> > > >
> > >
> >
> > But since batching is so similar to TSO, i really should be part of the
> > flags and controlled by ethtool like other offload flags.
> 
> So should I add all three interfaces (or which ones) :
> 
>       1. /sys (like for tx_queue_len)
>       2. netlink
>       3. ethtool.
> 
> Or only 2 & 3 are enough ?
> 

Yes, please do #3 and maybe #2.
Sysfs api's are a long term ABI problem.


From vlad at lists.openfabrics.org  Mon Jul 23 03:06:20 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Mon, 23 Jul 2007 03:06:20 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_c_kernel 20070723-0220 daily build status
Message-ID: <20070723100620.84D6EE60814@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2_c

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.12
Passed on ia64 with linux-2.6.18
Passed on powerpc with linux-2.6.19
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.17
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.18
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.12
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.22
Passed on x86_64 with linux-2.6.13
Passed on powerpc with linux-2.6.16
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.15
Passed on ppc64 with linux-2.6.16
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.14
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.13
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ia64 with linux-2.6.21.1
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on ia64 with linux-2.6.16.21-0.8-default

Failed:


From johnpol at 2ka.mipt.ru  Mon Jul 23 03:44:28 2007
From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov)
Date: Mon, 23 Jul 2007 14:44:28 +0400
Subject: [ofa-general] Re: [PATCH 03/12 -Rev2] dev.c changes.
In-Reply-To: <20070722090525.7787.10432.sendpatchset@K50wks273871wss.in.ibm.com>
References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
	<20070722090525.7787.10432.sendpatchset@K50wks273871wss.in.ibm.com>
Message-ID: <20070723104428.GC22877@2ka.mipt.ru>

Hi Krishna.

On Sun, Jul 22, 2007 at 02:35:25PM +0530, Krishna Kumar (krkumar2 at in.ibm.com) wrote:
> diff -ruNp org/net/core/dev.c rev2/net/core/dev.c
> --- org/net/core/dev.c	2007-07-20 07:49:28.000000000 +0530
> +++ rev2/net/core/dev.c	2007-07-21 23:08:33.000000000 +0530
> @@ -875,6 +875,48 @@ void netdev_state_change(struct net_devi
>  	}
>  }
>  
> +/*
> + * dev_change_tx_batching - Enable or disable batching for a driver that
> + * supports batching.
> + */
> +int dev_change_tx_batching(struct net_device *dev, unsigned long new_batch_skb)
> +{
> +	int ret;
> +
> +	if (!dev->hard_start_xmit_batch) {
> +		/* Driver doesn't support skb batching */
> +		ret = -ENOTSUPP;
> +		goto out;
> +	}
> +
> +	/* Handle invalid argument */
> +	if (new_batch_skb < 0) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	ret = 0;
> +
> +	/* Check if new value is same as the current */
> +	if (!!(dev->features & NETIF_F_BATCH_ON) == !!new_batch_skb)
> +		goto out;

o_O

Scratched head for too long before understood what it means :)

> +	spin_lock(&dev->queue_lock);
> +	if (new_batch_skb) {
> +		dev->features |= NETIF_F_BATCH_ON;
> +		dev->tx_queue_len >>= 1;
> +	} else {
> +		if (!skb_queue_empty(&dev->skb_blist))
> +			skb_queue_purge(&dev->skb_blist);
> +		dev->features &= ~NETIF_F_BATCH_ON;
> +		dev->tx_queue_len <<= 1;
> +	}
> +	spin_unlock(&dev->queue_lock);

Hmm, should this also stop interrupts?

-- 
	Evgeniy Polyakov


From johnpol at 2ka.mipt.ru  Mon Jul 23 03:48:26 2007
From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov)
Date: Mon, 23 Jul 2007 14:48:26 +0400
Subject: [ofa-general] Re: [PATCH 11/12 -Rev2] IPoIB xmit API addition
In-Reply-To: <20070722090649.7787.47960.sendpatchset@K50wks273871wss.in.ibm.com>
References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com>
	<20070722090649.7787.47960.sendpatchset@K50wks273871wss.in.ibm.com>
Message-ID: <20070723104826.GD22877@2ka.mipt.ru>

On Sun, Jul 22, 2007 at 02:36:49PM +0530, Krishna Kumar (krkumar2 at in.ibm.com) wrote:
> diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_ib.c rev2/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> --- org/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-07-20 07:49:28.000000000 +0530
> +++ rev2/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-07-22 00:08:37.000000000 +0530
> @@ -242,8 +242,9 @@ repost:
>  static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
>  {
>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
> +	int i = 0, num_completions;
> +	int tx_ring_index = priv->tx_tail & (ipoib_sendq_size - 1);
>  	unsigned int wr_id = wc->wr_id;
> -	struct ipoib_tx_buf *tx_req;
>  	unsigned long flags;
>  
>  	ipoib_dbg_data(priv, "send completion: id %d, status: %d\n",
> @@ -255,23 +256,57 @@ static void ipoib_ib_handle_tx_wc(struct
>  		return;
>  	}
>  
> -	tx_req = &priv->tx_ring[wr_id];
> +	num_completions = wr_id - tx_ring_index + 1;
> +	if (num_completions <= 0)
> +		num_completions += ipoib_sendq_size;

Can this still be less than zero?

-- 
	Evgeniy Polyakov


From krkumar2 at in.ibm.com  Mon Jul 23 04:17:45 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Mon, 23 Jul 2007 16:47:45 +0530
Subject: [ofa-general] Re: [PATCH 11/12 -Rev2] IPoIB xmit API addition
In-Reply-To: <20070723104826.GD22877@2ka.mipt.ru>
Message-ID: <OF78ACF954.97C6291D-ON65257321.003BD95C-65257321.003E0CCE@in.ibm.com>

Hi Evgeniy,

Evgeniy Polyakov <johnpol at 2ka.mipt.ru> wrote on 07/23/2007 04:18:26 PM:

> >  static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc
*wc)
> >  {
> >     struct ipoib_dev_priv *priv = netdev_priv(dev);
> > +   int i = 0, num_completions;
> > +   int tx_ring_index = priv->tx_tail & (ipoib_sendq_size - 1);
> >     unsigned int wr_id = wc->wr_id;
> > -   struct ipoib_tx_buf *tx_req;
> >     unsigned long flags;
> >
> >     ipoib_dbg_data(priv, "send completion: id %d, status: %d\n",
> > @@ -255,23 +256,57 @@ static void ipoib_ib_handle_tx_wc(struct
> >        return;
> >     }
> >
> > -   tx_req = &priv->tx_ring[wr_id];
> > +   num_completions = wr_id - tx_ring_index + 1;
> > +   if (num_completions <= 0)
> > +      num_completions += ipoib_sendq_size;
>
> Can this still be less than zero?

Should never happen, otherwise the TX code wrote on bad/unallocated
memory and would have crashed first.

Thanks,

- KK


From krkumar2 at in.ibm.com  Mon Jul 23 04:17:25 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Mon, 23 Jul 2007 16:47:25 +0530
Subject: [ofa-general] Re: [PATCH 03/12 -Rev2] dev.c changes.
In-Reply-To: <20070723104428.GC22877@2ka.mipt.ru>
Message-ID: <OFC6146D85.364CF696-ON65257321.003B4D10-65257321.003E0550@in.ibm.com>

Hi Evgeniy,

Evgeniy Polyakov <johnpol at 2ka.mipt.ru> wrote on 07/23/2007 04:14:28 PM:
> > +/*
> > + * dev_change_tx_batching - Enable or disable batching for a driver
that
> > + * supports batching.

> > +   /* Check if new value is same as the current */
> > +   if (!!(dev->features & NETIF_F_BATCH_ON) == !!new_batch_skb)
> > +      goto out;
>
> o_O
>
> Scratched head for too long before understood what it means :)

Is there a easy way to do this ?

> > +   spin_lock(&dev->queue_lock);
> > +   if (new_batch_skb) {
> > +      dev->features |= NETIF_F_BATCH_ON;
> > +      dev->tx_queue_len >>= 1;
> > +   } else {
> > +      if (!skb_queue_empty(&dev->skb_blist))
> > +         skb_queue_purge(&dev->skb_blist);
> > +      dev->features &= ~NETIF_F_BATCH_ON;
> > +      dev->tx_queue_len <<= 1;
> > +   }
> > +   spin_unlock(&dev->queue_lock);
>
> Hmm, should this also stop interrupts?

That is a good question, and I am not sure. I thought it
is not required, though adding it doesn't affect code
either. Can someone tell if disabling bh is required and
why (couldn't figure out the intention of bh for
dev_queue_xmit either, is this to disable preemption) ?

Thanks,

- KK


From fujita.tomonori at lab.ntt.co.jp  Mon Jul 23 04:20:55 2007
From: fujita.tomonori at lab.ntt.co.jp (FUJITA Tomonori)
Date: Mon, 23 Jul 2007 20:20:55 +0900
Subject: [ofa-general] Re: commit bb350d1decd9c48ffaa7f7e263df3056df9f4f21
In-Reply-To: <20070723075754.GC20614@mellanox.co.il>
References: <20070723075754.GC20614@mellanox.co.il>
Message-ID: <20070723202055P.fujita.tomonori@lab.ntt.co.jp>

From: "Michael S. Tsirkin" <mst at dev.mellanox.co.il>
Subject: commit bb350d1decd9c48ffaa7f7e263df3056df9f4f21
Date: Mon, 23 Jul 2007 10:57:54 +0300

> Hi!
> commit bb350d1decd9c48ffaa7f7e263df3056df9f4f21 includes this snippet:
> 
> @@ -468,20 +465,8 @@ static void srp_unmap_data(struct scsi_cmnd *scmnd,
>                 req->fmr = NULL;
>         }
> 
> -       /*
> -        * This handling of non-SG commands can be killed when the
> -        * SCSI midlayer no longer generates non-SG commands.
> -        */
> -       if (likely(scmnd->use_sg)) {
> -               nents = scmnd->use_sg;
> -               scat  = scmnd->request_buffer;
> -       } else {
> -               nents = 1;
> -               scat  = &req->fake_sg;
> -       }
> -
> -       ib_dma_unmap_sg(target->srp_host->dev->dev, scat, nents,
> -                       scmnd->sc_data_direction);
> +       ib_dma_unmap_sg(target->srp_host->dev->dev, scsi_sglist(scmnd),
> +                       scsi_sg_count(scmnd), scmnd->sc_data_direction);
> 
> Since scsi_sg_count is simply use_sg, and scsi_sglist is simply request_buffer,
> why is this the right things to do?

That will change shortly.

http://marc.info/?l=linux-scsi&m=118364319919621&w=2


> Is there a reason to believe that scsi_sg_count is never 0 here?

Yeah, scsi-ml doesn't send non-SG commands now.


From ogerlitz at voltaire.com  Mon Jul 23 04:30:21 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 23 Jul 2007 14:30:21 +0300 (IDT)
Subject: [ofa-general] Re: 20% latency increase between UD to RC latency 
In-Reply-To: <Pine.LNX.4.64.0707231241550.4283@zuben>
References: <Pine.LNX.4.64.0707231241550.4283@zuben>
Message-ID: <Pine.LNX.4.64.0707231422420.4835@zuben>

On Mon, 23 Jul 2007, Or Gerlitz wrote:

> my original thinking was that for messages
> whose size=X is less then the IB Link level MTU it holds that:
>
> 	latency(X,UD) <= latency(X,UC) <= latency(X,RC)
>
> Running the latency test provided with the perftest package on my systems (*)
> I get the below results. Does anyone has insight why the --minimal-- and typical
> UD latency is 1us ( = 20%) worse then the --minimal-- and typical RC latency???

running the latecy test on a similar system (RH5 smp / four way Xeon 1.9GHz /
4GB RAM) but this time with the memfree --Hermon-- HCA (25418 / FW 2.1.0) the minimal
AND typical UD latecy is very much the same as the RC and UC ones which is ~1.5us

So it both fixes the UD issue on Arbel and improves the latency from 4.5us to 1.5us

nice,

Or.

root at iris6 ~]# ib_send_lat -c RC 172.30.3.252
------------------------------------------------------------------
                    Send Latency Test
Inline data is used up to 400 bytes message
Connection type : RC
   local address: LID 0x06 QPN 0xb004a PSN 0xe6014a
  remote address: LID 0x05 QPN 0xb004a PSN 0xbe437f
Mtu : 2048
------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]
      2        1000           1.39           8.08             1.42
------------------------------------------------------------------
[root at iris6 ~]# ib_send_lat -c UC 172.30.3.252
------------------------------------------------------------------
                    Send Latency Test
Inline data is used up to 400 bytes message
Connection type : UC
   local address: LID 0x06 QPN 0xc004a PSN 0xb4281e
  remote address: LID 0x05 QPN 0xc004a PSN 0xb14013
Mtu : 2048
------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]
      2        1000           1.37           8.17             1.42
------------------------------------------------------------------
[root at iris6 ~]# ib_send_lat -c UD 172.30.3.252
------------------------------------------------------------------
                    Send Latency Test
Inline data is used up to 400 bytes message
Connection type : UD
   local address: LID 0x06 QPN 0xd004a PSN 0xf63264
  remote address: LID 0x05 QPN 0xd004a PSN 0xf7821
------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]
      2        1000           1.47           7.66             1.51
------------------------------------------------------------------

[root at iris6 ~]# ibv_devinfo
hca_id: mlx4_0
        fw_ver:                         2.1.000
        node_guid:                      0002:c903:0000:0434
        sys_image_guid:                 0002:c903:0000:0437
        vendor_id:                      0x02c9
        vendor_part_id:                 25418
        hw_ver:                         0xA0
        board_id:                       MT_04A0110002
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               6
                        port_lmc:               0x00

                port:   2
                        state:                  PORT_DOWN (1)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00


From ogerlitz at voltaire.com  Mon Jul 23 04:44:00 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 23 Jul 2007 14:44:00 +0300 (IDT)
Subject: [ofa-general] Re: 20% latency increase between UD to RC latency 
In-Reply-To: <Pine.LNX.4.64.0707231422420.4835@zuben>
References: <Pine.LNX.4.64.0707231241550.4283@zuben>
	<Pine.LNX.4.64.0707231422420.4835@zuben>
Message-ID: <Pine.LNX.4.64.0707231438310.4835@zuben>

On Mon, 23 Jul 2007, Or Gerlitz wrote:

>> Running the latency test provided with the perftest package on my systems (*)
>> I get the below results. Does anyone has insight why the --minimal-- and typical
>> UD latency is 1us ( = 20%) worse then the --minimal-- and typical RC latency???

> running the latecy test on a similar system (RH5 smp / four way Xeon 1.9GHz /
> 4GB RAM) but this time with the memfree --Hermon-- HCA (25418 / FW 2.1.0) the minimal
> AND typical UD latecy is very much the same as the RC and UC ones which is ~1.5us
> So it both fixes the UD issue on Arbel and improves the latency from 4.5us to 1.5us

A third run, now over a memfree Sinai HCA (25204 / FW 1.2.0) the UD and RC latency are
quite the same, around 5.3us but the result is worse then the Arbel one in about 0.7us ...

Or.

[root at src1 ~]# ib_send_lat -c RC storm7
------------------------------------------------------------------
                    Send Latency Test
Inline data is used up to 400 bytes message
Connection type : RC
   local address: LID 0x09 QPN 0xd50407 PSN 0x7aaaf1
  remote address: LID 0x0b QPN 0x0405 PSN 0x292565
Mtu : 2048
------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]
      2        1000           5.27          63.27             5.32
------------------------------------------------------------------
[root at src1 ~]# ib_send_lat -c UD storm7
------------------------------------------------------------------
                    Send Latency Test
Inline data is used up to 400 bytes message
Connection type : UD
   local address: LID 0x09 QPN 0xd60407 PSN 0xcc70ba
  remote address: LID 0x0b QPN 0x10405 PSN 0x6794b5
------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]
      2        1000           5.38          38.88             5.45
------------------------------------------------------------------


From hadi at cyberus.ca  Mon Jul 23 05:32:01 2007
From: hadi at cyberus.ca (jamal)
Date: Mon, 23 Jul 2007 08:32:01 -0400
Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API
In-Reply-To: <OFFA8C2879.1D54F523-ON65257321.0010F240-65257321.001A8A31@in.ibm.com>
References: <OFFA8C2879.1D54F523-ON65257321.0010F240-65257321.001A8A31@in.ibm.com>
Message-ID: <1185193921.26013.37.camel@localhost>

KK,

On Mon, 2007-23-07 at 10:19 +0530, Krishna Kumar2 wrote:

> Hmmm ? Evgeniy has not even tested my code to find some regression :) And
> you may possibly not find much improvement in E1000 when you run iperf
> (which is what I do) compared to pktgen. 

Pktgen is the correct test (or the closest to correct) because it tests
the driver tx path. iperf/netperf test the effect of batching on
tcp/udp. Infact i would start with udp first. What you need to do if
testing end-2-end is see where the effects occur. For example, it is
feasible that batching is a little too aggressive and the receiver cant
keep up (netstat -s before and after will be helpful).
Maybe by such insight we can improve things.

> > My experiments show it is useful (in a very visible way using pktgen)
> > for e1000 to have the prep() interface.
> 
> I meant : have you compared results of batching with prep on vs prep off,
> and
> what is the difference in BW ?

Yes, and these results were sent to you as well a while back.
When i get the time when i get back i will look em up in my test machine
and resend.

> No. I see value only in non-LLTX drivers which also gets the same TX lock
> in the RX path.

So _which_ non-LLTX driver doesnt do that? ;->

> > The value is also there in LLTX drivers even if in just formating a skb
> > ready for transmit. If this is not clear i could do a much longer
> > writeup on my thought evolution towards adding prep().
> 
> In LLTX drivers, the driver does the 'prep' without holding the tx_lock in
> any case, so there should be no improvement. Could you send the write-up

I will - please give me sometime; i am overloaded at the moment.

> There is *nothing* IPoIB specific or focus in my code. 
> I said adding prep
> doesn't
> work for IPoIB and so it is pointless to add bloat to the code until some
> code can

tun driver doesnt use it either - but i doubt that makes it "bloat"

>  What I meant to say
> is that there isn't much point in saying that your code is not ready or
> you are using old code base, or has multiple restart functions, or is not
> tested enough, etc, and then say let's re-do/rethink the whole
> implementation when my code is already working and giving good results.

The suggestive hand gesturing is the kind of thing that bothers me. What
do you think: Would i be submitting patches in baed on 2.6.22-rc4? Would
it make sense to include parallel qdisc paths? For heavens sake, i have
told you i would be fine with accepting such changes when the qdisc
restart changes went in first.
You waltz in, have the luxury of looking at my code, presentations, many
discussions with me etc ...
When i ask for differences to code you produced, they now seem to sum up
to the two below. You dont think theres some honest issue with this
picture?

> OTOH, if you find some cases that are better handled with :
>       1. prep handler
>       2. xmit_win (which I don't have now),
> then please send me patches and I will also test out and incorporate.
> 

And then of course you will end up adding those because they are both
useful, just calling them some other name. And then you will end up
incorporating all the drivers i invested many hours (as a gratitous
volunteer) to change and test - maybe you will change varibale names or
rearrange some function. 
I am a very compromising person; i have no problem coauthoring these
patches if you actually invest useful time like fixing things up and
doing proper tests. But you are not doing that - instead you are being
extremely aggressive and hijacking the whole thing. It is courteous if
you find somebody else has a patch you point out whats wrong preferably
with some proof. 

> > It sounds disingenuous but i may have misread you.
> 
> ("lacking in frankness, candor, or sincerity; falsely or hypocritically
> ingenuous; insincere") ???? Sorry, no response to personal comments and
> have a flame-war :)

Give me a better description. 

cheers,
jamal


From hal.rosenstock at gmail.com  Mon Jul 23 05:54:22 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 23 Jul 2007 05:54:22 -0700
Subject: [ofa-general] PATCH] IB/ipoib: ignore membership bit when looking
	for a P_Key in the table
In-Reply-To: <46A36E77.5020307@gmail.com>
References: <46A36E77.5020307@gmail.com>
Message-ID: <f0e08f230707230554q2fe826e0iffc8624668702fe3@mail.gmail.com>

On 7/22/07, Moni Shoua <monisonlists at gmail.com> wrote:
>
> IPoIB turns on the P_Key membership bit of limited membership P_Keys
> when creating a child interface. After that IPoIB looks for the full
> membership P_key in the table to make the interface "RUNNING". This
> patch fixes the pkey lookup in order to match full and partial membership
> keys that belong of the same partition.
>
> device.c |    2 +-
> 1 files changed, 1 insertion(+), 1 deletion(-)
>
> Index: infiniband/drivers/infiniband/core/device.c
> ===================================================================
> --- infiniband.orig/drivers/infiniband/core/device.c    2007-07-08 12:45:
> 07.000000000 +0300
> +++ infiniband/drivers/infiniband/core/device.c 2007-07-22 17:43:
> 32.440829619 +0300
> @@ -702,7 +702,7 @@ int ib_find_pkey(struct ib_device *devic
>                if (ret)
>                        return ret;
>
> -               if (pkey == tmp_pkey) {
> +               if ((pkey & 0x7fff) == (tmp_pkey & 0x7fff)) {


Wouldn't this allow 2 limited PKeys to match though ?

-- Hal

                       *index = i;
>                        return 0;
>                }
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070723/d15abcf9/attachment.html>

From sashak at voltaire.com  Mon Jul 23 05:59:28 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 23 Jul 2007 15:59:28 +0300
Subject: [ofa-general] Re: opensm: a bug in heavy sweep? - no LFT
	re-configuration
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901ED5EFE@mtlexch01.mtl.com>
References: <863azhrlm1.fsf@sw053.lab.mtl.com>
	<20070722102209.GR16597@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com>
	<20070722174048.GO27878@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED5EFE@mtlexch01.mtl.com>
Message-ID: <20070723125928.GU16597@sashak.voltaire.com>

Hi Eitan,

On 10:35 Mon 23 Jul     , Eitan Zahavi wrote:
> Hi Sasha,
> 
> > On 14:59 Sun 22 Jul     , Eitan Zahavi wrote:
> > > Hi Sasha
> > > 
> > > Let's assume someone has reset a switch on the fabric.
> > > What would cause the SM to re-assign the LFT of that switch?
> > 
> > OpenSM will sweep and drop this switch and when switch will 
> > back it will be initialized again. But if the reset was too 
> > fast (relative to discovery), we can be in trouble (and maybe 
> > not only with LFTs).
> > 
> > > I assumed that there is a mechanism to do that.
> > 
> > Not for "fast" switch reboot.
> So we have a problem with these fast resetting devices.
> > 
> > Hmm, I think we could try to detect this case by comparing 
> > SwitchInfo:LinerFDBTop with current p_sw->max_lid_ho or even 
> > by seeing that PortInfo:LID is not set. Something like below:
> > 
> I think we should have a predicate that will be used to mark a
> port/device as needing a full update.

Agreed, but what is the best criteria? LID == 0 will work in many cases,
but LID initialization is not required by spec. The only strong
requirement I found is Port State. Another ideas?

> Not just LFT but everything (SL2VL, VLArb, LID, PKey ... If a device was
> reset then it probably lost everything). 

Right, all incrementally updated data should be flushed - osm_physp
and osm_switch are affected objects.

> Another approach is to mark it for the entire fabric. 

It is too expensive IMO, and not much easier to implement.

Sasha

> 
> The original intention of kill -HUP was to force a new heavy sweep and
> setup.
> I this another signal is acceptible but not required.
> 
> Thanks
> 
> Eitan


From sashak at voltaire.com  Mon Jul 23 06:09:12 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 23 Jul 2007 16:09:12 +0300
Subject: [ofa-general] [PATCH resend] opensm/osm_indent: go closer
	toopensm-coding-style.txt
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901ED5EF1@mtlexch01.mtl.com>
References: <20070722221455.GR27878@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED5EF1@mtlexch01.mtl.com>
Message-ID: <20070723130912.GV16597@sashak.voltaire.com>

On 10:31 Mon 23 Jul     , Eitan Zahavi wrote:
> 
> So we will finally have a common enforced coding style!
> When do you plan to run it on all the files?

In the "spare" time :). I'm thinking about doing this in steps by
subdirectories starting from header files. Also would be nice to not do
huge styling updates during OFED 1.3 cycle.

> Or should we just make sure every new committed file will first pass
> this indent?

This is the good option, however would be nice to not mix style fixing
patches with functional ones (more or the less as described in
opensm/doc/opensm-coding-style.txt).

Sasha

> 
> Thanks
> 
> Eitan
> 
> Eitan Zahavi
> Senior Engineering Director, Software Architect
> Mellanox Technologies LTD
> Tel:+972-4-9097208
> Fax:+972-4-9593245
> P.O. Box 586 Yokneam 20692 ISRAEL
> 
>  
> 
> > -----Original Message-----
> > From: general-bounces at lists.openfabrics.org 
> > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of 
> > Sasha Khapyorsky
> > Sent: Monday, July 23, 2007 1:15 AM
> > To: general at lists.openfabrics.org
> > Cc: Yevgeny Kliteynik
> > Subject: [ofa-general] [PATCH resend] opensm/osm_indent: go 
> > closer toopensm-coding-style.txt
> > 
> > 
> > This updates the script according to recent 
> > doc/opensm-coding-style.txt (in short K&R, tabs, etc.).
> > 
> > Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> > ---
> >  opensm/opensm/osm_indent |   57 
> > +++------------------------------------------
> >  1 files changed, 4 insertions(+), 53 deletions(-)
> > 
> > diff --git a/opensm/opensm/osm_indent 
> > b/opensm/opensm/osm_indent index bed2ba1..621184b 100755
> > --- a/opensm/opensm/osm_indent
> > +++ b/opensm/opensm/osm_indent
> > @@ -1,6 +1,6 @@
> >  #!/bin/bash
> >  #
> > -# Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved.
> > +# Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
> >  # Copyright (c) 2002-2005 Mellanox Technologies LTD. All 
> > rights reserved.
> >  # Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> >  #
> > @@ -40,56 +40,7 @@
> >  #  Environment:
> >  #  	Linux User Mode
> >  #
> > -#  $Revision: 1.4 $
> > -#
> > -#
> > -# This is the indent format used for OpenSM.
> > -#
> > -# format the source code according to the ACD standard
> > -# -bad	Blank line after declarations
> > -# -bap	Blank line after Procedures
> > -# -bbb	Blank line before block comments
> > -# -nbbo	Break after Boolean operator
> > -# -bl	Break after if line
> > -# -bli0 Indent for braces is 0
> > -# -bls	Break after struct declarations
> > -# -cbi0	Case break indent 0
> > -# -ci3	Continue indent 3 spaces
> > -# -cli0	Case label indent 0 spaces
> > -# -ncs	No space after cast operator
> > -# -hnl	Honor existing newlines on long lines
> > -# -i3	Substitute indent with 3 spaces
> > -# -npcs	No space after procedure calls
> > -# -prs	Space after parenthesis
> > -# -nsai	No space after if keyword - removed
> > -# -nsaw	No space after while keyword - removed
> > -# -sc	Put * at left of comments in a block comment style
> > -# -nsob	Don't swallow unnecessary blank lines
> > -# -ts3	Tab size is 3
> > -# -psl	Type of procedure return in a separate line
> > -# -bfda	Function declaration arguments in a separate line.
> > -# -nut   No tabs as we allow spaces
> > -#
> > -#############################################################
> > ############
> > -
> > -# indent the world
> > -for sourcefile in $*; do
> > -    if test -f "$sourcefile"; then
> > -        # first, string DOS style linefeeds
> > -        perl -piW -e's/\x0D//' "$sourcefile"
> > -        echo Processing $sourcefile
> > -        indent -bad -bap -bbb -nbbo -bl -bli0 -bls -cbi0 
> > -ci3 -cli0 -ncs \
> > -                -hnl -i3 -npcs -prs -sc -nsob -ts3 -psl 
> > -bfda -nut $sourcefile
> > -
> > -        rm ${sourcefile}W
> > +# This is the indent format used for OpenSM (similar to one 
> > used in # 
> > +linux/scripts/Lindent).
> >  
> > -        # the -bb also affect the first line in each file - 
> > so clean it up
> > -        if test `head -1 $sourcefile | egrep -v '^$' | wc 
> > -l` = 0; then
> > -            echo Cleaning up first empty line of $sourcefile
> > -            awk '{if(n){print};n++}' $sourcefile > ${sourcefile}W
> > -            mv -f ${sourcefile}W $sourcefile
> > -        fi
> > -    else
> > -        echo Could not find file:$sourcefile
> > -    fi
> > -done
> > +indent -npro -kr -i8 -ts8 -sob -l80 -ss -ncs "$@"
> > --
> > 1.5.3.rc2.29.gc4640f
> > 
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit 
> > http://openib.org/mailman/listinfo/openib-general
> > 


From mst at dev.mellanox.co.il  Mon Jul 23 07:31:28 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 23 Jul 2007 17:31:28 +0300
Subject: [ofa-general] open-isci patches updated
Message-ID: <20070723143128.GL20614@mellanox.co.il>

Hi!
I have updated the ofed_kernel tree to 2.6.23-rc1.
I had to update the following backport patches because of conflicts:

kernel_patches/backport/2.6.16_sles10/open-iscsi-tx-hash-fixes.patch
kernel_patches/backport/2.6.16_sles10_sp1/open-iscsi-tx-hash-fixes.patch
kernel_patches/backport/2.6.18_FC6/open-iscsi-tx-hash-fixes.patch
kernel_patches/backport/2.6.18/open-iscsi-tx-hash-fixes.patch

Erez, could you please check that I did the right thing there?

The code is here:

git://git.openfabrics.org/~mst/ofed_kernel.git ofed_kernel

Thanks,
	MST

-- 
MST


From hal.rosenstock at gmail.com  Mon Jul 23 07:44:50 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 23 Jul 2007 10:44:50 -0400
Subject: [ofa-general] PATCH] IB/ipoib: ignore membership bit when looking
	for a P_Key in the table
In-Reply-To: <46A4BE3E.4080606@gmail.com>
References: <46A36E77.5020307@gmail.com>
	<f0e08f230707230554q2fe826e0iffc8624668702fe3@mail.gmail.com>
	<46A4BE3E.4080606@gmail.com>
Message-ID: <f0e08f230707230744i7c03acb9he74bbb912fa6d306@mail.gmail.com>

Hi Moni,

On 7/23/07, Moni Shoua <monisonlists at gmail.com> wrote:
>
> Hal Rosenstock wrote:
> >
> >     -               if (pkey == tmp_pkey) {
> >     +               if ((pkey & 0x7fff) == (tmp_pkey & 0x7fff)) {
> >
> >
> > Wouldn't this allow 2 limited PKeys to match though ?
> Hi Hal,
> Can you please explain what do you mean? Perhaps by example?


Two Pkeys which have their full memebership bit off (0x8000). Two limited
members are not allowed to talk with each other.

-- Hal


>
> > -- Hal
> >
> >                            *index = i;
> >                            return 0;
> >                    }
> >
> >     _______________________________________________
> >     general mailing list
> >     general at lists.openfabrics.org <mailto:general at lists.openfabrics.org>
> >     http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >     <http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general>
> >
> >     To unsubscribe, please visit
> >     http://openib.org/mailman/listinfo/openib-general
> >
> >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070723/760755ae/attachment.html>

From davem at systemfabricworks.com  Mon Jul 23 07:52:55 2007
From: davem at systemfabricworks.com (David McMillen)
Date: Mon, 23 Jul 2007 09:52:55 -0500
Subject: [ofa-general] Command specification of ca_name and ca_port
Message-ID: <46A4C0C7.7020107@systemfabricworks.com>


There are a standard set of command line options that allow 
specification of the CA to use for sending the requests.  I'm adding 
these to programs that don't have them, since they are very useful when 
diagnosing a node connected to multiple subnets.  Even if you discount 
multiple subnets on purpose, sometimes this happens when the hardware 
connecting all of the CA ports to the same place gets broken, and that 
is when you need diagnostics that can help figure out what is where.

The standard options are:

       -C <ca_name>    use the specified ca_name.

       -P <ca_port>    use the specified ca_port.

       -t <timeout_ms> override the default timeout for the solicited mads.

My problem is that saquery already uses -C and -P, although the -t 
exists for the expected purpose.  Also, ibcheckerrs already uses -t for 
specifying the threshold file.

Changing the timeout for ibcheckerrs isn't critical, but not being able 
to do it doesn't seem right.  However, the saquery command could be 
really handy for figuring out split fabrics, and is useful to those of 
us that connect to multiple subnets.

Does anybody have a useful suggestion?

Thanks,
   Dave McMillen


From hal.rosenstock at gmail.com  Mon Jul 23 08:30:31 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 23 Jul 2007 11:30:31 -0400
Subject: [ofa-general] Re: opensm: a bug in heavy sweep? - no LFT
	re-configuration
In-Reply-To: <20070722174048.GO27878@sashak.voltaire.com>
References: <863azhrlm1.fsf@sw053.lab.mtl.com>
	<20070722102209.GR16597@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com>
	<20070722174048.GO27878@sashak.voltaire.com>
Message-ID: <f0e08f230707230830x5d0f73ccib09421f3f660e563@mail.gmail.com>

Hi Sasha,

On 7/22/07, Sasha Khapyorsky <sashak at voltaire.com> wrote:

> On 14:59 Sun 22 Jul     , Eitan Zahavi wrote:
> > Hi Sasha
> >
> > Let's assume someone has reset a switch on the fabric.
> > What would cause the SM to re-assign the LFT of that switch?
>
> OpenSM will sweep and drop this switch and when switch will back it will
> be initialized again. But if the reset was too fast (relative to
> discovery), we can be in trouble (and maybe not only with LFTs).
>
> > I assumed that there is a mechanism to do that.
>
> Not for "fast" switch reboot.
>
> Hmm, I think we could try to detect this by comparing
> SwitchInfo:LinerFDBTop with current p_sw->max_lid_ho or even by seeing
> that PortInfo:LID is not set.


Not sure about checking PortInfo:LID. Wouldn't that approach need to be
qualified by PortState (armed or active) ? LFTTop seems better to me or
perhaps a combination of the two but I may be missing something.


> Something like below:
>
>
> diff --git a/opensm/include/opensm/osm_switch.h
> b/opensm/include/opensm/osm_switch.h
> index 5b2b19e..62c072f 100644
> --- a/opensm/include/opensm/osm_switch.h
> +++ b/opensm/include/opensm/osm_switch.h
> @@ -112,6 +112,7 @@ typedef struct _osm_switch
>        osm_fwd_tbl_t                           fwd_tbl;
>        osm_mcast_tbl_t                         mcast_tbl;
>        uint32_t                                discovery_count;
> +       unsigned                                update_ft;
>        void                                    *priv;
> } osm_switch_t;
> /*
> @@ -152,6 +153,10 @@ typedef struct _osm_switch
> *              during the current fabric sweep.  This number is reset
> *              to zero at the start of a sweep.
> *
> +*      update_ft
> +*              When set fwd tables will be updated regardless to entry
> +*              values locally stored in fwd tables images
> +*
> * SEE ALSO
> *      Switch object
> *********/
> diff --git a/opensm/opensm/osm_port_info_rcv.c
> b/opensm/opensm/osm_port_info_rcv.c
> index adece65..8bbbcac 100644
> --- a/opensm/opensm/osm_port_info_rcv.c
> +++ b/opensm/opensm/osm_port_info_rcv.c
> @@ -336,6 +336,9 @@ __osm_pi_rcv_process_switch_port(
>       break;
>     }
>   }
> +  else if (port_num == 0 && p_node->sw &&
> +           (!p_pi->base_lid || !p_pi->master_sm_base_lid))
> +    p_node->sw->update_ft = 1;
>
>   /*
>     Update the PortInfo attribute.
> diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
> index b44a3ba..03516ae 100644
> --- a/opensm/opensm/osm_ucast_mgr.c
> +++ b/opensm/opensm/osm_ucast_mgr.c
> @@ -811,7 +811,8 @@ osm_ucast_mgr_set_fwd_table(
>        osm_switch_get_fwd_tbl_block( p_sw, block_id_ho, block ) ;
>        block_id_ho++ )
>   {
> -    if (!memcmp(block, p_mgr->lft_buf + block_id_ho * 64, 64))
> +    if (!p_sw->update_ft &&
> +        !memcmp(block, p_mgr->lft_buf + block_id_ho * 64, 64))
>       continue;
>
>     if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
> @@ -850,6 +851,7 @@ osm_ucast_mgr_set_fwd_table(
>     }
>   }
>
> +  p_sw->update_ft = 0;
>   OSM_LOG_EXIT( p_mgr->p_log );
> }
>
>
>
> BTW what do you think is the best way to detect switch power up? I
> didn't really find a strong requirement for at powerup initialization of
> any suitable component.


Peer switch link state change is insufficient to differentiate switch reboot
from "normal" link up/down. There is no IB standard indication of this.


> > Anyway, kill -HUP should flush out the state and restart from scratch.
>
> Thinking more about it I'm not sure. Similar flush will be required for
> another "stored" components like pkey, sl2vl tables etc.. So it is more
> than just "regular" heavy sweep, another signal or option could be used
> for this, but OTOH it becomes very close to OpenSM restarting..


Shouldn't this be automatic rather than requiring the admin to issue a
signal somehow ?

-- Hal


Sasha
>
> >
> >
> > Eitan
> >
> > > -----Original Message-----
> > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com]
> > > Sent: Sunday, July 22, 2007 1:22 PM
> > > To: Eitan Zahavi
> > > Cc: OPENIB; hal.rosenstock at gmail.com; Yevgeny Kliteynik
> > > Subject: Re: opensm: a bug in heavy sweep? - no LFT re-configuration
> > >
> > > Hi Eitan,
> > >
> > > On 09:36 Sun 22 Jul     , Eitan Zahavi wrote:
> > > > Hi Sasha
> > > >
> > > > I am running some tests manually and apparently it looks
> > > like I found
> > > > a bug. Here is the sequence of things:
> > > > 1. SM sweeps the fabric assign LFTs
> > > > 2. I manually modify some LFTs (single entry now marked
> > > UNREACHABLE 3.
> > > > I force some switch change bit to 1 or issue kill -HUP 4. The SM
> > > > reports SUBNET UP 5. The modified LFT entry is still
> > > UNREACHABLE and
> > > > the path is broken
> > >
> > > Right, in most cases (unless OpenSM has its own changes in
> > > the same LFT
> > > block) OpenSM will refer its own LFT image for  "need to update"
> > > decision, so _manual_ changes will not trigger new update.
> > > Rerunning OpenSM should help however.
> > >
> > > > It looks to me some optimization of routing does not fully reroute
> > > > unless some condition is met - but that condition does not
> > > include the
> > > > above triggers listed in step 3.
> > >
> > > Rereading all fabrics LFTs by default seems to be too
> > > expensive operations. At least by default, if it is real
> > > requirement this could be enforced manually, for example when
> > > kill -HUP is used. Thoughts?
> > >
> > > Sasha
> > >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070723/3562da27/attachment.html>

From hal.rosenstock at gmail.com  Mon Jul 23 08:33:50 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 23 Jul 2007 11:33:50 -0400
Subject: [ofa-general] Re: opensm: a bug in heavy sweep? - no LFT
	re-configuration
In-Reply-To: <f0e08f230707230830x5d0f73ccib09421f3f660e563@mail.gmail.com>
References: <863azhrlm1.fsf@sw053.lab.mtl.com>
	<20070722102209.GR16597@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com>
	<20070722174048.GO27878@sashak.voltaire.com>
	<f0e08f230707230830x5d0f73ccib09421f3f660e563@mail.gmail.com>
Message-ID: <f0e08f230707230833p6080b92h52d32a07b853bbdd@mail.gmail.com>

On 7/23/07, Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
>
> Hi Sasha,
>
> On 7/22/07, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>
> > On 14:59 Sun 22 Jul     , Eitan Zahavi wrote:
> > > Hi Sasha
> > >
> > > Let's assume someone has reset a switch on the fabric.
> > > What would cause the SM to re-assign the LFT of that switch?
> >
> > OpenSM will sweep and drop this switch and when switch will back it will
> > be initialized again. But if the reset was too fast (relative to
> > discovery), we can be in trouble (and maybe not only with LFTs).
> >
> > > I assumed that there is a mechanism to do that.
> >
> > Not for "fast" switch reboot.
> >
> > Hmm, I think we could try to detect this by comparing
> > SwitchInfo:LinerFDBTop with current p_sw->max_lid_ho or even by seeing
> > that PortInfo:LID is not set.
>
>
> Not sure about checking PortInfo:LID. Wouldn't that approach need to be
> qualified by PortState (armed or active) ? LFTTop seems better to me or
> perhaps a combination of the two but I may be missing something.
>

Another thought on this :-(

Not sure that resetting either LID or LFTTop is required by the spec either
so this is relying on "beyond the spec" behavior and may not be true for all
switch implementations.

-- Hal


>
> > Something like below:
> >
> >
> > diff --git a/opensm/include/opensm/osm_switch.h
> > b/opensm/include/opensm/osm_switch.h
> > index 5b2b19e..62c072f 100644
> > --- a/opensm/include/opensm/osm_switch.h
> > +++ b/opensm/include/opensm/osm_switch.h
> > @@ -112,6 +112,7 @@ typedef struct _osm_switch
> >        osm_fwd_tbl_t                           fwd_tbl;
> >        osm_mcast_tbl_t                         mcast_tbl;
> >        uint32_t                                discovery_count;
> > +       unsigned                                update_ft;
> >        void                                    *priv;
> > } osm_switch_t;
> > /*
> > @@ -152,6 +153,10 @@ typedef struct _osm_switch
> > *              during the current fabric sweep.  This number is reset
> > *              to zero at the start of a sweep.
> > *
> > +*      update_ft
> > +*              When set fwd tables will be updated regardless to entry
> > +*              values locally stored in fwd tables images
> > +*
> > * SEE ALSO
> > *      Switch object
> > *********/
> > diff --git a/opensm/opensm/osm_port_info_rcv.c
> > b/opensm/opensm/osm_port_info_rcv.c
> > index adece65..8bbbcac 100644
> > --- a/opensm/opensm/osm_port_info_rcv.c
> > +++ b/opensm/opensm/osm_port_info_rcv.c
> > @@ -336,6 +336,9 @@ __osm_pi_rcv_process_switch_port(
> >       break;
> >     }
> >   }
> > +  else if (port_num == 0 && p_node->sw &&
> > +           (!p_pi->base_lid || !p_pi->master_sm_base_lid))
> > +    p_node->sw->update_ft = 1;
> >
> >   /*
> >     Update the PortInfo attribute.
> > diff --git a/opensm/opensm/osm_ucast_mgr.c
> > b/opensm/opensm/osm_ucast_mgr.c
> > index b44a3ba..03516ae 100644
> > --- a/opensm/opensm/osm_ucast_mgr.c
> > +++ b/opensm/opensm/osm_ucast_mgr.c
> > @@ -811,7 +811,8 @@ osm_ucast_mgr_set_fwd_table(
> >        osm_switch_get_fwd_tbl_block( p_sw, block_id_ho, block ) ;
> >        block_id_ho++ )
> >   {
> > -    if (!memcmp(block, p_mgr->lft_buf + block_id_ho * 64, 64))
> > +    if (!p_sw->update_ft &&
> > +        !memcmp(block, p_mgr->lft_buf + block_id_ho * 64, 64))
> >       continue;
> >
> >     if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
> > @@ -850,6 +851,7 @@ osm_ucast_mgr_set_fwd_table(
> >     }
> >   }
> >
> > +  p_sw->update_ft = 0;
> >   OSM_LOG_EXIT( p_mgr->p_log );
> > }
> >
> >
> >
> > BTW what do you think is the best way to detect switch power up? I
> > didn't really find a strong requirement for at powerup initialization of
> > any suitable component.
>
>
> Peer switch link state change is insufficient to differentiate switch
> reboot from "normal" link up/down. There is no IB standard indication of
> this.
>
>
>
> > > Anyway, kill -HUP should flush out the state and restart from scratch.
> >
> > Thinking more about it I'm not sure. Similar flush will be required for
> > another "stored" components like pkey, sl2vl tables etc.. So it is more
> > than just "regular" heavy sweep, another signal or option could be used
> > for this, but OTOH it becomes very close to OpenSM restarting..
>
>
> Shouldn't this be automatic rather than requiring the admin to issue a
> signal somehow ?
>
> -- Hal
>
>
> Sasha
> >
> > >
> > >
> > > Eitan
> > >
> > > > -----Original Message-----
> > > > From: Sasha Khapyorsky [mailto: sashak at voltaire.com]
> > > > Sent: Sunday, July 22, 2007 1:22 PM
> > > > To: Eitan Zahavi
> > > > Cc: OPENIB; hal.rosenstock at gmail.com ; Yevgeny Kliteynik
> > > > Subject: Re: opensm: a bug in heavy sweep? - no LFT re-configuration
> > > >
> > > > Hi Eitan,
> > > >
> > > > On 09:36 Sun 22 Jul     , Eitan Zahavi wrote:
> > > > > Hi Sasha
> > > > >
> > > > > I am running some tests manually and apparently it looks
> > > > like I found
> > > > > a bug. Here is the sequence of things:
> > > > > 1. SM sweeps the fabric assign LFTs
> > > > > 2. I manually modify some LFTs (single entry now marked
> > > > UNREACHABLE 3.
> > > > > I force some switch change bit to 1 or issue kill -HUP 4. The SM
> > > > > reports SUBNET UP 5. The modified LFT entry is still
> > > > UNREACHABLE and
> > > > > the path is broken
> > > >
> > > > Right, in most cases (unless OpenSM has its own changes in
> > > > the same LFT
> > > > block) OpenSM will refer its own LFT image for  "need to update"
> > > > decision, so _manual_ changes will not trigger new update.
> > > > Rerunning OpenSM should help however.
> > > >
> > > > > It looks to me some optimization of routing does not fully reroute
> >
> > > > > unless some condition is met - but that condition does not
> > > > include the
> > > > > above triggers listed in step 3.
> > > >
> > > > Rereading all fabrics LFTs by default seems to be too
> > > > expensive operations. At least by default, if it is real
> > > > requirement this could be enforced manually, for example when
> > > > kill -HUP is used. Thoughts?
> > > >
> > > > Sasha
> > > >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070723/92545ea4/attachment.html>

From eitan at mellanox.co.il  Mon Jul 23 10:59:21 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Mon, 23 Jul 2007 20:59:21 +0300
Subject: [ofa-general] RE: opensm: a bug in heavy sweep? - no LFT
	re-configuration
In-Reply-To: <f0e08f230707230830x5d0f73ccib09421f3f660e563@mail.gmail.com>
References: <863azhrlm1.fsf@sw053.lab.mtl.com>
	<20070722102209.GR16597@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com>
	<20070722174048.GO27878@sashak.voltaire.com>
	<f0e08f230707230830x5d0f73ccib09421f3f660e563@mail.gmail.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com>

Hi Sasha, Hal,
 
I think I have an idea:
 
Since this is a specific switch that reported ChangeBit or Trap why
can't we just qualify that there was no change in the switch setup?
We could send PortInfo, SwitchInfo, LFT, MFT, SL2VL, VLArb, PKey queries
and make sure no change from previous state. Or we could simply enforce
last state by sending it over again ...
 

Eitan Zahavi 
Senior Engineering Director, Software Architect 
Mellanox Technologies LTD 
Tel:+972-4-9097208
Fax:+972-4-9593245 
P.O. Box 586 Yokneam 20692 ISRAEL 

 
________________________________

	From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] 
	Sent: Monday, July 23, 2007 6:31 PM
	To: Sasha Khapyorsky
	Cc: Eitan Zahavi; OPENIB; Yevgeny Kliteynik
	Subject: Re: opensm: a bug in heavy sweep? - no LFT
re-configuration
	
	
	Hi Sasha,
	
	
	On 7/22/07, Sasha Khapyorsky <sashak at voltaire.com> wrote:

		On 14:59 Sun 22 Jul     , Eitan Zahavi wrote:
		> Hi Sasha
		>
		> Let's assume someone has reset a switch on the fabric.

		> What would cause the SM to re-assign the LFT of that
switch?
		
		OpenSM will sweep and drop this switch and when switch
will back it will
		be initialized again. But if the reset was too fast
(relative to
		discovery), we can be in trouble (and maybe not only
with LFTs).
		
		> I assumed that there is a mechanism to do that.
		
		Not for "fast" switch reboot.
		
		Hmm, I think we could try to detect this by comparing 
		SwitchInfo:LinerFDBTop with current p_sw->max_lid_ho or
even by seeing
		that PortInfo:LID is not set.

	 
	Not sure about checking PortInfo:LID. Wouldn't that approach
need to be qualified by PortState (armed or active) ? LFTTop seems
better to me or perhaps a combination of the two but I may be missing
something.

	 
		Something like below:
		
		
		diff --git a/opensm/include/opensm/osm_switch.h
b/opensm/include/opensm/osm_switch.h 
		index 5b2b19e..62c072f 100644
		--- a/opensm/include/opensm/osm_switch.h
		+++ b/opensm/include/opensm/osm_switch.h
		@@ -112,6 +112,7 @@ typedef struct _osm_switch
		       osm_fwd_tbl_t                           fwd_tbl; 
		       osm_mcast_tbl_t
mcast_tbl;
		       uint32_t
discovery_count;
		+       unsigned
update_ft;
		       void                                    *priv; 
		} osm_switch_t;
		/*
		@@ -152,6 +153,10 @@ typedef struct _osm_switch
		*              during the current fabric sweep.  This
number is reset
		*              to zero at the start of a sweep.
		*
		+*      update_ft 
		+*              When set fwd tables will be updated
regardless to entry
		+*              values locally stored in fwd tables
images
		+*
		* SEE ALSO
		*      Switch object
		*********/
		diff --git a/opensm/opensm/osm_port_info_rcv.c
b/opensm/opensm/osm_port_info_rcv.c 
		index adece65..8bbbcac 100644
		--- a/opensm/opensm/osm_port_info_rcv.c
		+++ b/opensm/opensm/osm_port_info_rcv.c
		@@ -336,6 +336,9 @@ __osm_pi_rcv_process_switch_port(
		      break;
		    }
		  }
		+  else if (port_num == 0 && p_node->sw && 
		+           (!p_pi->base_lid ||
!p_pi->master_sm_base_lid))
		+    p_node->sw->update_ft = 1;
		
		  /*
		    Update the PortInfo attribute.
		diff --git a/opensm/opensm/osm_ucast_mgr.c
b/opensm/opensm/osm_ucast_mgr.c 
		index b44a3ba..03516ae 100644
		--- a/opensm/opensm/osm_ucast_mgr.c
		+++ b/opensm/opensm/osm_ucast_mgr.c
		@@ -811,7 +811,8 @@ osm_ucast_mgr_set_fwd_table(
		       osm_switch_get_fwd_tbl_block( p_sw, block_id_ho,
block ) ; 
		       block_id_ho++ )
		  {
		-    if (!memcmp(block, p_mgr->lft_buf + block_id_ho *
64, 64))
		+    if (!p_sw->update_ft &&
		+        !memcmp(block, p_mgr->lft_buf + block_id_ho *
64, 64))
		      continue; 
		
		    if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG )
)
		@@ -850,6 +851,7 @@ osm_ucast_mgr_set_fwd_table(
		    }
		  }
		
		+  p_sw->update_ft = 0;
		  OSM_LOG_EXIT( p_mgr->p_log );
		}
		
		
		BTW what do you think is the best way to detect switch
power up? I
		didn't really find a strong requirement for at powerup
initialization of
		any suitable component.

	 
	Peer switch link state change is insufficient to differentiate
switch reboot from "normal" link up/down. There is no IB standard
indication of this. 

	 
		> Anyway, kill -HUP should flush out the state and
restart from scratch.
		
		Thinking more about it I'm not sure. Similar flush will
be required for 
		another "stored" components like pkey, sl2vl tables
etc.. So it is more
		than just "regular" heavy sweep, another signal or
option could be used
		for this, but OTOH it becomes very close to OpenSM
restarting.. 

	 
	Shouldn't this be automatic rather than requiring the admin to
issue a signal somehow ?
	 
	-- Hal
	 

		Sasha
		
		>
		>
		> Eitan
		>
		> > -----Original Message-----
		> > From: Sasha Khapyorsky [mailto: sashak at voltaire.com]
		> > Sent: Sunday, July 22, 2007 1:22 PM
		> > To: Eitan Zahavi
		> > Cc: OPENIB; hal.rosenstock at gmail.com ; Yevgeny
Kliteynik
		> > Subject: Re: opensm: a bug in heavy sweep? - no LFT
re-configuration
		> >
		> > Hi Eitan,
		> >
		> > On 09:36 Sun 22 Jul     , Eitan Zahavi wrote:
		> > > Hi Sasha 
		> > >
		> > > I am running some tests manually and apparently it
looks
		> > like I found
		> > > a bug. Here is the sequence of things:
		> > > 1. SM sweeps the fabric assign LFTs 
		> > > 2. I manually modify some LFTs (single entry now
marked
		> > UNREACHABLE 3.
		> > > I force some switch change bit to 1 or issue kill
-HUP 4. The SM
		> > > reports SUBNET UP 5. The modified LFT entry is
still 
		> > UNREACHABLE and
		> > > the path is broken
		> >
		> > Right, in most cases (unless OpenSM has its own
changes in
		> > the same LFT
		> > block) OpenSM will refer its own LFT image for
"need to update" 
		> > decision, so _manual_ changes will not trigger new
update.
		> > Rerunning OpenSM should help however.
		> >
		> > > It looks to me some optimization of routing does
not fully reroute 
		> > > unless some condition is met - but that condition
does not
		> > include the
		> > > above triggers listed in step 3.
		> >
		> > Rereading all fabrics LFTs by default seems to be
too 
		> > expensive operations. At least by default, if it is
real
		> > requirement this could be enforced manually, for
example when
		> > kill -HUP is used. Thoughts?
		> >
		> > Sasha
		> >
		

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070723/55acb245/attachment.html>

From eitan at mellanox.co.il  Mon Jul 23 11:05:23 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Mon, 23 Jul 2007 21:05:23 +0300
Subject: [ofa-general] [PATCH resend] opensm/osm_indent:
In-Reply-To: <20070723130912.GV16597@sashak.voltaire.com>
References: <20070722221455.GR27878@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED5EF1@mtlexch01.mtl.com>
	<20070723130912.GV16597@sashak.voltaire.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901ED61C2@mtlexch01.mtl.com>

Hi Sasha,

I read the new coding style doc after this last mail.
I thought you only defined new "indentation rules" and I am for doing
this step as it is automatic and safe.
But rewriting the code with shorter names and replacing all variables 
and functions seems a little too risky in my mind.


Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL

 
> -----Original Message-----
> From: Sasha Khapyorsky [mailto:sashak at voltaire.com] 
> Sent: Monday, July 23, 2007 4:09 PM
> To: Eitan Zahavi
> Cc: general at lists.openfabrics.org; Yevgeny Kliteynik
> Subject: Re: [ofa-general] [PATCH resend] opensm/osm_indent: 
> go closertoopensm-coding-style.txt
> 
> On 10:31 Mon 23 Jul     , Eitan Zahavi wrote:
> > 
> > So we will finally have a common enforced coding style!
> > When do you plan to run it on all the files?
> 
> In the "spare" time :). I'm thinking about doing this in 
> steps by subdirectories starting from header files. Also 
> would be nice to not do huge styling updates during OFED 1.3 cycle.
> 
> > Or should we just make sure every new committed file will 
> first pass 
> > this indent?
> 
> This is the good option, however would be nice to not mix 
> style fixing patches with functional ones (more or the less 
> as described in opensm/doc/opensm-coding-style.txt).
> 
> Sasha
> 
> > 
> > Thanks
> > 
> > Eitan
> > 
> > Eitan Zahavi
> > Senior Engineering Director, Software Architect Mellanox 
> Technologies 
> > LTD
> > Tel:+972-4-9097208
> > Fax:+972-4-9593245
> > P.O. Box 586 Yokneam 20692 ISRAEL
> > 
> >  
> > 
> > > -----Original Message-----
> > > From: general-bounces at lists.openfabrics.org
> > > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Sasha 
> > > Khapyorsky
> > > Sent: Monday, July 23, 2007 1:15 AM
> > > To: general at lists.openfabrics.org
> > > Cc: Yevgeny Kliteynik
> > > Subject: [ofa-general] [PATCH resend] opensm/osm_indent: 
> go closer 
> > > toopensm-coding-style.txt
> > > 
> > > 
> > > This updates the script according to recent 
> > > doc/opensm-coding-style.txt (in short K&R, tabs, etc.).
> > > 
> > > Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> > > ---
> > >  opensm/opensm/osm_indent |   57 
> > > +++------------------------------------------
> > >  1 files changed, 4 insertions(+), 53 deletions(-)
> > > 
> > > diff --git a/opensm/opensm/osm_indent b/opensm/opensm/osm_indent 
> > > index bed2ba1..621184b 100755
> > > --- a/opensm/opensm/osm_indent
> > > +++ b/opensm/opensm/osm_indent
> > > @@ -1,6 +1,6 @@
> > >  #!/bin/bash
> > >  #
> > > -# Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved.
> > > +# Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
> > >  # Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights 
> > > reserved.
> > >  # Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> > >  #
> > > @@ -40,56 +40,7 @@
> > >  #  Environment:
> > >  #  	Linux User Mode
> > >  #
> > > -#  $Revision: 1.4 $
> > > -#
> > > -#
> > > -# This is the indent format used for OpenSM.
> > > -#
> > > -# format the source code according to the ACD standard
> > > -# -bad	Blank line after declarations
> > > -# -bap	Blank line after Procedures
> > > -# -bbb	Blank line before block comments
> > > -# -nbbo	Break after Boolean operator
> > > -# -bl	Break after if line
> > > -# -bli0 Indent for braces is 0
> > > -# -bls	Break after struct declarations
> > > -# -cbi0	Case break indent 0
> > > -# -ci3	Continue indent 3 spaces
> > > -# -cli0	Case label indent 0 spaces
> > > -# -ncs	No space after cast operator
> > > -# -hnl	Honor existing newlines on long lines
> > > -# -i3	Substitute indent with 3 spaces
> > > -# -npcs	No space after procedure calls
> > > -# -prs	Space after parenthesis
> > > -# -nsai	No space after if keyword - removed
> > > -# -nsaw	No space after while keyword - removed
> > > -# -sc	Put * at left of comments in a block comment style
> > > -# -nsob	Don't swallow unnecessary blank lines
> > > -# -ts3	Tab size is 3
> > > -# -psl	Type of procedure return in a separate line
> > > -# -bfda	Function declaration arguments in a separate line.
> > > -# -nut   No tabs as we allow spaces
> > > -#
> > > -#############################################################
> > > ############
> > > -
> > > -# indent the world
> > > -for sourcefile in $*; do
> > > -    if test -f "$sourcefile"; then
> > > -        # first, string DOS style linefeeds
> > > -        perl -piW -e's/\x0D//' "$sourcefile"
> > > -        echo Processing $sourcefile
> > > -        indent -bad -bap -bbb -nbbo -bl -bli0 -bls -cbi0 
> > > -ci3 -cli0 -ncs \
> > > -                -hnl -i3 -npcs -prs -sc -nsob -ts3 -psl 
> > > -bfda -nut $sourcefile
> > > -
> > > -        rm ${sourcefile}W
> > > +# This is the indent format used for OpenSM (similar to one
> > > used in #
> > > +linux/scripts/Lindent).
> > >  
> > > -        # the -bb also affect the first line in each file - 
> > > so clean it up
> > > -        if test `head -1 $sourcefile | egrep -v '^$' | wc 
> > > -l` = 0; then
> > > -            echo Cleaning up first empty line of $sourcefile
> > > -            awk '{if(n){print};n++}' $sourcefile > ${sourcefile}W
> > > -            mv -f ${sourcefile}W $sourcefile
> > > -        fi
> > > -    else
> > > -        echo Could not find file:$sourcefile
> > > -    fi
> > > -done
> > > +indent -npro -kr -i8 -ts8 -sob -l80 -ss -ncs "$@"
> > > --
> > > 1.5.3.rc2.29.gc4640f
> > > 
> > > _______________________________________________
> > > general mailing list
> > > general at lists.openfabrics.org
> > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > 
> > > To unsubscribe, please visit
> > > http://openib.org/mailman/listinfo/openib-general
> > > 
> 


From mshefty at ichips.intel.com  Mon Jul 23 11:10:08 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 23 Jul 2007 11:10:08 -0700
Subject: [ofa-general] Re: IPoIB path caching
In-Reply-To: <46A46A1D.6040000@voltaire.com>
References: <adalkdl43w0.fsf@cisco.com>	<46A2F696.4060007@voltaire.com>	<adafy3f22z5.fsf@cisco.com>	<46A46637.3080104@voltaire.com>	<20070723083020.GD20614@mellanox.co.il>
	<46A46A1D.6040000@voltaire.com>
Message-ID: <46A4EF00.9070305@ichips.intel.com>

> What I have in mind is that IPoIB must not use cached IB path info.
> 
> If the IB stack has path caching which is in the default flow of 
> requesting a path record, it should provide an API (eg flag to the 
> function through which one does path query) to request a non cached path.

Argh!  This was the original design.  I believe the current design is a 
better approach.  The ULP shouldn't care whether the PR is cached or not 
- only that it's usable.

> The design I was thinking to suggest for IPoIB is to almost always use 
> this API since this policy makes the implementation consistent with the 
> decisions made by the network stack neighbour cache

This defeats one of the benefit of caching, which is using a single 
GetTable query, versus literally hundreds or thousands of Get queries. 
Consider that constant all-to-all communication using IPoIB between 1024 
ports, with a 15 minute ARP table timeout would hit the SA with close to 
600 queries per second.

I agree with Michael that it would be better for a ULP to invalidate 
cache entries.

While I agree that there's the potential for a problem, given that IPoIB 
has always cached PRs and no one has reported problems, I think we're 
overstating the likelihood of issues occurring in practice.  Even the SA 
caches the path data -- getting a PR from the SA doesn't provide any 
additional guarantees.

- Sean


From mshefty at ichips.intel.com  Mon Jul 23 11:38:26 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 23 Jul 2007 11:38:26 -0700
Subject: [ofa-general] Re: I think that there is a resource leak in the core
	file mad_rmpp.c
In-Reply-To: <46A45A8C.2090800@dev.mellanox.co.il>
References: <46A45A8C.2090800@dev.mellanox.co.il>
Message-ID: <46A4F5A2.2020508@ichips.intel.com>

> I reviewed the file mad_rmpp.c and it seems that there is a leak of the 
> Address Handle.
> The AH that is being created in the function "alloc_response_msg" is 
> never being destroyed.

The AH is destroyed in ib_rmpp_send_handler().

- Sean


From jgunthorpe at obsidianresearch.com  Mon Jul 23 11:41:05 2007
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Mon, 23 Jul 2007 12:41:05 -0600
Subject: [ofa-general] Re: IPoIB path caching
In-Reply-To: <46A4EF00.9070305@ichips.intel.com>
References: <adalkdl43w0.fsf@cisco.com> <46A2F696.4060007@voltaire.com>
	<adafy3f22z5.fsf@cisco.com> <46A46637.3080104@voltaire.com>
	<20070723083020.GD20614@mellanox.co.il>
	<46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com>
Message-ID: <20070723184105.GB19768@obsidianresearch.com>

On Mon, Jul 23, 2007 at 11:10:08AM -0700, Sean Hefty wrote:

> >The design I was thinking to suggest for IPoIB is to almost always use 
> >this API since this policy makes the implementation consistent with the 
> >decisions made by the network stack neighbour cache
> 
> This defeats one of the benefit of caching, which is using a single 
> GetTable query, versus literally hundreds or thousands of Get queries. 
> Consider that constant all-to-all communication using IPoIB between 1024 
> ports, with a 15 minute ARP table timeout would hit the SA with close to 
> 600 queries per second.
> 
> I agree with Michael that it would be better for a ULP to invalidate 
> cache entries.

Well, in my view, this is exactly the sort of thing you should not
do. ULPs have no better idea what is going on. ARP expiry doesn't give
you any special information about the cached PR.

If kernel caching is used it must be viewed as authorative and kept
current through some kind of external mechanism. Something like
re-doing the big GetTable query prior to starting a job is a fine
interm way to do this. Ideally updating the kernel sa cache would also
push updated data into the neighbor AH structures as appropriate. Then
there is one single source of PR data, one source of IP -> GID
mapppings, etc.

These problems are just an unavoidable part of trying to use caching -
build the mechanism to support coherent replication and just deal with
these downsides <shrug>. Sean is basically doing non-coherent
replication today with his big GetTable query and that sounds like
what is speeding things up, not caching indivudal PRs.

> overstating the likelihood of issues occurring in practice.  Even
> the SA

Well, any time your renumber your network you will get burned and have
to restart ipoib on every node with the way things are
today. Something like a SM upgrade or changing to a new vendor SM, or
increasing LMC could do this to you.

> caches the path data -- getting a PR from the SA doesn't provide any
> additional guarantees.

Erm, any SA that returns a PR that is invalid in the network outside
the time the network is being updated is seriously busted, IMHO.

Jason


From sean.hefty at intel.com  Mon Jul 23 12:39:09 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 23 Jul 2007 12:39:09 -0700
Subject: [ofa-general] Re: IPoIB path caching
In-Reply-To: <20070723184105.GB19768@obsidianresearch.com>
Message-ID: <000001c7cd61$24301080$9c98070a@amr.corp.intel.com>

>If kernel caching is used it must be viewed as authorative and kept
>current through some kind of external mechanism. Something like
>re-doing the big GetTable query prior to starting a job is a fine
>interm way to do this. Ideally updating the kernel sa cache would also
>push updated data into the neighbor AH structures as appropriate. Then
>there is one single source of PR data, one source of IP -> GID
>mapppings, etc.

IB does not define anything to standardize distributing SA data.  The proposed
solution works with any SM, and supports SA events to the degree that they've
been defined.

The local SA does not define vendor-specific SA extensions, though such
extensions are supported at a high level by manual cache refreshes.  Someone
could provide finer control to refresh specific cache entries if needed and
used.

>These problems are just an unavoidable part of trying to use caching -
>build the mechanism to support coherent replication and just deal with
>these downsides <shrug>. Sean is basically doing non-coherent
>replication today with his big GetTable query and that sounds like
>what is speeding things up, not caching indivudal PRs.

The caching provides the speedup on the client side.  The GetTable provides the
scalability on the SA side.

Whether we cache PR or PR data in the form of AHs, caching is in use and
required by the software today.  No one would suggest that IPoIB issue a PR
query per packet.

I feel that we're trying to come up with the ideal solution at the start.  Let's
start with what we have today and expand.  Currently IPoIB caches PR data, does
not share it, and doesn't update it.  The local SA collects data more
efficiently*, shares the data, and provides ways for updating it.  It is
refreshed in response to specific events, and scripts could be used to refresh
the cache periodically.

* If communication is only to a few nodes in a large cluster, then multiple Get
queries may be more efficient than using GetTable.  The local SA could be
expanded to cache query responses, rather than issuing it's own in this case.

>Well, any time your renumber your network you will get burned and have
>to restart ipoib on every node with the way things are
>today. Something like a SM upgrade or changing to a new vendor SM, or
>increasing LMC could do this to you.

The local SA responds to these types of SA changes by refreshing the cache.

>Erm, any SA that returns a PR that is invalid in the network outside
>the time the network is being updated is seriously busted, IMHO.

The SA isn't guaranteed to know all links that are down at the time it returns a
PR.  There's a delay between when a path becomes unusable, and when the SA
detects it.  In fact, an end node could detect that a path is unusable before
the SA does, which could be the reason for it requesting a new path.  The SA
cannot sweep the fabric looking for changes before responding to every PR query.

- Sean


From harms at alcf.anl.gov  Mon Jul 23 13:05:17 2007
From: harms at alcf.anl.gov (Kevin Harms)
Date: Mon, 23 Jul 2007 15:05:17 -0500
Subject: [ofa-general] openibd / srp question
Message-ID: <B87F630C-EE6C-4BA7-95D1-87FD10DFB8A9@alcf.anl.gov>


	is there a reason that starting up the srp_daemon is bound to the  
SRPHA_ENABLE variable? I would like to propose that either the daemon  
is started up if the ib_srp module is loaded on boot or a second  
dependent variable is created that controls the srp_daemon startup.

ofed_1_2/linux-2.6.git/ofed_scripts/openibd : line 844

ib_srp)
	 /sbin/modprobe $mod > /dev/null 2>&1
	if [ "X${SRPHA_ENABLE}" == "Xyes" ]; then
		if [ ! -x /sbin/multipath ]; then
			echo "/sbin/multipath is required to enable SRP HA."
		else
			# Create 91-srp.rules file
			mkdir -p /etc/udev/rules.d
			if [ "$DISTRIB" == "SuSE"  ]; then
				cat > /etc/udev/rules.d/91-srp.rules << EOF
				ACTION=="add", KERNEL=="sd*[!0-9]", RUN+="/sbin/multipath %M:%m"
				EOF
			fi
			/sbin/modprobe dm_multipath > /dev/null 2>&1
			SRPD_ENABLE=yes
		fi
	fi

	if [ "X${SRPD_ENABLE}" = "Xyes" ]; then
		srp_daemon.sh &
		srp_daemon_pid=$!
		echo ${srp_daemon_pid} > ${srp_daemon_pidfile}
	fi
;;


From arthur.jones at qlogic.com  Mon Jul 23 13:06:40 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Mon, 23 Jul 2007 13:06:40 -0700
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070612084108.GK6470@mellanox.co.il>
References: <20070612084108.GK6470@mellanox.co.il>
Message-ID: <20070723200640.GA13117@bauxite.pathscale.com>

hi michael, ...

On Tue, Jun 12, 2007 at 11:41:08AM +0300, Michael S. Tsirkin wrote:
> For whom it may concern,
> I have created an ofed git tree updated with kernel bits from 2.6.22-rc4,
> and put that up at git://git.openfabrics.org/~mst/ofed_kernel.git
> [...] 
> In particular, there were a ton of ipath patches that it seems were
> for the most part applied.
> Qlogic maintainers, please help double check that I did not miss something
> of value.

thanks for setting this up, i'm still looking
at the diffs to make sure things got setup
correctly for the ipath stuff...

i have found it difficult to navigate the
source having to run:

./ofed_scripts/configure --kernel-version=2.6.xxx --without-quilt

everytime to check against our tree.  so, rather
than spending the better part of the afternoon
running these scripts by hand, i created a shell
script to populate a bunch of branches with the
backports in each branch.

at qlogic we now keep the backports as branches in
our git tree and this, i find, is much easier to
handle.  because:

* viewing and navigating backport source becomes
  _much_ easier.
* merges are easier -- patches are much more fragile
  than branches.
* comparisons are easier -- checking for differences
  between backports and between a backport and the
  canonical source is faster and more convenient...
* changesets are readable.  trying to decipher diffs
  to patches is medically proven to take months, if not
  years, off your life.

anyway, what do you think?  is there anyway i could
convince you to dump the backport patches and put
all the backports in branches?  i'm willing to do the
legwork if you see value...

arthur


From sweitzen at cisco.com  Mon Jul 23 14:27:56 2007
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Mon, 23 Jul 2007 14:27:56 -0700
Subject: [ofa-general] created version "1.3" in bugzilla
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA303E98636@xmb-sjc-216.amer.cisco.com>

This allows me to REOPEN some RESOLVED LATER bugs from 1.2.
 
Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070723/0e86f890/attachment.html>

From ardavis at ichips.intel.com  Mon Jul 23 16:17:00 2007
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Mon, 23 Jul 2007 16:17:00 -0700
Subject: [ofa-general] RE: OFA website edits
In-Reply-To: <46968448.2000401@ichips.intel.com>
References: <B0095134066CC94FBC80973103FFA1FE046A5CB3@orsmsx416.amr.corp.intel.com>	<adamyy27cxk.fsf@cisco.com>
	<46956FF9.50102@ichips.intel.com>
	<46968448.2000401@ichips.intel.com>
Message-ID: <46A536EC.4060201@ichips.intel.com>

Maintainers: please review the following proposal regarding new public 
download locations/website links and respond. This request originated 
from xwg.

http://lists.openfabrics.org/pipermail/xwg/2007-June/000018.html

Thanks.

> Arlin Davis wrote:
>
>> The proposal was attempting to come up with a method to automatically 
>> link to a package and description file from the download webpage. I 
>> have no problem
>> targeting http://openfabrics.org/downloads as long as we come up with 
>> a way for the webpage to correlate a description with a package 
>> without hand coding the links everytime. We need to come up with a 
>> method for automatic links to keep our download webpage updated and 
>> complete.
>>
>> What if we add a directory for each project under downloads and 
>> provide a README for a description? Other suggestions?
>>
> Here is a stab at what we have today for discussion purposes:
>
> Linux  Libraries:
>    - libibverbs -http://www.openfabrics.org/downloads/       - 
> librdmacm -  http://www.openfabrics.org/~shefty/
>    - dapl  - http://www.openfabrics.org/~ardavis/
>    - management -http://www.openfabrics.org/~halr/   OFED Linux:
>    - OFED 1.2 release - 
> http://www.openfabrics.org/builds/ofed-1.2/release/OFED-1.2.tgz
>    - OFED 1.2 binary RPMs for SLES 9.0, SLES 10 SP1, RHEL 4.0 U5  and 
> RHEL 5.0
>         
> http://www.openfabrics.org/builds/ofed-1.2/release/OFED-1.2-RPMS/      
> - OFED connectx release - 
> _http://www.openfabrics.org/builds/connectx/release/_
> OFED Linux Archives:
>    - SLES 10 OFED 1.0 RPMS - http://www.openfabrics.org/downloads/     
> - OFED 1.1 release - 
> https://svn.openfabrics.org/svn/openib/gen2/branches/1.1/ofed/releases/
>    - OFED 1.0 release - 
> https://svn.openfabrics.org/svn/openib/gen2/branches/1.0/ofed/releases/
> WinOF for windows:
>    WinOF 1.0 release - http://www.oprnfabrics.org/~ardavis/WinOF 
> 1.0/WinOF_1-0.zip
>    WinOF source - svn://openib.tc.cornell.edu
>     WinOF faq - 
> https://wiki.openfabrics.org/tiki-index.php?page=OpenIB+Windows
>
> I would like to propose adding project directories under 
> http://www.openfabrics.org/downloads/  where appropriate and give 
> maintainers access. For example:
>
> http://www.openfabrics.org/downloads/verbs (rdreier)
> http://www.openfabrics.org/downloads/rdmacm (shefty)
> http://www.openfabrics.org/downloads/dapl (ardavis)
> http://www.openfabrics.org/downloads/management (sashak)
> http://www.openfabrics.org/downloads/OFED (vlad) 
> http://www.openfabrics.org/downloads/WinOF (ardavis)
> http://www.openfabrics.org/downloads/archives (vlad) ??
> etc...
>
> Each of these would contain a README that details the contents of the 
> directory along with WEB_README that provides a short description for 
> the webpage. Jeff could then automatically parse for directories under 
> downloads and if it contains WEB_README add a webpage link to the 
> directory along with the short description.
>
> Jeff, is this possible?
>
> comments?
>
> -arlin
>
>
>  
>
>
>
>
>
>
>
>        


From sean.hefty at intel.com  Mon Jul 23 16:32:31 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 23 Jul 2007 16:32:31 -0700
Subject: [ofa-general] RE: OFA website edits
In-Reply-To: <46A536EC.4060201@ichips.intel.com>
Message-ID: <000901c7cd81$be2ffe00$9c98070a@amr.corp.intel.com>

>> http://www.openfabrics.org/downloads/verbs (rdreier)
>> http://www.openfabrics.org/downloads/rdmacm (shefty)
>> http://www.openfabrics.org/downloads/dapl (ardavis)
>> http://www.openfabrics.org/downloads/management (sashak)
>> http://www.openfabrics.org/downloads/OFED (vlad)
>> http://www.openfabrics.org/downloads/WinOF (ardavis)
>> http://www.openfabrics.org/downloads/archives (vlad) ??
>> etc...

These seem fine to me.  We will need a place for the hardware specific
libraries.  (.../downloads/hw/xxx ?)

Having the web page automatically update would be nice.

- Sean


From envio10007 at gmail.com  Mon Jul 23 16:35:43 2007
From: envio10007 at gmail.com (Odontologos)
Date: Mon, 23 Jul 2007 19:35:43 -0400
Subject: [ofa-general] =?iso-8859-1?q?Pasta_dental_de_Aloe_Vera_+_Propoleo?=
	=?iso-8859-1?q?s_de_Abeja_sin_Fluor=2C_importada_de_USA=85=2E?=
Message-ID: <1080897-220077123233543391@Mauricio>


Pasta dental de Aloe Vera + Prop&oacute;leos de Abeja sin Fluor, importada de USA�. 
        
  
Se&ntilde;ores 
Cl&iacute;nica Dental 
Presente

Estimados Se&ntilde;ores:

Somos distribuidores mayoristas de Forever Bright, una pasta dental importada de Estados Unidos desde hace 13 a&ntilde;os (Con c&oacute;digo SESMA en Chile), esta pasta tiene la particularidad de ser la &uacute;nica de ALOE VERA m&aacute;s Prop&oacute;leos de Abeja sin Fluor ni abrasivos, especialmente dise&ntilde;ada para blanquear los dientes sin rayar el esmalte y proporcionar el mejor cuidado a las enc&iacute;as, con ingredientes 100% naturales, premiada en Estados Unidos por los lectores de Reader&acute;s Digest 1999 como el mejor producto del a&ntilde;o. 

Tenemos una propuesta para su Cl&iacute;nica Dental que le permitir&aacute; captar nuevos clientes, ofrecer a sus pacientes un producto de clase mundial y mucho m&aacute;s, si es de su inter&eacute;s conocer nuestra propuesta por favor ll&aacute;menos para coordinar una reuni&oacute;n de no m&aacute;s de 10 minutos en la cual le explicaremos el proyecto, podremos entregarle muestras y material de apoyo.

Esperando su pronta respuesta se despide atentamente,

 
Fono: 235 12 07
www.ellas.cl


Este mensaje se env&iacute;a en base al art. 28b de la ley 19.955 que reforma la la ley de derechos del consumidor, y los art&iacute;culos 2 y 4 de la ley 19.628 sobre protecci&oacute;n de la vida privada o datos de car&aacute;cter personal, todo esto en conformidad a los numerales 4 y 12 de la constituci&oacute;n pol&iacute;tica. Su direcci&oacute;n ha sido extra&iacute;da manualmente por personal de nuestra compa&ntilde;&iacute;a desde su sitio Web en Internet, o ha sido introducida por usted al aceptar el env&iacute;o de mensajes publicitarios al inscribirse en alguno de los sitios o foros de nuestra Red de trabajo. Para ser removido presione Borrarme de su Base de Datos

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070723/b2d6b962/attachment.html>

From mshefty at ichips.intel.com  Mon Jul 23 17:22:49 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 23 Jul 2007 17:22:49 -0700
Subject: [ofa-general] QoS RFC
In-Reply-To: <46A283B6.1070105@dev.mellanox.co.il>
References: <46A283B6.1070105@dev.mellanox.co.il>
Message-ID: <46A54659.8010608@ichips.intel.com>

> 2.5. ULPs that use CM interface (like SRP) should have their own 
> pre-assigned Service-ID and use it while obtaining PR/MPR for
> establishing connections. The SA receiving the PR/MPR should match it
> against the policy and return the appropriate PR/MPR including SL,
> MTU and RATE.

We need to ensure that this can work without pre-assigned service IDs, 
or at least service IDs that are assigned within a fairly wide range, 
such as locally assigned IDs.

> 2.6. ULPs and programs using CMA to establish RC connection should 
> provide the CMA the target IP and Service-ID. Some of the ULPs might
> also provide QoS-Class (E.g. for SDP sockets that are provided the
> TOS socket option). The CMA should then use the provided Service-ID
> and optional QoS-Class and pass them in the PR/MPR request. The
> resulting PR/MPR should be used for configuring the connection QP.

The interface to the CMA needs to remain as transport independent as 
possible, and I am unsure of the transport independence of tying QoS to 
the destination port number.  (I'm not disagreeing; I'm just not sure at 
the moment it's the right approach.)

> PathRecord and MultiPathRecord enhancement for QoS: As mentioned
> above the PathRecord and MultiPathRecord attributes should be 
> enhanced to carry the Service-ID which is a 64bit value, which has
> been standardized by the IBTA. A new field QoS-Class is also
> provided. A new capability bit should describe the SM QoS support in
> the SA class port info. This approach provides an easy migration path
> for existing access layer and ULPs by not introducing new set of
> PR/MPR attribute.

Has any thought been given to how to make this scale?

> 5. CMA features ----------------
> 
> The CMA interface supports Service-ID through the notion of port
> space as a prefixes to the port_num which is part of the sockaddr
> provided to rdma_resolve_add(). What is missing is the explicit
> request for a QoS-Class that should allow the ULP (like SDP) to
> propagate a specific request for a class of service. A mechanism for
> providing the QoS-Class is available in the IPv6 address, so we could
> use that address field. Another option is to implement a special 
> connection options API for CMA.
> 
> Missing functionality by CMA is the usage of the provided QoS-Class
> and Service-ID in the sent PR/MPR. When a response is obtained it is
> an existing requirement for the CMA to use the PR/MPR from the
> response in setting up the QP address vector.

The most natural function to specify additional QoS parameters would be 
rdma_resolve_route.

- Sean


From sashak at voltaire.com  Mon Jul 23 17:33:02 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 24 Jul 2007 03:33:02 +0300
Subject: [ofa-general] Re: opensm: a bug in heavy sweep? - no LFT
	re-configuration
In-Reply-To: <f0e08f230707230833p6080b92h52d32a07b853bbdd@mail.gmail.com>
References: <863azhrlm1.fsf@sw053.lab.mtl.com>
	<20070722102209.GR16597@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com>
	<20070722174048.GO27878@sashak.voltaire.com>
	<f0e08f230707230830x5d0f73ccib09421f3f660e563@mail.gmail.com>
	<f0e08f230707230833p6080b92h52d32a07b853bbdd@mail.gmail.com>
Message-ID: <20070724003302.GC11674@sashak.voltaire.com>

Hi Hal,

On 11:33 Mon 23 Jul     , Hal Rosenstock wrote:
> > >
> > > Not for "fast" switch reboot.
> > >
> > > Hmm, I think we could try to detect this by comparing
> > > SwitchInfo:LinerFDBTop with current p_sw->max_lid_ho or even by seeing
> > > that PortInfo:LID is not set.
> >
> >
> > Not sure about checking PortInfo:LID. Wouldn't that approach need to be
> > qualified by PortState (armed or active) ? LFTTop seems better to me or
> > perhaps a combination of the two but I may be missing something.
> >
> 
>  Another thought on this :-(
> 
>  Not sure that resetting either LID or LFTTop is required by the spec either
>  so this is relying on "beyond the spec" behavior and may not be true for all
>  switch implementations.

Yes, it is similar to my findings. Suggested in your previous email
PortState check seems only reliable reboot detection criteria (for ports
and switches).

Sasha


From sashak at voltaire.com  Mon Jul 23 17:51:53 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 24 Jul 2007 03:51:53 +0300
Subject: [ofa-general] Re: opensm: a bug in heavy sweep? - no LFT
	re-configuration
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com>
References: <863azhrlm1.fsf@sw053.lab.mtl.com>
	<20070722102209.GR16597@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com>
	<20070722174048.GO27878@sashak.voltaire.com>
	<f0e08f230707230830x5d0f73ccib09421f3f660e563@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com>
Message-ID: <20070724005153.GD11674@sashak.voltaire.com>

Hi Eitan,

On 20:59 Mon 23 Jul     , Eitan Zahavi wrote:
> Hi Sasha, Hal,
>  
> I think I have an idea:
>  
> Since this is a specific switch that reported ChangeBit or Trap why
> can't we just qualify that there was no change in the switch setup?

The ChangeBit seems to be good start point - then OpenSM will query all
switch ports PortInfo anyway and if for all ports PortState is <= INIT
(and at least for one port it is = INIT), it means that this switch was
rebooted/reinitialized.

And for single port PortState drop to = INIT should indicate
reinitialization.

Seems correct?

> We could send PortInfo, SwitchInfo,

SwitchInfo is queried at each light sweep, PortInfo's if ChangeBit is
set. Guess we are ok with it even now.

> LFT, MFT, SL2VL, VLArb, PKey queries
> and make sure no change from previous state. Or we could simply enforce
> last state by sending it over again ...

I think we could want to re-read PKey tables in order to preserve
existing PKey indices and just to flush (overwrite with new settings)
LFT, MFT, SL2VL, VLArb tables. Reasonable?

Sasha


From sashak at voltaire.com  Mon Jul 23 18:08:38 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 24 Jul 2007 04:08:38 +0300
Subject: [ofa-general] [PATCH resend] opensm/osm_indent:
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901ED61C2@mtlexch01.mtl.com>
References: <20070722221455.GR27878@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED5EF1@mtlexch01.mtl.com>
	<20070723130912.GV16597@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED61C2@mtlexch01.mtl.com>
Message-ID: <20070724010838.GE11674@sashak.voltaire.com>

On 21:05 Mon 23 Jul     , Eitan Zahavi wrote:
> Hi Sasha,
> 
> I read the new coding style doc after this last mail.

It was under RFC subject on the list couple of months ago...

> I thought you only defined new "indentation rules" and I am for doing
> this step as it is automatic and safe.
> But rewriting the code with shorter names and replacing all variables 
> and functions seems a little too risky in my mind.

Yes, the script enforces "indentation rules" only. I didn't think we will
be able to deal with rest stuff shortly (<= OFED 1.3). So currently it
is (1) OpenSM style definition and (2) recommendations for new code/files
style.

Sasha


From sashak at voltaire.com  Mon Jul 23 18:16:31 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 24 Jul 2007 04:16:31 +0300
Subject: [ofa-general] RE: OFA website edits
In-Reply-To: <46A536EC.4060201@ichips.intel.com>
References: <B0095134066CC94FBC80973103FFA1FE046A5CB3@orsmsx416.amr.corp.intel.com>
	<adamyy27cxk.fsf@cisco.com> <46956FF9.50102@ichips.intel.com>
	<46968448.2000401@ichips.intel.com>
	<46A536EC.4060201@ichips.intel.com>
Message-ID: <20070724011631.GG11674@sashak.voltaire.com>

On 16:17 Mon 23 Jul     , Arlin Davis wrote:
> >
> > I would like to propose adding project directories under 
> > http://www.openfabrics.org/downloads/  where appropriate and give 
> > maintainers access. For example:
> >
> > http://www.openfabrics.org/downloads/verbs (rdreier)
> > http://www.openfabrics.org/downloads/rdmacm (shefty)
> > http://www.openfabrics.org/downloads/dapl (ardavis)
> > http://www.openfabrics.org/downloads/management (sashak)
> > http://www.openfabrics.org/downloads/OFED (vlad) 
> > http://www.openfabrics.org/downloads/WinOF (ardavis)
> > http://www.openfabrics.org/downloads/archives (vlad) ??
> > etc...
> >
> > Each of these would contain a README that details the contents of the 
> > directory along with WEB_README that provides a short description for the 
> > webpage. Jeff could then automatically parse for directories under 
> > downloads and if it contains WEB_README add a webpage link to the directory 
> > along with the short description.

Looks fine for me.

Sasha


From sashak at voltaire.com  Mon Jul 23 18:33:06 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 24 Jul 2007 04:33:06 +0300
Subject: [ofa-general] Command specification of ca_name and ca_port
In-Reply-To: <46A4C0C7.7020107@systemfabricworks.com>
References: <46A4C0C7.7020107@systemfabricworks.com>
Message-ID: <20070724013306.GH11674@sashak.voltaire.com>

Hi David,

On 09:52 Mon 23 Jul     , David McMillen wrote:
> 
>  There are a standard set of command line options that allow specification of 
>  the CA to use for sending the requests.  I'm adding these to programs that 
>  don't have them, since they are very useful when diagnosing a node connected 
>  to multiple subnets.  Even if you discount multiple subnets on purpose, 
>  sometimes this happens when the hardware connecting all of the CA ports to 
>  the same place gets broken, and that is when you need diagnostics that can 
>  help figure out what is where.
> 
>  The standard options are:
> 
>        -C <ca_name>    use the specified ca_name.
> 
>        -P <ca_port>    use the specified ca_port.
> 
>        -t <timeout_ms> override the default timeout for the solicited mads.
> 
>  My problem is that saquery already uses -C and -P, although the -t exists 
>  for the expected purpose.  Also, ibcheckerrs already uses -t for specifying 
>  the threshold file.

I think unified command line options over diags are good thing, so I
guess reasonable renaming should be acceptable.

> 
>  Changing the timeout for ibcheckerrs isn't critical, but not being able to 
>  do it doesn't seem right.  However, the saquery command could be really 
>  handy for figuring out split fabrics, and is useful to those of us that 
>  connect to multiple subnets.
> 
>  Does anybody have a useful suggestion?

'-T' for the threshold file? But it is easy part - saquery renames are
less intuitive :(. Probably just lower case? Or special query option
(-q or -Q), so queries could be specified as -qP, -qC?

Sasha


From mst at dev.mellanox.co.il  Mon Jul 23 20:03:41 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 24 Jul 2007 06:03:41 +0300
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070723200640.GA13117@bauxite.pathscale.com>
References: <20070612084108.GK6470@mellanox.co.il>
	<20070723200640.GA13117@bauxite.pathscale.com>
Message-ID: <20070724030318.GA7589@mellanox.co.il>

>Quoting Arthur Jones <arthur.jones at qlogic.com>:
>Subject: Re: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
>
>hi michael, ...
>
>On Tue, Jun 12, 2007 at 11:41:08AM +0300, Michael S. Tsirkin wrote:
>> For whom it may concern,
>> I have created an ofed git tree updated with kernel bits from 2.6.22-rc4,
>> and put that up at git://git.openfabrics.org/~mst/ofed_kernel.git
>> [...] 
>> In particular, there were a ton of ipath patches that it seems were
>> for the most part applied.
>> Qlogic maintainers, please help double check that I did not miss something
>> of value.
>
>thanks for setting this up, i'm still looking
>at the diffs to make sure things got setup
>correctly for the ipath stuff...
>
>i have found it difficult to navigate the
>source having to run:
>
>./ofed_scripts/configure --kernel-version=2.6.xxx --without-quilt
>
>everytime to check against our tree.  so, rather
>than spending the better part of the afternoon
>running these scripts by hand, i created a shell
>script to populate a bunch of branches with the
>backports in each branch.
>
>at qlogic we now keep the backports as branches in
>our git tree and this, i find, is much easier to
>handle.  because:
>
>* viewing and navigating backport source becomes
>  _much_ easier.
>* merges are easier -- patches are much more fragile
>  than branches.
>* comparisons are easier -- checking for differences
>  between backports and between a backport and the
>  canonical source is faster and more convenient...
>* changesets are readable.  trying to decipher diffs
>  to patches is medically proven to take months, if not
>  years, off your life.

Sigh. I wish it were possible to do everything through
addons tricks.

I see the advantages of the "bush of branches" -
for example it's possible
to add a backport patch to a recent kernel, and then
merge this into other kernel branches.

But I also see a serious problem with addressing: basically
git tracks content. It's not designed to track a bush
of branches taken together.  For example, take tagging:
tag namespace is global, so you can not have the same
tag point at multiple branches at the same time.

>anyway, what do you think?  is there anyway i could
>convince you to dump the backport patches and put
>all the backports in branches?  i'm willing to do the
>legwork if you see value...

Can you publish the scripts and/or the tree?
I think we can start by just running the scripts nightly,
making it possible for people to view backport history
with gitview.

-- 
MST


From krkumar2 at in.ibm.com  Mon Jul 23 20:44:46 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Tue, 24 Jul 2007 09:14:46 +0530
Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API
In-Reply-To: <1185193921.26013.37.camel@localhost>
Message-ID: <OF782207DD.01F7022B-ON65257322.000F7C16-65257322.0014942E@in.ibm.com>

Hi Jamal,

J Hadi Salim <j.hadi123 at gmail.com> wrote on 07/23/2007 06:02:01 PM:

> Yes, and these results were sent to you as well a while back.
> When i get the time when i get back i will look em up in my test machine
> and resend.

Actually you have not sent netperf results with prep and without prep.

> > No. I see value only in non-LLTX drivers which also gets the same TX
lock
> > in the RX path.
>
> So _which_ non-LLTX driver doesnt do that? ;->

I have no idea since I haven't looked at all drivers. Can you tell which
all non-LLTX drivers does that ? I stated this as the sole criterea.

> tun driver doesnt use it either - but i doubt that makes it "bloat"

Adding extra code that is currently not usable (esp from a submission
point)
is bloat.

> You waltz in, have the luxury of looking at my code, presentations, many
> discussions with me etc ...

"luxury" ? I had implemented the entire thing even before knowing that you
are working on something similar! and I had sent the first proposal to
netdev,
*after* which you told that you have your own code and presentations (which
I had never seen earlier - I joined netdev a few months back, earlier I was
working on RDMA, Infiniband as you know). And it didn't give me any great
ideas either, remember I had posted results for E1000 at the time of
sending
the proposals. However I do give credit in my proposal to you for what
ideas
that your provided (without actual code), and the same I did for other
people
who did the same, like Dave, Sridhar. BTW, you too had discussions with me,
and I sent some patches to improve your code too, so it looks like a two
way
street to me (and that is how open source works and should).

> When i ask for differences to code you produced, they now seem to sum up
> to the two below. You dont think theres some honest issue with this
> picture?

Two changes ? That's it ? I gave a big list of changes between our
implementations but you twist my words to conclude there is just two (by
conveniently labelling everything else "cosmetic", or "potentially
useful"!)! Even my restart routine used a single API from the first day,
I would never imagine using multiple API's. Our codes probably doesn't
have even one line that look remotely similar!

To clarify : I suggested that you could send patches for the two *missing*
items if you can show they add value (and not the rest, as I consider
those will not improve the code/logic/algo).

> > ("lacking in frankness, candor, or sincerity; falsely or hypocritically
> > ingenuous; insincere") ???? Sorry, no response to personal comments and
> > have a flame-war :)
>
> Give me a better description.

Sorry, no personal comments. Infact I will avoid responding to baits and
innuendoes from now on.

Thanks,

- KK


From donour at cs.unm.edu  Mon Jul 23 21:20:17 2007
From: donour at cs.unm.edu (Donour Sizemore)
Date: Mon, 23 Jul 2007 22:20:17 -0600
Subject: [ofa-general] correct buffer init for multiple receives
Message-ID: <46A57E01.6080109@cs.unm.edu>

Hi everybody.

I'm having a bit of trouble setting up multiple receive buffers for 
verbs. I'm using the ud pingpong example in ofed1.2 as an outline, but 
that example posts the same buffer for all receives.

I'm trying to do something like:

--
  for(i=0; i < IB_RXDEPTH; i++){
     posix_memalign((void**)&(conn->bufs[i]),1024, (IB_MTU + 40));
     memset(conn->bufs, 0, (IB_MTU+40));
   }

  conn->pd = ibv_alloc_pd(conn->context);
  for(i=0; i < nbufs; i++)
     conn->mr = ibv_reg_mr(conn->pd, (conn->bufs[i]), (IB_MTU+40), 
IBV_ACCESS_LOCAL_WRITE);
--

Then I'm trying to do a bunch of ibv_post_recv()'s with each buf[i] as 
the address in the ibv_sge.

Is this what I should be doing? It seems to be causing a big mess, 
corrupting memory, and giving unrepeatable results.

thanks,

Donour Sizemore
University of New Mexico


From eitan at mellanox.co.il  Mon Jul 23 21:56:31 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 24 Jul 2007 07:56:31 +0300
Subject: [ofa-general] RE: opensm: a bug in heavy sweep? - no LFT
	re-configuration
In-Reply-To: <20070724005153.GD11674@sashak.voltaire.com>
References: <863azhrlm1.fsf@sw053.lab.mtl.com>
	<20070722102209.GR16597@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com>
	<20070722174048.GO27878@sashak.voltaire.com>
	<f0e08f230707230830x5d0f73ccib09421f3f660e563@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com>
	<20070724005153.GD11674@sashak.voltaire.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com>

> On 20:59 Mon 23 Jul     , Eitan Zahavi wrote:
> > Hi Sasha, Hal,
> >  
> > I think I have an idea:
> >  
> > Since this is a specific switch that reported ChangeBit or Trap why 
> > can't we just qualify that there was no change in the switch setup?
> 
> The ChangeBit seems to be good start point - then OpenSM will 
> query all switch ports PortInfo anyway and if for all ports 
> PortState is <= INIT (and at least for one port it is = 
> INIT), it means that this switch was rebooted/reinitialized.
> 
> And for single port PortState drop to = INIT should indicate 
> reinitialization.
> 
> Seems correct?
Yes.
> 
> > We could send PortInfo, SwitchInfo,
> 
> SwitchInfo is queried at each light sweep, PortInfo's if 
> ChangeBit is set. Guess we are ok with it even now.
I will double check that...
Well - even setting one port state to INIT did not cause the switch to
be reconfigured.
Seems the code does not enforce this condition yet.
> 
> > LFT, MFT, SL2VL, VLArb, PKey queries
> > and make sure no change from previous state. Or we could simply 
> > enforce last state by sending it over again ...
> 
> I think we could want to re-read PKey tables in order to 
> preserve existing PKey indices and just to flush (overwrite 
> with new settings) LFT, MFT, SL2VL, VLArb tables. Reasonable?
Correct.


> 
> Sasha
> 


From dotanb at dev.mellanox.co.il  Mon Jul 23 23:35:46 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Tue, 24 Jul 2007 09:35:46 +0300
Subject: [ofa-general] correct buffer init for multiple receives
In-Reply-To: <46A57E01.6080109@cs.unm.edu>
References: <46A57E01.6080109@cs.unm.edu>
Message-ID: <46A59DC2.10608@dev.mellanox.co.il>

Hi.

Donour Sizemore wrote:
> Hi everybody.
>
> I'm having a bit of trouble setting up multiple receive buffers for 
> verbs. I'm using the ud pingpong example in ofed1.2 as an outline, but 
> that example posts the same buffer for all receives.
>
> I'm trying to do something like:
>
> -- 
>  for(i=0; i < IB_RXDEPTH; i++){
>     posix_memalign((void**)&(conn->bufs[i]),1024, (IB_MTU + 40));
>     memset(conn->bufs, 0, (IB_MTU+40));
>   }
which value are you using in the IB_MTU?
(the maximum supported IB MTU value is 4K)
>
>  conn->pd = ibv_alloc_pd(conn->context);
>  for(i=0; i < nbufs; i++)
>     conn->mr = ibv_reg_mr(conn->pd, (conn->bufs[i]), (IB_MTU+40), 
> IBV_ACCESS_LOCAL_WRITE);
EVERY memory registration gives you a Memory Region handle (with 
different lkey+rkey for this reason).
In this example you should have an array which will be filled with nbufs 
MR....

or

You can create one big buffer and handle the memory alignment when 
posting the WR yourself (if you wish).
> -- 
>
> Then I'm trying to do a bunch of ibv_post_recv()'s with each buf[i] as 
> the address in the ibv_sge.
No, because you will use the lkey of the last MR that you created with 
memory addresses of different buffers.
>
> Is this what I should be doing? It seems to be causing a big mess, 
> corrupting memory, and giving unrepeatable results.
When you post send/recv request for every memory buffer that you 
registered you need to use the appropriate lkey of this region.


I hope that this helped you...
Dotan


From ogerlitz at voltaire.com  Tue Jul 24 00:50:47 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 24 Jul 2007 10:50:47 +0300
Subject: [ofa-general] PATCH] IB/ipoib: ignore membership bit when looking
	for a P_Key in the table
In-Reply-To: <f0e08f230707230744i7c03acb9he74bbb912fa6d306@mail.gmail.com>
References: <46A36E77.5020307@gmail.com>	<f0e08f230707230554q2fe826e0iffc8624668702fe3@mail.gmail.com>	<46A4BE3E.4080606@gmail.com>
	<f0e08f230707230744i7c03acb9he74bbb912fa6d306@mail.gmail.com>
Message-ID: <46A5AF57.6040702@voltaire.com>

Hal Rosenstock wrote:
> On 7/23/07, *Moni Shoua* <monisonlists at gmail.com 
>     Hal Rosenstock wrote:
>      >
>      >     -               if (pkey == tmp_pkey) {
>      >     +               if ((pkey & 0x7fff) == (tmp_pkey & 0x7fff)) {
>      >
>      >

>      > Wouldn't this allow 2 limited PKeys to match though ?

>     Hi Hal,
>     Can you please explain what do you mean? Perhaps by example?

> Two Pkeys which have their full memebership bit off (0x8000). Two 
> limited members are not allowed to talk with each other.

Hal,

ib_find_pkey() is the buddy of ib_find_cached_pkey() which is in the 
stack from day one. Now, ib_find_cached_pkey does some abstraction where 
it masks out the membership bit, so pkeys are matched in 15 bit fashion.

Indeed, the overall design of the IB stack wrt to partial membership in 
a partition is not perfect nor final. I don't see why this masking off 
makes things worse then they could have been without it.

As you know, as some changes need to be done in the IB spec and the 
IPoIB RFC, I am personally holding off with suggesting changes/fixes 
till the spec is done, this is per the approach expressed by you and Sean.

Or.


From vlad at lists.openfabrics.org  Tue Jul 24 02:02:34 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Tue, 24 Jul 2007 02:02:34 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070724-0100 daily build status
Message-ID: <20070724090234.335D0E60821@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.12
Passed on x86_64 with linux-2.6.20
Passed on powerpc with linux-2.6.19
Passed on powerpc with linux-2.6.18
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.15
Passed on ia64 with linux-2.6.18
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.15
Passed on powerpc with linux-2.6.12
Passed on x86_64 with linux-2.6.12
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.14
Passed on powerpc with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18
Passed on powerpc with linux-2.6.14
Passed on x86_64 with linux-2.6.13
Passed on ia64 with linux-2.6.19
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.14
Passed on powerpc with linux-2.6.13
Passed on ia64 with linux-2.6.21.1
Passed on ppc64 with linux-2.6.12
Passed on x86_64 with linux-2.6.17
Passed on ia64 with linux-2.6.13
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.13
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.14
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.18-8.el5

Failed:


From tbwl at foster667.fsnet.co.uk  Tue Jul 24 02:12:37 2007
From: tbwl at foster667.fsnet.co.uk (Hilary R. Silva)
Date: Tue, 24 Jul 2007 16:12:37 +0700
Subject: [ofa-general] Journal
Message-ID: <46A5C285.5050403@foster667.fsnet.co.uk>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: Journal.pdf
Type: application/pdf
Size: 7627 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070724/8cf630c4/attachment.pdf>

From ogerlitz at voltaire.com  Tue Jul 24 02:39:50 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 24 Jul 2007 12:39:50 +0300
Subject: [ofa-general] Re: IPoIB path caching
In-Reply-To: <46A4EF00.9070305@ichips.intel.com>
References: <adalkdl43w0.fsf@cisco.com>	<46A2F696.4060007@voltaire.com>	<adafy3f22z5.fsf@cisco.com>	<46A46637.3080104@voltaire.com>	<20070723083020.GD20614@mellanox.co.il>
	<46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com>
Message-ID: <46A5C8E6.5020906@voltaire.com>

Sean Hefty wrote:
>> What I have in mind is that IPoIB must not use cached IB path info.

>> If the IB stack has path caching which is in the default flow of 
>> requesting a path record, it should provide an API (eg flag to the 
>> function through which one does path query) to request a non cached path.

> Argh!  This was the original design.  I believe the current design is a 
> better approach.  The ULP shouldn't care whether the PR is cached or not 
> - only that it's usable.

Linux has a quite sophisticated mechanism to maintain / cache / probe / 
invalidate / update the network stack L2 neighbour info.

Stating that although the neighbour cache state machine decided to 
update/delete a neighbour it is just correct by design for IPoIB to use 
  cached IB L2 info is somehow moving too fast I think, some discussion 
is needed here.

My basic thought is that for IPoIB its better to never use cached path 
then to always use cached path. But! maybe there's a way in the middle 
here, lets think. This is what I was referring to when saying "almost 
always".

For example, in the Voltaire gen1 stack we had an ib arp module which 
was used by both IPoIB and native IB ULPs (SDP, iSER, Lustre, etc). This 
module managed some sort of path cache, were IPoIB was always asking for 
non-cached path and other ULPs were willing to get cached path.

>> The design I was thinking to suggest for IPoIB is to almost always use 
>> this API since this policy makes the implementation consistent with 
>> the decisions made by the network stack neighbour cache

> This defeats one of the benefit of caching, which is using a single 
> GetTable query, versus literally hundreds or thousands of Get queries. 
> Consider that constant all-to-all communication using IPoIB between 1024 
> ports, with a 15 minute ARP table timeout would hit the SA with close to 
> 600 queries per second.

If the cache comes to serve all-to-all MPI jobs and practically with IB, 
to get MPI performance (specifically latency) people would --not-- be 
using IPoIB for their MPI jobs since they want kernel AND net-stack 
bypass, it does make sense to use non-cached path in IPoIB if we agree 
that design-wise its the the correct approach.

> While I agree that there's the potential for a problem, given that IPoIB 
> has always cached PRs and no one has reported problems, I think we're 
> overstating the likelihood of issues occurring in practice.  Even the SA 
> caches the path data -- getting a PR from the SA doesn't provide any 
> additional guarantees.

I am not with you... I would expect an SA implementation to invalid / 
recompute the relevant data structures associated with each change in 
the fabric and get a trap for each change.

Or.


From vlad at lists.openfabrics.org  Tue Jul 24 02:43:32 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Tue, 24 Jul 2007 02:43:32 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_c_kernel 20070724-0200 daily build status
Message-ID: <20070724094332.4E8D3E60857@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2_c

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.12
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.15
Passed on powerpc with linux-2.6.18
Passed on ia64 with linux-2.6.22
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.20
Passed on powerpc with linux-2.6.17
Passed on powerpc with linux-2.6.19
Passed on ppc64 with linux-2.6.16
Passed on powerpc with linux-2.6.16
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.14
Passed on powerpc with linux-2.6.13
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.17
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.12
Passed on ia64 with linux-2.6.18
Passed on ppc64 with linux-2.6.13
Passed on ia64 with linux-2.6.13
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ppc64 with linux-2.6.17
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.14
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.15
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on ia64 with linux-2.6.16.21-0.8-default

Failed:


From hal.rosenstock at gmail.com  Tue Jul 24 04:08:28 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 24 Jul 2007 07:08:28 -0400
Subject: [ofa-general] Re: opensm: a bug in heavy sweep? - no LFT
	re-configuration
In-Reply-To: <20070724005153.GD11674@sashak.voltaire.com>
References: <863azhrlm1.fsf@sw053.lab.mtl.com>
	<20070722102209.GR16597@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com>
	<20070722174048.GO27878@sashak.voltaire.com>
	<f0e08f230707230830x5d0f73ccib09421f3f660e563@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com>
	<20070724005153.GD11674@sashak.voltaire.com>
Message-ID: <f0e08f230707240408j2073368ckaadc2b8da0559692@mail.gmail.com>

On 7/23/07, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>
> Hi Eitan,
>
> On 20:59 Mon 23 Jul     , Eitan Zahavi wrote:
> > Hi Sasha, Hal,
> >
> > I think I have an idea:
> >
> > Since this is a specific switch that reported ChangeBit or Trap why
> > can't we just qualify that there was no change in the switch setup?
>
> The ChangeBit seems to be good start point - then OpenSM will query all
> switch ports PortInfo anyway and if for all ports PortState is <= INIT
> (and at least for one port it is = INIT), it means that this switch was
> rebooted/reinitialized.
>
> And for single port PortState drop to = INIT should indicate
> reinitialization.
>
> Seems correct?


Wouldn't this be all ports in INIT indicate reset of switch ?

-- Hal

> We could send PortInfo, SwitchInfo,
>
> SwitchInfo is queried at each light sweep, PortInfo's if ChangeBit is
> set. Guess we are ok with it even now.
>
> > LFT, MFT, SL2VL, VLArb, PKey queries
> > and make sure no change from previous state. Or we could simply enforce
> > last state by sending it over again ...
>
> I think we could want to re-read PKey tables in order to preserve
> existing PKey indices and just to flush (overwrite with new settings)
> LFT, MFT, SL2VL, VLArb tables. Reasonable?
>
> Sasha
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070724/96acb317/attachment.html>

From dotanb at dev.mellanox.co.il  Tue Jul 24 04:32:14 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Tue, 24 Jul 2007 14:32:14 +0300
Subject: [ofa-general] [PATCH] libibmad: Fixed a name of a field in
	SwitchInfo to the right name
Message-ID: <200707241432.14848.dotanb@dev.mellanox.co.il>

Fixed a name of a field in SwitchInfo to the right name.

Signed-off-by: Dotan Barak <dotanb at dev.mellanox.co.il>

---

Index: connectx_user/src/userspace/management/libibmad/src/fields.c
===================================================================
--- connectx_user.orig/src/userspace/management/libibmad/src/fields.c	2007-07-22 16:34:02.000000000 +0300
+++ connectx_user/src/userspace/management/libibmad/src/fields.c	2007-07-24 13:58:41.000000000 +0300
@@ -193,7 +193,7 @@ ib_field_t ib_mad_f [] = {
 	[IB_SW_PARTITION_ENF_INB_F]	{BITSOFFS(128, 1), "InboundPartEnf", mad_dump_uint},
 	[IB_SW_PARTITION_ENF_OUTB_F]	{BITSOFFS(129, 1), "OutboundPartEnf", mad_dump_uint},
 	[IB_SW_FILTER_RAW_INB_F]	{BITSOFFS(130, 1), "FilterRawInbound", mad_dump_uint},
-	[IB_SW_FILTER_RAW_OUTB_F]	{BITSOFFS(131, 1), "FilterRawInbound", mad_dump_uint},
+	[IB_SW_FILTER_RAW_OUTB_F]	{BITSOFFS(131, 1), "FilterRawOutbound", mad_dump_uint},
 	[IB_SW_ENHANCED_PORT0_F]	{BITSOFFS(132, 1), "EnhancedPort0", mad_dump_uint},
 
 	/*


From glebn at voltaire.com  Tue Jul 24 05:14:40 2007
From: glebn at voltaire.com (Gleb Natapov)
Date: Tue, 24 Jul 2007 15:14:40 +0300
Subject: [ofa-general] Bug in inline sends with sge_num > 0 in libmlx4
Message-ID: <20070724121440.GA2775@minantech.com>

Hi,

 There is a bug in mlx4_post_send(). A data that is sent inline and
consists from multiple small sges isn't copied properly into wqe.
The following patch fixes it for me.

Signed-off-by: Gleb Natapov <glebn at voltaire.com>

diff --git a/src/qp.c b/src/qp.c
index 66ee309..83a4fd4 100644
--- a/src/qp.c
+++ b/src/qp.c
@@ -288,6 +288,7 @@ int mlx4_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
 				memcpy(wqe, addr, len);
 				wqe += len;
 				seg_len += len;
+				off += len;
 			}
 
 			if (seg_len) {
--
			Gleb.


From hal.rosenstock at gmail.com  Tue Jul 24 06:22:19 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 24 Jul 2007 09:22:19 -0400
Subject: [ofa-general] Re: opensm: a bug in heavy sweep? - no LFT
	re-configuration
In-Reply-To: <f0e08f230707240408j2073368ckaadc2b8da0559692@mail.gmail.com>
References: <863azhrlm1.fsf@sw053.lab.mtl.com>
	<20070722102209.GR16597@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com>
	<20070722174048.GO27878@sashak.voltaire.com>
	<f0e08f230707230830x5d0f73ccib09421f3f660e563@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com>
	<20070724005153.GD11674@sashak.voltaire.com>
	<f0e08f230707240408j2073368ckaadc2b8da0559692@mail.gmail.com>
Message-ID: <f0e08f230707240622k697f32a7n6ecbbae442c12293@mail.gmail.com>

On 7/24/07, Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
>
>
>
> On 7/23/07, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> >
> > Hi Eitan,
> >
> > On 20:59 Mon 23 Jul     , Eitan Zahavi wrote:
> > > Hi Sasha, Hal,
> > >
> > > I think I have an idea:
> > >
> > > Since this is a specific switch that reported ChangeBit or Trap why
> > > can't we just qualify that there was no change in the switch setup?
> >
> > The ChangeBit seems to be good start point - then OpenSM will query all
> > switch ports PortInfo anyway and if for all ports PortState is <= INIT
> > (and at least for one port it is = INIT), it means that this switch was
> > rebooted/reinitialized.
> >
> > And for single port PortState drop to = INIT should indicate
> > reinitialization.
> >
> > Seems correct?
>
>
> Wouldn't this be all ports in INIT indicate reset of switch ?
>

for ports which are LinkUp. This is pretty dicey :-( I don't see a good way
to determine this.

-- Hal


> -- Hal
>
> > We could send PortInfo, SwitchInfo,
> >
> > SwitchInfo is queried at each light sweep, PortInfo's if ChangeBit is
> > set. Guess we are ok with it even now.
> >
> > > LFT, MFT, SL2VL, VLArb, PKey queries
> > > and make sure no change from previous state. Or we could simply
> > enforce
> > > last state by sending it over again ...
> >
> > I think we could want to re-read PKey tables in order to preserve
> > existing PKey indices and just to flush (overwrite with new settings)
> > LFT, MFT, SL2VL, VLArb tables. Reasonable?
> >
> > Sasha
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070724/ab4c61d4/attachment.html>

From vlad at mellanox.co.il  Tue Jul 24 06:58:42 2007
From: vlad at mellanox.co.il (Vladimir Sokolovsky)
Date: Tue, 24 Jul 2007 16:58:42 +0300
Subject: [ofa-general] RE: OFA website edits
In-Reply-To: <46A536EC.4060201@ichips.intel.com>
References: <B0095134066CC94FBC80973103FFA1FE046A5CB3@orsmsx416.amr.corp.intel.com>	<adamyy27cxk.fsf@cisco.com>
	<46956FF9.50102@ichips.intel.com>
	<46968448.2000401@ichips.intel.com>
	<46A536EC.4060201@ichips.intel.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901ED65B2@mtlexch01.mtl.com>

> >
> > I would like to propose adding project directories under
> > http://www.openfabrics.org/downloads/  where appropriate and give
> > maintainers access. For example:
> >
> > http://www.openfabrics.org/downloads/verbs (rdreier)
> > http://www.openfabrics.org/downloads/rdmacm (shefty)
> > http://www.openfabrics.org/downloads/dapl (ardavis)
> > http://www.openfabrics.org/downloads/management (sashak)
> > http://www.openfabrics.org/downloads/OFED (vlad)
> > http://www.openfabrics.org/downloads/WinOF (ardavis)
> > http://www.openfabrics.org/downloads/archives (vlad) ??
> > etc...
> >
> > Each of these would contain a README that details the contents of
the
> > directory along with WEB_README that provides a short description
for
> > the webpage. Jeff could then automatically parse for directories
under
> > downloads and if it contains WEB_README add a webpage link to the
> > directory along with the short description.
> >

Looks good for me.

Regards,
Vladimir


From sashak at voltaire.com  Tue Jul 24 07:04:50 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 24 Jul 2007 17:04:50 +0300
Subject: [ofa-general] Re: opensm: a bug in heavy sweep? - no LFT
	re-configuration
In-Reply-To: <f0e08f230707240408j2073368ckaadc2b8da0559692@mail.gmail.com>
References: <863azhrlm1.fsf@sw053.lab.mtl.com>
	<20070722102209.GR16597@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com>
	<20070722174048.GO27878@sashak.voltaire.com>
	<f0e08f230707230830x5d0f73ccib09421f3f660e563@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com>
	<20070724005153.GD11674@sashak.voltaire.com>
	<f0e08f230707240408j2073368ckaadc2b8da0559692@mail.gmail.com>
Message-ID: <20070724140450.GV27878@sashak.voltaire.com>

On 07:08 Tue 24 Jul     , Hal Rosenstock wrote:
>  On 7/23/07, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> >
> > Hi Eitan,
> >
> > On 20:59 Mon 23 Jul     , Eitan Zahavi wrote:
> > > Hi Sasha, Hal,
> > >
> > > I think I have an idea:
> > >
> > > Since this is a specific switch that reported ChangeBit or Trap why
> > > can't we just qualify that there was no change in the switch setup?
> >
> > The ChangeBit seems to be good start point - then OpenSM will query all
> > switch ports PortInfo anyway and if for all ports PortState is <= INIT
> > (and at least for one port it is = INIT), it means that this switch was
> > rebooted/reinitialized.
> >
> > And for single port PortState drop to = INIT should indicate
> > reinitialization.
> >
> > Seems correct?
> 
> 
>  Wouldn't this be all ports in INIT indicate reset of switch ?

It includes not connected ports too, so I guess it should be <= INIT .

Sasha


From jackm at dev.mellanox.co.il  Tue Jul 24 07:04:31 2007
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Tue, 24 Jul 2007 17:04:31 +0300
Subject: [ofa-general] Bug in inline sends with sge_num > 0 in libmlx4
In-Reply-To: <20070724121440.GA2775@minantech.com>
References: <20070724121440.GA2775@minantech.com>
Message-ID: <200707241704.31831.jackm@dev.mellanox.co.il>

On Tuesday 24 July 2007 15:14, Gleb Natapov wrote:
> Hi,
> 
>  There is a bug in mlx4_post_send(). A data that is sent inline and
> consists from multiple small sges isn't copied properly into wqe.
> The following patch fixes it for me.
> 
> Signed-off-by: Gleb Natapov <glebn at voltaire.com>
> 
> diff --git a/src/qp.c b/src/qp.c
> index 66ee309..83a4fd4 100644
> --- a/src/qp.c
> +++ b/src/qp.c
> @@ -288,6 +288,7 @@ int mlx4_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
>  				memcpy(wqe, addr, len);
>  				wqe += len;
>  				seg_len += len;
> +				off += len;
>  			}
>  
>  			if (seg_len) {

Good catch! This patch is correct.
Roland?

- Jack


From sashak at voltaire.com  Tue Jul 24 07:12:20 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 24 Jul 2007 17:12:20 +0300
Subject: [ofa-general] Re: [PATCH] libibmad: Fixed a name of a field in
	SwitchInfo to the right name
In-Reply-To: <200707241432.14848.dotanb@dev.mellanox.co.il>
References: <200707241432.14848.dotanb@dev.mellanox.co.il>
Message-ID: <20070724141220.GW27878@sashak.voltaire.com>

On 14:32 Tue 24 Jul     , Dotan Barak wrote:
> Fixed a name of a field in SwitchInfo to the right name.
> 
> Signed-off-by: Dotan Barak <dotanb at dev.mellanox.co.il>

Applied. Thanks.

Sasha


From hal.rosenstock at gmail.com  Tue Jul 24 07:30:41 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 24 Jul 2007 10:30:41 -0400
Subject: [ofa-general] OpenSM detection of duplicated GUIDs on loopback
Message-ID: <f0e08f230707240730o6d665bb7q2357cfb4a49445aa@mail.gmail.com>

Hi,

This is what starts off as a "minor" issue and I know it has been discussed
it somewhat in the past:

Putting a loopback connector on a (switch) link causes OpenSM to indicate
duplicated GUID error 0D18 as follows:

__osm_ni_rcv_set_links
{
...
          /*
             When there are only two nodes with exact same guids (connected
back
             to back) - the previous check for duplicated guid will not
catch
             them. But the link will be from the port to itself...
             Enhanced Port 0 is an exception to this
          */
          if ((osm_node_get_node_guid( p_node ) == p_ni_context->node_guid)
&&
              (port_num == p_ni_context->port_num) &&
              (port_num != 0))
          {
            osm_log( p_rcv->p_log, OSM_LOG_ERROR,
                     "__osm_ni_rcv_set_links: ERR 0D18: "
                     "Duplicate GUID found by link from a port to itself:"
                     "node 0x%" PRIx64 ", port number 0x%X\n",
                     cl_ntoh64( osm_node_get_node_guid( p_node ) ),
                     port_num );
...

So this occurs over and over and over and fills the log with the same spew.
This should be improved IMO.

Is this really a fatal condition ? Doesn't seem like it should be to me.

Also, OpenSM can "ride" this out with -y (stay on fatal) but is that safe
for this condition ?

Seems like something like an extra loopback bit should be added to some port
structure which should cause these links to be ignored. This bit would then
be reset when the peer is now longer itself.

Also, is there a relationship of this with the 12x/duplicated GUID code ?

Thanks.

-- Hal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070724/0c64c54c/attachment.html>

From eitan at mellanox.co.il  Tue Jul 24 07:44:22 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 24 Jul 2007 17:44:22 +0300
Subject: [ofa-general] RE: OpenSM detection of duplicated GUIDs on loopback
In-Reply-To: <f0e08f230707240730o6d665bb7q2357cfb4a49445aa@mail.gmail.com>
References: <f0e08f230707240730o6d665bb7q2357cfb4a49445aa@mail.gmail.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901ED65F2@mtlexch01.mtl.com>

Hi Hal,
 
What is this "loopback" connector used for?
Does not seem to me like a very useful thing to do.
Anyway, if it is not a production environment we could add a "debug
mode" (-d flag option) to ignore this check.
 

Eitan Zahavi 
Senior Engineering Director, Software Architect 
Mellanox Technologies LTD 
Tel:+972-4-9097208
Fax:+972-4-9593245 
P.O. Box 586 Yokneam 20692 ISRAEL 

 
________________________________

	From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] 
	Sent: Tuesday, July 24, 2007 5:31 PM
	To: OpenFabrics General
	Cc: Sasha Khapyorsky; Eitan Zahavi; Yevgeny Kliteynik
	Subject: OpenSM detection of duplicated GUIDs on loopback
	
	
	Hi,
	 
	This is what starts off as a "minor" issue and I know it has
been discussed it somewhat in the past: 
	 
	Putting a loopback connector on a (switch) link causes OpenSM to
indicate duplicated GUID error 0D18 as follows:
	
	__osm_ni_rcv_set_links
	{
	...
	          /*
	             When there are only two nodes with exact same guids
(connected back 
	             to back) - the previous check for duplicated guid
will not catch
	             them. But the link will be from the port to
itself...
	             Enhanced Port 0 is an exception to this
	          */ 
	          if ((osm_node_get_node_guid( p_node ) ==
p_ni_context->node_guid) &&
	              (port_num == p_ni_context->port_num) &&
	              (port_num != 0))
	          {
	            osm_log( p_rcv->p_log, OSM_LOG_ERROR, 
	                     "__osm_ni_rcv_set_links: ERR 0D18: "
	                     "Duplicate GUID found by link from a port
to itself:"
	                     "node 0x%" PRIx64 ", port number 0x%X\n", 
	                     cl_ntoh64( osm_node_get_node_guid( p_node )
),
	                     port_num );
	...
	
	So this occurs over and over and over and fills the log with the
same spew. This should be improved IMO. 
	
	Is this really a fatal condition ? Doesn't seem like it should
be to me. 
	 
	Also, OpenSM can "ride" this out with -y (stay on fatal) but is
that safe for this condition ?
	 
	Seems like something like an extra loopback bit should be added
to some port structure which should cause these links to be ignored.
This bit would then be reset when the peer is now longer itself. 
	
	Also, is there a relationship of this with the 12x/duplicated
GUID code ? 
	 
	Thanks.
	 
	-- Hal

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070724/6862e638/attachment.html>

From hal.rosenstock at gmail.com  Tue Jul 24 07:53:00 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 24 Jul 2007 10:53:00 -0400
Subject: [ofa-general] Re: OpenSM detection of duplicated GUIDs on loopback
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901ED65F2@mtlexch01.mtl.com>
References: <f0e08f230707240730o6d665bb7q2357cfb4a49445aa@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED65F2@mtlexch01.mtl.com>
Message-ID: <f0e08f230707240753u496187b8m670e72fbd1f499f8@mail.gmail.com>

Hi Eitan,

On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:
>
>  *Hi Hal,*
> **
> *What is this "loopback" connector used for?*
> *Does not seem to me like a very useful thing to do.*
>

Perhaps not but no reason OpenSM can't handle this more gracefully.

 *Anyway, if it is not a production environment we could add a "debug mode"
> (-d flag option) to ignore this check.*
>

Why would a separate flag be needed ?

-- Hal


>
> *Eitan Zahavi***
> Senior Engineering Director, Software Architect
> Mellanox Technologies LTD
> Tel:+972-4-9097208
> Fax:+972-4-9593245
> P.O. Box 586 Yokneam 20692 ISRAEL
>
>
>  ------------------------------
> *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> *Sent:* Tuesday, July 24, 2007 5:31 PM
> *To:* OpenFabrics General
> *Cc:* Sasha Khapyorsky; Eitan Zahavi; Yevgeny Kliteynik
> *Subject:* OpenSM detection of duplicated GUIDs on loopback
>
>
>  Hi,
>
> This is what starts off as a "minor" issue and I know it has been
> discussed it somewhat in the past:
>
> Putting a loopback connector on a (switch) link causes OpenSM to indicate
> duplicated GUID error 0D18 as follows:
>
> __osm_ni_rcv_set_links
> {
> ...
>           /*
>              When there are only two nodes with exact same guids
> (connected back
>              to back) - the previous check for duplicated guid will not
> catch
>              them. But the link will be from the port to itself...
>              Enhanced Port 0 is an exception to this
>           */
>           if ((osm_node_get_node_guid( p_node ) ==
> p_ni_context->node_guid) &&
>               (port_num == p_ni_context->port_num) &&
>               (port_num != 0))
>           {
>             osm_log( p_rcv->p_log, OSM_LOG_ERROR,
>                      "__osm_ni_rcv_set_links: ERR 0D18: "
>                      "Duplicate GUID found by link from a port to itself:"
>                      "node 0x%" PRIx64 ", port number 0x%X\n",
>                      cl_ntoh64( osm_node_get_node_guid( p_node ) ),
>                      port_num );
> ...
>
> So this occurs over and over and over and fills the log with the same
> spew. This should be improved IMO.
>
> Is this really a fatal condition ? Doesn't seem like it should be to me.
>
> Also, OpenSM can "ride" this out with -y (stay on fatal) but is that safe
> for this condition ?
>
> Seems like something like an extra loopback bit should be added to some
> port structure which should cause these links to be ignored. This bit would
> then be reset when the peer is now longer itself.
>
> Also, is there a relationship of this with the 12x/duplicated GUID code ?
>
> Thanks.
>
> -- Hal
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070724/4d3f1808/attachment.html>

From arthur.jones at qlogic.com  Tue Jul 24 07:53:35 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Tue, 24 Jul 2007 07:53:35 -0700
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070724030318.GA7589@mellanox.co.il>
References: <20070612084108.GK6470@mellanox.co.il>
	<20070723200640.GA13117@bauxite.pathscale.com>
	<20070724030318.GA7589@mellanox.co.il>
Message-ID: <20070724145335.GF16727@bauxite.pathscale.com>

hi michael, ...

On Tue, Jul 24, 2007 at 06:03:41AM +0300, Michael S. Tsirkin wrote:
> [...]
> But I also see a serious problem with addressing: basically
> git tracks content. It's not designed to track a bush
> of branches taken together.  For example, take tagging:
> tag namespace is global, so you can not have the same
> tag point at multiple branches at the same time.

agreed.  however, the way we use git, with the
location of the git DB as the "tag", it's not
really a problem in practice.  but tagging each
branch separately is indeed a PITA...

> >anyway, what do you think?  is there anyway i could
> >convince you to dump the backport patches and put
> >all the backports in branches?  i'm willing to do the
> >legwork if you see value...
> 
> Can you publish the scripts and/or the tree?
> I think we can start by just running the scripts nightly,
> making it possible for people to view backport history
> with gitview.

i've attached the script that i'm using to compare
the trees, but it's a total hack.  it doesn't keep
the patch history.  that would not be too hard to
do i guess -- if there's interest...

to run the script:

<cp attached files here...>
$ git clone git://git.openfabrics.org/~mst/ofed_kernel.git ofed_kernel
$ cd ofed_kernel
$ for b in `cat ../ofed-backports.txt`; do ../create-backport.sh $b; done

now you'll have a bunch of backport-2.6.xxx branches...

arthur
-------------- next part --------------
2.6.5_sles9_sp3
2.6.9_U2
2.6.9_U3
2.6.9_U4
2.6.9_U5
2.6.11_FC4
2.6.11
2.6.12
2.6.13_suse10_0_u
2.6.13
2.6.14
2.6.15_ubuntu606
2.6.15
2.6.16_sles10
2.6.16_sles10_sp1
2.6.16
2.6.17
2.6.18_FC6
2.6.18
2.6.19
2.6.20
2.6.21
2.6.22
-------------- next part --------------
A non-text attachment was scrubbed...
Name: create-backport.sh
Type: application/x-sh
Size: 265 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070724/f6992a90/attachment.sh>

From eitan at mellanox.co.il  Tue Jul 24 07:52:33 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 24 Jul 2007 17:52:33 +0300
Subject: [ofa-general] RE: OpenSM detection of duplicated GUIDs on loopback
In-Reply-To: <f0e08f230707240753u496187b8m670e72fbd1f499f8@mail.gmail.com>
References: <f0e08f230707240730o6d665bb7q2357cfb4a49445aa@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED65F2@mtlexch01.mtl.com>
	<f0e08f230707240753u496187b8m670e72fbd1f499f8@mail.gmail.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com>

From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] 
Sent: Tuesday, July 24, 2007 5:53 PM
To: Eitan Zahavi
Cc: OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
Subject: Re: OpenSM detection of duplicated GUIDs on loopback


	Hi Eitan,
	
	
	On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote: 

		Hi Hal,
		 
		What is this "loopback" connector used for?
		Does not seem to me like a very useful thing to do.

	 
	Perhaps not but no reason OpenSM can't handle this more
gracefully.


		Anyway, if it is not a production environment we could
add a "debug mode" (-d flag option) to ignore this check.

	 
	Why would a separate flag be needed ?
	[EZ] Since I do not see any other solution for the SM  to know
it is really a loop back plug rather then two devices with same GUID
connected back to back ...
	 
	-- Hal


		Eitan Zahavi 
		Senior Engineering Director, Software Architect 
		Mellanox Technologies LTD 
		Tel:+972-4-9097208
		Fax:+972-4-9593245 
		P.O. Box 586 Yokneam 20692 ISRAEL 

		 
________________________________

			From: Hal Rosenstock
[mailto:hal.rosenstock at gmail.com] 
			Sent: Tuesday, July 24, 2007 5:31 PM
			To: OpenFabrics General
			Cc: Sasha Khapyorsky; Eitan Zahavi; Yevgeny
Kliteynik
			Subject: OpenSM detection of duplicated GUIDs on
loopback
			
			 
			Hi,
			 
			This is what starts off as a "minor" issue and I
know it has been discussed it somewhat in the past: 
			 
			Putting a loopback connector on a (switch) link
causes OpenSM to indicate duplicated GUID error 0D18 as follows:
			
			__osm_ni_rcv_set_links
			{
			...
			          /*
			             When there are only two nodes with
exact same guids (connected back 
			             to back) - the previous check for
duplicated guid will not catch
			             them. But the link will be from the
port to itself...
			             Enhanced Port 0 is an exception to
this
			          */ 
			          if ((osm_node_get_node_guid( p_node )
== p_ni_context->node_guid) &&
			              (port_num ==
p_ni_context->port_num) &&
			              (port_num != 0))
			          {
			            osm_log( p_rcv->p_log,
OSM_LOG_ERROR, 
			                     "__osm_ni_rcv_set_links:
ERR 0D18: "
			                     "Duplicate GUID found by
link from a port to itself:"
			                     "node 0x%" PRIx64 ", port
number 0x%X\n", 
			                     cl_ntoh64(
osm_node_get_node_guid( p_node ) ),
			                     port_num );
			...
			
			So this occurs over and over and over and fills
the log with the same spew. This should be improved IMO. 
			
			Is this really a fatal condition ? Doesn't seem
like it should be to me. 
			 
			Also, OpenSM can "ride" this out with -y (stay
on fatal) but is that safe for this condition ?
			 
			Seems like something like an extra loopback bit
should be added to some port structure which should cause these links to
be ignored. This bit would then be reset when the peer is now longer
itself. 
			
			Also, is there a relationship of this with the
12x/duplicated GUID code ? 
			 
			Thanks.
			 
			-- Hal


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070724/524b0479/attachment.html>

From hal.rosenstock at gmail.com  Tue Jul 24 08:03:30 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 24 Jul 2007 11:03:30 -0400
Subject: [ofa-general] Re: OpenSM detection of duplicated GUIDs on loopback
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com>
References: <f0e08f230707240730o6d665bb7q2357cfb4a49445aa@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED65F2@mtlexch01.mtl.com>
	<f0e08f230707240753u496187b8m670e72fbd1f499f8@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com>
Message-ID: <f0e08f230707240803nf0f2297x44df64a802b080a1@mail.gmail.com>

On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:
>
>  *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> *Sent:* Tuesday, July 24, 2007 5:53 PM
> *To:* Eitan Zahavi
> *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
> *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback
>
>
>
> Hi Eitan,
>
> On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:
> >
> >  *Hi Hal,*
> > **
> > *What is this "loopback" connector used for?*
> > *Does not seem to me like a very useful thing to do.*
> >
> **
> Perhaps not but no reason OpenSM can't handle this more gracefully.
>
>  *Anyway, if it is not a production environment we could add a "debug
> > mode" (-d flag option) to ignore this check.*
> >
> **
> Why would a separate flag be needed ?
> *[EZ] Since I do not see any other solution for the SM  to know it is
> really a loop back plug rather then two devices with same GUID connected
> back to back ...*
>
>
"Technically", this should only occur when looped back and not two devices
with same GUID as GUID == globally unique and a duplication indicates a
"manufacturing" issue.

Anyhow, can't these be treated the same (and handled more gracefully)
without an additional option/flag ?

-- Hal


> -- Hal
>
>  **
> >
> > *Eitan Zahavi***
> > Senior Engineering Director, Software Architect
> > Mellanox Technologies LTD
> > Tel:+972-4-9097208
> > Fax:+972-4-9593245
> > P.O. Box 586 Yokneam 20692 ISRAEL
> >
> >
> >  ------------------------------
> > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> > *Sent: *Tuesday, July 24, 2007 5:31 PM
> > *To:* OpenFabrics General
> > *Cc:* Sasha Khapyorsky; Eitan Zahavi; Yevgeny Kliteynik
> > *Subject:* OpenSM detection of duplicated GUIDs on loopback
> >
> >
> >  Hi,
> >
> > This is what starts off as a "minor" issue and I know it has been
> > discussed it somewhat in the past:
> >
> > Putting a loopback connector on a (switch) link causes OpenSM to
> > indicate duplicated GUID error 0D18 as follows:
> >
> > __osm_ni_rcv_set_links
> > {
> > ...
> >           /*
> >              When there are only two nodes with exact same guids
> > (connected back
> >              to back) - the previous check for duplicated guid will not
> > catch
> >              them. But the link will be from the port to itself...
> >              Enhanced Port 0 is an exception to this
> >           */
> >           if ((osm_node_get_node_guid( p_node ) ==
> > p_ni_context->node_guid) &&
> >               (port_num == p_ni_context->port_num) &&
> >               (port_num != 0))
> >           {
> >             osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> >                      "__osm_ni_rcv_set_links: ERR 0D18: "
> >                      "Duplicate GUID found by link from a port to
> > itself:"
> >                      "node 0x%" PRIx64 ", port number 0x%X\n",
> >                      cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> >                      port_num );
> > ...
> >
> > So this occurs over and over and over and fills the log with the same
> > spew. This should be improved IMO.
> >
> > Is this really a fatal condition ? Doesn't seem like it should be to me.
> >
> >
> > Also, OpenSM can "ride" this out with -y (stay on fatal) but is that
> > safe for this condition ?
> >
> > Seems like something like an extra loopback bit should be added to some
> > port structure which should cause these links to be ignored. This bit would
> > then be reset when the peer is now longer itself.
> >
> > Also, is there a relationship of this with the 12x/duplicated GUID code
> > ?
> >
> > Thanks.
> >
> > -- Hal
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070724/ed957ba8/attachment.html>

From mst at dev.mellanox.co.il  Tue Jul 24 08:09:09 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 24 Jul 2007 18:09:09 +0300
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070724145335.GF16727@bauxite.pathscale.com>
References: <20070612084108.GK6470@mellanox.co.il>
	<20070723200640.GA13117@bauxite.pathscale.com>
	<20070724030318.GA7589@mellanox.co.il>
	<20070724145335.GF16727@bauxite.pathscale.com>
Message-ID: <20070724150909.GL4359@mellanox.co.il>

> Quoting Arthur Jones <arthur.jones at qlogic.com>:
> Subject: Re: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
> 
> hi michael, ...
> 
> On Tue, Jul 24, 2007 at 06:03:41AM +0300, Michael S. Tsirkin wrote:
> > [...]
> > But I also see a serious problem with addressing: basically
> > git tracks content. It's not designed to track a bush
> > of branches taken together.  For example, take tagging:
> > tag namespace is global, so you can not have the same
> > tag point at multiple branches at the same time.
> 
> agreed.  however, the way we use git, with the
> location of the git DB as the "tag", it's not
> really a problem in practice.

who uses git this way?

> but tagging each
> branch separately is indeed a PITA...

This is just one problem.
For example, git pull can only merge one branch at a time.

> > >anyway, what do you think?  is there anyway i could
> > >convince you to dump the backport patches and put
> > >all the backports in branches?  i'm willing to do the
> > >legwork if you see value...
> > 
> > can you publish the scripts and/or the tree?
> > i think we can start by just running the scripts nightly,
> > making it possible for people to view backport history
> > with gitview.
> 
> i've attached the script that i'm using to compare
> the trees, but it's a total hack.  it doesn't keep
> the patch history.  that would not be too hard to
> do i guess -- if there's interest...
> 
> to run the script:
> 
> <cp attached files here...>
> $ git clone git://git.openfabrics.org/~mst/ofed_kernel.git ofed_kernel
> $ cd ofed_kernel
> $ for b in `cat ../ofed-backports.txt`; do ../create-backport.sh $b; done
> 
> now you'll have a bunch of backport-2.6.xxx branches...

So, would you like to have this script run nightly on ofed trees?

-- 
MST


From arthur.jones at qlogic.com  Tue Jul 24 08:23:05 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Tue, 24 Jul 2007 08:23:05 -0700
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070724150909.GL4359@mellanox.co.il>
References: <20070612084108.GK6470@mellanox.co.il>
	<20070723200640.GA13117@bauxite.pathscale.com>
	<20070724030318.GA7589@mellanox.co.il>
	<20070724145335.GF16727@bauxite.pathscale.com>
	<20070724150909.GL4359@mellanox.co.il>
Message-ID: <20070724152305.GG16727@bauxite.pathscale.com>

hi michael, ...

On Tue, Jul 24, 2007 at 06:09:09PM +0300, Michael S. Tsirkin wrote:
> > Quoting Arthur Jones <arthur.jones at qlogic.com>:
> > Subject: Re: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
> > 
> > hi michael, ...
> > 
> > On Tue, Jul 24, 2007 at 06:03:41AM +0300, Michael S. Tsirkin wrote:
> > > [...]
> > > But I also see a serious problem with addressing: basically
> > > git tracks content. It's not designed to track a bush
> > > of branches taken together.  For example, take tagging:
> > > tag namespace is global, so you can not have the same
> > > tag point at multiple branches at the same time.
> > 
> > agreed.  however, the way we use git, with the
> > location of the git DB as the "tag", it's not
> > really a problem in practice.
> 
> who uses git this way?

i do.

> > but tagging each
> > branch separately is indeed a PITA...
> 
> This is just one problem.
> For example, git pull can only merge one branch at a time.

how is this a problem?  the way i use git,
i use a script to "reflow" the changes into
the dependent branches.  over the last few
months, anyway, it has worked fine...

> > > >anyway, what do you think?  is there anyway i could
> > > >convince you to dump the backport patches and put
> > > >all the backports in branches?  i'm willing to do the
> > > >legwork if you see value...
> > > 
> > > can you publish the scripts and/or the tree?
> > > i think we can start by just running the scripts nightly,
> > > making it possible for people to view backport history
> > > with gitview.
> > 
> > i've attached the script that i'm using to compare
> > the trees, but it's a total hack.  it doesn't keep
> > the patch history.  that would not be too hard to
> > do i guess -- if there's interest...
> > 
> > to run the script:
> > 
> > <cp attached files here...>
> > $ git clone git://git.openfabrics.org/~mst/ofed_kernel.git ofed_kernel
> > $ cd ofed_kernel
> > $ for b in `cat ../ofed-backports.txt`; do ../create-backport.sh $b; done
> > 
> > now you'll have a bunch of backport-2.6.xxx branches...
> 
> So, would you like to have this script run nightly on ofed trees?

if someone finds that useful.  my main motivation is
getting rid of all the patches in ofed, if running this
script nightly helps us to get there, then i'm all for
it.  if it's just for me, it's easy enough to run the
scripts by hand...

arthur


From mst at dev.mellanox.co.il  Tue Jul 24 08:32:28 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 24 Jul 2007 18:32:28 +0300
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070724152305.GG16727@bauxite.pathscale.com>
References: <20070612084108.GK6470@mellanox.co.il>
	<20070723200640.GA13117@bauxite.pathscale.com>
	<20070724030318.GA7589@mellanox.co.il>
	<20070724145335.GF16727@bauxite.pathscale.com>
	<20070724150909.GL4359@mellanox.co.il>
	<20070724152305.GG16727@bauxite.pathscale.com>
Message-ID: <20070724152833.GN4359@mellanox.co.il>

> Quoting Arthur Jones <arthur.jones at qlogic.com>:
> Subject: Re: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
> 
> hi michael, ...
> 
> On Tue, Jul 24, 2007 at 06:09:09PM +0300, Michael S. Tsirkin wrote:
> > > Quoting Arthur Jones <arthur.jones at qlogic.com>:
> > > Subject: Re: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
> > > 
> > > hi michael, ...
> > > 
> > > On Tue, Jul 24, 2007 at 06:03:41AM +0300, Michael S. Tsirkin wrote:
> > > > [...]
> > > > But I also see a serious problem with addressing: basically
> > > > git tracks content. It's not designed to track a bush
> > > > of branches taken together.  For example, take tagging:
> > > > tag namespace is global, so you can not have the same
> > > > tag point at multiple branches at the same time.
> > > 
> > > agreed.  however, the way we use git, with the
> > > location of the git DB as the "tag", it's not
> > > really a problem in practice.
> > 
> > who uses git this way?
> 
> i do.
> 
> > > but tagging each
> > > branch separately is indeed a PITA...
> > 
> > This is just one problem.
> > For example, git pull can only merge one branch at a time.
> 
> how is this a problem?  the way i use git,
> i use a script to "reflow" the changes into
> the dependent branches.  over the last few
> months, anyway, it has worked fine...

Precisely because no one developed on these branches,
so you are re-generating themfrom patches - not a problem,
but as you point out not too useful either.

If people start developing on these branches, then
eventually you will need to merge them - and git only merges
them one at a time.

-- 
MST


From arthur.jones at qlogic.com  Tue Jul 24 08:41:51 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Tue, 24 Jul 2007 08:41:51 -0700
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070724152833.GN4359@mellanox.co.il>
References: <20070612084108.GK6470@mellanox.co.il>
	<20070723200640.GA13117@bauxite.pathscale.com>
	<20070724030318.GA7589@mellanox.co.il>
	<20070724145335.GF16727@bauxite.pathscale.com>
	<20070724150909.GL4359@mellanox.co.il>
	<20070724152305.GG16727@bauxite.pathscale.com>
	<20070724152833.GN4359@mellanox.co.il>
Message-ID: <20070724154151.GH16727@bauxite.pathscale.com>

hi michael, ...

On Tue, Jul 24, 2007 at 06:32:28PM +0300, Michael S. Tsirkin wrote:
> [...]
> > > For example, git pull can only merge one branch at a time.
> > 
> > how is this a problem?  the way i use git,
> > i use a script to "reflow" the changes into
> > the dependent branches.  over the last few
> > months, anyway, it has worked fine...
> 
> Precisely because no one developed on these branches,
> so you are re-generating themfrom patches - not a problem,
> but as you point out not too useful either.

well, no, i _have_ been doing development on the
local branches in our internal repo.  i also
merge in changes that you make to the ofed repo
to our internal backport branches.  the script
i posted is just so that i can more easily compare
our internal branches to the ofed backport "branches".

> If people start developing on these branches, then
> eventually you will need to merge them - and git only merges
> them one at a time.

yes, i have to merge them one at a time.  i
still don't see how this is a problem.  backport
changes can be pulled in and the changes from
upstream can be merged in as well.  i haven't
had a problem with this so far.  can you be more
specific about what you expect will fail?

arthur


From mst at dev.mellanox.co.il  Tue Jul 24 08:53:48 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 24 Jul 2007 18:53:48 +0300
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070724154151.GH16727@bauxite.pathscale.com>
References: <20070612084108.GK6470@mellanox.co.il>
	<20070723200640.GA13117@bauxite.pathscale.com>
	<20070724030318.GA7589@mellanox.co.il>
	<20070724145335.GF16727@bauxite.pathscale.com>
	<20070724150909.GL4359@mellanox.co.il>
	<20070724152305.GG16727@bauxite.pathscale.com>
	<20070724152833.GN4359@mellanox.co.il>
	<20070724154151.GH16727@bauxite.pathscale.com>
Message-ID: <20070724155348.GP4359@mellanox.co.il>

> Quoting Arthur Jones <arthur.jones at qlogic.com>:
> Subject: Re: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
> 
> hi michael, ...
> 
> On Tue, Jul 24, 2007 at 06:32:28PM +0300, Michael S. Tsirkin wrote:
> > [...]
> > > > For example, git pull can only merge one branch at a time.
> > > 
> > > how is this a problem?  the way i use git,
> > > i use a script to "reflow" the changes into
> > > the dependent branches.  over the last few
> > > months, anyway, it has worked fine...
> > 
> > Precisely because no one developed on these branches,
> > so you are re-generating themfrom patches - not a problem,
> > but as you point out not too useful either.
> 
> well, no, i _have_ been doing development on the
> local branches in our internal repo.  i also
> merge in changes that you make to the ofed repo
> to our internal backport branches.  the script
> i posted is just so that i can more easily compare
> our internal branches to the ofed backport "branches".

How do you do the merging?

> > If people start developing on these branches, then
> > eventually you will need to merge them - and git only merges
> > them one at a time.
> 
> yes, i have to merge them one at a time.  i
> still don't see how this is a problem.  backport
> changes can be pulled in and the changes from
> upstream can be merged in as well.  i haven't
> had a problem with this so far.  can you be more
> specific about what you expect will fail?

Well, as distro maintainers we need to merge a lot, from different
people. We'll have to write all kind of scripts to do it instead of
a plain git pull.

And, I expect almost all git operations will have to be wrapped
in a script in some way, to operate on a bush of branches.

-- 
MST


From weiny2 at llnl.gov  Tue Jul 24 09:05:11 2007
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 24 Jul 2007 09:05:11 -0700
Subject: [ofa-general] Command specification of ca_name and ca_port
In-Reply-To: <20070724013306.GH11674@sashak.voltaire.com>
References: <46A4C0C7.7020107@systemfabricworks.com>
	<20070724013306.GH11674@sashak.voltaire.com>
Message-ID: <20070724090511.636bbccb.weiny2@llnl.gov>

On Tue, 24 Jul 2007 04:33:06 +0300
Sasha Khapyorsky <sashak at voltaire.com> wrote:

> Hi David,
> 
> On 09:52 Mon 23 Jul     , David McMillen wrote:
> > 
> >  There are a standard set of command line options that allow specification of 
> >  the CA to use for sending the requests.  I'm adding these to programs that 
> >  don't have them, since they are very useful when diagnosing a node connected 
> >  to multiple subnets.  Even if you discount multiple subnets on purpose, 
> >  sometimes this happens when the hardware connecting all of the CA ports to 
> >  the same place gets broken, and that is when you need diagnostics that can 
> >  help figure out what is where.
> > 
> >  The standard options are:
> > 
> >        -C <ca_name>    use the specified ca_name.
> > 
> >        -P <ca_port>    use the specified ca_port.
> > 
> >        -t <timeout_ms> override the default timeout for the solicited mads.
> > 
> >  My problem is that saquery already uses -C and -P, although the -t exists 
> >  for the expected purpose.  Also, ibcheckerrs already uses -t for specifying 
> >  the threshold file.
> 
> I think unified command line options over diags are good thing, so I
> guess reasonable renaming should be acceptable.

I agree, however right now saquery does not support specifying the ca_name or
ca_port, so you would have to add that support.

> 
> > 
> >  Changing the timeout for ibcheckerrs isn't critical, but not being able to 
> >  do it doesn't seem right.  However, the saquery command could be really 
> >  handy for figuring out split fabrics, and is useful to those of us that 
> >  connect to multiple subnets.
> > 
> >  Does anybody have a useful suggestion?
> 
> '-T' for the threshold file?

That sounds good.

>
> But it is easy part - saquery renames are
> less intuitive :(. Probably just lower case? Or special query option
> (-q or -Q), so queries could be specified as -qP, -qC?
> 

I disagree with this because ~50% of the options are query's, it's primary
purpose is to query, and most of the other options change the format of the
output of the query.  Therefore, I don't think a -q should be required for a
query.  I think that seems redundant.

Perhaps just changing the current option to -c,-p, and adding -C and -P would
be best.  I know this might break some scripts out there, particularly mine,
but I think it is the right thing to do if you really want consistency.

Thoughts?
Ira


From sean.hefty at intel.com  Tue Jul 24 09:09:35 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 24 Jul 2007 09:09:35 -0700
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070723200640.GA13117@bauxite.pathscale.com>
Message-ID: <000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com>

>at qlogic we now keep the backports as branches in
>our git tree and this, i find, is much easier to
>handle.  because:
>
>* viewing and navigating backport source becomes
>  _much_ easier.
>* merges are easier -- patches are much more fragile
>  than branches.
>* comparisons are easier -- checking for differences
>  between backports and between a backport and the
>  canonical source is faster and more convenient...
>* changesets are readable.  trying to decipher diffs
>  to patches is medically proven to take months, if not
>  years, off your life.

Let's add that you don't need patches to patches, and the order patches are
applied isn't determined alphabetically.

>anyway, what do you think?  is there anyway i could
>convince you to dump the backport patches and put
>all the backports in branches?  i'm willing to do the
>legwork if you see value...

I would love OFED to dump the patch directory concept.

- Sean


From arthur.jones at qlogic.com  Tue Jul 24 09:13:51 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Tue, 24 Jul 2007 09:13:51 -0700
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070724155348.GP4359@mellanox.co.il>
References: <20070612084108.GK6470@mellanox.co.il>
	<20070723200640.GA13117@bauxite.pathscale.com>
	<20070724030318.GA7589@mellanox.co.il>
	<20070724145335.GF16727@bauxite.pathscale.com>
	<20070724150909.GL4359@mellanox.co.il>
	<20070724152305.GG16727@bauxite.pathscale.com>
	<20070724152833.GN4359@mellanox.co.il>
	<20070724154151.GH16727@bauxite.pathscale.com>
	<20070724155348.GP4359@mellanox.co.il>
Message-ID: <20070724161351.GI16727@bauxite.pathscale.com>

hi michael, ...

On Tue, Jul 24, 2007 at 06:53:48PM +0300, Michael S. Tsirkin wrote:
> [...]
> > well, no, i _have_ been doing development on the
> > local branches in our internal repo.  i also
> > merge in changes that you make to the ofed repo
> > to our internal backport branches.  the script
> > i posted is just so that i can more easily compare
> > our internal branches to the ofed backport "branches".
> 
> How do you do the merging?

for just the backport branches, i merge different ways
from different sources:
   * from upstream, it's a pull into master and a git merge master
     into local backport branches -- i call this a reflow.
   * from local developers, it's a git pull straight into
     the backport branch, then reflow the repo.
   * from ofed, i apply the backport patch by hand and
     fixup the inevitable clashes -- either because part
     of the patch is already applied, or because context
     has changed enough for git apply to get confused.  when
     these are fixed up, reflow the repo...
   
> > > If people start developing on these branches, then
> > > eventually you will need to merge them - and git only merges
> > > them one at a time.
> > 
> > yes, i have to merge them one at a time.  i
> > still don't see how this is a problem.  backport
> > changes can be pulled in and the changes from
> > upstream can be merged in as well.  i haven't
> > had a problem with this so far.  can you be more
> > specific about what you expect will fail?
> 
> Well, as distro maintainers we need to merge a lot, from different
> people. We'll have to write all kind of scripts to do it instead of
> a plain git pull.

i can't imagine what script you would need.  can
you be more specific?  it would seem to me that you
could just pull straight in to the backport branch...

> And, I expect almost all git operations will have to be wrapped
> in a script in some way, to operate on a bush of branches.

so far, this hasn't been an issue for me.  the only
operation that i've scripted is the reflow.  for 
most work, i can just ignore the backport branches and
do the work in the (copy of) master, then reflow the
changes into the backports...

arthur


From mst at dev.mellanox.co.il  Tue Jul 24 09:16:46 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 24 Jul 2007 19:16:46 +0300
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com>
References: <20070723200640.GA13117@bauxite.pathscale.com>
	<000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com>
Message-ID: <20070724161646.GA24797@mellanox.co.il>

> Quoting Sean Hefty <sean.hefty at intel.com>:
> Subject: RE: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
> 
> >at qlogic we now keep the backports as branches in
> >our git tree and this, i find, is much easier to
> >handle.  because:
> >
> >* viewing and navigating backport source becomes
> >  _much_ easier.
> >* merges are easier -- patches are much more fragile
> >  than branches.
> >* comparisons are easier -- checking for differences
> >  between backports and between a backport and the
> >  canonical source is faster and more convenient...
> >* changesets are readable.  trying to decipher diffs
> >  to patches is medically proven to take months, if not
> >  years, off your life.
> 
> Let's add that you don't need patches to patches, and the order patches are
> applied isn't determined alphabetically.
> 
> >anyway, what do you think?  is there anyway i could
> >convince you to dump the backport patches and put
> >all the backports in branches?  i'm willing to do the
> >legwork if you see value...
> 
> I would love OFED to dump the patch directory concept.

I'd love to have a common source for all kernels,
and the kernel_addons mechanism does this for us whenever possible.

But, for these cases where the code actually needs to be modified,
applying a patch seems like the least evil way to do it.
Alternatives seem to be much worse.

-- 
MST


From mst at dev.mellanox.co.il  Tue Jul 24 09:23:06 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 24 Jul 2007 19:23:06 +0300
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070724161351.GI16727@bauxite.pathscale.com>
References: <20070612084108.GK6470@mellanox.co.il>
	<20070723200640.GA13117@bauxite.pathscale.com>
	<20070724030318.GA7589@mellanox.co.il>
	<20070724145335.GF16727@bauxite.pathscale.com>
	<20070724150909.GL4359@mellanox.co.il>
	<20070724152305.GG16727@bauxite.pathscale.com>
	<20070724152833.GN4359@mellanox.co.il>
	<20070724154151.GH16727@bauxite.pathscale.com>
	<20070724155348.GP4359@mellanox.co.il>
	<20070724161351.GI16727@bauxite.pathscale.com>
Message-ID: <20070724162305.GB24797@mellanox.co.il>

> Quoting Arthur Jones <arthur.jones at qlogic.com>:
> Subject: Re: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
> 
> hi michael, ...
> 
> On Tue, Jul 24, 2007 at 06:53:48PM +0300, Michael S. Tsirkin wrote:
> > [...]
> > > well, no, i _have_ been doing development on the
> > > local branches in our internal repo.  i also
> > > merge in changes that you make to the ofed repo
> > > to our internal backport branches.  the script
> > > i posted is just so that i can more easily compare
> > > our internal branches to the ofed backport "branches".
> > 
> > How do you do the merging?
> 
> for just the backport branches, i merge different ways
> from different sources:
>    * from upstream, it's a pull into master and a git merge master
>      into local backport branches -- i call this a reflow.
>    * from local developers, it's a git pull straight into
>      the backport branch, then reflow the repo.
>    * from ofed, i apply the backport patch by hand and
>      fixup the inevitable clashes -- either because part
>      of the patch is already applied, or because context
>      has changed enough for git apply to get confused.  when
>      these are fixed up, reflow the repo...

Hmm. Concider that yuou did all of the above, and then mail me
that there's an update. Now I need to merge updates to multiple branches directly
and git pull does not do this. It's a problem.

> > > > If people start developing on these branches, then
> > > > eventually you will need to merge them - and git only merges
> > > > them one at a time.
> > > 
> > > yes, i have to merge them one at a time.  i
> > > still don't see how this is a problem.  backport
> > > changes can be pulled in and the changes from
> > > upstream can be merged in as well.  i haven't
> > > had a problem with this so far.  can you be more
> > > specific about what you expect will fail?
> > 
> > Well, as distro maintainers we need to merge a lot, from different
> > people. We'll have to write all kind of scripts to do it instead of
> > a plain git pull.
> 
> i can't imagine what script you would need.  can
> you be more specific?  it would seem to me that you
> could just pull straight in to the backport branch...

You'll have to check out branches one by one, and do a pull.
What if there's a conflict? I currently just do git reset --hard ORIG_HEAD
and mail the maintainer to fix it up - but this won't work
with the "bush of branches" approach.

> > And, I expect almost all git operations will have to be wrapped
> > in a script in some way, to operate on a bush of branches.
> 
> so far, this hasn't been an issue for me.  the only
> operation that i've scripted is the reflow.  for 
> most work, i can just ignore the backport branches and
> do the work in the (copy of) master, then reflow the
> changes into the backports...

Because you only have your driver to maintain.


-- 
MST


From mshefty at ichips.intel.com  Tue Jul 24 09:29:12 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 24 Jul 2007 09:29:12 -0700
Subject: [ofa-general] Re: IPoIB path caching
In-Reply-To: <46A5C8E6.5020906@voltaire.com>
References: <adalkdl43w0.fsf@cisco.com>	<46A2F696.4060007@voltaire.com>	<adafy3f22z5.fsf@cisco.com>	<46A46637.3080104@voltaire.com>	<20070723083020.GD20614@mellanox.co.il>
	<46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com>
	<46A5C8E6.5020906@voltaire.com>
Message-ID: <46A628D8.4050109@ichips.intel.com>

> Linux has a quite sophisticated mechanism to maintain / cache / probe / 
> invalidate / update the network stack L2 neighbour info.

Path records are not just L2 info.  They contain L4, L3, and L2 info 
together.

> For example, in the Voltaire gen1 stack we had an ib arp module which 
> was used by both IPoIB and native IB ULPs (SDP, iSER, Lustre, etc). This 
> module managed some sort of path cache, were IPoIB was always asking for 
> non-cached path and other ULPs were willing to get cached path.

IMO, using a cached AH is no different than using a cached path.  You're 
simply mapping the PR data into another structure.

We're ignoring the problem here, and that is that a centralized SA 
doesn't scale.  MPI stacks have largely ignored this problem by simply 
not doing path record queries.  Path information is often hard-coded, 
with QPN data exchanged out of band over sockets (often over Ethernet).

We've seen problems running large MPI jobs without PR caching.  I know 
that Silverstorm/QLogic did as well.  And apparently Voltaire hit the 
same type of problem, since you added a caching module.  (Did Mellanox 
and Topspin/Cisco create PR caches as well?)  At least three companies 
working on IB came up with the same solution.  What is the objection to 
the current patch set?

- Sean


From arthur.jones at qlogic.com  Tue Jul 24 09:46:59 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Tue, 24 Jul 2007 09:46:59 -0700
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070724162305.GB24797@mellanox.co.il>
References: <20070723200640.GA13117@bauxite.pathscale.com>
	<20070724030318.GA7589@mellanox.co.il>
	<20070724145335.GF16727@bauxite.pathscale.com>
	<20070724150909.GL4359@mellanox.co.il>
	<20070724152305.GG16727@bauxite.pathscale.com>
	<20070724152833.GN4359@mellanox.co.il>
	<20070724154151.GH16727@bauxite.pathscale.com>
	<20070724155348.GP4359@mellanox.co.il>
	<20070724161351.GI16727@bauxite.pathscale.com>
	<20070724162305.GB24797@mellanox.co.il>
Message-ID: <20070724164659.GJ16727@bauxite.pathscale.com>

hi michael, ...

On Tue, Jul 24, 2007 at 07:23:06PM +0300, Michael S. Tsirkin wrote:
> [...]
> > for just the backport branches, i merge different ways
> > from different sources:
> >    * from upstream, it's a pull into master and a git merge master
> >      into local backport branches -- i call this a reflow.
> >    * from local developers, it's a git pull straight into
> >      the backport branch, then reflow the repo.
> >    * from ofed, i apply the backport patch by hand and
> >      fixup the inevitable clashes -- either because part
> >      of the patch is already applied, or because context
> >      has changed enough for git apply to get confused.  when
> >      these are fixed up, reflow the repo...
> 
> Hmm. Concider that yuou did all of the above, and then mail me
> that there's an update. Now I need to merge updates to multiple branches directly
> and git pull does not do this. It's a problem.

for changes made to the canonical source, it's
just git pull into ofed_kernel and a reflow.

for changes made to the backports, you would need
to git checkout and git pull into each of the
backport branches _in which i made a change_.
the case that i make changes to _all_ or even
a significant number of backport patches is
sufficiently rare that i doubt it is worth scripting.
but, if the script is necessary, it's pretty
straightforward:

set -e
for b in branches-which-have-changed; do
   git checkout $b
   git pull <remote> $b
done

> [...]
> > i can't imagine what script you would need.  can
> > you be more specific?  it would seem to me that you
> > could just pull straight in to the backport branch...
> 
> You'll have to check out branches one by one, and do a pull.
> What if there's a conflict? I currently just do git reset --hard ORIG_HEAD
> and mail the maintainer to fix it up - but this won't work
> with the "bush of branches" approach.

it works for me.  what do you expect will break?

> > > And, I expect almost all git operations will have to be wrapped
> > > in a script in some way, to operate on a bush of branches.
> > 
> > so far, this hasn't been an issue for me.  the only
> > operation that i've scripted is the reflow.  for 
> > most work, i can just ignore the backport branches and
> > do the work in the (copy of) master, then reflow the
> > changes into the backports...
> 
> Because you only have your driver to maintain.

no, i have to maintain quite a few of the
ofed backport branches as well for our release.
if i started getting pull requests from people
with changes to 15 backport branches in one go,
i'd probably want to script it...

i have found that drawing a DAG with graphviz has
been a big help in making sure that i organize the
branches correctly...

arthur


From arthur.jones at qlogic.com  Tue Jul 24 09:50:32 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Tue, 24 Jul 2007 09:50:32 -0700
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070724161646.GA24797@mellanox.co.il>
References: <20070723200640.GA13117@bauxite.pathscale.com>
	<000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com>
	<20070724161646.GA24797@mellanox.co.il>
Message-ID: <20070724165032.GK16727@bauxite.pathscale.com>

hi michael, ...

On Tue, Jul 24, 2007 at 07:16:46PM +0300, Michael S. Tsirkin wrote:
> [...]
> But, for these cases where the code actually needs to be modified,
> applying a patch seems like the least evil way to do it.
> Alternatives seem to be much worse.

what is it about patches that are less evil
than changesets?  can you list some of the
advantages?

arthur


From mst at dev.mellanox.co.il  Tue Jul 24 09:52:03 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 24 Jul 2007 19:52:03 +0300
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070724164659.GJ16727@bauxite.pathscale.com>
References: <20070724030318.GA7589@mellanox.co.il>
	<20070724145335.GF16727@bauxite.pathscale.com>
	<20070724150909.GL4359@mellanox.co.il>
	<20070724152305.GG16727@bauxite.pathscale.com>
	<20070724152833.GN4359@mellanox.co.il>
	<20070724154151.GH16727@bauxite.pathscale.com>
	<20070724155348.GP4359@mellanox.co.il>
	<20070724161351.GI16727@bauxite.pathscale.com>
	<20070724162305.GB24797@mellanox.co.il>
	<20070724164659.GJ16727@bauxite.pathscale.com>
Message-ID: <20070724165203.GC24797@mellanox.co.il>

> > Because you only have your driver to maintain.
> 
> no, i have to maintain quite a few of the
> ofed backport branches as well for our release.
> if i started getting pull requests from people
> with changes to 15 backport branches in one go,
> i'd probably want to script it...

Yea. Happens all the time here: when component maintainer
makes a change, it will typically affect all backports or none.

> i have found that drawing a DAG with graphviz has
> been a big help in making sure that i organize the
> branches correctly...

Ugh .. *that* sounds complicated.
Looks like it's much simpler with current setup.

-- 
MST


From sashak at voltaire.com  Tue Jul 24 10:00:11 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 24 Jul 2007 20:00:11 +0300
Subject: [ofa-general] Re: OpenSM detection of duplicated GUIDs on loopback
In-Reply-To: <f0e08f230707240803nf0f2297x44df64a802b080a1@mail.gmail.com>
References: <f0e08f230707240730o6d665bb7q2357cfb4a49445aa@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED65F2@mtlexch01.mtl.com>
	<f0e08f230707240753u496187b8m670e72fbd1f499f8@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com>
	<f0e08f230707240803nf0f2297x44df64a802b080a1@mail.gmail.com>
Message-ID: <20070724170011.GY27878@sashak.voltaire.com>

Hi,

On 11:03 Tue 24 Jul     , Hal Rosenstock wrote:
>  On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:
> >
> >  *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> > *Sent:* Tuesday, July 24, 2007 5:53 PM
> > *To:* Eitan Zahavi
> > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
> > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback
> >
> >
> >
> > Hi Eitan,
> >
> > On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:
> > >
> > >  *Hi Hal,*
> > > **
> > > *What is this "loopback" connector used for?*
> > > *Does not seem to me like a very useful thing to do.*
> > >
> > **
> > Perhaps not but no reason OpenSM can't handle this more gracefully.

I don't have "loopback" plug, but used loopback connections for some
checks with simulator. There is nothing illegal, so I think it would be
better to support it.

> >  *Anyway, if it is not a production environment we could add a "debug
> > > mode" (-d flag option) to ignore this check.*
> > >
> > **
> > Why would a separate flag be needed ?
> > *[EZ] Since I do not see any other solution for the SM  to know it is
> > really a loop back plug rather then two devices with same GUID connected
> > back to back ...*

Also we saw the cases when port moving triggers duplicated GUIDs
detector (originally was reported on real fabric and it is trivially
reproducible in simulated environment).

So probably we need to find some better way to handle duplication GUID
detector (in general, not just for loopback). For example node_info
content could be compared. More ideas?

Sasha


From mst at dev.mellanox.co.il  Tue Jul 24 09:55:50 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 24 Jul 2007 19:55:50 +0300
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070724165032.GK16727@bauxite.pathscale.com>
References: <20070723200640.GA13117@bauxite.pathscale.com>
	<000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com>
	<20070724161646.GA24797@mellanox.co.il>
	<20070724165032.GK16727@bauxite.pathscale.com>
Message-ID: <20070724165550.GD24797@mellanox.co.il>

> Quoting Arthur Jones <arthur.jones at qlogic.com>:
> Subject: Re: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
> 
> hi michael, ...
> 
> On Tue, Jul 24, 2007 at 07:16:46PM +0300, Michael S. Tsirkin wrote:
> > [...]
> > But, for these cases where the code actually needs to be modified,
> > applying a patch seems like the least evil way to do it.
> > Alternatives seem to be much worse.
> 
> what is it about patches that are less evil
> than changesets?  can you list some of the
> advantages?

changesets *do not exist* in git - git tracks content.

I compare "multiple directories with patches" with the "bush of branches".
With bush of branches:
git pull broken, git archive broken, git tag broken, git reset broken.
It looks like the list can be continued.

Yes, we can start building our own tools on top of git to do this,
but I'd rather not.

-- 
MST


From sashak at voltaire.com  Tue Jul 24 10:04:32 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 24 Jul 2007 20:04:32 +0300
Subject: [ofa-general] Re: opensm: a bug in heavy sweep? - no LFT
	re-configuration
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com>
References: <863azhrlm1.fsf@sw053.lab.mtl.com>
	<20070722102209.GR16597@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com>
	<20070722174048.GO27878@sashak.voltaire.com>
	<f0e08f230707230830x5d0f73ccib09421f3f660e563@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com>
	<20070724005153.GD11674@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com>
Message-ID: <20070724170432.GZ27878@sashak.voltaire.com>

On 07:56 Tue 24 Jul     , Eitan Zahavi wrote:
> > On 20:59 Mon 23 Jul     , Eitan Zahavi wrote:
> > > Hi Sasha, Hal,
> > >  
> > > I think I have an idea:
> > >  
> > > Since this is a specific switch that reported ChangeBit or Trap why 
> > > can't we just qualify that there was no change in the switch setup?
> > 
> > The ChangeBit seems to be good start point - then OpenSM will 
> > query all switch ports PortInfo anyway and if for all ports 
> > PortState is <= INIT (and at least for one port it is = 
> > INIT), it means that this switch was rebooted/reinitialized.
> > 
> > And for single port PortState drop to = INIT should indicate 
> > reinitialization.
> > 
> > Seems correct?
> Yes.
> > 
> > > We could send PortInfo, SwitchInfo,
> > 
> > SwitchInfo is queried at each light sweep, PortInfo's if 
> > ChangeBit is set. Guess we are ok with it even now.
> I will double check that...
> Well - even setting one port state to INIT did not cause the switch to
> be reconfigured.
> Seems the code does not enforce this condition yet.
> > 
> > > LFT, MFT, SL2VL, VLArb, PKey queries
> > > and make sure no change from previous state. Or we could simply 
> > > enforce last state by sending it over again ...
> > 
> > I think we could want to re-read PKey tables in order to 
> > preserve existing PKey indices and just to flush (overwrite 
> > with new settings) LFT, MFT, SL2VL, VLArb tables. Reasonable?
> Correct.

Ok, I will prepare patches. I think about separate patches for switches
and ports. Also likely MFT should be handled separately, since we don't
do incremental update there yet.

Sasha


From sashak at voltaire.com  Tue Jul 24 10:07:25 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 24 Jul 2007 20:07:25 +0300
Subject: [ofa-general] Command specification of ca_name and ca_port
In-Reply-To: <20070724090511.636bbccb.weiny2@llnl.gov>
References: <46A4C0C7.7020107@systemfabricworks.com>
	<20070724013306.GH11674@sashak.voltaire.com>
	<20070724090511.636bbccb.weiny2@llnl.gov>
Message-ID: <20070724170725.GA27878@sashak.voltaire.com>

On 09:05 Tue 24 Jul     , Ira Weiny wrote:
> >
> > But it is easy part - saquery renames are
> > less intuitive :(. Probably just lower case? Or special query option
> > (-q or -Q), so queries could be specified as -qP, -qC?
> > 
> 
> I disagree with this because ~50% of the options are query's, it's primary
> purpose is to query, and most of the other options change the format of the
> output of the query.  Therefore, I don't think a -q should be required for a
> query.  I think that seems redundant.
> 
> Perhaps just changing the current option to -c,-p, and adding -C and -P would
> be best.  I know this might break some scripts out there, particularly mine,
> but I think it is the right thing to do if you really want consistency.
> 
> Thoughts?

-c,-p are fine for me too.

Sasha


From arthur.jones at qlogic.com  Tue Jul 24 10:07:26 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Tue, 24 Jul 2007 10:07:26 -0700
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070724165550.GD24797@mellanox.co.il>
References: <20070723200640.GA13117@bauxite.pathscale.com>
	<000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com>
	<20070724161646.GA24797@mellanox.co.il>
	<20070724165032.GK16727@bauxite.pathscale.com>
	<20070724165550.GD24797@mellanox.co.il>
Message-ID: <20070724170726.GL16727@bauxite.pathscale.com>

hi michael, ...

On Tue, Jul 24, 2007 at 07:55:50PM +0300, Michael S. Tsirkin wrote:
> [...]
> > what is it about patches that are less evil
> > than changesets?  can you list some of the
> > advantages?
> 
> changesets *do not exist* in git - git tracks content.
> 
> I compare "multiple directories with patches" with the "bush of branches".
> With bush of branches:
> git pull broken, git archive broken, git tag broken, git reset broken.
> It looks like the list can be continued.

none of these things are broken, they are just
used differently.  despite your apprehension, i'd
like to see a list of the _advantages_ of multiple
directories with patches -- perhaps with this list
in hand we can see how they stack up...

> Yes, we can start building our own tools on top of git to do this,
> but I'd rather not.

i'd hardly call a 4 line script a "tool".  compare
it to the ./ofed_scripts/configure script which is
no longer necessary with backport branches.  i think
the complexity argument doesn't take you too far...

i realize that you're attached to your current method,
but i've _used_ a different method, and i can say from
experience that it works _much_ better...

at sonoma, i heard quite a few people asking for easier
access to the OFED source.  from the user's point of view,
pulling a single branch from a repo is _much_ simpler
than our current setup, don't you think?

arthur


From arthur.jones at qlogic.com  Tue Jul 24 10:11:21 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Tue, 24 Jul 2007 10:11:21 -0700
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070724165203.GC24797@mellanox.co.il>
References: <20070724145335.GF16727@bauxite.pathscale.com>
	<20070724150909.GL4359@mellanox.co.il>
	<20070724152305.GG16727@bauxite.pathscale.com>
	<20070724152833.GN4359@mellanox.co.il>
	<20070724154151.GH16727@bauxite.pathscale.com>
	<20070724155348.GP4359@mellanox.co.il>
	<20070724161351.GI16727@bauxite.pathscale.com>
	<20070724162305.GB24797@mellanox.co.il>
	<20070724164659.GJ16727@bauxite.pathscale.com>
	<20070724165203.GC24797@mellanox.co.il>
Message-ID: <20070724171121.GM16727@bauxite.pathscale.com>

hi michael, ...

On Tue, Jul 24, 2007 at 07:52:03PM +0300, Michael S. Tsirkin wrote:
> [...]
> > i have found that drawing a DAG with graphviz has
> > been a big help in making sure that i organize the
> > branches correctly...
> 
> Ugh .. *that* sounds complicated.
> Looks like it's much simpler with current setup.

compared to the rather sophisticated linux-kernel
changesets that i see from you on this list -- it's
child's play...

compared to figuring out the list of options for
ofed_scripts/configure just so we can _see_ the
source we're running on our box -- it's a walk in
the park...

one of the goals of OFED 1.3 is to make access
to the source easier.  to do that, we will prob
need to rid ourselves of patches...

arthur


From mshefty at ichips.intel.com  Tue Jul 24 10:11:52 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 24 Jul 2007 10:11:52 -0700
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070724162305.GB24797@mellanox.co.il>
References: <20070612084108.GK6470@mellanox.co.il>	<20070723200640.GA13117@bauxite.pathscale.com>	<20070724030318.GA7589@mellanox.co.il>	<20070724145335.GF16727@bauxite.pathscale.com>	<20070724150909.GL4359@mellanox.co.il>	<20070724152305.GG16727@bauxite.pathscale.com>	<20070724152833.GN4359@mellanox.co.il>	<20070724154151.GH16727@bauxite.pathscale.com>	<20070724155348.GP4359@mellanox.co.il>	<20070724161351.GI16727@bauxite.pathscale.com>
	<20070724162305.GB24797@mellanox.co.il>
Message-ID: <46A632D8.3030801@ichips.intel.com>

> Hmm. Concider that yuou did all of the above, and then mail me
> that there's an update. Now I need to merge updates to multiple branches directly
> and git pull does not do this. It's a problem.

A simple script can do this.

> You'll have to check out branches one by one, and do a pull.
> What if there's a conflict? I currently just do git reset --hard ORIG_HEAD
> and mail the maintainer to fix it up - but this won't work
> with the "bush of branches" approach.

If there's a conflict, then you need a different patch.  A single patch 
may work for all backports, or a fix may require different patches 
depending on the kernel version.  As it stands now, there are patches 
that we apply that do not work and expect a subsequent patch to fix it up.

- Sean


From mst at dev.mellanox.co.il  Tue Jul 24 10:14:45 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 24 Jul 2007 20:14:45 +0300
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <46A632D8.3030801@ichips.intel.com>
References: <20070724030318.GA7589@mellanox.co.il>
	<20070724145335.GF16727@bauxite.pathscale.com>
	<20070724150909.GL4359@mellanox.co.il>
	<20070724152305.GG16727@bauxite.pathscale.com>
	<20070724152833.GN4359@mellanox.co.il>
	<20070724154151.GH16727@bauxite.pathscale.com>
	<20070724155348.GP4359@mellanox.co.il>
	<20070724161351.GI16727@bauxite.pathscale.com>
	<20070724162305.GB24797@mellanox.co.il>
	<46A632D8.3030801@ichips.intel.com>
Message-ID: <20070724171445.GE24797@mellanox.co.il>

> Quoting Sean Hefty <mshefty at ichips.intel.com>:
> Subject: Re: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
> 
> >Hmm. Concider that yuou did all of the above, and then mail me
> >that there's an update. Now I need to merge updates to multiple branches 
> >directly
> >and git pull does not do this. It's a problem.
> 
> A simple script can do this.

Basically we'll have to script around all of git.

Examples: What if there's a conflict? I currently do git reset, we'll
a script for this too? The tagging issue will have to be resolved
somehow - by a naming convention for tags? Another script ...

-- 
MST


From mst at dev.mellanox.co.il  Tue Jul 24 10:16:49 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 24 Jul 2007 20:16:49 +0300
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070724171121.GM16727@bauxite.pathscale.com>
References: <20070724150909.GL4359@mellanox.co.il>
	<20070724152305.GG16727@bauxite.pathscale.com>
	<20070724152833.GN4359@mellanox.co.il>
	<20070724154151.GH16727@bauxite.pathscale.com>
	<20070724155348.GP4359@mellanox.co.il>
	<20070724161351.GI16727@bauxite.pathscale.com>
	<20070724162305.GB24797@mellanox.co.il>
	<20070724164659.GJ16727@bauxite.pathscale.com>
	<20070724165203.GC24797@mellanox.co.il>
	<20070724171121.GM16727@bauxite.pathscale.com>
Message-ID: <20070724171649.GF24797@mellanox.co.il>

> one of the goals of OFED 1.3 is to make access
> to the source easier.  to do that, we will prob
> need to rid ourselves of patches...

I'm working on a rather simpler solution to this problem.
Stay tuned.

-- 
MST


From mst at dev.mellanox.co.il  Tue Jul 24 10:19:24 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 24 Jul 2007 20:19:24 +0300
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070724170726.GL16727@bauxite.pathscale.com>
References: <20070723200640.GA13117@bauxite.pathscale.com>
	<000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com>
	<20070724161646.GA24797@mellanox.co.il>
	<20070724165032.GK16727@bauxite.pathscale.com>
	<20070724165550.GD24797@mellanox.co.il>
	<20070724170726.GL16727@bauxite.pathscale.com>
Message-ID: <20070724171924.GG24797@mellanox.co.il>

> i realize that you're attached to your current method,
> but i've _used_ a different method, and i can say from
> experience that it works _much_ better...

I'd like to see a clean method, that doesn't replace one set of
problems that I understand with another that I have to learn.

> at sonoma, i heard quite a few people asking for easier
> access to the OFED source.  from the user's point of view,
> pulling a single branch from a repo is _much_ simpler
> than our current setup, don't you think?

I think users really want tarballs. If we had tarballs
prepatched for all kernels, I think the problem would be solved
for most people.

-- 
MST


From sean.hefty at intel.com  Tue Jul 24 10:19:49 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 24 Jul 2007 10:19:49 -0700
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070724171445.GE24797@mellanox.co.il>
Message-ID: <000101c7ce16$d73bc790$ff0da8c0@amr.corp.intel.com>

>Examples: What if there's a conflict? I currently do git reset, we'll

If there's a conflict applying a patch, you reject it.  I fail to see any issue
here.

- Sean


From tom at opengridcomputing.com  Tue Jul 24 10:20:45 2007
From: tom at opengridcomputing.com (Tom Tucker)
Date: Tue, 24 Jul 2007 12:20:45 -0500
Subject: [ofa-general] [ANNOUNCE] NFS-RDMA for OFED 1.2 G/A
Message-ID: <1185297645.14681.22.camel@trinity.ogc.int>

For those interested in NFS-RDMA, OGC has created an install package
based on the OFA 1.2 GA release. The package supports both SLES 10 and
RHEL 5. You can download this package from
http://www.opengridcomputing.com/nfs-rdma.html.

Please let me know if you find any problems.

Thanks,
Tom


From arthur.jones at qlogic.com  Tue Jul 24 10:28:26 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Tue, 24 Jul 2007 10:28:26 -0700
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070724171924.GG24797@mellanox.co.il>
References: <20070723200640.GA13117@bauxite.pathscale.com>
	<000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com>
	<20070724161646.GA24797@mellanox.co.il>
	<20070724165032.GK16727@bauxite.pathscale.com>
	<20070724165550.GD24797@mellanox.co.il>
	<20070724170726.GL16727@bauxite.pathscale.com>
	<20070724171924.GG24797@mellanox.co.il>
Message-ID: <20070724172826.GN16727@bauxite.pathscale.com>

hi michael, ...

On Tue, Jul 24, 2007 at 08:19:24PM +0300, Michael S. Tsirkin wrote:
> > i realize that you're attached to your current method,
> > but i've _used_ a different method, and i can say from
> > experience that it works _much_ better...
> 
> I'd like to see a clean method, that doesn't replace one set of
> problems that I understand with another that I have to learn.

i think we'll be further along by just doing a better
job rather than waiting endlessly for perfection to come
along.

i'd _really_ like to see a list of the advantages of
patches over branches.  it's hard for me to know if
i'm just missing something if the case is not laid out...

arthur


From mst at dev.mellanox.co.il  Tue Jul 24 10:42:55 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 24 Jul 2007 20:42:55 +0300
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <000101c7ce16$d73bc790$ff0da8c0@amr.corp.intel.com>
References: <20070724171445.GE24797@mellanox.co.il>
	<000101c7ce16$d73bc790$ff0da8c0@amr.corp.intel.com>
Message-ID: <20070724174255.GH24797@mellanox.co.il>

> Quoting Sean Hefty <sean.hefty at intel.com>:
> Subject: RE: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
> 
> >Examples: What if there's a conflict? I currently do git reset, we'll
> 
> If there's a conflict applying a patch, you reject it.  I fail to see any issue
> here.

But the proposal here was to have a bush of branches, all of which
need to be merged at the same time. It's possible that some
would merge and some would fail, leaving me in an inconsistent state,
and no easy way to get back to where I started.

-- 
MST


From mst at dev.mellanox.co.il  Tue Jul 24 10:52:20 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 24 Jul 2007 20:52:20 +0300
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070724172826.GN16727@bauxite.pathscale.com>
References: <20070723200640.GA13117@bauxite.pathscale.com>
	<000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com>
	<20070724161646.GA24797@mellanox.co.il>
	<20070724165032.GK16727@bauxite.pathscale.com>
	<20070724165550.GD24797@mellanox.co.il>
	<20070724170726.GL16727@bauxite.pathscale.com>
	<20070724171924.GG24797@mellanox.co.il>
	<20070724172826.GN16727@bauxite.pathscale.com>
Message-ID: <20070724175220.GI24797@mellanox.co.il>

> i'd _really_ like to see a list of the advantages of
> patches over branches.  it's hard for me to know if
> i'm just missing something if the case is not laid out...

Here's a short list off the top of my head

- A single git pull merges any number of backport changes
- A single git reset ORIG_HEAD recovers from a conflicting merge
- A single tag tags all code for all kernels
- On update from upstream, if there is a conflict
  between upstream code and and a patch
  it's easy to temporarily remote the patch, complete the merge,
  and go bugger the patch author
- For recent kernels there are almost no patches.
  So an update from upstream for these kernels is free,
  with branches I will still need to update all branches.
- Adding a fix which only affects common code
  is currently straight-forward: make a change, commit.
  With multiple branches every fix must be pulled into
  all branches.

-- 
MST


From mshefty at ichips.intel.com  Tue Jul 24 10:57:54 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 24 Jul 2007 10:57:54 -0700
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070724174255.GH24797@mellanox.co.il>
References: <20070724171445.GE24797@mellanox.co.il>	<000101c7ce16$d73bc790$ff0da8c0@amr.corp.intel.com>
	<20070724174255.GH24797@mellanox.co.il>
Message-ID: <46A63DA2.5000602@ichips.intel.com>

> But the proposal here was to have a bush of branches, all of which
> need to be merged at the same time. It's possible that some
> would merge and some would fail, leaving me in an inconsistent state,
> and no easy way to get back to where I started.

A fix could be applied to some kernels, but not others.  In fact, if a 
patch works for kernel X & Y, but has a conflict with kernel Z, then 
different patches are needed anyway.  I don't see the requirement to 
merge everything or even apply a fix to all kernels at the same time.

- Sean


From mshefty at ichips.intel.com  Tue Jul 24 11:13:08 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 24 Jul 2007 11:13:08 -0700
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070724175220.GI24797@mellanox.co.il>
References: <20070723200640.GA13117@bauxite.pathscale.com>	<000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com>	<20070724161646.GA24797@mellanox.co.il>	<20070724165032.GK16727@bauxite.pathscale.com>	<20070724165550.GD24797@mellanox.co.il>	<20070724170726.GL16727@bauxite.pathscale.com>	<20070724171924.GG24797@mellanox.co.il>	<20070724172826.GN16727@bauxite.pathscale.com>
	<20070724175220.GI24797@mellanox.co.il>
Message-ID: <46A64134.5080502@ichips.intel.com>

> Here's a short list off the top of my head
> 
> - A single git pull merges any number of backport changes
> - A single git reset ORIG_HEAD recovers from a conflicting merge
> - A single tag tags all code for all kernels
> - On update from upstream, if there is a conflict
>   between upstream code and and a patch
>   it's easy to temporarily remote the patch, complete the merge,
>   and go bugger the patch author
> - For recent kernels there are almost no patches.
>   So an update from upstream for these kernels is free,
>   with branches I will still need to update all branches.
> - Adding a fix which only affects common code
>   is currently straight-forward: make a change, commit.
>   With multiple branches every fix must be pulled into
>   all branches.

You seem to be overlooking the fact that you already require a script to 
check that things work for all kernels.  Until you apply a series of 
patches to form a particular kernel, you don't know if a change that you 
pulled in caused a conflict.  You still have the requirement to verify 
the fix on all kernels, and it still requires running a script that 
pushes/pops patches to create each tree.

- Sean


From eitan at mellanox.co.il  Tue Jul 24 11:12:10 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 24 Jul 2007 21:12:10 +0300
Subject: [ofa-general] RE: OpenSM detection of duplicated GUIDs on loopback
In-Reply-To: <f0e08f230707240803nf0f2297x44df64a802b080a1@mail.gmail.com>
References: <f0e08f230707240730o6d665bb7q2357cfb4a49445aa@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED65F2@mtlexch01.mtl.com>
	<f0e08f230707240753u496187b8m670e72fbd1f499f8@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com>
	<f0e08f230707240803nf0f2297x44df64a802b080a1@mail.gmail.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901F750FD@mtlexch01.mtl.com>

Hi Hal,
 
The code to find "duplicated" GUIDs stem from real user cases where
flawed 
burning procedure caused actual GUID duplications. There is nothing
"impossible". 
So it is really critical the the SM will be able to recognize this case
and abort.
 
It might be that for testing someone wants to use a loopback plug that
cause the same 
port GUID appear on both sides of link - but it is better to require the
user doing the test 
to set some flag than to miss such a situation in real life cluster.
 
This requirement was written after many people wasted many hours trying
to figure out what was going on.
PLEASE DO NOT TAKE IT AWAY
 

Eitan Zahavi 
Senior Engineering Director, Software Architect 
Mellanox Technologies LTD 
Tel:+972-4-9097208
Fax:+972-4-9593245 
P.O. Box 586 Yokneam 20692 ISRAEL 

 
________________________________

	From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] 
	Sent: Tuesday, July 24, 2007 6:04 PM
	To: Eitan Zahavi
	Cc: OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
	Subject: Re: OpenSM detection of duplicated GUIDs on loopback
	
	
	On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote: 

		From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com ] 
		Sent: Tuesday, July 24, 2007 5:53 PM
		To: Eitan Zahavi
		Cc: OpenFabrics General; Sasha Khapyorsky; Yevgeny
Kliteynik
		Subject: Re: OpenSM detection of duplicated GUIDs on
loopback 
		
		 
			Hi Eitan,
			
			
			On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il >
wrote: 

				Hi Hal,
				 
				What is this "loopback" connector used
for?
				Does not seem to me like a very useful
thing to do.

			 
			Perhaps not but no reason OpenSM can't handle
this more gracefully.


				Anyway, if it is not a production
environment we could add a "debug mode" (-d flag option) to ignore this
check.

			 
			Why would a separate flag be needed ?
			[EZ] Since I do not see any other solution for
the SM  to know it is really a loop back plug rather then two devices
with same GUID connected back to back ... 

	 
	"Technically", this should only occur when looped back and not
two devices with same GUID as GUID == globally unique and a duplication
indicates a "manufacturing" issue.
	 
	Anyhow, can't these be treated the same (and handled more
gracefully) without an additional option/flag ?
	 
	-- Hal


			-- Hal


				Eitan Zahavi 
				Senior Engineering Director, Software
Architect 
				Mellanox Technologies LTD 
				Tel:+972-4-9097208
				Fax:+972-4-9593245 
				P.O. Box 586 Yokneam 20692 ISRAEL 

				 
________________________________

				From: Hal Rosenstock
[mailto:hal.rosenstock at gmail.com] 
				Sent: Tuesday, July 24, 2007 5:31 PM
				To: OpenFabrics General
				Cc: Sasha Khapyorsky; Eitan Zahavi;
Yevgeny Kliteynik
				Subject: OpenSM detection of duplicated
GUIDs on loopback
				
				 
				Hi,
				 
				This is what starts off as a "minor"
issue and I know it has been discussed it somewhat in the past: 
				 
				Putting a loopback connector on a
(switch) link causes OpenSM to indicate duplicated GUID error 0D18 as
follows:
				
				__osm_ni_rcv_set_links
				{
				...
				          /*
				             When there are only two
nodes with exact same guids (connected back 
				             to back) - the previous
check for duplicated guid will not catch
				             them. But the link will be
from the port to itself...
				             Enhanced Port 0 is an
exception to this
				          */ 
				          if ((osm_node_get_node_guid(
p_node ) == p_ni_context->node_guid) &&
				              (port_num ==
p_ni_context->port_num) &&
				              (port_num != 0))
				          {
				            osm_log( p_rcv->p_log,
OSM_LOG_ERROR, 
	
"__osm_ni_rcv_set_links: ERR 0D18: "
				                     "Duplicate GUID
found by link from a port to itself:"
				                     "node 0x%" PRIx64
", port number 0x%X\n", 
				                     cl_ntoh64(
osm_node_get_node_guid( p_node ) ),
				                     port_num );
				...
				
				So this occurs over and over and over
and fills the log with the same spew. This should be improved IMO. 
				
				Is this really a fatal condition ?
Doesn't seem like it should be to me. 
				 
				Also, OpenSM can "ride" this out with -y
(stay on fatal) but is that safe for this condition ?
				 
				Seems like something like an extra
loopback bit should be added to some port structure which should cause
these links to be ignored. This bit would then be reset when the peer is
now longer itself. 
				
				Also, is there a relationship of this
with the 12x/duplicated GUID code ? 
				 
				Thanks.
				 
				-- Hal


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070724/ba0e1dc5/attachment.html>

From sashak at voltaire.com  Tue Jul 24 11:36:42 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 24 Jul 2007 21:36:42 +0300
Subject: [ofa-general] [PATCH] opensm: detect fast switch reset and force LFT
	update
In-Reply-To: <20070724170432.GZ27878@sashak.voltaire.com>
References: <863azhrlm1.fsf@sw053.lab.mtl.com>
	<20070722102209.GR16597@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com>
	<20070722174048.GO27878@sashak.voltaire.com>
	<f0e08f230707230830x5d0f73ccib09421f3f660e563@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com>
	<20070724005153.GD11674@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com>
	<20070724170432.GZ27878@sashak.voltaire.com>
Message-ID: <20070724183642.GC27878@sashak.voltaire.com>


Here we are trying to detect "fast" (so that OpenSM doesn't not detect
down state in sweep period) switch reset by validating PortState of all
ports (for <= INIT). If detected p_sw->need_update flag still remain
"on". In this case this switch forwarding tables will be updated
unconditionally.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/include/opensm/osm_switch.h |    5 +++++
 opensm/opensm/osm_port_info_rcv.c  |    3 +++
 opensm/opensm/osm_state_mgr.c      |    1 +
 opensm/opensm/osm_switch.c         |    1 +
 opensm/opensm/osm_ucast_mgr.c      |    3 ++-
 5 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h
index 5b2b19e..9364d2c 100644
--- a/opensm/include/opensm/osm_switch.h
+++ b/opensm/include/opensm/osm_switch.h
@@ -112,6 +112,7 @@ typedef struct _osm_switch
 	osm_fwd_tbl_t				fwd_tbl;
 	osm_mcast_tbl_t				mcast_tbl;
 	uint32_t				discovery_count;
+	unsigned				need_update;
 	void					*priv;
 } osm_switch_t;
 /*
@@ -152,6 +153,10 @@ typedef struct _osm_switch
 *		during the current fabric sweep.  This number is reset
 *		to zero at the start of a sweep.
 *
+*	need_update
+*		When set indicates that switch was probably reset, so
+*		fwd tables and rest cached data should be flushed
+*
 * SEE ALSO
 *	Switch object
 *********/
diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c
index adece65..6fe2d1d 100644
--- a/opensm/opensm/osm_port_info_rcv.c
+++ b/opensm/opensm/osm_port_info_rcv.c
@@ -337,6 +337,9 @@ __osm_pi_rcv_process_switch_port(
     }
   }
 
+  if (ib_port_info_get_port_state(p_pi) > IB_LINK_INIT && p_node->sw)
+    p_node->sw->need_update = 0;
+
   /*
     Update the PortInfo attribute.
   */
diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
index 0181c0f..7efbe2a 100644
--- a/opensm/opensm/osm_state_mgr.c
+++ b/opensm/opensm/osm_state_mgr.c
@@ -565,6 +565,7 @@ __osm_state_mgr_reset_switch_count(
    }
 
    p_sw->discovery_count = 0;
+   p_sw->need_update = 1;
 }
 
 /**********************************************************************
diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c
index a5a6fb7..2e170fc 100644
--- a/opensm/opensm/osm_switch.c
+++ b/opensm/opensm/osm_switch.c
@@ -104,6 +104,7 @@ osm_switch_init(
   p_sw->p_node = p_node;
   p_sw->switch_info = *p_si;
   p_sw->num_ports = num_ports;
+  p_sw->need_update = 1;
 
   status = osm_fwd_tbl_init( &p_sw->fwd_tbl, p_si );
   if( status != IB_SUCCESS )
diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
index b44a3ba..a8fc649 100644
--- a/opensm/opensm/osm_ucast_mgr.c
+++ b/opensm/opensm/osm_ucast_mgr.c
@@ -811,7 +811,8 @@ osm_ucast_mgr_set_fwd_table(
        osm_switch_get_fwd_tbl_block( p_sw, block_id_ho, block ) ;
        block_id_ho++ )
   {
-    if (!memcmp(block, p_mgr->lft_buf + block_id_ho * 64, 64))
+    if (!p_sw->need_update &&
+        !memcmp(block, p_mgr->lft_buf + block_id_ho * 64, 64))
       continue;
 
     if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
-- 
1.5.3.rc2.29.gc4640f


From hal.rosenstock at gmail.com  Tue Jul 24 11:38:24 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 24 Jul 2007 14:38:24 -0400
Subject: [ofa-general] Re: OpenSM detection of duplicated GUIDs on loopback
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F750FD@mtlexch01.mtl.com>
References: <f0e08f230707240730o6d665bb7q2357cfb4a49445aa@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED65F2@mtlexch01.mtl.com>
	<f0e08f230707240753u496187b8m670e72fbd1f499f8@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com>
	<f0e08f230707240803nf0f2297x44df64a802b080a1@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F750FD@mtlexch01.mtl.com>
Message-ID: <f0e08f230707241138sdd03eapece02d1626a25bb5@mail.gmail.com>

On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:
>
>  *Hi Hal,*
> **
> *The code to find "duplicated" GUIDs stem from real user cases where
> flawed *
> *burning procedure caused actual GUID duplications. There is nothing
> "impossible". *
>

No one said impossible; just a violation of what globally unique (GU from
GUID) really means. It's largely because vendors allowed users to program
non volatile RAM for GUIDs rather than a real manufacturing process for this
which guarantees uniqueness that we are even discussing this aspect of it.

 *So it is really critical the the SM will be able to recognize this case
> and abort.*
>

I agree with the detect part but not the abort part. Why can't it report
these errors and continue on ? That seems better to me than aborting.

-- Hal


> *It might be that for testing someone wants to use a loopback plug that
> cause the same *
> *port GUID appear on both sides of link - but it is better to require the
> user doing the test *
> *to set some flag than to miss such a situation in real life cluster.*
> **
> *This requirement was written after many people wasted many hours trying
> to figure out what was going on.*
> *PLEASE DO NOT TAKE IT AWAY*
> **
>
> *Eitan Zahavi***
> Senior Engineering Director, Software Architect
> Mellanox Technologies LTD
> Tel:+972-4-9097208
> Fax:+972-4-9593245
> P.O. Box 586 Yokneam 20692 ISRAEL
>
>
>  ------------------------------
> *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> *Sent:* Tuesday, July 24, 2007 6:04 PM
> *To:* Eitan Zahavi
> *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
> *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback
>
>
>
>
> On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:
> >
> >  *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com ]
> > *Sent:* Tuesday, July 24, 2007 5:53 PM
> > *To:* Eitan Zahavi
> > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
> > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback
> >
> >
> >
> > Hi Eitan,
> >
> > On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il > wrote:
> > >
> > >  *Hi Hal,*
> > > **
> > > *What is this "loopback" connector used for?*
> > > *Does not seem to me like a very useful thing to do.*
> > >
> > **
> > Perhaps not but no reason OpenSM can't handle this more gracefully.
> >
> >  *Anyway, if it is not a production environment we could add a "debug
> > > mode" (-d flag option) to ignore this check.*
> > >
> > **
> > Why would a separate flag be needed ?
> > *[EZ] Since I do not see any other solution for the SM  to know it is
> > really a loop back plug rather then two devices with same GUID connected
> > back to back ... *
> >
> >
> "Technically", this should only occur when looped back and not two devices
> with same GUID as GUID == globally unique and a duplication indicates a
> "manufacturing" issue.
>
> Anyhow, can't these be treated the same (and handled more gracefully)
> without an additional option/flag ?
>
> -- Hal
>
>
> > -- Hal
> >
> >  **
> > >
> > > *Eitan Zahavi***
> > > Senior Engineering Director, Software Architect
> > > Mellanox Technologies LTD
> > > Tel:+972-4-9097208
> > > Fax:+972-4-9593245
> > > P.O. Box 586 Yokneam 20692 ISRAEL
> > >
> > >
> > >  ------------------------------
> > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> > > *Sent: *Tuesday, July 24, 2007 5:31 PM
> > > *To:* OpenFabrics General
> > > *Cc:* Sasha Khapyorsky; Eitan Zahavi; Yevgeny Kliteynik
> > > *Subject:* OpenSM detection of duplicated GUIDs on loopback
> > >
> > >
> > >  Hi,
> > >
> > > This is what starts off as a "minor" issue and I know it has been
> > > discussed it somewhat in the past:
> > >
> > > Putting a loopback connector on a (switch) link causes OpenSM to
> > > indicate duplicated GUID error 0D18 as follows:
> > >
> > > __osm_ni_rcv_set_links
> > > {
> > > ...
> > >           /*
> > >              When there are only two nodes with exact same guids
> > > (connected back
> > >              to back) - the previous check for duplicated guid will
> > > not catch
> > >              them. But the link will be from the port to itself...
> > >              Enhanced Port 0 is an exception to this
> > >           */
> > >           if ((osm_node_get_node_guid( p_node ) ==
> > > p_ni_context->node_guid) &&
> > >               (port_num == p_ni_context->port_num) &&
> > >               (port_num != 0))
> > >           {
> > >             osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> > >                      "__osm_ni_rcv_set_links: ERR 0D18: "
> > >                      "Duplicate GUID found by link from a port to
> > > itself:"
> > >                      "node 0x%" PRIx64 ", port number 0x%X\n",
> > >                      cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> > >                      port_num );
> > > ...
> > >
> > > So this occurs over and over and over and fills the log with the same
> > > spew. This should be improved IMO.
> > >
> > > Is this really a fatal condition ? Doesn't seem like it should be to
> > > me.
> > >
> > > Also, OpenSM can "ride" this out with -y (stay on fatal) but is that
> > > safe for this condition ?
> > >
> > > Seems like something like an extra loopback bit should be added to
> > > some port structure which should cause these links to be ignored. This bit
> > > would then be reset when the peer is now longer itself.
> > >
> > > Also, is there a relationship of this with the 12x/duplicated GUID
> > > code ?
> > >
> > > Thanks.
> > >
> > > -- Hal
> > >
> > >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070724/b97e2a87/attachment.html>

From eitan at mellanox.co.il  Tue Jul 24 11:39:29 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 24 Jul 2007 21:39:29 +0300
Subject: [ofa-general] RE: OpenSM detection of duplicated GUIDs on loopback
In-Reply-To: <f0e08f230707241138sdd03eapece02d1626a25bb5@mail.gmail.com>
References: <f0e08f230707240730o6d665bb7q2357cfb4a49445aa@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED65F2@mtlexch01.mtl.com>
	<f0e08f230707240753u496187b8m670e72fbd1f499f8@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com>
	<f0e08f230707240803nf0f2297x44df64a802b080a1@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F750FD@mtlexch01.mtl.com>
	<f0e08f230707241138sdd03eapece02d1626a25bb5@mail.gmail.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901F7510A@mtlexch01.mtl.com>

Hi Hal,
 
For many users such a critical failure (one the SM can not really do
anything with) is better aborted then forgotten in some log file.
Anyway's the -y flag lets you ignore it if you like.
 

Eitan Zahavi 
Senior Engineering Director, Software Architect 
Mellanox Technologies LTD 
Tel:+972-4-9097208
Fax:+972-4-9593245 
P.O. Box 586 Yokneam 20692 ISRAEL 

 
________________________________

	From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] 
	Sent: Tuesday, July 24, 2007 9:38 PM
	To: Eitan Zahavi
	Cc: OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
	Subject: Re: OpenSM detection of duplicated GUIDs on loopback
	
	
	On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote: 

		Hi Hal,
		 
		The code to find "duplicated" GUIDs stem from real user
cases where flawed 
		burning procedure caused actual GUID duplications. There
is nothing "impossible". 

	 
	No one said impossible; just a violation of what globally unique
(GU from GUID) really means. It's largely because vendors allowed users
to program non volatile RAM for GUIDs rather than a real manufacturing
process for this which guarantees uniqueness that we are even discussing
this aspect of it. 


		So it is really critical the the SM will be able to
recognize this case and abort.

	 
	I agree with the detect part but not the abort part. Why can't
it report these errors and continue on ? That seems better to me than
aborting.
	 
	-- Hal


		It might be that for testing someone wants to use a
loopback plug that cause the same 
		port GUID appear on both sides of link - but it is
better to require the user doing the test 
		to set some flag than to miss such a situation in real
life cluster.
		 
		This requirement was written after many people wasted
many hours trying to figure out what was going on.
		PLEASE DO NOT TAKE IT AWAY
		
		 
		Eitan Zahavi 
		Senior Engineering Director, Software Architect 
		Mellanox Technologies LTD 
		Tel:+972-4-9097208
		Fax:+972-4-9593245 
		P.O. Box 586 Yokneam 20692 ISRAEL 

		 
________________________________

			From: Hal Rosenstock
[mailto:hal.rosenstock at gmail.com ] 
			Sent: Tuesday, July 24, 2007 6:04 PM 
			
			To: Eitan Zahavi
			Cc: OpenFabrics General; Sasha Khapyorsky;
Yevgeny Kliteynik
			Subject: Re: OpenSM detection of duplicated
GUIDs on loopback
			

			On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il >
wrote: 

				From: Hal Rosenstock
[mailto:hal.rosenstock at gmail.com ] 
				Sent: Tuesday, July 24, 2007 5:53 PM
				To: Eitan Zahavi
				Cc: OpenFabrics General; Sasha
Khapyorsky; Yevgeny Kliteynik
				Subject: Re: OpenSM detection of
duplicated GUIDs on loopback 
				
				 
				Hi Eitan,
				
				
				On 7/24/07, Eitan Zahavi
<eitan at mellanox.co.il > wrote: 

				Hi Hal,
				 
				What is this "loopback" connector used
for?
				Does not seem to me like a very useful
thing to do.

				 
				Perhaps not but no reason OpenSM can't
handle this more gracefully.


				Anyway, if it is not a production
environment we could add a "debug mode" (-d flag option) to ignore this
check.

				 
				Why would a separate flag be needed ?
				[EZ] Since I do not see any other
solution for the SM  to know it is really a loop back plug rather then
two devices with same GUID connected back to back ... 

			 
			"Technically", this should only occur when
looped back and not two devices with same GUID as GUID == globally
unique and a duplication indicates a "manufacturing" issue.
			 
			Anyhow, can't these be treated the same (and
handled more gracefully) without an additional option/flag ?
			 
			-- Hal


				-- Hal


				Eitan Zahavi 
				Senior Engineering Director, Software
Architect 
				Mellanox Technologies LTD 
				Tel:+972-4-9097208
				Fax:+972-4-9593245 
				P.O. Box 586 Yokneam 20692 ISRAEL 

				 
________________________________

				From: Hal Rosenstock
[mailto:hal.rosenstock at gmail.com] 
				Sent: Tuesday, July 24, 2007 5:31 PM
				To: OpenFabrics General
				Cc: Sasha Khapyorsky; Eitan Zahavi;
Yevgeny Kliteynik
				Subject: OpenSM detection of duplicated
GUIDs on loopback
				
				 
				Hi,
				 
				This is what starts off as a "minor"
issue and I know it has been discussed it somewhat in the past: 
				 
				Putting a loopback connector on a
(switch) link causes OpenSM to indicate duplicated GUID error 0D18 as
follows:
				
				__osm_ni_rcv_set_links
				{
				...
				          /*
				             When there are only two
nodes with exact same guids (connected back 
				             to back) - the previous
check for duplicated guid will not catch
				             them. But the link will be
from the port to itself...
				             Enhanced Port 0 is an
exception to this
				          */ 
				          if ((osm_node_get_node_guid(
p_node ) == p_ni_context->node_guid) &&
				              (port_num ==
p_ni_context->port_num) &&
				              (port_num != 0))
				          {
				            osm_log( p_rcv->p_log,
OSM_LOG_ERROR, 
	
"__osm_ni_rcv_set_links: ERR 0D18: "
				                     "Duplicate GUID
found by link from a port to itself:"
				                     "node 0x%" PRIx64
", port number 0x%X\n", 
				                     cl_ntoh64(
osm_node_get_node_guid( p_node ) ),
				                     port_num );
				...
				
				So this occurs over and over and over
and fills the log with the same spew. This should be improved IMO. 
				
				Is this really a fatal condition ?
Doesn't seem like it should be to me. 
				 
				Also, OpenSM can "ride" this out with -y
(stay on fatal) but is that safe for this condition ?
				 
				Seems like something like an extra
loopback bit should be added to some port structure which should cause
these links to be ignored. This bit would then be reset when the peer is
now longer itself. 
				
				Also, is there a relationship of this
with the 12x/duplicated GUID code ? 
				 
				Thanks.
				 
				-- Hal


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070724/32668010/attachment.html>

From hal.rosenstock at gmail.com  Tue Jul 24 11:55:35 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 24 Jul 2007 14:55:35 -0400
Subject: [ofa-general] Re: OpenSM detection of duplicated GUIDs on loopback
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F7510A@mtlexch01.mtl.com>
References: <f0e08f230707240730o6d665bb7q2357cfb4a49445aa@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED65F2@mtlexch01.mtl.com>
	<f0e08f230707240753u496187b8m670e72fbd1f499f8@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com>
	<f0e08f230707240803nf0f2297x44df64a802b080a1@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F750FD@mtlexch01.mtl.com>
	<f0e08f230707241138sdd03eapece02d1626a25bb5@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F7510A@mtlexch01.mtl.com>
Message-ID: <f0e08f230707241155v225aeed7s172f5cfb9024fc0e@mail.gmail.com>

On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:

>  *Hi Hal,*
> **
> *For many users such a critical failure (one the SM can not really do
> anything with) is better aborted then forgotten in some log file.*
> *Anyway's the -y flag lets you ignore it if you like.*
>

So everything else continues to work fine with -y ? In which case, I'm not
sure which is the better default.

Users certainly won't like their logs filling up with continuous duplicated
GUID messages. The log spew should be cleaned up IMO.

-- Hal


>  *Eitan Zahavi***
> Senior Engineering Director, Software Architect
> Mellanox Technologies LTD
> Tel:+972-4-9097208
> Fax:+972-4-9593245
> P.O. Box 586 Yokneam 20692 ISRAEL
>
>
>  ------------------------------
> *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> *Sent:* Tuesday, July 24, 2007 9:38 PM
> *To:* Eitan Zahavi
> *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
> *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback
>
>
>
>
> On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:
> >
> >  *Hi Hal,*
> > **
> > *The code to find "duplicated" GUIDs stem from real user cases where
> > flawed *
> > *burning procedure caused actual GUID duplications. There is nothing
> > "impossible". *
> >
>
> No one said impossible; just a violation of what globally unique (GU from
> GUID) really means. It's largely because vendors allowed users to program
> non volatile RAM for GUIDs rather than a real manufacturing process for this
> which guarantees uniqueness that we are even discussing this aspect of it.
>
>  *So it is really critical the the SM will be able to recognize this case
> > and abort.*
> >
>
> I agree with the detect part but not the abort part. Why can't it report
> these errors and continue on ? That seems better to me than aborting.
>
> -- Hal
>
>
> > *It might be that for testing someone wants to use a loopback plug that
> > cause the same *
> > *port GUID appear on both sides of link - but it is better to require
> > the user doing the test *
> > *to set some flag than to miss such a situation in real life cluster.*
> > **
> > *This requirement was written after many people wasted many hours trying
> > to figure out what was going on.*
> > *PLEASE DO NOT TAKE IT AWAY*
> > **
> >
> > *Eitan Zahavi***
> > Senior Engineering Director, Software Architect
> > Mellanox Technologies LTD
> > Tel:+972-4-9097208
> > Fax:+972-4-9593245
> > P.O. Box 586 Yokneam 20692 ISRAEL
> >
> >
> >  ------------------------------
> > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com ]
> > *Sent:* Tuesday, July 24, 2007 6:04 PM
> > *To:* Eitan Zahavi
> > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
> > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback
> >
> >
> >
> >
> > On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il > wrote:
> > >
> > >  *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com ]
> > > *Sent:* Tuesday, July 24, 2007 5:53 PM
> > > *To:* Eitan Zahavi
> > > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
> > > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback
> > >
> > >
> > >
> > > Hi Eitan,
> > >
> > > On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il > wrote:
> > > >
> > > >  *Hi Hal,*
> > > > **
> > > > *What is this "loopback" connector used for?*
> > > > *Does not seem to me like a very useful thing to do.*
> > > >
> > > **
> > > Perhaps not but no reason OpenSM can't handle this more gracefully.
> > >
> > >  *Anyway, if it is not a production environment we could add a "debug
> > > > mode" (-d flag option) to ignore this check.*
> > > >
> > > **
> > > Why would a separate flag be needed ?
> > > *[EZ] Since I do not see any other solution for the SM  to know it is
> > > really a loop back plug rather then two devices with same GUID connected
> > > back to back ... *
> > >
> > >
> > "Technically", this should only occur when looped back and not two
> > devices with same GUID as GUID == globally unique and a duplication
> > indicates a "manufacturing" issue.
> >
> > Anyhow, can't these be treated the same (and handled more gracefully)
> > without an additional option/flag ?
> >
> > -- Hal
> >
> >
> > > -- Hal
> > >
> > >  **
> > > >
> > > > *Eitan Zahavi***
> > > > Senior Engineering Director, Software Architect
> > > > Mellanox Technologies LTD
> > > > Tel:+972-4-9097208
> > > > Fax:+972-4-9593245
> > > > P.O. Box 586 Yokneam 20692 ISRAEL
> > > >
> > > >
> > > >  ------------------------------
> > > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> > > > *Sent: *Tuesday, July 24, 2007 5:31 PM
> > > > *To:* OpenFabrics General
> > > > *Cc:* Sasha Khapyorsky; Eitan Zahavi; Yevgeny Kliteynik
> > > > *Subject:* OpenSM detection of duplicated GUIDs on loopback
> > > >
> > > >
> > > >  Hi,
> > > >
> > > > This is what starts off as a "minor" issue and I know it has been
> > > > discussed it somewhat in the past:
> > > >
> > > > Putting a loopback connector on a (switch) link causes OpenSM to
> > > > indicate duplicated GUID error 0D18 as follows:
> > > >
> > > > __osm_ni_rcv_set_links
> > > > {
> > > > ...
> > > >           /*
> > > >              When there are only two nodes with exact same guids
> > > > (connected back
> > > >              to back) - the previous check for duplicated guid will
> > > > not catch
> > > >              them. But the link will be from the port to itself...
> > > >              Enhanced Port 0 is an exception to this
> > > >           */
> > > >           if ((osm_node_get_node_guid( p_node ) ==
> > > > p_ni_context->node_guid) &&
> > > >               (port_num == p_ni_context->port_num) &&
> > > >               (port_num != 0))
> > > >           {
> > > >             osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> > > >                      "__osm_ni_rcv_set_links: ERR 0D18: "
> > > >                      "Duplicate GUID found by link from a port to
> > > > itself:"
> > > >                      "node 0x%" PRIx64 ", port number 0x%X\n",
> > > >                      cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> > > >                      port_num );
> > > > ...
> > > >
> > > > So this occurs over and over and over and fills the log with the
> > > > same spew. This should be improved IMO.
> > > >
> > > > Is this really a fatal condition ? Doesn't seem like it should be to
> > > > me.
> > > >
> > > > Also, OpenSM can "ride" this out with -y (stay on fatal) but is that
> > > > safe for this condition ?
> > > >
> > > > Seems like something like an extra loopback bit should be added to
> > > > some port structure which should cause these links to be ignored. This bit
> > > > would then be reset when the peer is now longer itself.
> > > >
> > > > Also, is there a relationship of this with the 12x/duplicated GUID
> > > > code ?
> > > >
> > > > Thanks.
> > > >
> > > > -- Hal
> > > >
> > > >
> > >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070724/f69f8f9f/attachment.html>

From mst at dev.mellanox.co.il  Tue Jul 24 12:05:02 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 24 Jul 2007 22:05:02 +0300
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <46A64134.5080502@ichips.intel.com>
References: <20070723200640.GA13117@bauxite.pathscale.com>
	<000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com>
	<20070724161646.GA24797@mellanox.co.il>
	<20070724165032.GK16727@bauxite.pathscale.com>
	<20070724165550.GD24797@mellanox.co.il>
	<20070724170726.GL16727@bauxite.pathscale.com>
	<20070724171924.GG24797@mellanox.co.il>
	<20070724172826.GN16727@bauxite.pathscale.com>
	<20070724175220.GI24797@mellanox.co.il>
	<46A64134.5080502@ichips.intel.com>
Message-ID: <20070724190502.GA29012@mellanox.co.il>

> Quoting Sean Hefty <mshefty at ichips.intel.com>:
> Subject: Re: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
> 
> >Here's a short list off the top of my head
> >
> >- A single git pull merges any number of backport changes
> >- A single git reset ORIG_HEAD recovers from a conflicting merge
> >- A single tag tags all code for all kernels
> >- On update from upstream, if there is a conflict
> >  between upstream code and and a patch
> >  it's easy to temporarily remote the patch, complete the merge,
> >  and go bugger the patch author
> >- For recent kernels there are almost no patches.
> >  So an update from upstream for these kernels is free,
> >  with branches I will still need to update all branches.
> >- Adding a fix which only affects common code
> >  is currently straight-forward: make a change, commit.
> >  With multiple branches every fix must be pulled into
> >  all branches.
> 
> You seem to be overlooking the fact that you already require a script to 
> check that things work for all kernels.  Until you apply a series of 
> patches to form a particular kernel, you don't know if a change that you 
> pulled in caused a conflict.  You still have the requirement to verify 
> the fix on all kernels, and it still requires running a script that 
> pushes/pops patches to create each tree.

Yes. But I find it preferable to manage history with
full power of native git tools, where a single hash identifies a revision,
and limit the scope of the scripts to the build process.

This, as opposed to an elaborate methodology that is based
on naming conventions, and requires use of scripts to do
basic tasks such as tagging, history rewriting, etc.

-- 
MST


From mst at dev.mellanox.co.il  Tue Jul 24 12:06:56 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 24 Jul 2007 22:06:56 +0300
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <46A63DA2.5000602@ichips.intel.com>
References: <20070724171445.GE24797@mellanox.co.il>
	<000101c7ce16$d73bc790$ff0da8c0@amr.corp.intel.com>
	<20070724174255.GH24797@mellanox.co.il>
	<46A63DA2.5000602@ichips.intel.com>
Message-ID: <20070724190656.GB29012@mellanox.co.il>

> Quoting Sean Hefty <mshefty at ichips.intel.com>:
> Subject: Re: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
> 
> >But the proposal here was to have a bush of branches, all of which
> >need to be merged at the same time. It's possible that some
> >would merge and some would fail, leaving me in an inconsistent state,
> >and no easy way to get back to where I started.
> 
> A fix could be applied to some kernels, but not others.  In fact, if a 
> patch works for kernel X & Y, but has a conflict with kernel Z, then 
> different patches are needed anyway.  I don't see the requirement to 
> merge everything or even apply a fix to all kernels at the same time.

This is typically component maintainer's job, not integrator's.
As an integrator, I want to pull but if the merge fails,
reset everything back to the original state, and let the maintainer know.

-- 
MST


From hadi at cyberus.ca  Tue Jul 24 12:28:20 2007
From: hadi at cyberus.ca (jamal)
Date: Tue, 24 Jul 2007 15:28:20 -0400
Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API
In-Reply-To: <OF782207DD.01F7022B-ON65257322.000F7C16-65257322.0014942E@in.ibm.com>
References: <OF782207DD.01F7022B-ON65257322.000F7C16-65257322.0014942E@in.ibm.com>
Message-ID: <1185305300.26013.152.camel@localhost>

KK,

On Tue, 2007-24-07 at 09:14 +0530, Krishna Kumar2 wrote:

> 
> J Hadi Salim <j.hadi123 at gmail.com> wrote on 07/23/2007 06:02:01 PM:


> Actually you have not sent netperf results with prep and without prep.

My results were based on pktgen (which i explained as testing the
driver). I think depending on netperf without further analysis is
simplistic. It was like me doing forwarding tests on these patches.

> > So _which_ non-LLTX driver doesnt do that? ;->
> 
> I have no idea since I haven't looked at all drivers. Can you tell which
> all non-LLTX drivers does that ? I stated this as the sole criterea.

The few i have peeked at all do it. I also think the e1000 should be
converted to be non-LLTX. The rest of netdev is screaming to kill LLTX. 

> > tun driver doesnt use it either - but i doubt that makes it "bloat"
> 
> Adding extra code that is currently not usable (esp from a submission
> point) is bloat.

So far i have converted 3 drivers, 1 of them doesnt use it. Two more
driver conversions are on the way, they will both use it. How is this
bloat again? 
A few emails back you said if only IPOIB can use batching then thats
good enough justification. 

> > You waltz in, have the luxury of looking at my code, presentations, many
> > discussions with me etc ...
> 
> "luxury" ? 
> I had implemented the entire thing even before knowing that you
> are working on something similar! and I had sent the first proposal to
> netdev,

I saw your patch at the end of may (or at least 2 weeks after you said
it existed). That patch has very little resemblance to what you just
posted conceptwise or codewise. I could post it if you would give me
permission.

> *after* which you told that you have your own code and presentations (which
> I had never seen earlier - I joined netdev a few months back, earlier I was
> working on RDMA, Infiniband as you know).

I am gonna assume you didnt know of my work - which i have been making
public for about 3 years. Infact i talked about this topic when i
visited your office in 2006 on a day you were not present, so it is
plausible you didnt hear of it.

>  And it didn't give me any great
> ideas either, remember I had posted results for E1000 at the time of
> sending the proposals. 

In mid-June you sent me a series of patches which included anything from
changing variable names to combining qdisc_restart and about everything
i referred to as being "cosmetic differences" in your posted patches. I
took two of those and incorporated them in. One was an "XXX" in my code
already to allocate the dev->blist 
(Commit: bb4464c5f67e2a69ffb233fcf07aede8657e4f63). 
The other one was a mechanical removal of the blist being passed
(Commit: 0e9959e5ee6f6d46747c97ca8edc91b3eefa0757). 
Some of the others i asked you to defer. For example, the reason i gave
you for not merging any qdisc_restart_combine changes is because i was
waiting for Dave to swallow the qdisc_restart changes i made; otherwise
maintainance becomes extremely painful for me. 
Sridhar actually provided a lot more valuable comments and fixes but has
not planted a flag on behalf of the queen of spain like you did. 

> However I do give credit in my proposal to you for what
> ideas that your provided (without actual code), and the same I did for other
> people who did the same, like Dave, Sridhar. BTW, you too had discussions with me,
> and I sent some patches to improve your code too, 

I incorporated two of your patches and asked for deferal of others.
These patches have now shown up in what you claim as "the difference". I
just call them "cosmetic difference" not to downplay the importance of
having an ethtool interface but because they do not make batching
perform any better. The real differences are those two items. I am
suprised you havent cannibalized those changes as well. I thought you
renamed them to something else; according to your posting:
"This patch will work with drivers updated by Jamal, Matt & Michael Chan
with minor modifications - rename xmit_win to xmit_slots & rename batch
handler". Or maybe thats a "future plan" you have in mind?

> so it looks like a two
> way street to me (and that is how open source works and should).

Open source is a lot more transparent than that.

You posted a question, which was part of your research. I responded and
told you i have patches; you asked me for them and i promptly ported
them from pre-2.6.18 to the latest kernel at the time. 

The nature of this batching work is one of performance. So numbers are
important. If you had some strong disagreements on something in the
architecture, then it would be of great value to explain it in a
technical detail - and more importantly to provide some numbers to say
why it is a bad idea. You get numbers by running some tests. 
You did none of the above. Your effort has been to produce "your patch"
for whatever reasons. This would not have been problematic to me if it
actually was based within reasons of optimization because the end goal
would have been achieved.

I have deleted the rest of the email because it goes back and forth on
the same points. 

I am gonna continue work on the current tree i have. I will put more
time when i get back next week (and hopefully no travel right after).
I will upgrade to Daves tree later when i get the two new drivers in. I
am probably gonna hold on until the new NAPI stuff settles in first. You
are welcome to  submit the ipoib changes in. You are also welcome to
co-author with me but you will have to work for it this time.

cheers,
jamal


From tom at opengridcomputing.com  Tue Jul 24 12:31:52 2007
From: tom at opengridcomputing.com (Tom Tucker)
Date: Tue, 24 Jul 2007 14:31:52 -0500
Subject: [ofa-general] [PATCH] amso1100: QP init bug in amso driver
Message-ID: <1185305512.20489.6.camel@trinity.ogc.int>

Roland:

The guys at UNH found this and fixed it. I'm surprised no
one has hit this before. I guess it only breaks when the 
refcount on the QP is non-zero.

Initialize the wait_queue_head_t in the c2_qp structure.

Signed-off-by: Ethan Burns <eaburns at iol.unh.edu>
Acked-by: Tom Tucker <tom at opengridcomputing.com>

---
 drivers/infiniband/hw/amso1100/c2_qp.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2_qp.c b/drivers/infiniband/hw/amso1100/c2_qp.c
index 420c138..01d0786 100644
--- a/drivers/infiniband/hw/amso1100/c2_qp.c
+++ b/drivers/infiniband/hw/amso1100/c2_qp.c
@@ -506,6 +506,7 @@ int c2_alloc_qp(struct c2_dev *c2dev,
 	qp->send_sgl_depth = qp_attrs->cap.max_send_sge;
 	qp->rdma_write_sgl_depth = qp_attrs->cap.max_send_sge;
 	qp->recv_sgl_depth = qp_attrs->cap.max_recv_sge;
+	init_waitqueue_head(&qp->wait);
 
 	/* Initialize the SQ MQ */
 	q_size = be32_to_cpu(reply->sq_depth);


From eitan at mellanox.co.il  Tue Jul 24 13:20:37 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 24 Jul 2007 23:20:37 +0300
Subject: [ofa-general] RE: OpenSM detection of duplicated GUIDs on loopback
In-Reply-To: <f0e08f230707241155v225aeed7s172f5cfb9024fc0e@mail.gmail.com>
References: <f0e08f230707240730o6d665bb7q2357cfb4a49445aa@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED65F2@mtlexch01.mtl.com>
	<f0e08f230707240753u496187b8m670e72fbd1f499f8@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com>
	<f0e08f230707240803nf0f2297x44df64a802b080a1@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F750FD@mtlexch01.mtl.com>
	<f0e08f230707241138sdd03eapece02d1626a25bb5@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F7510A@mtlexch01.mtl.com>
	<f0e08f230707241155v225aeed7s172f5cfb9024fc0e@mail.gmail.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901F7512E@mtlexch01.mtl.com>

Maybe  avoid the log if -y is provided?
 

Eitan Zahavi 
Senior Engineering Director, Software Architect 
Mellanox Technologies LTD 
Tel:+972-4-9097208
Fax:+972-4-9593245 
P.O. Box 586 Yokneam 20692 ISRAEL 

 
________________________________

	From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] 
	Sent: Tuesday, July 24, 2007 9:56 PM
	To: Eitan Zahavi
	Cc: OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
	Subject: Re: OpenSM detection of duplicated GUIDs on loopback
	
	
	On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:

		Hi Hal,
		 
		For many users such a critical failure (one the SM can
not really do anything with) is better aborted then forgotten in some
log file. 
		Anyway's the -y flag lets you ignore it if you like.

	 
	So everything else continues to work fine with -y ? In which
case, I'm not sure which is the better default.
	 
	Users certainly won't like their logs filling up with continuous
duplicated GUID messages. The log spew should be cleaned up IMO.
	 
	-- Hal

	 
		Eitan Zahavi 
		Senior Engineering Director, Software Architect 
		Mellanox Technologies LTD 
		Tel:+972-4-9097208
		Fax:+972-4-9593245 
		P.O. Box 586 Yokneam 20692 ISRAEL 

		 
________________________________

			From: Hal Rosenstock
[mailto:hal.rosenstock at gmail.com ] 
			Sent: Tuesday, July 24, 2007 9:38 PM 
			
			To: Eitan Zahavi
			Cc: OpenFabrics General; Sasha Khapyorsky;
Yevgeny Kliteynik
			Subject: Re: OpenSM detection of duplicated
GUIDs on loopback
			

			On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il >
wrote: 

				Hi Hal,
				 
				The code to find "duplicated" GUIDs stem
from real user cases where flawed 
				burning procedure caused actual GUID
duplications. There is nothing "impossible". 

			 
			No one said impossible; just a violation of what
globally unique (GU from GUID) really means. It's largely because
vendors allowed users to program non volatile RAM for GUIDs rather than
a real manufacturing process for this which guarantees uniqueness that
we are even discussing this aspect of it. 


				So it is really critical the the SM will
be able to recognize this case and abort.

			 
			I agree with the detect part but not the abort
part. Why can't it report these errors and continue on ? That seems
better to me than aborting.
			 
			-- Hal


				It might be that for testing someone
wants to use a loopback plug that cause the same 
				port GUID appear on both sides of link -
but it is better to require the user doing the test 
				to set some flag than to miss such a
situation in real life cluster.
				 
				This requirement was written after many
people wasted many hours trying to figure out what was going on.
				PLEASE DO NOT TAKE IT AWAY
				
				 
				Eitan Zahavi 
				Senior Engineering Director, Software
Architect 
				Mellanox Technologies LTD 
				Tel:+972-4-9097208
				Fax:+972-4-9593245 
				P.O. Box 586 Yokneam 20692 ISRAEL 

				 
________________________________

				From: Hal Rosenstock
[mailto:hal.rosenstock at gmail.com ] 
				Sent: Tuesday, July 24, 2007 6:04 PM 
				
				To: Eitan Zahavi
				Cc: OpenFabrics General; Sasha
Khapyorsky; Yevgeny Kliteynik
				Subject: Re: OpenSM detection of
duplicated GUIDs on loopback
				

				On 7/24/07, Eitan Zahavi
<eitan at mellanox.co.il > wrote: 

				From: Hal Rosenstock
[mailto:hal.rosenstock at gmail.com ] 
				Sent: Tuesday, July 24, 2007 5:53 PM
				To: Eitan Zahavi
				Cc: OpenFabrics General; Sasha
Khapyorsky; Yevgeny Kliteynik
				Subject: Re: OpenSM detection of
duplicated GUIDs on loopback 
				
				 
				Hi Eitan,
				
				
				On 7/24/07, Eitan Zahavi
<eitan at mellanox.co.il > wrote: 

				Hi Hal,
				 
				What is this "loopback" connector used
for?
				Does not seem to me like a very useful
thing to do.

				 
				Perhaps not but no reason OpenSM can't
handle this more gracefully.


				Anyway, if it is not a production
environment we could add a "debug mode" (-d flag option) to ignore this
check.

				 
				Why would a separate flag be needed ?
				[EZ] Since I do not see any other
solution for the SM  to know it is really a loop back plug rather then
two devices with same GUID connected back to back ... 

				 
				"Technically", this should only occur
when looped back and not two devices with same GUID as GUID == globally
unique and a duplication indicates a "manufacturing" issue.
				 
				Anyhow, can't these be treated the same
(and handled more gracefully) without an additional option/flag ?
				 
				-- Hal


				-- Hal


				Eitan Zahavi 
				Senior Engineering Director, Software
Architect 
				Mellanox Technologies LTD 
				Tel:+972-4-9097208
				Fax:+972-4-9593245 
				P.O. Box 586 Yokneam 20692 ISRAEL 

				 
________________________________

				From: Hal Rosenstock
[mailto:hal.rosenstock at gmail.com] 
				Sent: Tuesday, July 24, 2007 5:31 PM
				To: OpenFabrics General
				Cc: Sasha Khapyorsky; Eitan Zahavi;
Yevgeny Kliteynik
				Subject: OpenSM detection of duplicated
GUIDs on loopback
				
				 
				Hi,
				 
				This is what starts off as a "minor"
issue and I know it has been discussed it somewhat in the past: 
				 
				Putting a loopback connector on a
(switch) link causes OpenSM to indicate duplicated GUID error 0D18 as
follows:
				
				__osm_ni_rcv_set_links
				{
				...
				          /*
				             When there are only two
nodes with exact same guids (connected back 
				             to back) - the previous
check for duplicated guid will not catch
				             them. But the link will be
from the port to itself...
				             Enhanced Port 0 is an
exception to this
				          */ 
				          if ((osm_node_get_node_guid(
p_node ) == p_ni_context->node_guid) &&
				              (port_num ==
p_ni_context->port_num) &&
				              (port_num != 0))
				          {
				            osm_log( p_rcv->p_log,
OSM_LOG_ERROR, 
	
"__osm_ni_rcv_set_links: ERR 0D18: "
				                     "Duplicate GUID
found by link from a port to itself:"
				                     "node 0x%" PRIx64
", port number 0x%X\n", 
				                     cl_ntoh64(
osm_node_get_node_guid( p_node ) ),
				                     port_num );
				...
				
				So this occurs over and over and over
and fills the log with the same spew. This should be improved IMO. 
				
				Is this really a fatal condition ?
Doesn't seem like it should be to me. 
				 
				Also, OpenSM can "ride" this out with -y
(stay on fatal) but is that safe for this condition ?
				 
				Seems like something like an extra
loopback bit should be added to some port structure which should cause
these links to be ignored. This bit would then be reset when the peer is
now longer itself. 
				
				Also, is there a relationship of this
with the 12x/duplicated GUID code ? 
				 
				Thanks.
				 
				-- Hal


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070724/02494286/attachment.html>

From hal.rosenstock at gmail.com  Tue Jul 24 13:25:46 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 24 Jul 2007 16:25:46 -0400
Subject: [ofa-general] Re: OpenSM detection of duplicated GUIDs on loopback
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F7512E@mtlexch01.mtl.com>
References: <f0e08f230707240730o6d665bb7q2357cfb4a49445aa@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED65F2@mtlexch01.mtl.com>
	<f0e08f230707240753u496187b8m670e72fbd1f499f8@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com>
	<f0e08f230707240803nf0f2297x44df64a802b080a1@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F750FD@mtlexch01.mtl.com>
	<f0e08f230707241138sdd03eapece02d1626a25bb5@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F7510A@mtlexch01.mtl.com>
	<f0e08f230707241155v225aeed7s172f5cfb9024fc0e@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F7512E@mtlexch01.mtl.com>
Message-ID: <f0e08f230707241325p1b7b04cap5bf37459bc7cb33f@mail.gmail.com>

On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:
>
>  *Maybe  avoid the log if -y is provided?*
>

That avoids the spew but the duplicated GUID is important to know so IMO
something in the "middle" is needed where duplicated GUIDs are logged but
not continually the same ones.

 *Eitan Zahavi***
> Senior Engineering Director, Software Architect
> Mellanox Technologies LTD
> Tel:+972-4-9097208
> Fax:+972-4-9593245
> P.O. Box 586 Yokneam 20692 ISRAEL
>
>
>  ------------------------------
> *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> *Sent:* Tuesday, July 24, 2007 9:56 PM
> *To:* Eitan Zahavi
> *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
> *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback
>
>
>
>
> On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:
>
> >  *Hi Hal,*
> > **
> > *For many users such a critical failure (one the SM can not really do
> > anything with) is better aborted then forgotten in some log file.*
> > *Anyway's the -y flag lets you ignore it if you like.*
> >
>
> So everything else continues to work fine with -y ? In which case, I'm not
> sure which is the better default.
>
> Users certainly won't like their logs filling up with continuous
> duplicated GUID messages. The log spew should be cleaned up IMO.
>
> -- Hal
>
>
>
>
>
> >  *Eitan Zahavi***
> > Senior Engineering Director, Software Architect
> > Mellanox Technologies LTD
> > Tel:+972-4-9097208
> > Fax:+972-4-9593245
> > P.O. Box 586 Yokneam 20692 ISRAEL
> >
> >
> >  ------------------------------
> > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com ]
> > *Sent:* Tuesday, July 24, 2007 9:38 PM
> > *To:* Eitan Zahavi
> > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
> > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback
> >
> >
> >
> >
> > On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il > wrote:
> > >
> > >  *Hi Hal,*
> > > **
> > > *The code to find "duplicated" GUIDs stem from real user cases where
> > > flawed *
> > > *burning procedure caused actual GUID duplications. There is nothing
> > > "impossible". *
> > >
> >
> > No one said impossible; just a violation of what globally unique (GU
> > from GUID) really means. It's largely because vendors allowed users to
> > program non volatile RAM for GUIDs rather than a real manufacturing process
> > for this which guarantees uniqueness that we are even discussing this aspect
> > of it.
> >
> >  *So it is really critical the the SM will be able to recognize this
> > > case and abort.*
> > >
> >
> > I agree with the detect part but not the abort part. Why can't it report
> > these errors and continue on ? That seems better to me than aborting.
> >
> > -- Hal
> >
> >
> > > *It might be that for testing someone wants to use a loopback plug
> > > that cause the same *
> > > *port GUID appear on both sides of link - but it is better to require
> > > the user doing the test *
> > > *to set some flag than to miss such a situation in real life cluster.*
> > > **
> > > *This requirement was written after many people wasted many hours
> > > trying to figure out what was going on.*
> > > *PLEASE DO NOT TAKE IT AWAY*
> > > **
> > >
> > > *Eitan Zahavi***
> > > Senior Engineering Director, Software Architect
> > > Mellanox Technologies LTD
> > > Tel:+972-4-9097208
> > > Fax:+972-4-9593245
> > > P.O. Box 586 Yokneam 20692 ISRAEL
> > >
> > >
> > >  ------------------------------
> > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com ]
> > > *Sent:* Tuesday, July 24, 2007 6:04 PM
> > > *To:* Eitan Zahavi
> > > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
> > > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback
> > >
> > >
> > >
> > >
> > > On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il > wrote:
> > > >
> > > >  *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com ]
> > > > *Sent:* Tuesday, July 24, 2007 5:53 PM
> > > > *To:* Eitan Zahavi
> > > > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
> > > > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback
> > > >
> > > >
> > > >
> > > > Hi Eitan,
> > > >
> > > > On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il > wrote:
> > > > >
> > > > >  *Hi Hal,*
> > > > > **
> > > > > *What is this "loopback" connector used for?*
> > > > > *Does not seem to me like a very useful thing to do.*
> > > > >
> > > > **
> > > > Perhaps not but no reason OpenSM can't handle this more gracefully.
> > > >
> > > >  *Anyway, if it is not a production environment we could add a
> > > > > "debug mode" (-d flag option) to ignore this check.*
> > > > >
> > > > **
> > > > Why would a separate flag be needed ?
> > > > *[EZ] Since I do not see any other solution for the SM  to know it
> > > > is really a loop back plug rather then two devices with same GUID connected
> > > > back to back ... *
> > > >
> > > >
> > > "Technically", this should only occur when looped back and not two
> > > devices with same GUID as GUID == globally unique and a duplication
> > > indicates a "manufacturing" issue.
> > >
> > > Anyhow, can't these be treated the same (and handled more gracefully)
> > > without an additional option/flag ?
> > >
> > > -- Hal
> > >
> > >
> > > > -- Hal
> > > >
> > > >  **
> > > > >
> > > > > *Eitan Zahavi***
> > > > > Senior Engineering Director, Software Architect
> > > > > Mellanox Technologies LTD
> > > > > Tel:+972-4-9097208
> > > > > Fax:+972-4-9593245
> > > > > P.O. Box 586 Yokneam 20692 ISRAEL
> > > > >
> > > > >
> > > > >  ------------------------------
> > > > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> > > > > *Sent: *Tuesday, July 24, 2007 5:31 PM
> > > > > *To:* OpenFabrics General
> > > > > *Cc:* Sasha Khapyorsky; Eitan Zahavi; Yevgeny Kliteynik
> > > > > *Subject:* OpenSM detection of duplicated GUIDs on loopback
> > > > >
> > > > >
> > > > >  Hi,
> > > > >
> > > > > This is what starts off as a "minor" issue and I know it has been
> > > > > discussed it somewhat in the past:
> > > > >
> > > > > Putting a loopback connector on a (switch) link causes OpenSM to
> > > > > indicate duplicated GUID error 0D18 as follows:
> > > > >
> > > > > __osm_ni_rcv_set_links
> > > > > {
> > > > > ...
> > > > >           /*
> > > > >              When there are only two nodes with exact same guids
> > > > > (connected back
> > > > >              to back) - the previous check for duplicated guid
> > > > > will not catch
> > > > >              them. But the link will be from the port to itself...
> > > > >              Enhanced Port 0 is an exception to this
> > > > >           */
> > > > >           if ((osm_node_get_node_guid( p_node ) ==
> > > > > p_ni_context->node_guid) &&
> > > > >               (port_num == p_ni_context->port_num) &&
> > > > >               (port_num != 0))
> > > > >           {
> > > > >             osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> > > > >                      "__osm_ni_rcv_set_links: ERR 0D18: "
> > > > >                      "Duplicate GUID found by link from a port to
> > > > > itself:"
> > > > >                      "node 0x%" PRIx64 ", port number 0x%X\n",
> > > > >                      cl_ntoh64( osm_node_get_node_guid( p_node )
> > > > > ),
> > > > >                      port_num );
> > > > > ...
> > > > >
> > > > > So this occurs over and over and over and fills the log with the
> > > > > same spew. This should be improved IMO.
> > > > >
> > > > > Is this really a fatal condition ? Doesn't seem like it should be
> > > > > to me.
> > > > >
> > > > > Also, OpenSM can "ride" this out with -y (stay on fatal) but is
> > > > > that safe for this condition ?
> > > > >
> > > > > Seems like something like an extra loopback bit should be added to
> > > > > some port structure which should cause these links to be ignored. This bit
> > > > > would then be reset when the peer is now longer itself.
> > > > >
> > > > > Also, is there a relationship of this with the 12x/duplicated GUID
> > > > > code ?
> > > > >
> > > > > Thanks.
> > > > >
> > > > > -- Hal
> > > > >
> > > > >
> > > >
> > >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070724/29b4d891/attachment.html>

From eitan at mellanox.co.il  Tue Jul 24 13:25:32 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 24 Jul 2007 23:25:32 +0300
Subject: [ofa-general] RE: OpenSM detection of duplicated GUIDs on loopback
In-Reply-To: <f0e08f230707241325p1b7b04cap5bf37459bc7cb33f@mail.gmail.com>
References: <f0e08f230707240730o6d665bb7q2357cfb4a49445aa@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED65F2@mtlexch01.mtl.com>
	<f0e08f230707240753u496187b8m670e72fbd1f499f8@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com>
	<f0e08f230707240803nf0f2297x44df64a802b080a1@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F750FD@mtlexch01.mtl.com>
	<f0e08f230707241138sdd03eapece02d1626a25bb5@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F7510A@mtlexch01.mtl.com>
	<f0e08f230707241155v225aeed7s172f5cfb9024fc0e@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F7512E@mtlexch01.mtl.com>
	<f0e08f230707241325p1b7b04cap5bf37459bc7cb33f@mail.gmail.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901F75131@mtlexch01.mtl.com>


	On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote: 

		Maybe  avoid the log if -y is provided?

	 
	That avoids the spew but the duplicated GUID is important to
know so IMO something in the "middle" is needed where duplicated GUIDs
are logged but not continually the same ones.
	[EZ]  
	OK so in -y mode only we track which ones were reported and do
not repeat the log?
	 

		Eitan Zahavi 
		Senior Engineering Director, Software Architect 
		Mellanox Technologies LTD 
		Tel:+972-4-9097208
		Fax:+972-4-9593245 
		P.O. Box 586 Yokneam 20692 ISRAEL 

		 
________________________________

			From: Hal Rosenstock
[mailto:hal.rosenstock at gmail.com ] 
			Sent: Tuesday, July 24, 2007 9:56 PM 
			
			To: Eitan Zahavi
			Cc: OpenFabrics General; Sasha Khapyorsky;
Yevgeny Kliteynik
			Subject: Re: OpenSM detection of duplicated
GUIDs on loopback
			

			On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il >
wrote:

				Hi Hal,
				 
				For many users such a critical failure
(one the SM can not really do anything with) is better aborted then
forgotten in some log file. 
				Anyway's the -y flag lets you ignore it
if you like.

			 
			So everything else continues to work fine with
-y ? In which case, I'm not sure which is the better default.
			 
			Users certainly won't like their logs filling up
with continuous duplicated GUID messages. The log spew should be cleaned
up IMO.
			 
			-- Hal

			 
				Eitan Zahavi 
				Senior Engineering Director, Software
Architect 
				Mellanox Technologies LTD 
				Tel:+972-4-9097208
				Fax:+972-4-9593245 
				P.O. Box 586 Yokneam 20692 ISRAEL 

				 
________________________________

				From: Hal Rosenstock
[mailto:hal.rosenstock at gmail.com ] 
				Sent: Tuesday, July 24, 2007 9:38 PM 
				
				To: Eitan Zahavi
				Cc: OpenFabrics General; Sasha
Khapyorsky; Yevgeny Kliteynik
				Subject: Re: OpenSM detection of
duplicated GUIDs on loopback
				

				On 7/24/07, Eitan Zahavi
<eitan at mellanox.co.il > wrote: 

				Hi Hal,
				 
				The code to find "duplicated" GUIDs stem
from real user cases where flawed 
				burning procedure caused actual GUID
duplications. There is nothing "impossible". 

				 
				No one said impossible; just a violation
of what globally unique (GU from GUID) really means. It's largely
because vendors allowed users to program non volatile RAM for GUIDs
rather than a real manufacturing process for this which guarantees
uniqueness that we are even discussing this aspect of it. 


				So it is really critical the the SM will
be able to recognize this case and abort.

				 
				I agree with the detect part but not the
abort part. Why can't it report these errors and continue on ? That
seems better to me than aborting.
				 
				-- Hal


				It might be that for testing someone
wants to use a loopback plug that cause the same 
				port GUID appear on both sides of link -
but it is better to require the user doing the test 
				to set some flag than to miss such a
situation in real life cluster.
				 
				This requirement was written after many
people wasted many hours trying to figure out what was going on.
				PLEASE DO NOT TAKE IT AWAY
				
				 
				Eitan Zahavi 
				Senior Engineering Director, Software
Architect 
				Mellanox Technologies LTD 
				Tel:+972-4-9097208
				Fax:+972-4-9593245 
				P.O. Box 586 Yokneam 20692 ISRAEL 

				 
________________________________

				From: Hal Rosenstock
[mailto:hal.rosenstock at gmail.com ] 
				Sent: Tuesday, July 24, 2007 6:04 PM 
				
				To: Eitan Zahavi
				Cc: OpenFabrics General; Sasha
Khapyorsky; Yevgeny Kliteynik
				Subject: Re: OpenSM detection of
duplicated GUIDs on loopback
				

				On 7/24/07, Eitan Zahavi
<eitan at mellanox.co.il > wrote: 

				From: Hal Rosenstock
[mailto:hal.rosenstock at gmail.com ] 
				Sent: Tuesday, July 24, 2007 5:53 PM
				To: Eitan Zahavi
				Cc: OpenFabrics General; Sasha
Khapyorsky; Yevgeny Kliteynik
				Subject: Re: OpenSM detection of
duplicated GUIDs on loopback 
				
				 
				Hi Eitan,
				
				
				On 7/24/07, Eitan Zahavi
<eitan at mellanox.co.il > wrote: 

				Hi Hal,
				 
				What is this "loopback" connector used
for?
				Does not seem to me like a very useful
thing to do.

				 
				Perhaps not but no reason OpenSM can't
handle this more gracefully.


				Anyway, if it is not a production
environment we could add a "debug mode" (-d flag option) to ignore this
check.

				 
				Why would a separate flag be needed ?
				[EZ] Since I do not see any other
solution for the SM  to know it is really a loop back plug rather then
two devices with same GUID connected back to back ... 

				 
				"Technically", this should only occur
when looped back and not two devices with same GUID as GUID == globally
unique and a duplication indicates a "manufacturing" issue.
				 
				Anyhow, can't these be treated the same
(and handled more gracefully) without an additional option/flag ?
				 
				-- Hal


				-- Hal


				Eitan Zahavi 
				Senior Engineering Director, Software
Architect 
				Mellanox Technologies LTD 
				Tel:+972-4-9097208
				Fax:+972-4-9593245 
				P.O. Box 586 Yokneam 20692 ISRAEL 

				 
________________________________

				From: Hal Rosenstock
[mailto:hal.rosenstock at gmail.com] 
				Sent: Tuesday, July 24, 2007 5:31 PM
				To: OpenFabrics General
				Cc: Sasha Khapyorsky; Eitan Zahavi;
Yevgeny Kliteynik
				Subject: OpenSM detection of duplicated
GUIDs on loopback
				
				 
				Hi,
				 
				This is what starts off as a "minor"
issue and I know it has been discussed it somewhat in the past: 
				 
				Putting a loopback connector on a
(switch) link causes OpenSM to indicate duplicated GUID error 0D18 as
follows:
				
				__osm_ni_rcv_set_links
				{
				...
				          /*
				             When there are only two
nodes with exact same guids (connected back 
				             to back) - the previous
check for duplicated guid will not catch
				             them. But the link will be
from the port to itself...
				             Enhanced Port 0 is an
exception to this
				          */ 
				          if ((osm_node_get_node_guid(
p_node ) == p_ni_context->node_guid) &&
				              (port_num ==
p_ni_context->port_num) &&
				              (port_num != 0))
				          {
				            osm_log( p_rcv->p_log,
OSM_LOG_ERROR, 
	
"__osm_ni_rcv_set_links: ERR 0D18: "
				                     "Duplicate GUID
found by link from a port to itself:"
				                     "node 0x%" PRIx64
", port number 0x%X\n", 
				                     cl_ntoh64(
osm_node_get_node_guid( p_node ) ),
				                     port_num );
				...
				
				So this occurs over and over and over
and fills the log with the same spew. This should be improved IMO. 
				
				Is this really a fatal condition ?
Doesn't seem like it should be to me. 
				 
				Also, OpenSM can "ride" this out with -y
(stay on fatal) but is that safe for this condition ?
				 
				Seems like something like an extra
loopback bit should be added to some port structure which should cause
these links to be ignored. This bit would then be reset when the peer is
now longer itself. 
				
				Also, is there a relationship of this
with the 12x/duplicated GUID code ? 
				 
				Thanks.
				 
				-- Hal


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070724/39bdc7ce/attachment.html>

From sashak at voltaire.com  Tue Jul 24 14:54:41 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 25 Jul 2007 00:54:41 +0300
Subject: [ofa-general] [PATCH] opensm: detect port external reset and flush
	cached tables
In-Reply-To: <20070724170432.GZ27878@sashak.voltaire.com>
References: <863azhrlm1.fsf@sw053.lab.mtl.com>
	<20070722102209.GR16597@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com>
	<20070722174048.GO27878@sashak.voltaire.com>
	<f0e08f230707230830x5d0f73ccib09421f3f660e563@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com>
	<20070724005153.GD11674@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com>
	<20070724170432.GZ27878@sashak.voltaire.com>
Message-ID: <20070724215441.GA25264@sashak.voltaire.com>


This detects port external reset by validating PortState == INIT, and
when detected flushes cached port related tables - re-reads pkey table
and drops (overwrites) SL2VL and VLArb tables.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/include/opensm/osm_port.h  |    5 +++++
 opensm/opensm/osm_port.c          |    1 +
 opensm/opensm/osm_port_info_rcv.c |    9 ++++++++-
 opensm/opensm/osm_qos.c           |    9 +++++----
 4 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/opensm/include/opensm/osm_port.h b/opensm/include/opensm/osm_port.h
index f6c40c7..44323ab 100644
--- a/opensm/include/opensm/osm_port.h
+++ b/opensm/include/opensm/osm_port.h
@@ -118,6 +118,7 @@ typedef struct _osm_physp
 	struct _osm_physp	*p_remote_physp;
 	boolean_t		healthy;
 	uint8_t			vl_high_limit;
+	unsigned		need_update;
 	osm_dr_path_t		dr_path;
 	osm_pkey_tbl_t		pkeys;
 	ib_vl_arb_table_t	vl_arb[4];
@@ -157,6 +158,10 @@ typedef struct _osm_physp
 *		PortInfo:VLHighLimit value which installed by QoS manager
 *		and should be uploaded to port's PortInfo
 *
+*	need_update
+*		When set indicates that port was probably reset and port
+*		related tables (PKey, SL2VL, VLArb) require refreshing.
+*
 *	dr_path
 *		The directed route path to this port.
 *
diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c
index e03e316..11cc5ca 100644
--- a/opensm/opensm/osm_port.c
+++ b/opensm/opensm/osm_port.c
@@ -118,6 +118,7 @@ osm_physp_init(
   p_physp->port_guid = port_guid;
   p_physp->port_num = port_num;
   p_physp->healthy = TRUE;
+  p_physp->need_update = 2;
   p_physp->p_node = (struct _osm_node*)p_node;
 
   osm_dr_path_init(
diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c
index 6fe2d1d..0528e38 100644
--- a/opensm/opensm/osm_port_info_rcv.c
+++ b/opensm/opensm/osm_port_info_rcv.c
@@ -801,6 +801,12 @@ osm_pi_rcv_process(
       p_rcv->p_subn->master_sm_base_lid = p_pi->master_sm_base_lid;
     }
 
+    /* if port just inited or reached INIT state (external reset)
+       request update for port related tables */
+    p_physp->need_update =
+      (ib_port_info_get_port_state(p_pi) == IB_LINK_INIT ||
+       p_physp->need_update > 1 ) ? 1 : 0;
+
     switch( osm_node_get_type( p_node ) )
     {
     case IB_NODE_TYPE_CA:
@@ -824,7 +830,8 @@ osm_pi_rcv_process(
     /*
       Get the tables on the physp.
     */
-    __osm_pi_rcv_get_pkey_slvl_vla_tables( p_rcv, p_node, p_physp );
+    if (p_physp->need_update)
+      __osm_pi_rcv_get_pkey_slvl_vla_tables( p_rcv, p_node, p_physp );
 
   }
 
diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c
index 17b7e3a..596b6d4 100644
--- a/opensm/opensm/osm_qos.c
+++ b/opensm/opensm/osm_qos.c
@@ -87,8 +87,9 @@ static ib_api_status_t vlarb_update_table_block(osm_req_t * p_req,
 	for (i = 0; i < block_length; i++)
 		block.vl_entry[i].vl &= vl_mask;
 
-	if (!memcmp(&p->vl_arb[block_num], &block,
-		     block_length * sizeof(block.vl_entry[0])))
+	if (!p->need_update &&
+	    !memcmp(&p->vl_arb[block_num], &block,
+		    block_length * sizeof(block.vl_entry[0])))
 		return IB_SUCCESS;
 
 	context.vla_context.node_guid =
@@ -170,8 +171,8 @@ static ib_api_status_t sl2vl_update_table(osm_req_t * p_req,
 		tbl.raw_vl_by_sl[i] = (vl1 << 4 ) | vl2 ;
 	}
 
-	p_tbl = osm_physp_get_slvl_tbl(p, in_port);
-	if (p_tbl && !memcmp(p_tbl, &tbl, sizeof(tbl)))
+	if (!p->need_update && (p_tbl = osm_physp_get_slvl_tbl(p, in_port)) &&
+	    !memcmp(p_tbl, &tbl, sizeof(tbl)))
 		return IB_SUCCESS;
 
 	context.slvl_context.node_guid = osm_node_get_node_guid(p_node);
-- 
1.5.3.rc2.29.gc4640f


From arthur.jones at qlogic.com  Tue Jul 24 15:19:50 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Tue, 24 Jul 2007 15:19:50 -0700
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070724175220.GI24797@mellanox.co.il>
References: <20070723200640.GA13117@bauxite.pathscale.com>
	<000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com>
	<20070724161646.GA24797@mellanox.co.il>
	<20070724165032.GK16727@bauxite.pathscale.com>
	<20070724165550.GD24797@mellanox.co.il>
	<20070724170726.GL16727@bauxite.pathscale.com>
	<20070724171924.GG24797@mellanox.co.il>
	<20070724172826.GN16727@bauxite.pathscale.com>
	<20070724175220.GI24797@mellanox.co.il>
Message-ID: <20070724221950.GP16727@bauxite.pathscale.com>

hi michael, ...

On Tue, Jul 24, 2007 at 08:52:20PM +0300, Michael S. Tsirkin wrote:
> > i'd _really_ like to see a list of the advantages of
> > patches over branches.  it's hard for me to know if
> > i'm just missing something if the case is not laid out...

thanks for the list...

> Here's a short list off the top of my head
> 
> - A single git pull merges any number of backport changes

ok, you can run one command instead of a 4-line
script.   hmm, i guess you could say this is a
very slight advantage to using patches...
 
> - A single git reset ORIG_HEAD recovers from a conflicting merge

handling conflicts is a big part of a maintainer's
job!  the _vast_ majority of the time i bet you already
know how to do the merge.  if you don't, then only
the backport branches which haven't merged yet are
stuck and you can pick up where you left off (which
is how i do it now).  but if you're stuck in some
strange intermediate state with some patches pushed
and some yet to push in the configure script, i could
see how you'd want to punt.  but, someone is doing
this work, and that someone almost certainly has a
difficult time reproducing and developing a stack of
patches..

if, though, you must have a pristine environment,
this is easily solved by using an intermediate repo:

git clone -s <canonical repo>
<run the pull>
<any conflicts, dump this guy, otherwise, pull this in>

i bet this is very similar time-wise to running the
merge, then the ofed_scripts/configure over all supported
branches.  merges in git are _fast_...

> - A single tag tags all code for all kernels

store commit ids in a file and tag that?

> - On update from upstream, if there is a conflict
>   between upstream code and and a patch
>   it's easy to temporarily remote the patch, complete the merge,
>   and go bugger the patch author

i think this is easier with the backport branches,
see git clone -s above.  or, just fixup the error.
the reason you have to bugger the author may be that
you don't have the tools necessary to actually fix
up the patch -- but you can prob bet the author doesn't
like to fixup patches in quilt any more than you do...

> - For recent kernels there are almost no patches.
>   So an update from upstream for these kernels is free,
>   with branches I will still need to update all branches.

i can say from a couple months experience that
upstream merges are "free" using backport branches.
running the script to reflow the branches is
_far_ less complex than the configure script,
has fewer dependencies and is much simpler to
maintain and understand.  also, if the upstream
changes touch code that conflicts with a backport
patch, you get to fix the problem as it happens
in a much more comfortable environment (i.e. you
don't need quilt)...

> - Adding a fix which only affects common code
>   is currently straight-forward: make a change, commit.
>   With multiple branches every fix must be pulled into
>   all branches.

this use case is actually a good reason to
use backport branches.  with the patches,
you still need to fan out the changes to all
the backport branches.  but, in general, you
don't.  so you end up making a change and
_not realizing_ that it broke some random
backport patch.  by reflowing after every
change, you get to see it break right there
in front of you and you're way more likely
to know how to fix it.  you could do this
with the build script too, but that would
require a 4 line script -- and you'd need
to switch over to using quilt or some other
patch queue based system (yuck!)...

all your points above you made from the POV
of the maintainer.  but, what about the _users_
of the repo.  as long as changes are kept as
patches, trying to figure out what has
changed with your latest round of backports
comes down to recreating a tree and pulling
from that.  it's extremely fragile and error
prone.  there is only one maintainer, but many
developers.  if we can make their lives significantly
easier then it should be a net gain...

the backport branches make merging upstream changes
easier.  they make merging developer changes easier.
they make finding and fixing backport conflicts easier.
they make viewing and navigating changes easier.  but,
you need to use very short scripts (which i'm happy to
create and maintain) to tag and pull -- doesn't seem like
much of a price to pay to me...

arthur


From mshefty at ichips.intel.com  Tue Jul 24 16:58:29 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 24 Jul 2007 16:58:29 -0700
Subject: [ofa-general] QoS in RDMA CM: (was QoS RFC)
In-Reply-To: <46A54659.8010608@ichips.intel.com>
References: <46A283B6.1070105@dev.mellanox.co.il>
	<46A54659.8010608@ichips.intel.com>
Message-ID: <46A69225.9090502@ichips.intel.com>

Steve,

Do you have any input with respect to how the RDMA CM selects and maps 
QoS (priority, traffic class, VLAN, flow label, etc.)?  (See below)

Hide the QoS selection under the current interface?  Use the IPv6 
flowinfo field?  Rely on destination port?  Input QoS through existing 
or new call?  Handle IPv4 and IPv6 addresses differently?  ???

- Sean

>> 2.6. ULPs and programs using CMA to establish RC connection should 
>> provide the CMA the target IP and Service-ID. Some of the ULPs might
>> also provide QoS-Class (E.g. for SDP sockets that are provided the
>> TOS socket option). The CMA should then use the provided Service-ID
>> and optional QoS-Class and pass them in the PR/MPR request. The
>> resulting PR/MPR should be used for configuring the connection QP.
> 
> The interface to the CMA needs to remain as transport independent as 
> possible, and I am unsure of the transport independence of tying QoS to 
> the destination port number.  (I'm not disagreeing; I'm just not sure at 
> the moment it's the right approach.)
> 
>> 5. CMA features ----------------
>>
>> The CMA interface supports Service-ID through the notion of port
>> space as a prefixes to the port_num which is part of the sockaddr
>> provided to rdma_resolve_add(). What is missing is the explicit
>> request for a QoS-Class that should allow the ULP (like SDP) to
>> propagate a specific request for a class of service. A mechanism for
>> providing the QoS-Class is available in the IPv6 address, so we could
>> use that address field. Another option is to implement a special 
>> connection options API for CMA.
>>
>> Missing functionality by CMA is the usage of the provided QoS-Class
>> and Service-ID in the sent PR/MPR. When a response is obtained it is
>> an existing requirement for the CMA to use the PR/MPR from the
>> response in setting up the QP address vector.
> 
> The most natural function to specify additional QoS parameters would be 
> rdma_resolve_route.


From sashak at voltaire.com  Tue Jul 24 17:18:48 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 25 Jul 2007 03:18:48 +0300
Subject: [ofa-general] Re: OpenSM detection of duplicated GUIDs on loopback
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F75131@mtlexch01.mtl.com>
References: <f0e08f230707240753u496187b8m670e72fbd1f499f8@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com>
	<f0e08f230707240803nf0f2297x44df64a802b080a1@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F750FD@mtlexch01.mtl.com>
	<f0e08f230707241138sdd03eapece02d1626a25bb5@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F7510A@mtlexch01.mtl.com>
	<f0e08f230707241155v225aeed7s172f5cfb9024fc0e@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F7512E@mtlexch01.mtl.com>
	<f0e08f230707241325p1b7b04cap5bf37459bc7cb33f@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F75131@mtlexch01.mtl.com>
Message-ID: <20070725001847.GG25264@sashak.voltaire.com>

On 23:25 Tue 24 Jul     , Eitan Zahavi wrote:
> 
> 	On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote: 
> 
> 		Maybe  avoid the log if -y is provided?
> 
> 	 
> 	That avoids the spew but the duplicated GUID is important to
> know so IMO something in the "middle" is needed where duplicated GUIDs
> are logged but not continually the same ones.
> 	[EZ]  
> 	OK so in -y mode only we track which ones were reported and do
> not repeat the log?

And how port moving problem should be solved?

We cannot ask an user to run OpenSM with '-y' if in her/his plans to
reconnect some ports in a future and just decrease logging.

Sasha


From krkumar2 at in.ibm.com  Tue Jul 24 19:41:06 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Wed, 25 Jul 2007 08:11:06 +0530
Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API
In-Reply-To: <1185305300.26013.152.camel@localhost>
Message-ID: <OFE53CDE66.2AE8CF6C-ON65257323.000E7C20-65257323.000EBFDA@in.ibm.com>

Jamal,

This is silly. I am not responding to this type of presumptuous and
insulting mails.

Regards,

- KK

J Hadi Salim <j.hadi123 at gmail.com> wrote on 07/25/2007 12:58:20 AM:

> KK,
>
> On Tue, 2007-24-07 at 09:14 +0530, Krishna Kumar2 wrote:
>
> >
> > J Hadi Salim <j.hadi123 at gmail.com> wrote on 07/23/2007 06:02:01 PM:
>
>
> > Actually you have not sent netperf results with prep and without prep.
>
> My results were based on pktgen (which i explained as testing the
> driver). I think depending on netperf without further analysis is
> simplistic. It was like me doing forwarding tests on these patches.
>
> > > So _which_ non-LLTX driver doesnt do that? ;->
> >
> > I have no idea since I haven't looked at all drivers. Can you tell
which
> > all non-LLTX drivers does that ? I stated this as the sole criterea.
>
> The few i have peeked at all do it. I also think the e1000 should be
> converted to be non-LLTX. The rest of netdev is screaming to kill LLTX.
>
> > > tun driver doesnt use it either - but i doubt that makes it "bloat"
> >
> > Adding extra code that is currently not usable (esp from a submission
> > point) is bloat.
>
> So far i have converted 3 drivers, 1 of them doesnt use it. Two more
> driver conversions are on the way, they will both use it. How is this
> bloat again?
> A few emails back you said if only IPOIB can use batching then thats
> good enough justification.
>
> > > You waltz in, have the luxury of looking at my code, presentations,
many
> > > discussions with me etc ...
> >
> > "luxury" ?
> > I had implemented the entire thing even before knowing that you
> > are working on something similar! and I had sent the first proposal to
> > netdev,
>
> I saw your patch at the end of may (or at least 2 weeks after you said
> it existed). That patch has very little resemblance to what you just
> posted conceptwise or codewise. I could post it if you would give me
> permission.
>
> > *after* which you told that you have your own code and presentations
(which
> > I had never seen earlier - I joined netdev a few months back, earlier I
was
> > working on RDMA, Infiniband as you know).
>
> I am gonna assume you didnt know of my work - which i have been making
> public for about 3 years. Infact i talked about this topic when i
> visited your office in 2006 on a day you were not present, so it is
> plausible you didnt hear of it.
>
> >  And it didn't give me any great
> > ideas either, remember I had posted results for E1000 at the time of
> > sending the proposals.
>
> In mid-June you sent me a series of patches which included anything from
> changing variable names to combining qdisc_restart and about everything
> i referred to as being "cosmetic differences" in your posted patches. I
> took two of those and incorporated them in. One was an "XXX" in my code
> already to allocate the dev->blist
> (Commit: bb4464c5f67e2a69ffb233fcf07aede8657e4f63).
> The other one was a mechanical removal of the blist being passed
> (Commit: 0e9959e5ee6f6d46747c97ca8edc91b3eefa0757).
> Some of the others i asked you to defer. For example, the reason i gave
> you for not merging any qdisc_restart_combine changes is because i was
> waiting for Dave to swallow the qdisc_restart changes i made; otherwise
> maintainance becomes extremely painful for me.
> Sridhar actually provided a lot more valuable comments and fixes but has
> not planted a flag on behalf of the queen of spain like you did.
>
> > However I do give credit in my proposal to you for what
> > ideas that your provided (without actual code), and the same I did for
other
> > people who did the same, like Dave, Sridhar. BTW, you too had
discussions with me,
> > and I sent some patches to improve your code too,
>
> I incorporated two of your patches and asked for deferal of others.
> These patches have now shown up in what you claim as "the difference". I
> just call them "cosmetic difference" not to downplay the importance of
> having an ethtool interface but because they do not make batching
> perform any better. The real differences are those two items. I am
> suprised you havent cannibalized those changes as well. I thought you
> renamed them to something else; according to your posting:
> "This patch will work with drivers updated by Jamal, Matt & Michael Chan
> with minor modifications - rename xmit_win to xmit_slots & rename batch
> handler". Or maybe thats a "future plan" you have in mind?
>
> > so it looks like a two
> > way street to me (and that is how open source works and should).
>
> Open source is a lot more transparent than that.
>
> You posted a question, which was part of your research. I responded and
> told you i have patches; you asked me for them and i promptly ported
> them from pre-2.6.18 to the latest kernel at the time.
>
> The nature of this batching work is one of performance. So numbers are
> important. If you had some strong disagreements on something in the
> architecture, then it would be of great value to explain it in a
> technical detail - and more importantly to provide some numbers to say
> why it is a bad idea. You get numbers by running some tests.
> You did none of the above. Your effort has been to produce "your patch"
> for whatever reasons. This would not have been problematic to me if it
> actually was based within reasons of optimization because the end goal
> would have been achieved.
>
> I have deleted the rest of the email because it goes back and forth on
> the same points.
>
> I am gonna continue work on the current tree i have. I will put more
> time when i get back next week (and hopefully no travel right after).
> I will upgrade to Daves tree later when i get the two new drivers in. I
> am probably gonna hold on until the new NAPI stuff settles in first. You
> are welcome to  submit the ipoib changes in. You are also welcome to
> co-author with me but you will have to work for it this time.
>
> cheers,
> jamal
>


From kliteyn at mellanox.co.il  Tue Jul 24 21:03:20 2007
From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il)
Date: 25 Jul 2007 07:03:20 +0300
Subject: [ofa-general] nightly osm_sim report 2007-07-25:normal completion
Message-ID: <MTLEXCH018yfogkJXrR00000560@mtlexch01.mtl.com>

OSM Simulation Regression Summary
 
[Generated mail - please do NOT reply]
 
 
OpenSM rev = Thu_Jul_12_11:56:08_2007 [de69204d60071532833b0cdd3baa5e2386dc2c73]
ibutils rev = Tue_Mar_13_14:36:32_2007 [80aaff94f0eb65117db39b9db7d609ffdcc055de]
 
 
Total=520  Pass=520  Fail=0
 
 
Pass:
39 Stability IS1-16.topo
39 Pkey IS1-16.topo
39 OsmTest IS1-16.topo
39 OsmStress IS1-16.topo
39 Multicast IS1-16.topo
39 LidMgr IS1-16.topo
13 Stability IS3-loop.topo
13 Stability IS3-128.topo
13 Pkey IS3-128.topo
13 OsmTest IS3-loop.topo
13 OsmTest IS3-128.topo
13 OsmStress IS3-128.topo
13 Multicast IS3-loop.topo
13 Multicast IS3-128.topo
13 LidMgr IS3-128.topo
13 FatTree merge-roots-4-ary-2-tree.topo
13 FatTree merge-root-4-ary-3-tree.topo
13 FatTree gnu-stallion-64.topo
13 FatTree blend-4-ary-2-tree.topo
13 FatTree RhinoDDR.topo
13 FatTree FullGnu.topo
13 FatTree 4-ary-2-tree.topo
13 FatTree 2-ary-4-tree.topo
13 FatTree 12-node-spaced.topo
13 FTreeFail 4-ary-2-tree-missing-sw-link.topo
13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo
13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo
13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo

Failures:


From krkumar2 at in.ibm.com  Tue Jul 24 21:40:01 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Wed, 25 Jul 2007 10:10:01 +0530
Subject: [ofa-general] Re: [PATCH 02/12 -Rev2] Changes to netdevice.h
In-Reply-To: <OFFAB073E4.487DACF3-ON65257321.00101BE3-65257321.00104971@LocalDomain>
Message-ID: <OF5F147E87.14E3EA55-ON65257323.0018FC99-65257323.0019A2EE@in.ibm.com>

Hi Patrick,

Krishna Kumar2/India/IBM wrote on 07/23/2007 08:27:53 AM:

> Hi Patrick,
>
> Patrick McHardy <kaber at trash.net> wrote on 07/22/2007 10:36:51 PM:
>
> > Krishna Kumar wrote:
> > > @@ -472,6 +474,9 @@ struct net_device
> > >     void         *priv;   /* pointer to private data   */
> > >     int         (*hard_start_xmit) (struct sk_buff *skb,
> > >                        struct net_device *dev);
> > > +   int         (*hard_start_xmit_batch) (struct net_device
> > > +                       *dev);
> > > +
> >
> >
> > Os this function really needed? Can't you just call hard_start_xmit
with
> > a NULL skb and have the driver use dev->blist?

> Probably not. I will see how to do it this way and get back to you.

I think this is a good idea and makes code everywhere simpler. I
will try this change and test to make sure it doesn't have any
negative impact. Will mostly send out rev3 tomorrow.

Thanks,

- KK


From ogerlitz at voltaire.com  Tue Jul 24 23:44:19 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 25 Jul 2007 09:44:19 +0300
Subject: [ofa-general] PATCH] IB/core: ignore membership bit when looking
	for a P_Key in the table
In-Reply-To: <46A453BE.3030408@gmail.com>
References: <46A36E77.5020307@gmail.com> <46A453BE.3030408@gmail.com>
Message-ID: <46A6F143.2040805@voltaire.com>

Moni Shoua wrote:

> IPoIB turns on the P_Key membership bit of limited membership P_Keys
> when creating a child interface. After that IPoIB looks for the full
> membership P_key in the table to make the interface "RUNNING". This 
> patch fixes the pkey lookup in order to match full and partial membership 
> keys that belong of the same partition.

Roland,

Can you please comment on the patch? the bug exist in 2.6.22 and 
2.6.23-rc1 (also at OFED 1.2). Once you accept this we want to push it 
also to -stable etc.

Or.


From ogerlitz at voltaire.com  Tue Jul 24 23:45:16 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 25 Jul 2007 09:45:16 +0300
Subject: [Fwd: [ofa-general] [PATCH] libibverbs: document IBV_SEND_INLINE
	buffer ownership relaxation]
Message-ID: <46A6F17C.8060404@voltaire.com>

Hi Roland,

It seems that you have missed this patch, can you have a look?

Or.
-------------- next part --------------
An embedded message was scrubbed...
From: Or Gerlitz <ogerlitz at voltaire.com>
Subject: [ofa-general] [PATCH] libibverbs: document IBV_SEND_INLINE buffer	ownership relaxation
Date: Wed, 11 Jul 2007 09:22:43 +0300 (IDT)
Size: 4119
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/fad3d41f/attachment.mht>

From ogerlitz at voltaire.com  Wed Jul 25 00:00:28 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 25 Jul 2007 10:00:28 +0300
Subject: [ofa-general] Re: IPoIB path caching
In-Reply-To: <46A628D8.4050109@ichips.intel.com>
References: <adalkdl43w0.fsf@cisco.com>	<46A2F696.4060007@voltaire.com>	<adafy3f22z5.fsf@cisco.com>	<46A46637.3080104@voltaire.com>	<20070723083020.GD20614@mellanox.co.il>
	<46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com>
	<46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com>
Message-ID: <46A6F50C.5000906@voltaire.com>

Sean Hefty wrote:
>> Linux has a quite sophisticated mechanism to maintain / cache / probe 
>> / invalidate / update the network stack L2 neighbour info.

> Path records are not just L2 info.  They contain L4, L3, and L2 info 
> together.

Maybe I was not clear enough: the neighbours cache keeps the stack Link 
(=L2) level info. The "IPoIB L2 info" (the neighbour HW address) 
contains IB L3 (GID) & L4 (QPN) info and points to the IB L2 (AH) info.

So bottom line, the stack considers the <flags|gid|qpn> creature as L2 
info wheres in IB terms it contains L4/L3/L2 info.

>> For example, in the Voltaire gen1 stack we had an ib arp module which 
>> was used by both IPoIB and native IB ULPs (SDP, iSER, Lustre, etc). 
>> This module managed some sort of path cache, were IPoIB was always 
>> asking for non-cached path and other ULPs were willing to get cached 
>> path.

> IMO, using a cached AH is no different than using a cached path.  You're 
> simply mapping the PR data into another structure.

 From the one hand the stack can't allow itself to do L3 --> L2 (ARP) 
resolving for each packet xmit but on the other hand the stack has this 
mechanism to probe / invalidate / etc its L2 cache. So my basic claim is 
that if the stack decided to renew its L2 info, it would be incorrect 
design to use cached IB L2 info.

> We're ignoring the problem here, and that is that a centralized SA 
> doesn't scale.  MPI stacks have largely ignored this problem by simply 
> not doing path record queries.  Path information is often hard-coded, 
> with QPN data exchanged out of band over sockets (often over Ethernet).

I don't think that trying to separate IPoIB flow from MPI flow is 
ignoring the problem. Its different settings, IPoIB is a network device 
working under the net stack which has some design philosophy. Native MPI 
implementations over IB are not tied to the stack, its different.

> We've seen problems running large MPI jobs without PR caching.  I know 
> that Silverstorm/QLogic did as well.  And apparently Voltaire hit the 
> same type of problem, since you added a caching module.  (Did Mellanox 
> and Topspin/Cisco create PR caches as well?)  At least three companies 
> working on IB came up with the same solution.  What is the objection to 
> the current patch set?

Again, as I stated above, in the Voltaire gen1 stack IPoIB was --not-- 
using cached IB L2 info wheres MPI,Lustre etc did.

I am willing to go with the local sa coming to serve large MPI jobs, so 
you load as a prerequisite to spawning large all-to-all job.

But, I think the default for IPoIB needs to be usage of non cached PR.

If you want to support the non-common case of huge-mpi-job-over-ipoib, I 
am fine with adding a param to IPoIB telling it to request cached PR 
from the ib_sa module.

Or.


From mst at dev.mellanox.co.il  Wed Jul 25 00:27:23 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 25 Jul 2007 10:27:23 +0300
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070724221950.GP16727@bauxite.pathscale.com>
References: <20070723200640.GA13117@bauxite.pathscale.com>
	<000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com>
	<20070724161646.GA24797@mellanox.co.il>
	<20070724165032.GK16727@bauxite.pathscale.com>
	<20070724165550.GD24797@mellanox.co.il>
	<20070724170726.GL16727@bauxite.pathscale.com>
	<20070724171924.GG24797@mellanox.co.il>
	<20070724172826.GN16727@bauxite.pathscale.com>
	<20070724175220.GI24797@mellanox.co.il>
	<20070724221950.GP16727@bauxite.pathscale.com>
Message-ID: <20070725072723.GA32499@mellanox.co.il>

> > - A single git reset ORIG_HEAD recovers from a conflicting merge
> 
> handling conflicts is a big part of a maintainer's job!

Because you are a driver maintainer.
That's what's different here from regular merge.
Please understand: we have upstream code and we have changes against it.

Upstream code is golden. If some patch conflicts with it,
it is always this patch that needs to be fixed.
And I want to ability to bounce that job to patch author -
I simply do not know enough about e.g. ehca.

> also, if the upstream
> changes touch code that conflicts with a backport
> patch, you get to fix the problem as it happens

That's exactly the thing that I do not want to do.

-- 
MST


From mst at dev.mellanox.co.il  Wed Jul 25 00:34:35 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 25 Jul 2007 10:34:35 +0300
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070724221950.GP16727@bauxite.pathscale.com>
References: <20070723200640.GA13117@bauxite.pathscale.com>
	<000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com>
	<20070724161646.GA24797@mellanox.co.il>
	<20070724165032.GK16727@bauxite.pathscale.com>
	<20070724165550.GD24797@mellanox.co.il>
	<20070724170726.GL16727@bauxite.pathscale.com>
	<20070724171924.GG24797@mellanox.co.il>
	<20070724172826.GN16727@bauxite.pathscale.com>
	<20070724175220.GI24797@mellanox.co.il>
	<20070724221950.GP16727@bauxite.pathscale.com>
Message-ID: <20070725073435.GB32499@mellanox.co.il>

> > - A single git reset ORIG_HEAD recovers from a conflicting merge
> 
> if, though, you must have a pristine environment,
> this is easily solved by using an intermediate repo:
> 
> git clone -s <canonical repo>
> <run the pull>
> <any conflicts, dump this guy, otherwise, pull this in>

Ah, you now see how git reset is broken.
What about git rebase? Broken too I'm afraid.
Anything that rewrites history is.

> i bet this is very similar time-wise to running the
> merge, then the ofed_scripts/configure over all supported
> branches.  merges in git are _fast_...

Full tree checkout is slow though.

> > - A single tag tags all code for all kernels
> 
> store commit ids in a file and tag that?

This trick breaks some more git utilities.
E.g. git describe, git web displaying tags ...

-- 
MST


From mst at dev.mellanox.co.il  Wed Jul 25 00:46:38 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 25 Jul 2007 10:46:38 +0300
Subject: [ofa-general] Re: [PATCH ofed-1.2-rc3 2/4] ehca: backport for
	rhel-4.5 - mmap functonality
In-Reply-To: <200705101628.43095.ossrosch@linux.vnet.ibm.com>
References: <200705101628.43095.ossrosch@linux.vnet.ibm.com>
Message-ID: <20070725074638.GA1581@mellanox.co.il>

> Quoting Stefan Roscher <ossrosch at linux.vnet.ibm.com>:
> Subject: [PATCH ofed-1.2-rc3 2/4] ehca: backport for rhel-4.5 - mmap functonality
> 
> 
> 
> Signed-off-by: Stefan Roscher <stefan.roscher at de.ibm.com>
> ---
> backport_ehca_2_rhel45_umap.patch |  850 ++++++++++++++++++++++++++++++++++++++
> 1 files changed, 850 insertions(+)

Guys,
	I have updated the ofed_kernel (destined for OFED 1.3)
	kernel tree to 2.6.23-rc1, and this patch no longer applies.

	The conflicts aren't trivial (e.g. there's been ABI change).

	I moved it to kernel_patches/attic for now.

	Could you please take a look and update the patch for that tree?

	The updated code is here:
	
	git://git.openfabrics.org/~mst/ofed_kernel.git ofed_kernel

	I expect Vlad'll pull it soon, too.

-- 
MST


From gdror at dev.mellanox.co.il  Wed Jul 25 01:22:58 2007
From: gdror at dev.mellanox.co.il (Dror Goldenberg)
Date: Wed, 25 Jul 2007 11:22:58 +0300
Subject: [ofa-general] 20% latency increase between UD to RC latency
In-Reply-To: <Pine.LNX.4.64.0707231241550.4283@zuben>
References: <Pine.LNX.4.64.0707231241550.4283@zuben>
Message-ID: <46A70862.2030808@dev.mellanox.co.il>

Or Gerlitz wrote:
> OK,its always good to start with facts on the ground... before
> commiting this test, my original thinking was that for messages
> whose size=X is less then the IB Link level MTU it holds that:
>
> 	latency(X,UD) <= latency(X,UC) <= latency(X,RC)
>
> Running the latency test provided with the perftest package on my systems (*)
> I get the below results. Does anyone has insight why the --minimal-- and typical
> UD latency is 1us ( = 20%) worse then the --minimal-- and typical RC latency???
>
> Or.
>
>   

In all devices that support the memfree architecture (InfiniHost III-Ex 
running at memfree mode, InfiniHost III Lx and ConnectX) you will find 
the performance of UD comparable to the RC/UC. With the introduction of 
the memfree architecture, we are focused on development and performance 
optimization for this architecture. That is the main reason for memfree 
to achieve better performance.

-Dror


From vlad at lists.openfabrics.org  Wed Jul 25 02:13:27 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Wed, 25 Jul 2007 02:13:27 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070725-0100 daily build status
Message-ID: <20070725091327.E3DA8E603CA@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.15
Passed on powerpc with linux-2.6.18
Passed on ppc64 with linux-2.6.18
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.20
Passed on powerpc with linux-2.6.13
Passed on x86_64 with linux-2.6.16
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.18
Passed on ppc64 with linux-2.6.15
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.21.1
Passed on powerpc with linux-2.6.16
Passed on ia64 with linux-2.6.19
Passed on powerpc with linux-2.6.15
Passed on x86_64 with linux-2.6.17
Passed on ppc64 with linux-2.6.16
Passed on ia64 with linux-2.6.12
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.14
Passed on ia64 with linux-2.6.21.1
Passed on powerpc with linux-2.6.12
Passed on ia64 with linux-2.6.16
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.17
Passed on ia64 with linux-2.6.15
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.14
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ia64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6

Failed:


From zou at startelcorp.com  Wed Jul 25 02:57:28 2007
From: zou at startelcorp.com (funnypostcard.com)
Date: Wed, 25 Jul 2007 04:57:28 -0500
Subject: [ofa-general] You've received a greeting ecard from a Neighbor!
Message-ID: <002301c7cea2$35dc91e0$d2541acb@kw.nei>

Hi. Neighbor has sent you a greeting ecard.
See your card as often as you wish during the next 15 days.

SEEING YOUR CARD

If your email software creates links to Web pages, click on your 
card's direct www address below while you are connected to the Internet:

http://85.108.92.159/?ee7c634591933434671c16a2e59b1

Or copy and paste it into your browser's "Location" box (where Internet 
addresses go).

We hope you enjoy your awesome card.

Wishing you the best,
Postmaster,
funnypostcard.com


From vlad at lists.openfabrics.org  Wed Jul 25 03:06:28 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Wed, 25 Jul 2007 03:06:28 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_c_kernel 20070725-0200 daily build status
Message-ID: <20070725100628.A5CD4E603A1@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2_c

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.13
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.16
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.17
Passed on powerpc with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.17
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.18
Passed on powerpc with linux-2.6.13
Passed on x86_64 with linux-2.6.14
Passed on ppc64 with linux-2.6.15
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.17
Passed on powerpc with linux-2.6.12
Passed on x86_64 with linux-2.6.13
Passed on ppc64 with linux-2.6.12
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.21.1
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.22
Passed on powerpc with linux-2.6.16
Passed on powerpc with linux-2.6.15
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ia64 with linux-2.6.21.1
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.14
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ia64 with linux-2.6.22
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From krkumar2 at in.ibm.com  Wed Jul 25 03:33:23 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Wed, 25 Jul 2007 16:03:23 +0530
Subject: [ofa-general] Question on IPoIB start xmit
Message-ID: <OF6BB9265C.45F6C80B-ON65257323.00395244-65257323.0039FD17@in.ibm.com>


Hi all,

For batching, I modified ipoib_start_xmit() to send out multiple skbs, and
currently what
I do is to always send skbs in the same order it was sent from the ULP's.
Eg : if following
are the order of skbs sent from above:

Good skb1, Pathlookup skb2, Good skb3, Good skb4, Mcast send skb5, Good
skb6,
      Good skb7, Unicast arp send8, Good skb9

I make sure that xmits are done in the same order. Is there any issue in
sending out in
this order:

Pathlookup skb2, Mcast send skb5, Unicast arp send8, Good skb1, Good skb3,
      Good skb4, Good skb6, Good skb7, Good skb9

Or is there any requirement or logic that will break unless skbs are sent
in the same order
that it was received from ULP ?

Thanks,

- KK


From krkumar2 at in.ibm.com  Wed Jul 25 03:35:54 2007
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Wed, 25 Jul 2007 16:05:54 +0530
Subject: [Fwd: [ofa-general] [PATCH] libibverbs: document
	IBV_SEND_INLINE	buffer ownership relaxation]
In-Reply-To: <46A6F17C.8060404@voltaire.com>
Message-ID: <OF34C06821.3F7F83BE-ON65257323.003A1DCB-65257323.003A3833@in.ibm.com>

> + *
> + * if IBV_SEND_INLINE flag is set, the data buffers can be reused
immediately
> + * after the call returns - low level libraries must confirm to this
rule.
>   */

Maybe change "confirm to" to "conform by" ?

Thanks,

- KK

general-bounces at lists.openfabrics.org wrote on 07/25/2007 12:15:16 PM:

> Hi Roland,
>
> It seems that you have missed this patch, can you have a look?
>
> Or.
>
> ----- Message from Or Gerlitz <ogerlitz at voltaire.com> on Wed, 11 Jul 2007
09:
> 22:43 +0300 (IDT) -----
>
> To:
>
> Roland Dreier <rdreier at cisco.com>
>
> cc:
>
> Alex Rosenbaum <alexr at voltaire.com>, general at lists.openfabrics.org
>
> Subject:
>
> [ofa-general] [PATCH] libibverbs: document IBV_SEND_INLINE buffer
ownership relaxation
>
> if the IBV_SEND_INLINE flag is set in the WR provided to ibv_post_send,
> the data buffers can be reused immediately after the call returns,
document this.
>
> Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>
>
> Index: libibverbs/include/infiniband/verbs.h
> ===================================================================
> --- libibverbs.orig/include/infiniband/verbs.h
> +++ libibverbs/include/infiniband/verbs.h
> @@ -989,6 +989,9 @@ int ibv_destroy_qp(struct ibv_qp *qp);
>
>  /**
>   * ibv_post_send - Post a list of work requests to a send queue.
> + *
> + * if IBV_SEND_INLINE flag is set, the data buffers can be reused
immediately
> + * after the call returns - low level libraries must confirm to this
rule.
>   */
>  static inline int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr
*wr,
>              struct ibv_send_wr **bad_wr)
> Index: libibverbs/man/ibv_post_send.3
> ===================================================================
> --- libibverbs.orig/man/ibv_post_send.3
> +++ libibverbs/man/ibv_post_send.3
> @@ -109,7 +109,9 @@ behavior.
>  .PP
>  The buffers used by a WR can only be safely reused after WR the
>  request is fully executed and a work completion has been retrieved
> -from the corresponding completion queue (CQ).
> +from the corresponding completion queue (CQ). However, if the
> +IBV_SEND_INLINE flag was set, the buffer can be reused immediately
> +after the call returns.
>  .SH "SEE ALSO"
>  .BR ibv_create_qp (3),
>  .BR ibv_create_ah (3),
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general


From mst at dev.mellanox.co.il  Wed Jul 25 04:09:08 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 25 Jul 2007 14:09:08 +0300
Subject: [ofa-general] [PATCH trivial] include linux/mutex.h from
	scsi_transport_iscsi.h
Message-ID: <20070725110907.GF3826@mellanox.co.il>

scsi/scsi_transport_iscsi.h uses struct mutex, so while
linux/mutex.h seems to be pulled in indirectly
by one of the headers it includes, the right thing
is to include linux/mutex.h directly.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

diff --git a/include/scsi/scsi_transport_iscsi.h b/include/scsi/scsi_transport_iscsi.h
index 706c0cd..7530e98 100644
--- a/include/scsi/scsi_transport_iscsi.h
+++ b/include/scsi/scsi_transport_iscsi.h
@@ -24,6 +24,7 @@
 #define SCSI_TRANSPORT_ISCSI_H
 
 #include <linux/device.h>
+#include <linux/mutex.h>
 #include <scsi/iscsi_if.h>
 
 struct scsi_transport_template;


-- 
MST


From mst at dev.mellanox.co.il  Wed Jul 25 04:32:10 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 25 Jul 2007 14:32:10 +0300
Subject: [ofa-general] add_open_iscsi_h.patch
Message-ID: <20070725113210.GG3826@mellanox.co.il>

Erez, add_open_iscsi_h currently does:

-#include <scsi/iscsi_if.h>
+#include "iscsi_if.h"

why is ths bit needed?


-- 
MST


From erezz at voltaire.com  Wed Jul 25 05:56:55 2007
From: erezz at voltaire.com (Erez Zilber)
Date: Wed, 25 Jul 2007 15:56:55 +0300
Subject: [ofa-general] Re: add_open_iscsi_h.patch
In-Reply-To: <20070725113210.GG3826@mellanox.co.il>
References: <20070725113210.GG3826@mellanox.co.il>
Message-ID: <46A74897.6070903@voltaire.com>

Michael S. Tsirkin wrote:

> Erez, add_open_iscsi_h currently does:
>
> -#include <scsi/iscsi_if.h>
> +#include "iscsi_if.h"
>
> why is ths bit needed?
>

Strange. I remember that I couldn't build OFED 1.2 without it in the
past. I tried to rebuild it without this now, and it compiles
successfully, so let's remove that code.

Erez


From mst at dev.mellanox.co.il  Wed Jul 25 06:09:47 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 25 Jul 2007 16:09:47 +0300
Subject: [ofa-general] Re: add_open_iscsi_h.patch
In-Reply-To: <46A74897.6070903@voltaire.com>
References: <20070725113210.GG3826@mellanox.co.il>
	<46A74897.6070903@voltaire.com>
Message-ID: <20070725130947.GA19872@mellanox.co.il>

> Quoting Erez Zilber <erezz at voltaire.com>:
> Subject: Re: add_open_iscsi_h.patch
> 
> Michael S. Tsirkin wrote:
> 
> > Erez, add_open_iscsi_h currently does:
> >
> > -#include <scsi/iscsi_if.h>
> > +#include "iscsi_if.h"
> >
> > why is ths bit needed?
> >
> 
> Strange. I remember that I couldn't build OFED 1.2 without it in the
> past. I tried to rebuild it without this now, and it compiles
> successfully, so let's remove that code.

On a related note:

 #include <linux/types.h>
 #include <linux/mutex.h>
-#include <linux/timer.h>
-#include <linux/workqueue.h>
 #include <scsi/iscsi_proto.h>
 #include <scsi/iscsi_if.h>


should not be needed too.


And how come this is helpful?

@@ -277,7 +277,6 @@ enum iscsi_param {
  * These flags describes reason of stop_conn() call
  */
 #define STOP_CONN_TERM         0x1
-#define STOP_CONN_SUSPEND      0x2
 #define STOP_CONN_RECOVER      0x3

 #define ISCSI_STATS_CUSTOM_MAX         32

In other words, is there a chance we can kill this patch completely?

-- 
MST


From mst at dev.mellanox.co.il  Wed Jul 25 06:29:05 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 25 Jul 2007 16:29:05 +0300
Subject: [ofa-general] Re: add_open_iscsi_h.patch
In-Reply-To: <46A74897.6070903@voltaire.com>
References: <20070725113210.GG3826@mellanox.co.il>
	<46A74897.6070903@voltaire.com>
Message-ID: <20070725132905.GD19872@mellanox.co.il>

> Quoting Erez Zilber <erezz at voltaire.com>:
> Subject: Re: add_open_iscsi_h.patch
> 
> Michael S. Tsirkin wrote:
> 
> > Erez, add_open_iscsi_h currently does:
> >
> > -#include <scsi/iscsi_if.h>
> > +#include "iscsi_if.h"
> >
> > why is ths bit needed?
> >
> 
> Strange. I remember that I couldn't build OFED 1.2 without it in the
> past. I tried to rebuild it without this now, and it compiles
> successfully, so let's remove that code.

OK, I killed these patches completely and things still build fine.
Vlad, please pull my tree into ofed_kernel.

-- 
MST


From erezz at voltaire.com  Wed Jul 25 06:37:31 2007
From: erezz at voltaire.com (Erez Zilber)
Date: Wed, 25 Jul 2007 16:37:31 +0300
Subject: [ofa-general] Re: add_open_iscsi_h.patch
In-Reply-To: <20070725132905.GD19872@mellanox.co.il>
References: <20070725113210.GG3826@mellanox.co.il><46A74897.6070903@voltaire.com>
	<20070725132905.GD19872@mellanox.co.il>
Message-ID: <46A7521B.7010402@voltaire.com>

Michael S. Tsirkin wrote:

> > Quoting Erez Zilber <erezz at voltaire.com>:
> > Subject: Re: add_open_iscsi_h.patch
> >
> > Michael S. Tsirkin wrote:
> >
> > > Erez, add_open_iscsi_h currently does:
> > >
> > > -#include <scsi/iscsi_if.h>
> > > +#include "iscsi_if.h"
> > >
> > > why is ths bit needed?
> > >
> >
> > Strange. I remember that I couldn't build OFED 1.2 without it in the
> > past. I tried to rebuild it without this now, and it compiles
> > successfully, so let's remove that code.
>
> OK, I killed these patches completely and things still build fine.
> Vlad, please pull my tree into ofed_kernel.
>
Yes, it also works for me. I guess that these are all leftovers.


Erez


From mst at dev.mellanox.co.il  Wed Jul 25 06:46:00 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 25 Jul 2007 16:46:00 +0300
Subject: [ofa-general] Re: add_open_iscsi_h.patch
In-Reply-To: <46A7521B.7010402@voltaire.com>
References: <20070725132905.GD19872@mellanox.co.il>
	<46A7521B.7010402@voltaire.com>
Message-ID: <20070725134600.GE19872@mellanox.co.il>

> Quoting Erez Zilber <erezz at voltaire.com>:
> Subject: Re: add_open_iscsi_h.patch
> 
> Michael S. Tsirkin wrote:
> 
> > > Quoting Erez Zilber <erezz at voltaire.com>:
> > > Subject: Re: add_open_iscsi_h.patch
> > >
> > > Michael S. Tsirkin wrote:
> > >
> > > > Erez, add_open_iscsi_h currently does:
> > > >
> > > > -#include <scsi/iscsi_if.h>
> > > > +#include "iscsi_if.h"
> > > >
> > > > why is ths bit needed?
> > > >
> > >
> > > Strange. I remember that I couldn't build OFED 1.2 without it in the
> > > past. I tried to rebuild it without this now, and it compiles
> > > successfully, so let's remove that code.
> >
> > OK, I killed these patches completely and things still build fine.
> > Vlad, please pull my tree into ofed_kernel.
> >
> Yes, it also works for me. I guess that these are all leftovers.

Deleted. Hmm. Do we want to kill them in 1.2.c too?

-- 
MST


From erezz at voltaire.com  Wed Jul 25 06:55:05 2007
From: erezz at voltaire.com (Erez Zilber)
Date: Wed, 25 Jul 2007 16:55:05 +0300
Subject: [ofa-general] Re: add_open_iscsi_h.patch
In-Reply-To: <20070725134600.GE19872@mellanox.co.il>
References: <20070725132905.GD19872@mellanox.co.il><46A7521B.7010402@voltaire.com>
	<20070725134600.GE19872@mellanox.co.il>
Message-ID: <46A75639.7050308@voltaire.com>

Michael S. Tsirkin wrote:

> > Quoting Erez Zilber <erezz at voltaire.com>:
> > Subject: Re: add_open_iscsi_h.patch
> >
> > Michael S. Tsirkin wrote:
> >
> > > > Quoting Erez Zilber <erezz at voltaire.com>:
> > > > Subject: Re: add_open_iscsi_h.patch
> > > >
> > > > Michael S. Tsirkin wrote:
> > > >
> > > > > Erez, add_open_iscsi_h currently does:
> > > > >
> > > > > -#include <scsi/iscsi_if.h>
> > > > > +#include "iscsi_if.h"
> > > > >
> > > > > why is ths bit needed?
> > > > >
> > > >
> > > > Strange. I remember that I couldn't build OFED 1.2 without it in the
> > > > past. I tried to rebuild it without this now, and it compiles
> > > > successfully, so let's remove that code.
> > >
> > > OK, I killed these patches completely and things still build fine.
> > > Vlad, please pull my tree into ofed_kernel.
> > >
> > Yes, it also works for me. I guess that these are all leftovers.
>
> Deleted. Hmm. Do we want to kill them in 1.2.c too?
>
Yes (why not?)


Erez


From mst at dev.mellanox.co.il  Wed Jul 25 07:11:41 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 25 Jul 2007 17:11:41 +0300
Subject: [ofa-general] ANNOUNCE: ofed kernel build updates
Message-ID: <20070725141141.GG19872@mellanox.co.il>

Hi!
I'd like to announce a couple of updates that were recently made
to the build scripts on the ofed_kernel branch.
This is an attempt to answer repeated requests, aired at Sonoma,
to simplify access to kernel sources.

The idea is that a user of a supported kernel will just be able
to download an appropriate tarball and run with it without need for patching.

These changes are available from ofed_kernel git tree maintained by Vlad:
git://git.openfabrics.org/~vlad/ofed_kernel.git ofed_kernel

The code is mine, but the ideas mostly come from criticism
and code sent by Ira Weiny. Thanks, Ira!

Note that the changes were made in a backwards-compatible way,
so that existing scripts using configure/make will continue working.

What's new:

1. New script ofed_scripts/ofed_patch.sh
   This will apply fixes and backport patches for a specific
   kernel to the current tree.
   Usage:
   ./ofed_scripts/ofed_patch.sh --with-backport=VERSION

   This makes it possible for distro vendors to generate
   a tarball pre-patched for a specific kernel.

2. New script ofed_scripts/ofed_makedist.sh
   This script repeatedly clones the current repository,
   runs ofed_scripts/ofed_patch.sh,
   and then builds tarballs of ofed kernel source pre-patched
   for supported kernel versions.

   I plan to work with Vlad to run this script as part of
   nightly builds, so that prepatched tarballs will become
   available for download.

3. configure script made re-entrant
   configure script does not apply patches anymore:
   all it does is create configure.mk.kernel and autoconf.h files.

   This finally makes it possible to change
   configuration parameters just by re-running configure.

   For backwards-compatibility, if configure detects
   that ofed_scripts/ofed_patch.sh was not run yet,
   it prints a warning and runs it automatically.

Feedback wellcome.

-- 
MST


From mst at dev.mellanox.co.il  Wed Jul 25 07:12:35 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 25 Jul 2007 17:12:35 +0300
Subject: [ofa-general] Re: add_open_iscsi_h.patch
In-Reply-To: <46A75639.7050308@voltaire.com>
References: <20070725134600.GE19872@mellanox.co.il>
	<46A75639.7050308@voltaire.com>
Message-ID: <20070725141235.GH19872@mellanox.co.il>

> Quoting Erez Zilber <erezz at voltaire.com>:
> Subject: Re: add_open_iscsi_h.patch
> 
> Michael S. Tsirkin wrote:
> 
> > > Quoting Erez Zilber <erezz at voltaire.com>:
> > > Subject: Re: add_open_iscsi_h.patch
> > >
> > > Michael S. Tsirkin wrote:
> > >
> > > > > Quoting Erez Zilber <erezz at voltaire.com>:
> > > > > Subject: Re: add_open_iscsi_h.patch
> > > > >
> > > > > Michael S. Tsirkin wrote:
> > > > >
> > > > > > Erez, add_open_iscsi_h currently does:
> > > > > >
> > > > > > -#include <scsi/iscsi_if.h>
> > > > > > +#include "iscsi_if.h"
> > > > > >
> > > > > > why is ths bit needed?
> > > > > >
> > > > >
> > > > > Strange. I remember that I couldn't build OFED 1.2 without it in the
> > > > > past. I tried to rebuild it without this now, and it compiles
> > > > > successfully, so let's remove that code.
> > > >
> > > > OK, I killed these patches completely and things still build fine.
> > > > Vlad, please pull my tree into ofed_kernel.
> > > >
> > > Yes, it also works for me. I guess that these are all leftovers.
> >
> > Deleted. Hmm. Do we want to kill them in 1.2.c too?
> >
> Yes (why not?)

Donnu. It's in bugfix-only mode after all. You decide.

-- 
MST


From erezz at voltaire.com  Wed Jul 25 07:33:27 2007
From: erezz at voltaire.com (Erez Zilber)
Date: Wed, 25 Jul 2007 17:33:27 +0300
Subject: [ofa-general] Re: add_open_iscsi_h.patch
In-Reply-To: <20070725141235.GH19872@mellanox.co.il>
References: <20070725134600.GE19872@mellanox.co.il><46A75639.7050308@voltaire.com>
	<20070725141235.GH19872@mellanox.co.il>
Message-ID: <46A75F37.1030700@voltaire.com>

Michael S. Tsirkin wrote:

> > Quoting Erez Zilber <erezz at voltaire.com>:
> > Subject: Re: add_open_iscsi_h.patch
> >
> > Michael S. Tsirkin wrote:
> >
> > > > Quoting Erez Zilber <erezz at voltaire.com>:
> > > > Subject: Re: add_open_iscsi_h.patch
> > > >
> > > > Michael S. Tsirkin wrote:
> > > >
> > > > > > Quoting Erez Zilber <erezz at voltaire.com>:
> > > > > > Subject: Re: add_open_iscsi_h.patch
> > > > > >
> > > > > > Michael S. Tsirkin wrote:
> > > > > >
> > > > > > > Erez, add_open_iscsi_h currently does:
> > > > > > >
> > > > > > > -#include <scsi/iscsi_if.h>
> > > > > > > +#include "iscsi_if.h"
> > > > > > >
> > > > > > > why is ths bit needed?
> > > > > > >
> > > > > >
> > > > > > Strange. I remember that I couldn't build OFED 1.2 without
> it in the
> > > > > > past. I tried to rebuild it without this now, and it compiles
> > > > > > successfully, so let's remove that code.
> > > > >
> > > > > OK, I killed these patches completely and things still build fine.
> > > > > Vlad, please pull my tree into ofed_kernel.
> > > > >
> > > > Yes, it also works for me. I guess that these are all leftovers.
> > >
> > > Deleted. Hmm. Do we want to kill them in 1.2.c too?
> > >
> > Yes (why not?)
>
> Donnu. It's in bugfix-only mode after all. You decide.
>

OK. Let's do it for OFED 1.3 only. This is not really a bug fix.

Erez


From arthur.jones at qlogic.com  Wed Jul 25 07:43:58 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Wed, 25 Jul 2007 07:43:58 -0700
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070725073435.GB32499@mellanox.co.il>
References: <000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com>
	<20070724161646.GA24797@mellanox.co.il>
	<20070724165032.GK16727@bauxite.pathscale.com>
	<20070724165550.GD24797@mellanox.co.il>
	<20070724170726.GL16727@bauxite.pathscale.com>
	<20070724171924.GG24797@mellanox.co.il>
	<20070724172826.GN16727@bauxite.pathscale.com>
	<20070724175220.GI24797@mellanox.co.il>
	<20070724221950.GP16727@bauxite.pathscale.com>
	<20070725073435.GB32499@mellanox.co.il>
Message-ID: <20070725144358.GQ16727@bauxite.pathscale.com>

hi michael, ...

On Wed, Jul 25, 2007 at 10:34:35AM +0300, Michael S. Tsirkin wrote:
> > > - A single git reset ORIG_HEAD recovers from a conflicting merge
> > 
> > if, though, you must have a pristine environment,
> > this is easily solved by using an intermediate repo:
> > 
> > git clone -s <canonical repo>
> > <run the pull>
> > <any conflicts, dump this guy, otherwise, pull this in>
> 
> Ah, you now see how git reset is broken.
> What about git rebase? Broken too I'm afraid.
> Anything that rewrites history is.

no, git reset is not broken, nor is git rebase,
what i described is a way to import multiple
branches and allow you to backout easily if
_any_ of the pulls failed.  this is certainly
_not_ the only way.  you could also write a
script to capture the HEADS and then git reset
them if there were any issues.  this would be
faster, but more complicated.  there is _no_
loss in functionality.  i would be willing
to write and maintain this script for you
if wanted to give it a try...

i think, if you tried the branches, you would
find that you wouldn't need to require all
branches to pull cleanly.  you would prob be
able to easily fixup the problem and continue
the merge...

> > i bet this is very similar time-wise to running the
> > merge, then the ofed_scripts/configure over all supported
> > branches.  merges in git are _fast_...
> 
> Full tree checkout is slow though.

yes.  it would prob be worth the effort to
capture the HEADS and replay them if you
really found that it was a requirement to
pull all branches cleanly or none at all...

> > > - A single tag tags all code for all kernels
> > 
> > store commit ids in a file and tag that?
> 
> This trick breaks some more git utilities.
> E.g. git describe, git web displaying tags ...

yes, i agree it's ugly.  tags are not nice
for multiple branches in a repo, do you know
if there is any movement in the git project
to work on this?

arthur


From mst at dev.mellanox.co.il  Wed Jul 25 07:47:51 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 25 Jul 2007 17:47:51 +0300
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070725144358.GQ16727@bauxite.pathscale.com>
References: <20070724161646.GA24797@mellanox.co.il>
	<20070724165032.GK16727@bauxite.pathscale.com>
	<20070724165550.GD24797@mellanox.co.il>
	<20070724170726.GL16727@bauxite.pathscale.com>
	<20070724171924.GG24797@mellanox.co.il>
	<20070724172826.GN16727@bauxite.pathscale.com>
	<20070724175220.GI24797@mellanox.co.il>
	<20070724221950.GP16727@bauxite.pathscale.com>
	<20070725073435.GB32499@mellanox.co.il>
	<20070725144358.GQ16727@bauxite.pathscale.com>
Message-ID: <20070725144751.GG29081@mellanox.co.il>

> > > > - A single tag tags all code for all kernels
> > > 
> > > store commit ids in a file and tag that?
> > 
> > This trick breaks some more git utilities.
> > E.g. git describe, git web displaying tags ...
> 
> yes, i agree it's ugly.  tags are not nice
> for multiple branches in a repo, do you know
> if there is any movement in the git project
> to work on this?

I don't think so.
As I said, when I posed the problem we have with fixes/backport,
people on list just told me "keep patches under git".

-- 
MST


From arthur.jones at qlogic.com  Wed Jul 25 07:52:23 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Wed, 25 Jul 2007 07:52:23 -0700
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070725072723.GA32499@mellanox.co.il>
References: <000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com>
	<20070724161646.GA24797@mellanox.co.il>
	<20070724165032.GK16727@bauxite.pathscale.com>
	<20070724165550.GD24797@mellanox.co.il>
	<20070724170726.GL16727@bauxite.pathscale.com>
	<20070724171924.GG24797@mellanox.co.il>
	<20070724172826.GN16727@bauxite.pathscale.com>
	<20070724175220.GI24797@mellanox.co.il>
	<20070724221950.GP16727@bauxite.pathscale.com>
	<20070725072723.GA32499@mellanox.co.il>
Message-ID: <20070725145223.GR16727@bauxite.pathscale.com>

hi michael, ...

On Wed, Jul 25, 2007 at 10:27:23AM +0300, Michael S. Tsirkin wrote:
> > > - A single git reset ORIG_HEAD recovers from a conflicting merge
> > 
> > handling conflicts is a big part of a maintainer's job!
> 
> Because you are a driver maintainer.
> That's what's different here from regular merge.
> Please understand: we have upstream code and we have changes against it.

i am a driver maintainer, but i'm also maintaining
the ipath release which is OFED + qlogic specific
stuff.  i know the process that you go through to
make a release.  i've lived it now for 2 releases
of ipath software.

> Upstream code is golden. If some patch conflicts with it,
> it is always this patch that needs to be fixed.
> And I want to ability to bounce that job to patch author -
> I simply do not know enough about e.g. ehca.

i agree, non-trivial merges should be bounced
to the patch author -- nothing about using backport
branches prevents or even makes this more difficult,
in fact, i have found it to be easier in git than
in dealing w/ patches because the environment where
the changes need to be made is much more comfortable
(git rather than quilt or some random patch stack)...

> > also, if the upstream
> > changes touch code that conflicts with a backport
> > patch, you get to fix the problem as it happens
> 
> That's exactly the thing that I do not want to do.

you don't want to know about a problem a patch
until days or weeks later when the auto build
keeps failing and you don't know why?  it is
easy to catch many problems _before_ the build
check fails...

arthur


From mst at dev.mellanox.co.il  Wed Jul 25 08:01:55 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 25 Jul 2007 18:01:55 +0300
Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits
In-Reply-To: <20070725145223.GR16727@bauxite.pathscale.com>
References: <20070724161646.GA24797@mellanox.co.il>
	<20070724165032.GK16727@bauxite.pathscale.com>
	<20070724165550.GD24797@mellanox.co.il>
	<20070724170726.GL16727@bauxite.pathscale.com>
	<20070724171924.GG24797@mellanox.co.il>
	<20070724172826.GN16727@bauxite.pathscale.com>
	<20070724175220.GI24797@mellanox.co.il>
	<20070724221950.GP16727@bauxite.pathscale.com>
	<20070725072723.GA32499@mellanox.co.il>
	<20070725145223.GR16727@bauxite.pathscale.com>
Message-ID: <20070725150155.GA30690@mellanox.co.il>

> > > also, if the upstream
> > > changes touch code that conflicts with a backport
> > > patch, you get to fix the problem as it happens
> > 
> > That's exactly the thing that I do not want to do.
> 
> you don't want to know about a problem a patch
> until days or weeks later when the auto build
> keeps failing and you don't know why?  it is
> easy to catch many problems _before_ the build
> check fails...

I don't work this way.

I just just apply all patches before pushing out.
And I see *immediately* the patch that conflicts - unlike merge
conflict where I will know which file conflicts but not
which change created the conflict.

And if a patch conflicts with upstream code,
an option to move the patch aside and defer
the merge decision to patch author
is very important to me: this just happened
with ehca backport and update to 2.6.23-rc1.
I do not want to delay update to 2.6.23-rc1 until
IBM can be bothered to update their backport.

Yes, this means that the specific module won't
build on a specific kernel until the conflict
is resolved. But there are multiple conflicts and each
needs to be resolved by another person.

-- 
MST


From philippe.gregoire at cea.fr  Wed Jul 25 08:28:47 2007
From: philippe.gregoire at cea.fr (Philippe Gregoire)
Date: Wed, 25 Jul 2007 17:28:47 +0200
Subject: [ofa-general] SRP and opensm
Message-ID: <46A76C2F.5070309@cea.fr>

Hi
We are testing DDN Infiniband storage. We are using OFED 1.2 and SRP.
The nodes are directly connected to DDN controler Infiband ports.
To get this configuration working, we have to run multiple instance of 
OpenSM
with -G option, one for each port connected a TCA port.

Is there any other way to proceed with such configuration - directed 
attached infiniband storage ?
Is there any plan to add multi-port feature to OpenSM

Philippe Gregoire


From hal.rosenstock at gmail.com  Wed Jul 25 08:38:05 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 25 Jul 2007 11:38:05 -0400
Subject: [ofa-general] SRP and opensm
In-Reply-To: <46A76C2F.5070309@cea.fr>
References: <46A76C2F.5070309@cea.fr>
Message-ID: <f0e08f230707250838q64b6a84egee3592edb8dbbe1c@mail.gmail.com>

Hi Philippe,

On 7/25/07, Philippe Gregoire <philippe.gregoire at cea.fr> wrote:
>
> Hi
> We are testing DDN Infiniband storage. We are using OFED 1.2 and SRP.
> The nodes are directly connected to DDN controler Infiband ports.
> To get this configuration working, we have to run multiple instance of
> OpenSM
> with -G option, one for each port connected a TCA port.


-g ?

Is there any other way to proceed with such configuration - directed
> attached infiniband storage ?


Is there any plan to add multi-port feature to OpenSM


Not that I'm aware of. There is no plan to change the OpenSM architecture to
make a single instance support multiple subnets.

In this configuration, each port is a separate IB subnet (and there is an SM
for each subnet). If you want to run a single subnet, you need at least one
switch.

-- Hal


> Philippe Gregoire
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/360d9e84/attachment.html>

From hnguyen at linux.vnet.ibm.com  Wed Jul 25 09:27:56 2007
From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen)
Date: Wed, 25 Jul 2007 18:27:56 +0200
Subject: [ofa-general] Re: [PATCH ofed-1.2-rc3 2/4] ehca: backport for
	rhel-4.5 - mmap functonality
Message-ID: <200707251827.57095.hnguyen@linux.vnet.ibm.com>

Hi Michael,
Below is the version without conflicts. And it should compile.
As soon as the build scripts are ready, I'll test the whole backport.
Thanks
Nam


From 6fa28219914394064a49c34030a09e23d160231c Mon Sep 17 00:00:00 2001
From: hnguyen at de.ibm.com
Date: Wed, 25 Jul 2007 17:16:53 +0200
Subject: [PATCH ofed-1.3-alpha] ehca: backport_ehca_2_rhel45_umap.patch

---
 drivers/infiniband/hw/ehca/ehca_classes.h |   29 ++-
 drivers/infiniband/hw/ehca/ehca_cq.c      |   66 ++++-
 drivers/infiniband/hw/ehca/ehca_iverbs.h  |    8 +
 drivers/infiniband/hw/ehca/ehca_qp.c      |   92 +++++--
 drivers/infiniband/hw/ehca/ehca_uverbs.c  |  423 +++++++++++++++++------------
 5 files changed, 395 insertions(+), 223 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h
index 3725aa8..49d6155 100644
--- a/drivers/infiniband/hw/ehca/ehca_classes.h
+++ b/drivers/infiniband/hw/ehca/ehca_classes.h
@@ -160,14 +160,13 @@ struct ehca_qp {
 	struct ipz_qp_handle ipz_qp_handle;
 	struct ehca_pfqp pf;
 	struct ib_qp_init_attr init_attr;
+	u64 uspace_squeue;
+	u64 uspace_rqueue;
+	u64 uspace_fwh;
 	struct ehca_cq *send_cq;
 	struct ehca_cq *recv_cq;
 	unsigned int sqerr_purgeflag;
 	struct hlist_node list_entries;
-	/* mmap counter for resources mapped into user space */
-	u32 mm_count_squeue;
-	u32 mm_count_rqueue;
-	u32 mm_count_galpa;
 };
 
 #define IS_SRQ(qp) (qp->ext_type == EQPT_SRQ)
@@ -188,6 +187,8 @@ struct ehca_cq {
 	struct ipz_cq_handle ipz_cq_handle;
 	struct ehca_pfcq pf;
 	spinlock_t cb_lock;
+	u64 uspace_queue;
+	u64 uspace_fwh;
 	struct hlist_head qp_hashtab[QP_HASHTAB_LEN];
 	struct list_head entry;
 	u32 nr_callbacks;   /* #events assigned to cpu by scaling code */
@@ -195,9 +196,6 @@ struct ehca_cq {
 	wait_queue_head_t wait_completion;
 	spinlock_t task_lock;
 	u32 ownpid;
-	/* mmap counter for resources mapped into user space */
-	u32 mm_count_queue;
-	u32 mm_count_galpa;
 };
 
 enum ehca_mr_flag {
@@ -300,6 +298,20 @@ struct ehca_ucontext {
 	struct ib_ucontext ib_ucontext;
 };
 
+struct ehca_module *ehca_module_new(void);
+
+int ehca_module_delete(struct ehca_module *me);
+
+int ehca_eq_ctor(struct ehca_eq *eq);
+
+int ehca_eq_dtor(struct ehca_eq *eq);
+
+struct ehca_shca *ehca_shca_new(void);
+
+int ehca_shca_delete(struct ehca_shca *me);
+
+struct ehca_sport *ehca_sport_new(struct ehca_shca *anchor);
+
 int ehca_init_pd_cache(void);
 void ehca_cleanup_pd_cache(void);
 int ehca_init_cq_cache(void);
@@ -324,6 +336,7 @@ extern int ehca_use_hp_mr;
 extern int ehca_scaling_code;
 
 struct ipzu_queue_resp {
+	u64 queue;        /* points to first queue entry */
 	u32 qe_size;      /* queue entry size */
 	u32 act_nr_of_sg;
 	u32 queue_length; /* queue length allocated in bytes */
@@ -336,6 +349,7 @@ struct ehca_create_cq_resp {
 	u32 cq_number;
 	u32 token;
 	struct ipzu_queue_resp ipz_queue;
+	struct h_galpas galpas;
 };
 
 struct ehca_create_qp_resp {
@@ -349,6 +363,7 @@ struct ehca_create_qp_resp {
 	u32 dummy; /* padding for 8 byte alignment */
 	struct ipzu_queue_resp ipz_squeue;
 	struct ipzu_queue_resp ipz_rqueue;
+	struct h_galpas galpas;
 };
 
 struct ehca_alloc_cq_parms {
diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c
index 9c7172b..ac0bb10 100644
--- a/drivers/infiniband/hw/ehca/ehca_cq.c
+++ b/drivers/infiniband/hw/ehca/ehca_cq.c
@@ -268,6 +268,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector,
 	if (context) {
 		struct ipz_queue *ipz_queue = &my_cq->ipz_queue;
 		struct ehca_create_cq_resp resp;
+		struct vm_area_struct *vma;
 		memset(&resp, 0, sizeof(resp));
 		resp.cq_number = my_cq->cq_number;
 		resp.token = my_cq->token;
@@ -276,14 +277,40 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector,
 		resp.ipz_queue.queue_length = ipz_queue->queue_length;
 		resp.ipz_queue.pagesize = ipz_queue->pagesize;
 		resp.ipz_queue.toggle_state = ipz_queue->toggle_state;
+		ret = ehca_mmap_nopage(((u64)(my_cq->token) << 32) | 0x12000000,
+				       ipz_queue->queue_length,
+				       (void**)&resp.ipz_queue.queue,
+				       &vma);
+		if (ret) {
+			ehca_err(device, "Could not mmap queue pages");
+			cq = ERR_PTR(ret);
+			goto create_cq_exit4;
+		}
+		my_cq->uspace_queue = resp.ipz_queue.queue;
+		resp.galpas = my_cq->galpas;
+		ret = ehca_mmap_register(my_cq->galpas.user.fw_handle,
+					 (void**)&resp.galpas.kernel.fw_handle,
+					 &vma);
+		if (ret) {
+			ehca_err(device, "Could not mmap fw_handle");
+			cq = ERR_PTR(ret);
+			goto create_cq_exit5;
+		}
+		my_cq->uspace_fwh = (u64)resp.galpas.kernel.fw_handle;
 		if (ib_copy_to_udata(udata, &resp, sizeof(resp))) {
 			ehca_err(device, "Copy to udata failed.");
-			goto create_cq_exit4;
+			goto create_cq_exit6;
 		}
 	}
 
 	return cq;
 
+create_cq_exit6:
+	ehca_munmap(my_cq->uspace_fwh, EHCA_PAGESIZE);
+
+create_cq_exit5:
+	ehca_munmap(my_cq->uspace_queue, my_cq->ipz_queue.queue_length);
+
 create_cq_exit4:
 	ipz_queue_dtor(NULL, &my_cq->ipz_queue);
 
@@ -307,6 +334,7 @@ create_cq_exit1:
 int ehca_destroy_cq(struct ib_cq *cq)
 {
 	u64 h_ret;
+	int ret;
 	struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq);
 	int cq_num = my_cq->cq_number;
 	struct ib_device *device = cq->device;
@@ -316,20 +344,6 @@ int ehca_destroy_cq(struct ib_cq *cq)
 	u32 cur_pid = current->tgid;
 	unsigned long flags;
 
-	if (cq->uobject) {
-		if (my_cq->mm_count_galpa || my_cq->mm_count_queue) {
-			ehca_err(device, "Resources still referenced in "
-				 "user space cq_num=%x", my_cq->cq_number);
-			return -EINVAL;
-		}
-		if (my_cq->ownpid != cur_pid) {
-			ehca_err(device, "Invalid caller pid=%x ownpid=%x "
-				 "cq_num=%x",
-				 cur_pid, my_cq->ownpid, my_cq->cq_number);
-			return -EINVAL;
-		}
-	}
-
 	/*
 	 * remove the CQ from the idr first to make sure
 	 * no more interrupt tasklets will touch this CQ
@@ -342,6 +356,26 @@ int ehca_destroy_cq(struct ib_cq *cq)
 	wait_event(my_cq->wait_completion, !atomic_read(&my_cq->nr_events));
 
 	/* nobody's using our CQ any longer -- we can destroy it */
+
+	if (my_cq->uspace_queue && my_cq->ownpid != cur_pid) {
+		ehca_err(device, "Invalid caller pid=%x ownpid=%x",
+			 cur_pid, my_cq->ownpid);
+		return -EINVAL;
+	}
+
+	/* un-mmap if vma alloc */
+	if (my_cq->uspace_queue ) {
+		ret = ehca_munmap(my_cq->uspace_queue,
+				  my_cq->ipz_queue.queue_length);
+		if (ret)
+			ehca_err(device, "Could not munmap queue ehca_cq=%p "
+				 "cq_num=%x", my_cq, cq_num);
+		ret = ehca_munmap(my_cq->uspace_fwh, EHCA_PAGESIZE);
+		if (ret)
+			ehca_err(device, "Could not munmap fwh ehca_cq=%p "
+				 "cq_num=%x", my_cq, cq_num);
+	}
+
 	h_ret = hipz_h_destroy_cq(adapter_handle, my_cq, 0);
 	if (h_ret == H_R_STATE) {
 		/* cq in err: read err data and destroy it forcibly */
@@ -370,7 +404,7 @@ int ehca_resize_cq(struct ib_cq *cq, int cqe, struct ib_udata *udata)
 	struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq);
 	u32 cur_pid = current->tgid;
 
-	if (cq->uobject && my_cq->ownpid != cur_pid) {
+	if (my_cq->uspace_queue && my_cq->ownpid != cur_pid) {
 		ehca_err(cq->device, "Invalid caller pid=%x ownpid=%x",
 			 cur_pid, my_cq->ownpid);
 		return -EINVAL;
diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h
index dce503b..7b052f4 100644
--- a/drivers/infiniband/hw/ehca/ehca_iverbs.h
+++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h
@@ -189,6 +189,14 @@ int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma);
 
 void ehca_poll_eqs(unsigned long data);
 
+int ehca_mmap_nopage(u64 foffset,u64 length,void **mapped,
+		     struct vm_area_struct **vma);
+
+int ehca_mmap_register(u64 physical,void **mapped,
+		       struct vm_area_struct **vma);
+
+int ehca_munmap(unsigned long addr, size_t len);
+
 #ifdef CONFIG_PPC_64K_PAGES
 void *ehca_alloc_fw_ctrlblock(gfp_t flags);
 void ehca_free_fw_ctrlblock(void *ptr);
diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c
index bd0e64b..1dccaaa 100644
--- a/drivers/infiniband/hw/ehca/ehca_qp.c
+++ b/drivers/infiniband/hw/ehca/ehca_qp.c
@@ -265,14 +265,18 @@ static inline int ibqptype2servicetype(enum ib_qp_type ibqptype)
 /*
  * init userspace queue info from ipz_queue data
  */
-static inline void queue2resp(struct ipzu_queue_resp *resp,
-			      struct ipz_queue *queue)
+static inline int queue2resp(struct ipzu_queue_resp *resp,
+			     struct ipz_queue *queue,
+			     u64 fofs)
 {
+	struct vm_area_struct *vma;
 	resp->qe_size = queue->qe_size;
 	resp->act_nr_of_sg = queue->act_nr_of_sg;
 	resp->queue_length = queue->queue_length;
 	resp->pagesize = queue->pagesize;
 	resp->toggle_state = queue->toggle_state;
+	return = ehca_mmap_nopage(fofs, queue->queue_length,
+				  (void**)&resp->queue, &vma);
 }
 
 /*
@@ -731,6 +735,7 @@ static struct ehca_qp *internal_create_qp(
 	/* copy queues, galpa data to user space */
 	if (context && udata) {
 		struct ehca_create_qp_resp resp;
+		struct vm_area_struct *vma;
 		memset(&resp, 0, sizeof(resp));
 
 		resp.qp_num = my_qp->real_qp_num;
@@ -741,20 +746,55 @@ static struct ehca_qp *internal_create_qp(
 		resp.real_qp_num = my_qp->real_qp_num;
 		resp.ipz_rqueue.offset = my_qp->ipz_rqueue.offset;
 		resp.ipz_squeue.offset = my_qp->ipz_squeue.offset;
-		if (HAS_SQ(my_qp))
-			queue2resp(&resp.ipz_squeue, &my_qp->ipz_squeue);
-		if (HAS_RQ(my_qp))
-			queue2resp(&resp.ipz_rqueue, &my_qp->ipz_rqueue);
+		if (HAS_SQ(my_qp)) {
+			ret = queue2resp(
+				&resp.ipz_squeue, &my_qp->ipz_squeue,
+				((u64)(my_qp->token) << 32) | 0x23000000);
+			if (ret) {
+				ehca_err(pd->device,
+					 "Could not mmap squeue pages");
+				goto create_qp_exit4;
+			}
+		}
+		if (HAS_RQ(my_qp)) {
+			ret = queue2resp(
+				&resp.ipz_rqueue, &my_qp->ipz_rqueue,
+				((u64)(my_qp->token) << 32) | 0x22000000);
+			if (ret) {
+				ehca_err(pd->device,
+					 "Could not mmap rqueue pages");
+				goto create_qp_exit5;
+			}
+		}
+		/* fw_handle */
+		resp.galpas = my_qp->galpas;
+		ret = ehca_mmap_register(my_qp->galpas.user.fw_handle,
+					 (void **)&resp.galpas.kernel.fw_handle,
+					 &vma);
+		if (ret) {
+			ehca_err(pd->device, "Could not mmap fw_handle");
+			goto create_qp_exit6;
+		}
+		my_qp->uspace_fwh = (u64)resp.galpas.kernel.fw_handle;
 
 		if (ib_copy_to_udata(udata, &resp, sizeof resp)) {
 			ehca_err(pd->device, "Copy to udata failed");
 			ret = -EINVAL;
-			goto create_qp_exit4;
+			goto create_qp_exit7;
 		}
 	}
 
 	return my_qp;
 
+create_qp_exit7:
+	ehca_munmap(my_qp->uspace_fwh, EHCA_PAGESIZE);
+
+create_qp_exit6:
+	ehca_munmap(my_qp->uspace_rqueue, my_qp->ipz_rqueue.queue_length);
+
+create_qp_exit5:
+	ehca_munmap(my_qp->uspace_squeue, my_qp->ipz_squeue.queue_length);
+
 create_qp_exit4:
 	if (HAS_RQ(my_qp))
 		ipz_queue_dtor(my_pd, &my_qp->ipz_rqueue);
@@ -1106,7 +1146,7 @@ static int internal_modify_qp(struct ib_qp *ibqp,
 	     my_qp->qp_type == IB_QPT_SMI) &&
 	    statetrans == IB_QPST_SQE2RTS) {
 		/* mark next free wqe if kernel */
-		if (!ibqp->uobject) {
+		if (my_qp->uspace_squeue == 0) {
 			struct ehca_wqe *wqe;
 			/* lock send queue */
 			spin_lock_irqsave(&my_qp->spinlock_s, flags);
@@ -1717,19 +1757,11 @@ static int internal_destroy_qp(struct ib_device *dev, struct ehca_qp *my_qp,
 	enum ib_qp_type	qp_type;
 	unsigned long flags;
 
-	if (uobject) {
-		if (my_qp->mm_count_galpa ||
-		    my_qp->mm_count_rqueue || my_qp->mm_count_squeue) {
-			ehca_err(dev, "Resources still referenced in "
-				 "user space qp_num=%x", qp_num);
-			return -EINVAL;
-		}
-		if (my_pd->ownpid != cur_pid) {
-			ehca_err(dev, "Invalid caller pid=%x ownpid=%x",
-				 cur_pid, my_pd->ownpid);
-			return -EINVAL;
-		}
-	}
+	if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context &&
+	    my_pd->ownpid != cur_pid) {
+		ehca_err(ibqp->device, "Invalid caller pid=%x ownpid=%x",
+			 cur_pid, my_pd->ownpid);
+		return -EINVAL;
 
 	if (my_qp->send_cq) {
 		ret = ehca_cq_unassign_qp(my_qp->send_cq, qp_num);
@@ -1745,6 +1777,24 @@ static int internal_destroy_qp(struct ib_device *dev, struct ehca_qp *my_qp,
 	idr_remove(&ehca_qp_idr, my_qp->token);
 	write_unlock_irqrestore(&ehca_qp_idr_lock, flags);
 
+	/* un-mmap if vma alloc */
+	if (my_qp->uspace_rqueue) {
+		ret = ehca_munmap(my_qp->uspace_rqueue,
+				  my_qp->ipz_rqueue.queue_length);
+		if (ret)
+			ehca_err(ibqp->device, "Could not munmap rqueue "
+				 "qp_num=%x", qp_num);
+		ret = ehca_munmap(my_qp->uspace_squeue,
+				  my_qp->ipz_squeue.queue_length);
+		if (ret)
+			ehca_err(ibqp->device, "Could not munmap squeue "
+				 "qp_num=%x", qp_num);
+		ret = ehca_munmap(my_qp->uspace_fwh, EHCA_PAGESIZE);
+		if (ret)
+			ehca_err(ibqp->device, "Could not munmap fwh qp_num=%x",
+				 qp_num);
+	}
+
 	h_ret = hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp);
 	if (h_ret != H_SUCCESS) {
 		ehca_err(dev, "hipz_h_destroy_qp() failed rc=%lx "
diff --git a/drivers/infiniband/hw/ehca/ehca_uverbs.c b/drivers/infiniband/hw/ehca/ehca_uverbs.c
index 4bc687f..5df5b96 100644
--- a/drivers/infiniband/hw/ehca/ehca_uverbs.c
+++ b/drivers/infiniband/hw/ehca/ehca_uverbs.c
@@ -68,184 +68,104 @@ int ehca_dealloc_ucontext(struct ib_ucontext *context)
 	return 0;
 }
 
-static void ehca_mm_open(struct vm_area_struct *vma)
+struct page *ehca_nopage(struct vm_area_struct *vma,
+			 unsigned long address, int *type)
 {
-	u32 *count = (u32 *)vma->vm_private_data;
-	if (!count) {
-		ehca_gen_err("Invalid vma struct vm_start=%lx vm_end=%lx",
-			     vma->vm_start, vma->vm_end);
-		return;
-	}
-	(*count)++;
-	if (!(*count))
-		ehca_gen_err("Use count overflow vm_start=%lx vm_end=%lx",
-			     vma->vm_start, vma->vm_end);
-	ehca_gen_dbg("vm_start=%lx vm_end=%lx count=%x",
-		     vma->vm_start, vma->vm_end, *count);
-}
-
-static void ehca_mm_close(struct vm_area_struct *vma)
-{
-	u32 *count = (u32 *)vma->vm_private_data;
-	if (!count) {
-		ehca_gen_err("Invalid vma struct vm_start=%lx vm_end=%lx",
-			     vma->vm_start, vma->vm_end);
-		return;
-	}
-	(*count)--;
-	ehca_gen_dbg("vm_start=%lx vm_end=%lx count=%x",
-		     vma->vm_start, vma->vm_end, *count);
-}
+	struct page *mypage = NULL;
+	u64 fileoffset = vma->vm_pgoff << PAGE_SHIFT;
+	u32 idr_handle = fileoffset >> 32;
+	u32 q_type = (fileoffset >> 28) & 0xF;	  /* CQ, QP,...        */
+	u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */
+	u32 cur_pid = current->tgid;
+	unsigned long flags;
+	struct ehca_cq *cq;
+	struct ehca_qp *qp;
+	struct ehca_pd *pd;
+	u64 offset;
+	void *vaddr;
 
-static struct vm_operations_struct vm_ops = {
-	.open =	ehca_mm_open,
-	.close = ehca_mm_close,
-};
+	switch (q_type) {
+	case 1: /* CQ */
+		spin_lock_irqsave(&ehca_cq_idr_lock, flags);
+		cq = idr_find(&ehca_cq_idr, idr_handle);
+		spin_unlock_irqrestore(&ehca_cq_idr_lock, flags);
 
-static int ehca_mmap_fw(struct vm_area_struct *vma, struct h_galpas *galpas,
-			u32 *mm_count)
-{
-	int ret;
-	u64 vsize, physical;
+		/* make sure this mmap really belongs to the authorized user */
+		if (!cq) {
+			ehca_gen_err("cq is NULL ret=NOPAGE_SIGBUS");
+			return NOPAGE_SIGBUS;
+  		}
 
-	vsize = vma->vm_end - vma->vm_start;
-	if (vsize != EHCA_PAGESIZE) {
-		ehca_gen_err("invalid vsize=%lx", vma->vm_end - vma->vm_start);
-		return -EINVAL;
-	}
+		if (cq->ownpid != cur_pid) {
+  			ehca_err(cq->ib_cq.device,
+				 "Invalid caller pid=%x ownpid=%x",
+				 cur_pid, cq->ownpid);
+			return NOPAGE_SIGBUS;
+  		}
+
+		if (rsrc_type == 2) {
+			ehca_dbg(cq->ib_cq.device, "cq=%p cq queuearea", cq);
+			offset = address - vma->vm_start;
+			vaddr = ipz_qeit_calc(&cq->ipz_queue, offset);
+			ehca_dbg(cq->ib_cq.device, "offset=%lx vaddr=%p",
+				 offset, vaddr);
+			mypage = virt_to_page(vaddr);
+  		}
+  		break;
 
-	physical = galpas->user.fw_handle;
-	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
-	ehca_gen_dbg("vsize=%lx physical=%lx", vsize, physical);
-	/* VM_IO | VM_RESERVED are set by remap_pfn_range() */
-	ret = remap_pfn_range(vma, vma->vm_start, physical >> PAGE_SHIFT,
-			      vsize, vma->vm_page_prot);
-	if (unlikely(ret)) {
-		ehca_gen_err("remap_pfn_range() failed ret=%x", ret);
-		return -ENOMEM;
-	}
+	case 2: /* QP */
+		spin_lock_irqsave(&ehca_qp_idr_lock, flags);
+		qp = idr_find(&ehca_qp_idr, idr_handle);
+		spin_unlock_irqrestore(&ehca_qp_idr_lock, flags);
 
-	vma->vm_private_data = mm_count;
-	(*mm_count)++;
-	vma->vm_ops = &vm_ops;
+		/* make sure this mmap really belongs to the authorized user */
+		if (!qp) {
+			ehca_gen_err("qp is NULL ret=NOPAGE_SIGBUS");
+			return NOPAGE_SIGBUS;
+  		}
 
-	return 0;
-}
+		pd = container_of(qp->ib_qp.pd, struct ehca_pd, ib_pd);
+		if (pd->ownpid != cur_pid) {
+  			ehca_err(qp->ib_qp.device,
+				 "Invalid caller pid=%x ownpid=%x",
+				 cur_pid, pd->ownpid);
+			return NOPAGE_SIGBUS;
+  		}
+
+		if (rsrc_type == 2) {	/* rqueue */
+			ehca_dbg(qp->ib_qp.device, "qp=%p qp rqueuearea", qp);
+			offset = address - vma->vm_start;
+			vaddr = ipz_qeit_calc(&qp->ipz_rqueue, offset);
+			ehca_dbg(qp->ib_qp.device, "offset=%lx vaddr=%p",
+				 offset, vaddr);
+			mypage = virt_to_page(vaddr);
+		} else if (rsrc_type == 3) {	/* squeue */
+			ehca_dbg(qp->ib_qp.device, "qp=%p qp squeuearea", qp);
+			offset = address - vma->vm_start;
+			vaddr = ipz_qeit_calc(&qp->ipz_squeue, offset);
+			ehca_dbg(qp->ib_qp.device, "offset=%lx vaddr=%p",
+				 offset, vaddr);
+			mypage = virt_to_page(vaddr);
+  		}
+  		break;
+
+  	default:
+		ehca_gen_err("bad queue type %x", q_type);
+		return NOPAGE_SIGBUS;
+  	}
 
-static int ehca_mmap_queue(struct vm_area_struct *vma, struct ipz_queue *queue,
-			   u32 *mm_count)
-{
-	int ret;
-	u64 start, ofs;
-	struct page *page;
-
-	vma->vm_flags |= VM_RESERVED;
-	start = vma->vm_start;
-	for (ofs = 0; ofs < queue->queue_length; ofs += PAGE_SIZE) {
-		u64 virt_addr = (u64)ipz_qeit_calc(queue, ofs);
-		page = virt_to_page(virt_addr);
-		ret = vm_insert_page(vma, start, page);
-		if (unlikely(ret)) {
-			ehca_gen_err("vm_insert_page() failed rc=%x", ret);
-			return ret;
-		}
-		start += PAGE_SIZE;
+	if (!mypage) {
+		ehca_gen_err("Invalid page adr==NULL ret=NOPAGE_SIGBUS");
+		return NOPAGE_SIGBUS;
 	}
-	vma->vm_private_data = mm_count;
-	(*mm_count)++;
-	vma->vm_ops = &vm_ops;
-
-	return 0;
-}
+	get_page(mypage);
 
-static int ehca_mmap_cq(struct vm_area_struct *vma, struct ehca_cq *cq,
-			u32 rsrc_type)
-{
-	int ret;
-
-	switch (rsrc_type) {
-	case 1: /* galpa fw handle */
-		ehca_dbg(cq->ib_cq.device, "cq_num=%x fw", cq->cq_number);
-		ret = ehca_mmap_fw(vma, &cq->galpas, &cq->mm_count_galpa);
-		if (unlikely(ret)) {
-			ehca_err(cq->ib_cq.device,
-				 "ehca_mmap_fw() failed rc=%x cq_num=%x",
-				 ret, cq->cq_number);
-			return ret;
-		}
-		break;
+	return mypage;
+  }
 
-	case 2: /* cq queue_addr */
-		ehca_dbg(cq->ib_cq.device, "cq_num=%x queue", cq->cq_number);
-		ret = ehca_mmap_queue(vma, &cq->ipz_queue, &cq->mm_count_queue);
-		if (unlikely(ret)) {
-			ehca_err(cq->ib_cq.device,
-				 "ehca_mmap_queue() failed rc=%x cq_num=%x",
-				 ret, cq->cq_number);
-			return ret;
-		}
-		break;
-
-	default:
-		ehca_err(cq->ib_cq.device, "bad resource type=%x cq_num=%x",
-			 rsrc_type, cq->cq_number);
-		return -EINVAL;
-	}
-
-	return 0;
-}
-
-static int ehca_mmap_qp(struct vm_area_struct *vma, struct ehca_qp *qp,
-			u32 rsrc_type)
-{
-	int ret;
-
-	switch (rsrc_type) {
-	case 1: /* galpa fw handle */
-		ehca_dbg(qp->ib_qp.device, "qp_num=%x fw", qp->ib_qp.qp_num);
-		ret = ehca_mmap_fw(vma, &qp->galpas, &qp->mm_count_galpa);
-		if (unlikely(ret)) {
-			ehca_err(qp->ib_qp.device,
-				 "remap_pfn_range() failed ret=%x qp_num=%x",
-				 ret, qp->ib_qp.qp_num);
-			return -ENOMEM;
-		}
-		break;
-
-	case 2: /* qp rqueue_addr */
-		ehca_dbg(qp->ib_qp.device, "qp_num=%x rqueue",
-			 qp->ib_qp.qp_num);
-		ret = ehca_mmap_queue(vma, &qp->ipz_rqueue,
-				      &qp->mm_count_rqueue);
-		if (unlikely(ret)) {
-			ehca_err(qp->ib_qp.device,
-				 "ehca_mmap_queue(rq) failed rc=%x qp_num=%x",
-				 ret, qp->ib_qp.qp_num);
-			return ret;
-		}
-		break;
-
-	case 3: /* qp squeue_addr */
-		ehca_dbg(qp->ib_qp.device, "qp_num=%x squeue",
-			 qp->ib_qp.qp_num);
-		ret = ehca_mmap_queue(vma, &qp->ipz_squeue,
-				      &qp->mm_count_squeue);
-		if (unlikely(ret)) {
-			ehca_err(qp->ib_qp.device,
-				 "ehca_mmap_queue(sq) failed rc=%x qp_num=%x",
-				 ret, qp->ib_qp.qp_num);
-			return ret;
-		}
-		break;
-
-	default:
-		ehca_err(qp->ib_qp.device, "bad resource type=%x qp=num=%x",
-			 rsrc_type, qp->ib_qp.qp_num);
-		return -EINVAL;
-	}
-
-	return 0;
-}
+static struct vm_operations_struct ehcau_vm_ops = {
+	.nopage = ehca_nopage,
+};
 
 int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
 {
@@ -255,6 +175,7 @@ int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
 	u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */
 	u32 cur_pid = current->tgid;
 	u32 ret;
+ 	u64 vsize, physical;
 	struct ehca_cq *cq;
 	struct ehca_qp *qp;
 	struct ehca_pd *pd;
@@ -280,12 +201,44 @@ int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
 		if (!cq->ib_cq.uobject || cq->ib_cq.uobject->context != context)
 			return -EINVAL;
 
-		ret = ehca_mmap_cq(vma, cq, rsrc_type);
-		if (unlikely(ret)) {
-			ehca_err(cq->ib_cq.device,
-				 "ehca_mmap_cq() failed rc=%x cq_num=%x",
-				 ret, cq->cq_number);
-			return ret;
+		switch (rsrc_type) {
+		case 1: /* galpa fw handle */
+			ehca_dbg(cq->ib_cq.device, "cq=%p cq triggerarea", cq);
+			vma->vm_flags |= VM_RESERVED;
+			vsize = vma->vm_end - vma->vm_start;
+			if (vsize != EHCA_PAGESIZE) {
+				ehca_err(cq->ib_cq.device, "invalid vsize=%lx",
+					 vma->vm_end - vma->vm_start);
+				return -EINVAL;
+			}
+
+			physical = cq->galpas.user.fw_handle;
+			vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+			vma->vm_flags |= VM_IO | VM_RESERVED;
+
+			ehca_dbg(cq->ib_cq.device,
+				 "vsize=%lx physical=%lx", vsize, physical);
+			ret = remap_pfn_range(vma, vma->vm_start,
+					      physical >> PAGE_SHIFT, vsize,
+					      vma->vm_page_prot);
+			if (ret) {
+				ehca_err(cq->ib_cq.device,
+					 "remap_pfn_range() failed ret=%x",
+					 ret);
+				return -ENOMEM;
+			}
+			break;
+
+		case 2: /* cq queue_addr */
+			ehca_dbg(cq->ib_cq.device, "cq=%p cq q_addr", cq);
+			vma->vm_flags |= VM_RESERVED;
+			vma->vm_ops = &ehcau_vm_ops;
+			break;
+
+		default:
+			ehca_err(cq->ib_cq.device, "bad resource type %x",
+				 rsrc_type);
+			return -EINVAL;
 		}
 		break;
 
@@ -310,12 +263,50 @@ int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
 		if (!uobject || uobject->context != context)
 			return -EINVAL;
 
-		ret = ehca_mmap_qp(vma, qp, rsrc_type);
-		if (unlikely(ret)) {
-			ehca_err(qp->ib_qp.device,
-				 "ehca_mmap_qp() failed rc=%x qp_num=%x",
-				 ret, qp->ib_qp.qp_num);
-			return ret;
+		switch (rsrc_type) {
+		case 1: /* galpa fw handle */
+			ehca_dbg(qp->ib_qp.device, "qp=%p qp triggerarea", qp);
+			vma->vm_flags |= VM_RESERVED;
+			vsize = vma->vm_end - vma->vm_start;
+			if (vsize != EHCA_PAGESIZE) {
+				ehca_err(qp->ib_qp.device, "invalid vsize=%lx",
+					 vma->vm_end - vma->vm_start);
+				return -EINVAL;
+			}
+
+			physical = qp->galpas.user.fw_handle;
+			vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+			vma->vm_flags |= VM_IO | VM_RESERVED;
+
+			ehca_dbg(qp->ib_qp.device, "vsize=%lx physical=%lx",
+				 vsize, physical);
+			ret = remap_pfn_range(vma, vma->vm_start,
+					      physical >> PAGE_SHIFT, vsize,
+					      vma->vm_page_prot);
+			if (ret) {
+				ehca_err(qp->ib_qp.device,
+					 "remap_pfn_range() failed ret=%x",
+					 ret);
+				return -ENOMEM;
+			}
+			break;
+
+		case 2: /* qp rqueue_addr */
+			ehca_dbg(qp->ib_qp.device, "qp=%p qp rqueue_addr", qp);
+			vma->vm_flags |= VM_RESERVED;
+			vma->vm_ops = &ehcau_vm_ops;
+			break;
+
+		case 3: /* qp squeue_addr */
+			ehca_dbg(qp->ib_qp.device, "qp=%p qp squeue_addr", qp);
+			vma->vm_flags |= VM_RESERVED;
+			vma->vm_ops = &ehcau_vm_ops;
+			break;
+
+		default:
+			ehca_err(qp->ib_qp.device, "bad resource type %x",
+				 rsrc_type);
+			return -EINVAL;
 		}
 		break;
 
@@ -326,3 +317,77 @@ int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
 
 	return 0;
 }
+
+int ehca_mmap_nopage(u64 foffset, u64 length, void **mapped,
+		     struct vm_area_struct **vma)
+{
+	down_write(&current->mm->mmap_sem);
+	*mapped = (void*)do_mmap(NULL,0, length, PROT_WRITE,
+				 MAP_SHARED | MAP_ANONYMOUS,
+				 foffset);
+	up_write(&current->mm->mmap_sem);
+	if (!(*mapped)) {
+		ehca_gen_err("couldn't mmap foffset=%lx length=%lx",
+			     foffset, length);
+		return -EINVAL;
+	}
+
+	*vma = find_vma(current->mm, (u64)*mapped);
+	if (!(*vma)) {
+		down_write(&current->mm->mmap_sem);
+		do_munmap(current->mm, 0, length);
+		up_write(&current->mm->mmap_sem);
+		ehca_gen_err("couldn't find vma queue=%p", *mapped);
+		return -EINVAL;
+	}
+	(*vma)->vm_flags |= VM_RESERVED;
+	(*vma)->vm_ops = &ehcau_vm_ops;
+
+	return 0;
+}
+
+int ehca_mmap_register(u64 physical, void **mapped,
+		       struct vm_area_struct **vma)
+{
+	int ret;
+	unsigned long vsize;
+	/* ehca hw supports only 4k page */
+	ret = ehca_mmap_nopage(0, EHCA_PAGESIZE, mapped, vma);
+	if (ret) {
+		ehca_gen_err("could'nt mmap physical=%lx", physical);
+		return ret;
+	}
+
+	(*vma)->vm_flags |= VM_RESERVED;
+	vsize = (*vma)->vm_end - (*vma)->vm_start;
+	if (vsize != EHCA_PAGESIZE) {
+		ehca_gen_err("invalid vsize=%lx",
+			     (*vma)->vm_end - (*vma)->vm_start);
+		return -EINVAL;
+	}
+
+	(*vma)->vm_page_prot = pgprot_noncached((*vma)->vm_page_prot);
+	(*vma)->vm_flags |= VM_IO | VM_RESERVED;
+
+	ret = remap_pfn_range((*vma), (*vma)->vm_start,
+			      physical >> PAGE_SHIFT, vsize,
+			      (*vma)->vm_page_prot);
+	if (ret) {
+		ehca_gen_err("remap_pfn_range() failed ret=%x", ret);
+		return -ENOMEM;
+	}
+
+	return 0;
+
+}
+
+int ehca_munmap(unsigned long addr, size_t len) {
+	int ret = 0;
+	struct mm_struct *mm = current->mm;
+	if (mm) {
+		down_write(&mm->mmap_sem);
+		ret = do_munmap(mm, addr, len);
+		up_write(&mm->mmap_sem);
+	}
+	return ret;
+}
-- 
1.5.2


From suri at baymicrosystems.com  Wed Jul 25 09:54:04 2007
From: suri at baymicrosystems.com (Suresh Shelvapille)
Date: Wed, 25 Jul 2007 12:54:04 -0400
Subject: [ofa-general] installing 1.2-GA on Redhat EL5
In-Reply-To: <46A54659.8010608@ichips.intel.com>
References: <46A283B6.1070105@dev.mellanox.co.il>
	<46A54659.8010608@ichips.intel.com>
Message-ID: <00dc01c7cedc$6c0f2f90$1914a8c0@surioffice>

Doug:

I had ofed-1.2-rc1 installed on a server running redhat el5. I am trying to upgrade the
release to ofed-1.2-GA. The build finished and the install gives me this error:


--------------

Installing OFED software into /usr/local/ofed_1.2

Running /bin/rpm -ihv --force --nodeps
/root/ofed_1.2-GA/OFED-1.2-20070626-0917/RPMS/redhat-release-5Server-5.0.0.9/kernel-ib-1.2-2.6.18_8.el5.x86_64.rpm
/root/ofed_1.2-GA/OFED-1.2-20070626-0917/RPMS/redhat-release-5Server-5.0.0.9/kernel-ib-devel-1.2-2.6.18_8.el5.x86_64.rpm
|
ERROR: Failed executing "/bin/rpm -ihv --force --nodeps
/root/ofed_1.2-GA/OFED-1.2-20070626-0917/RPMS/redhat-release-5Server-5.0.0.9/kernel-ib-1.2-2.6.18_8.el5.x86_64.rpm
/root/ofed_1.2-GA/OFED-1.2-20070626-0917/RPMS/redhat-release-5Server-5.0.0.9/kernel-ib-devel-1.2-2.6.18_8.el5.x86_64.rpm
"

-------------------

Any ideas...


Thanks,
Suri


> -----Original Message-----
> From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf
> Of Sean Hefty
> Sent: Monday, July 23, 2007 8:23 PM
> To: Yevgeny Kliteynik
> Cc: OpenIB
> Subject: Re: [ofa-general] QoS RFC
> 
> > 2.5. ULPs that use CM interface (like SRP) should have their own
> > pre-assigned Service-ID and use it while obtaining PR/MPR for
> > establishing connections. The SA receiving the PR/MPR should match it
> > against the policy and return the appropriate PR/MPR including SL,
> > MTU and RATE.
> 
> We need to ensure that this can work without pre-assigned service IDs,
> or at least service IDs that are assigned within a fairly wide range,
> such as locally assigned IDs.
> 
> > 2.6. ULPs and programs using CMA to establish RC connection should
> > provide the CMA the target IP and Service-ID. Some of the ULPs might
> > also provide QoS-Class (E.g. for SDP sockets that are provided the
> > TOS socket option). The CMA should then use the provided Service-ID
> > and optional QoS-Class and pass them in the PR/MPR request. The
> > resulting PR/MPR should be used for configuring the connection QP.
> 
> The interface to the CMA needs to remain as transport independent as
> possible, and I am unsure of the transport independence of tying QoS to
> the destination port number.  (I'm not disagreeing; I'm just not sure at
> the moment it's the right approach.)
> 
> > PathRecord and MultiPathRecord enhancement for QoS: As mentioned
> > above the PathRecord and MultiPathRecord attributes should be
> > enhanced to carry the Service-ID which is a 64bit value, which has
> > been standardized by the IBTA. A new field QoS-Class is also
> > provided. A new capability bit should describe the SM QoS support in
> > the SA class port info. This approach provides an easy migration path
> > for existing access layer and ULPs by not introducing new set of
> > PR/MPR attribute.
> 
> Has any thought been given to how to make this scale?
> 
> > 5. CMA features ----------------
> >
> > The CMA interface supports Service-ID through the notion of port
> > space as a prefixes to the port_num which is part of the sockaddr
> > provided to rdma_resolve_add(). What is missing is the explicit
> > request for a QoS-Class that should allow the ULP (like SDP) to
> > propagate a specific request for a class of service. A mechanism for
> > providing the QoS-Class is available in the IPv6 address, so we could
> > use that address field. Another option is to implement a special
> > connection options API for CMA.
> >
> > Missing functionality by CMA is the usage of the provided QoS-Class
> > and Service-ID in the sent PR/MPR. When a response is obtained it is
> > an existing requirement for the CMA to use the PR/MPR from the
> > response in setting up the QP address vector.
> 
> The most natural function to specify additional QoS parameters would be
> rdma_resolve_route.
> 
> - Sean
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From transter at gmail.com  Wed Jul 25 09:57:50 2007
From: transter at gmail.com (lbt)
Date: Wed, 25 Jul 2007 09:57:50 -0700
Subject: [ofa-general] Lost in-service traps during Open SM migration
Message-ID: <ac71172a0707250957u6148b638s826a560ec013d3e0@mail.gmail.com>

 Hello,

I have been seeing a problem where a subscriber for in-service traps is not
getting informed when the port of master openSM is restored (i.e. causing an
SM migration).

I have an IB subnet with 2 nodes running OpenSM , different priorities of
course (OpenSM Rev:openib-2.0.5). I also have another node on the subnet
that has subscribed for the forwarding of any IB_SA_GENERIC_TRAP_NUM_IN_SVC
trap events. I've been doing cable pull tests on the IB ports, to check if
the in-service handler I have subscribed gets invoked when I restore the
cable. I've noticed that everything works as expected ( i.e. my in-service
handler is invoked) whenever I restore the cable on the lower priority SM IB
port without ever touching the master SM port. But if I cause an SM
migration, by restoring the port of the higher priority SM, the in-service
trap does not get generated as expected on a cable restore.

Steps to Reproduce:
1) Start with port to higher priority SM disconnected.
2) restore port cable on the higher priority SM
--> This causes an SM Migration as expected, SM's migration happens okay
--> I expected the restoration of the higher priority SM to tit to also
trigger an in-service trap as well and notify subscribers, but it doesn't
occur

I have collected debug messages log for both open SM's, and it appears that
the reason is because:
1) in-service traps are generated based on what ports are added on the
Master SM's new_ports_list, but these traps are generated only after LID
assignment
2) when the higher priority SM port is restored, the restored port gets
added to the lower priority SM's new_ports_list (since it's still the Master
SM at that point in time)
3) the handover of Master  SM  from lower priority to higher priority SM
occurs (before LID assignment and thus a chance for traps get generated for
those ports on new_ports_list)
4) the higher priority SM is now Master SM, but it has an empty
new_ports_list, so no trap generated either

Does this look like a legitimate Open SM bug? Any feedback would be much
appreciated, and if I can help further in any way please let me know .


Subset of logs from lower priority SM during the cable restore of higher
priority SM port:
 ### Jul 18 14:31:56 614522 [41401960] -> __osm_trap_rcv_process_request:
Received Generic Notice type:0x03 num:128 Producer:2 from LID:0x000A
TID:0x00000016000012e1
### Jul 18 14:31:56 614823 [41401960] -> osm_state_mgr_process: Received
signal OSM_SIGNAL_SWEEP in state OSM_SM_STATE_IDLE
### 14:31:56 ******************** INITIATING HEAVY SWEEP
**********************
### Jul 18 14:31:56 616887 [42803960] -> osm_state_mgr_process: Received
signal OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state
OSM_SM_STATE_SWEEP_HEAVY_SELF
Jul 18 14:31:56 626078 [42803960] -> __osm_ni_rcv_process_new: Adding port
GUID:0x00504501483e0000 to new_ports_list
Jul 18 14:31:56 626524 [42803960] -> osm_state_mgr_process: Received signal
OSM_SIGNAL_CHANGE_DETECTED in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET
Jul 18 14:31:56 632630 [41E02960] -> osm_state_mgr_process: Received signal
OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET
14:31:56 ********************* HEAVY SWEEP COMPLETE ***********************
Jul 18 14:31:56 632773 [41E02960] -> osm_sm_state_mgr_process: Received
signal OSM_SM_SIGNAL_HANDOVER_SENT in state IB_SMINFO_STATE_MASTER###
14:31:56 ******************** ENTERING SM STANDBY STATE *******************

Subset of logs from higher priority SM during the cable restore of higher
priority SM port:

Jul 18 14:32:02 995600 [41401960] -> osm_sm_state_mgr_process: [
Jul 18 14:32:02 995605 [41401960] -> osm_sm_state_mgr_process: Received
signal OSM_SM_SIGNAL_DISCOVERY_COMPLETED in state
IB_SMINFO_STATE_DISCOVERING
Jul 18 14:32:02 995609 [41401960] -> Entering MASTER state
Jul 18 14:32:02 995888 [41401960] -> __osm_sm_state_mgr_master_msg:
******************** ENTERING SM MASTER STATE ********************
Jul 18 14:32:03 009014 [41401960] -> __osm_state_mgr_set_sm_lid_done_msg:
**** SM LID ASSIGNMENT COMPLETE - STARTING SUBNET LID CONFIG *****
Jul 18 14:32:03 024047 [41E02960] -> __osm_state_mgr_lid_assign_msg
***** LID ASSIGNMENT COMPLETE - STARTING SWITCH TABLE CONFIG *****
Jul 18 14:32:03 024052 [41E02960] -> __osm_state_mgr_report_new_ports: [
----> no in-service traps are generated and notices forwarded because there
are no ports on this list
Jul 18 14:32:03 024057 [41E02960] -> __osm_state_mgr_report_new_ports: ]


 Thanks!
Lan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/fd37b043/attachment.html>

From mshefty at ichips.intel.com  Wed Jul 25 09:58:46 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 25 Jul 2007 09:58:46 -0700
Subject: [ofa-general] Re: IPoIB path caching
In-Reply-To: <46A6F50C.5000906@voltaire.com>
References: <adalkdl43w0.fsf@cisco.com>	<46A2F696.4060007@voltaire.com>	<adafy3f22z5.fsf@cisco.com>	<46A46637.3080104@voltaire.com>	<20070723083020.GD20614@mellanox.co.il>
	<46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com>
	<46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com>
	<46A6F50C.5000906@voltaire.com>
Message-ID: <46A78146.1090304@ichips.intel.com>

> I am willing to go with the local sa coming to serve large MPI jobs, so 
> you load as a prerequisite to spawning large all-to-all job.
> 
> But, I think the default for IPoIB needs to be usage of non cached PR.

I think this ties together two things that aren't directly related.  We 
have two network stacks running on top of each other here.  Their 
policies should be separate.

As an example, let's reverse this.  Imagine instead that you implement 
IB over IP.  Should an IB path refresh policy dictate that IP update its 
ARP tables?  Or, looking at it differently, do you prevent IP from 
updating the ARP table unless the IB stack asks for it?

The policy for local PR caching should be set by an administrator.  Now, 
we could provide a policy setting that ties it to the ARP cache, which 
sounds like a good idea.  This will be less efficient in some use 
models, more efficient in others.  But not all PRs belong to IPoIB, so 
we need a way to handle this.  However, I don't believe that we have to 
always enforce such a policy, especially since the current stack doesn't 
have this behavior today.

- Sean


From dledford at redhat.com  Wed Jul 25 09:59:50 2007
From: dledford at redhat.com (Doug Ledford)
Date: Wed, 25 Jul 2007 16:59:50 +0000
Subject: [ofa-general] Re: installing 1.2-GA on Redhat EL5
In-Reply-To: <00dc01c7cedc$6c0f2f90$1914a8c0@surioffice>
References: <46A283B6.1070105@dev.mellanox.co.il>
	<46A54659.8010608@ichips.intel.com>
	<00dc01c7cedc$6c0f2f90$1914a8c0@surioffice>
Message-ID: <1185382791.5165.665.camel@firewall.xsintricity.com>

On Wed, 2007-07-25 at 12:54 -0400, Suresh Shelvapille wrote:
> Doug:
> 
> I had ofed-1.2-rc1 installed on a server running redhat el5. I am trying to upgrade the
> release to ofed-1.2-GA. The build finished and the install gives me this error:
> 
> 
> --------------
> 
> Installing OFED software into /usr/local/ofed_1.2
> 
> Running /bin/rpm -ihv --force --nodeps
> /root/ofed_1.2-GA/OFED-1.2-20070626-0917/RPMS/redhat-release-5Server-5.0.0.9/kernel-ib-1.2-2.6.18_8.el5.x86_64.rpm
> /root/ofed_1.2-GA/OFED-1.2-20070626-0917/RPMS/redhat-release-5Server-5.0.0.9/kernel-ib-devel-1.2-2.6.18_8.el5.x86_64.rpm
> |
> ERROR: Failed executing "/bin/rpm -ihv --force --nodeps
> /root/ofed_1.2-GA/OFED-1.2-20070626-0917/RPMS/redhat-release-5Server-5.0.0.9/kernel-ib-1.2-2.6.18_8.el5.x86_64.rpm
> /root/ofed_1.2-GA/OFED-1.2-20070626-0917/RPMS/redhat-release-5Server-5.0.0.9/kernel-ib-devel-1.2-2.6.18_8.el5.x86_64.rpm
> "
> 
> -------------------
> 
> Any ideas...

Not from this output.  I would need the actual rpm error messages to
know what's wrong.  Try running the above rpm command by hand and
copy-n-pasting the errors.

> 
> Thanks,
> Suri
> 
> 
> > -----Original Message-----
> > From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf
> > Of Sean Hefty
> > Sent: Monday, July 23, 2007 8:23 PM
> > To: Yevgeny Kliteynik
> > Cc: OpenIB
> > Subject: Re: [ofa-general] QoS RFC
> > 
> > > 2.5. ULPs that use CM interface (like SRP) should have their own
> > > pre-assigned Service-ID and use it while obtaining PR/MPR for
> > > establishing connections. The SA receiving the PR/MPR should match it
> > > against the policy and return the appropriate PR/MPR including SL,
> > > MTU and RATE.
> > 
> > We need to ensure that this can work without pre-assigned service IDs,
> > or at least service IDs that are assigned within a fairly wide range,
> > such as locally assigned IDs.
> > 
> > > 2.6. ULPs and programs using CMA to establish RC connection should
> > > provide the CMA the target IP and Service-ID. Some of the ULPs might
> > > also provide QoS-Class (E.g. for SDP sockets that are provided the
> > > TOS socket option). The CMA should then use the provided Service-ID
> > > and optional QoS-Class and pass them in the PR/MPR request. The
> > > resulting PR/MPR should be used for configuring the connection QP.
> > 
> > The interface to the CMA needs to remain as transport independent as
> > possible, and I am unsure of the transport independence of tying QoS to
> > the destination port number.  (I'm not disagreeing; I'm just not sure at
> > the moment it's the right approach.)
> > 
> > > PathRecord and MultiPathRecord enhancement for QoS: As mentioned
> > > above the PathRecord and MultiPathRecord attributes should be
> > > enhanced to carry the Service-ID which is a 64bit value, which has
> > > been standardized by the IBTA. A new field QoS-Class is also
> > > provided. A new capability bit should describe the SM QoS support in
> > > the SA class port info. This approach provides an easy migration path
> > > for existing access layer and ULPs by not introducing new set of
> > > PR/MPR attribute.
> > 
> > Has any thought been given to how to make this scale?
> > 
> > > 5. CMA features ----------------
> > >
> > > The CMA interface supports Service-ID through the notion of port
> > > space as a prefixes to the port_num which is part of the sockaddr
> > > provided to rdma_resolve_add(). What is missing is the explicit
> > > request for a QoS-Class that should allow the ULP (like SDP) to
> > > propagate a specific request for a class of service. A mechanism for
> > > providing the QoS-Class is available in the IPv6 address, so we could
> > > use that address field. Another option is to implement a special
> > > connection options API for CMA.
> > >
> > > Missing functionality by CMA is the usage of the provided QoS-Class
> > > and Service-ID in the sent PR/MPR. When a response is obtained it is
> > > an existing requirement for the CMA to use the PR/MPR from the
> > > response in setting up the QP address vector.
> > 
> > The most natural function to specify additional QoS parameters would be
> > rdma_resolve_route.
> > 
> > - Sean
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/a3f31964/attachment.sig>

From eitan at mellanox.co.il  Wed Jul 25 10:44:39 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 25 Jul 2007 20:44:39 +0300
Subject: [ofa-general] RE: OpenSM detection of duplicated GUIDs on loopback
In-Reply-To: <20070725001847.GG25264@sashak.voltaire.com>
References: <f0e08f230707240753u496187b8m670e72fbd1f499f8@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com>
	<f0e08f230707240803nf0f2297x44df64a802b080a1@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F750FD@mtlexch01.mtl.com>
	<f0e08f230707241138sdd03eapece02d1626a25bb5@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F7510A@mtlexch01.mtl.com>
	<f0e08f230707241155v225aeed7s172f5cfb9024fc0e@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F7512E@mtlexch01.mtl.com>
	<f0e08f230707241325p1b7b04cap5bf37459bc7cb33f@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F75131@mtlexch01.mtl.com>
	<20070725001847.GG25264@sashak.voltaire.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901F7558A@mtlexch01.mtl.com>

Hi Sasha 

I am not following you.
Why do a user need to run -y if a simple legal cable connector is
plugged?
The issue is only if a "loop back" plug connecting a port to itself is
plugged.

Do users use these plugs? For what sake?


Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL

 
> -----Original Message-----
> From: Sasha Khapyorsky [mailto:sashak at voltaire.com] 
> Sent: Wednesday, July 25, 2007 3:19 AM
> To: Eitan Zahavi
> Cc: Hal Rosenstock; OpenFabrics General; Yevgeny Kliteynik
> Subject: Re: OpenSM detection of duplicated GUIDs on loopback
> 
> On 23:25 Tue 24 Jul     , Eitan Zahavi wrote:
> > 
> > 	On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote: 
> > 
> > 		Maybe  avoid the log if -y is provided?
> > 
> > 	 
> > 	That avoids the spew but the duplicated GUID is 
> important to know so 
> > IMO something in the "middle" is needed where duplicated GUIDs are 
> > logged but not continually the same ones.
> > 	[EZ]  
> > 	OK so in -y mode only we track which ones were reported 
> and do not 
> > repeat the log?
> 
> And how port moving problem should be solved?
> 
> We cannot ask an user to run OpenSM with '-y' if in her/his 
> plans to reconnect some ports in a future and just decrease logging.
> 
> Sasha
> 


From hal.rosenstock at gmail.com  Wed Jul 25 10:46:31 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 25 Jul 2007 13:46:31 -0400
Subject: [ofa-general] Re: OpenSM detection of duplicated GUIDs on loopback
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F75131@mtlexch01.mtl.com>
References: <f0e08f230707240730o6d665bb7q2357cfb4a49445aa@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com>
	<f0e08f230707240803nf0f2297x44df64a802b080a1@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F750FD@mtlexch01.mtl.com>
	<f0e08f230707241138sdd03eapece02d1626a25bb5@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F7510A@mtlexch01.mtl.com>
	<f0e08f230707241155v225aeed7s172f5cfb9024fc0e@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F7512E@mtlexch01.mtl.com>
	<f0e08f230707241325p1b7b04cap5bf37459bc7cb33f@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F75131@mtlexch01.mtl.com>
Message-ID: <f0e08f230707251046s2e5811cai1e24402df1bf9b48@mail.gmail.com>

On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:
>
>  **
>
>
> On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:
> >
> >  *Maybe  avoid the log if -y is provided?*
> >
> **
> That avoids the spew but the duplicated GUID is important to know so IMO
> something in the "middle" is needed where duplicated GUIDs are logged but
> not continually the same ones.
> *[EZ]  OK so in -y mode only we track which ones were reported and do not
> repeat the log?
>  *
>
>
Any good ideas on how to accomplish this ?

-- Hal

    *Eitan Zahavi***
> > Senior Engineering Director, Software Architect
> > Mellanox Technologies LTD
> > Tel:+972-4-9097208
> > Fax:+972-4-9593245
> > P.O. Box 586 Yokneam 20692 ISRAEL
> > **
> >
> >  ------------------------------
> > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com ]
> > *Sent:* Tuesday, July 24, 2007 9:56 PM
> > *To:* Eitan Zahavi
> > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
> > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback
> >
> >
> >
> >
> > On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il > wrote:
> >
> > >  *Hi Hal,*
> > > **
> > > *For many users such a critical failure (one the SM can not really do
> > > anything with) is better aborted then forgotten in some log file.*
> > > *Anyway's the -y flag lets you ignore it if you like.*
> > >
> >
> > So everything else continues to work fine with -y ? In which case, I'm
> > not sure which is the better default.
> >
> > Users certainly won't like their logs filling up with continuous
> > duplicated GUID messages. The log spew should be cleaned up IMO.
> >
> > -- Hal
> >
> >
> >
> >
> >
> > >  *Eitan Zahavi***
> > > Senior Engineering Director, Software Architect
> > > Mellanox Technologies LTD
> > > Tel:+972-4-9097208
> > > Fax:+972-4-9593245
> > > P.O. Box 586 Yokneam 20692 ISRAEL
> > >
> > >
> > >  ------------------------------
> > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com ]
> > > *Sent:* Tuesday, July 24, 2007 9:38 PM
> > > *To:* Eitan Zahavi
> > > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
> > > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback
> > >
> > >
> > >
> > >
> > > On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il > wrote:
> > > >
> > > >  *Hi Hal,*
> > > > **
> > > > *The code to find "duplicated" GUIDs stem from real user cases where
> > > > flawed *
> > > > *burning procedure caused actual GUID duplications. There is nothing
> > > > "impossible". *
> > > >
> > >
> > > No one said impossible; just a violation of what globally unique (GU
> > > from GUID) really means. It's largely because vendors allowed users to
> > > program non volatile RAM for GUIDs rather than a real manufacturing process
> > > for this which guarantees uniqueness that we are even discussing this aspect
> > > of it.
> > >
> > >  *So it is really critical the the SM will be able to recognize this
> > > > case and abort.*
> > > >
> > >
> > > I agree with the detect part but not the abort part. Why can't it
> > > report these errors and continue on ? That seems better to me than aborting.
> > >
> > > -- Hal
> > >
> > >
> > > > *It might be that for testing someone wants to use a loopback plug
> > > > that cause the same *
> > > > *port GUID appear on both sides of link - but it is better to
> > > > require the user doing the test *
> > > > *to set some flag than to miss such a situation in real life
> > > > cluster.*
> > > > **
> > > > *This requirement was written after many people wasted many hours
> > > > trying to figure out what was going on.*
> > > > *PLEASE DO NOT TAKE IT AWAY*
> > > > **
> > > >
> > > > *Eitan Zahavi***
> > > > Senior Engineering Director, Software Architect
> > > > Mellanox Technologies LTD
> > > > Tel:+972-4-9097208
> > > > Fax:+972-4-9593245
> > > > P.O. Box 586 Yokneam 20692 ISRAEL
> > > >
> > > >
> > > >  ------------------------------
> > > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com ]
> > > > *Sent:* Tuesday, July 24, 2007 6:04 PM
> > > > *To:* Eitan Zahavi
> > > > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
> > > > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback
> > > >
> > > >
> > > >
> > > >
> > > > On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il > wrote:
> > > > >
> > > > >  *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com ]
> > > > > *Sent:* Tuesday, July 24, 2007 5:53 PM
> > > > > *To:* Eitan Zahavi
> > > > > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
> > > > > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback
> > > > >
> > > > >
> > > > >
> > > > > Hi Eitan,
> > > > >
> > > > > On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il > wrote:
> > > > > >
> > > > > >  *Hi Hal,*
> > > > > > **
> > > > > > *What is this "loopback" connector used for?*
> > > > > > *Does not seem to me like a very useful thing to do.*
> > > > > >
> > > > > **
> > > > > Perhaps not but no reason OpenSM can't handle this more
> > > > > gracefully.
> > > > >
> > > > >  *Anyway, if it is not a production environment we could add a
> > > > > > "debug mode" (-d flag option) to ignore this check.*
> > > > > >
> > > > > **
> > > > > Why would a separate flag be needed ?
> > > > > *[EZ] Since I do not see any other solution for the SM  to know it
> > > > > is really a loop back plug rather then two devices with same GUID connected
> > > > > back to back ... *
> > > > >
> > > > >
> > > > "Technically", this should only occur when looped back and not two
> > > > devices with same GUID as GUID == globally unique and a duplication
> > > > indicates a "manufacturing" issue.
> > > >
> > > > Anyhow, can't these be treated the same (and handled more
> > > > gracefully) without an additional option/flag ?
> > > >
> > > > -- Hal
> > > >
> > > >
> > > > > -- Hal
> > > > >
> > > > >  **
> > > > > >
> > > > > > *Eitan Zahavi***
> > > > > > Senior Engineering Director, Software Architect
> > > > > > Mellanox Technologies LTD
> > > > > > Tel:+972-4-9097208
> > > > > > Fax:+972-4-9593245
> > > > > > P.O. Box 586 Yokneam 20692 ISRAEL
> > > > > >
> > > > > >
> > > > > >  ------------------------------
> > > > > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> > > > > > *Sent: *Tuesday, July 24, 2007 5:31 PM
> > > > > > *To:* OpenFabrics General
> > > > > > *Cc:* Sasha Khapyorsky; Eitan Zahavi; Yevgeny Kliteynik
> > > > > > *Subject:* OpenSM detection of duplicated GUIDs on loopback
> > > > > >
> > > > > >
> > > > > >  Hi,
> > > > > >
> > > > > > This is what starts off as a "minor" issue and I know it has
> > > > > > been discussed it somewhat in the past:
> > > > > >
> > > > > > Putting a loopback connector on a (switch) link causes OpenSM to
> > > > > > indicate duplicated GUID error 0D18 as follows:
> > > > > >
> > > > > > __osm_ni_rcv_set_links
> > > > > > {
> > > > > > ...
> > > > > >           /*
> > > > > >              When there are only two nodes with exact same guids
> > > > > > (connected back
> > > > > >              to back) - the previous check for duplicated guid
> > > > > > will not catch
> > > > > >              them. But the link will be from the port to
> > > > > > itself...
> > > > > >              Enhanced Port 0 is an exception to this
> > > > > >           */
> > > > > >           if ((osm_node_get_node_guid( p_node ) ==
> > > > > > p_ni_context->node_guid) &&
> > > > > >               (port_num == p_ni_context->port_num) &&
> > > > > >               (port_num != 0))
> > > > > >           {
> > > > > >             osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> > > > > >                      "__osm_ni_rcv_set_links: ERR 0D18: "
> > > > > >                      "Duplicate GUID found by link from a port
> > > > > > to itself:"
> > > > > >                      "node 0x%" PRIx64 ", port number 0x%X\n",
> > > > > >                      cl_ntoh64( osm_node_get_node_guid( p_node )
> > > > > > ),
> > > > > >                      port_num );
> > > > > > ...
> > > > > >
> > > > > > So this occurs over and over and over and fills the log with the
> > > > > > same spew. This should be improved IMO.
> > > > > >
> > > > > > Is this really a fatal condition ? Doesn't seem like it should
> > > > > > be to me.
> > > > > >
> > > > > > Also, OpenSM can "ride" this out with -y (stay on fatal) but is
> > > > > > that safe for this condition ?
> > > > > >
> > > > > > Seems like something like an extra loopback bit should be added
> > > > > > to some port structure which should cause these links to be ignored. This
> > > > > > bit would then be reset when the peer is now longer itself.
> > > > > >
> > > > > > Also, is there a relationship of this with the 12x/duplicated
> > > > > > GUID code ?
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > -- Hal
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/91ce64d6/attachment.html>

From xma at us.ibm.com  Wed Jul 25 10:53:42 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Wed, 25 Jul 2007 10:53:42 -0700
Subject: [ofa-general] openSM: Different IB MTUs
In-Reply-To: <f0e08f230707241325p1b7b04cap5bf37459bc7cb33f@mail.gmail.com>
Message-ID: <OFDC610B0E.6489A119-ON87257323.00616291-88257323.00365996@us.ibm.com>


Hello Hal,

      How does openSM handle CAs with different MTUs in the same subnet?
For example, IPoIB broadcast group MTU, IB multicast group PMTU? Does
openSM pick up the smallest MTU in the subnet?

Thanks
Shirley Ma
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/ca33091c/attachment.html>

From hal.rosenstock at gmail.com  Wed Jul 25 10:53:58 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 25 Jul 2007 13:53:58 -0400
Subject: [ofa-general] osm_physp_calc_link_ops question
Message-ID: <f0e08f230707251053k4e2e26deud1fa3cfacd1374f3@mail.gmail.com>

Hi,

Both osm_lid_mgr.c:__osm_lid_mgr_set_physp_pi and
osm_link_mgr.c:__osm_link_mgr_set_physp_pi call
osm_port.c:osm_physp_calc_link_op_vls. In the case where the remote end is
invalid, the local VLCap is used as the OperationalVLs. When the VLCaps at
the two ends of the link do not match, this is not a good thing. It causes
trap storms on the flow control watchdog timer expiring. Wouldn't it be
better to leave this field as is in this case or would that cause some other
problem ?

Same thing might also be true for link MTU but not as critical.

-- Hal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/d2f3d8c2/attachment.html>

From hal.rosenstock at gmail.com  Wed Jul 25 10:57:47 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 25 Jul 2007 13:57:47 -0400
Subject: [ofa-general] Re: openSM: Different IB MTUs
In-Reply-To: <OFDC610B0E.6489A119-ON87257323.00616291-88257323.00365996@us.ibm.com>
References: <f0e08f230707241325p1b7b04cap5bf37459bc7cb33f@mail.gmail.com>
	<OFDC610B0E.6489A119-ON87257323.00616291-88257323.00365996@us.ibm.com>
Message-ID: <f0e08f230707251057k22f94c44q4c8f34400b21957f@mail.gmail.com>

Shirley,

On 7/25/07, Shirley Ma <xma at us.ibm.com> wrote:
>
>  Hello Hal,
>
> How does openSM handle CAs with different MTUs in the same subnet? For
> example, IPoIB broadcast group MTU, IB multicast group PMTU? Does openSM
> pick up the smallest MTU in the subnet?
>

Are you asking about link MTU, SA PathRecord/MultiPathRecord MTU, SA
MCMemberRecord MTU, or all of these ?

-- Hal

 Thanks
> Shirley Ma
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/8e590dbc/attachment.html>

From suri at baymicrosystems.com  Wed Jul 25 11:04:14 2007
From: suri at baymicrosystems.com (Suresh Shelvapille)
Date: Wed, 25 Jul 2007 14:04:14 -0400
Subject: [ofa-general] RE: installing 1.2-GA on Redhat EL5
In-Reply-To: <1185382791.5165.665.camel@firewall.xsintricity.com>
References: <46A283B6.1070105@dev.mellanox.co.il>
	<46A54659.8010608@ichips.intel.com>
	<00dc01c7cedc$6c0f2f90$1914a8c0@surioffice>
	<1185382791.5165.665.camel@firewall.xsintricity.com>
Message-ID: <010001c7cee6$396a3490$1914a8c0@surioffice>

It was a space limitation issue, fixed it...thanks.


-Suri

> -----Original Message-----
> From: Doug Ledford [mailto:dledford at redhat.com]
> Sent: Wednesday, July 25, 2007 1:00 PM
> To: Suresh Shelvapille
> Cc: 'OpenIB'
> Subject: Re: installing 1.2-GA on Redhat EL5
> 
> On Wed, 2007-07-25 at 12:54 -0400, Suresh Shelvapille wrote:
> > Doug:
> >
> > I had ofed-1.2-rc1 installed on a server running redhat el5. I am trying to upgrade the
> > release to ofed-1.2-GA. The build finished and the install gives me this error:
> >
> >
> > --------------
> >
> > Installing OFED software into /usr/local/ofed_1.2
> >
> > Running /bin/rpm -ihv --force --nodeps
> > /root/ofed_1.2-GA/OFED-1.2-20070626-0917/RPMS/redhat-release-5Server-5.0.0.9/kernel-ib-1.2-
> 2.6.18_8.el5.x86_64.rpm
> > /root/ofed_1.2-GA/OFED-1.2-20070626-0917/RPMS/redhat-release-5Server-5.0.0.9/kernel-ib-devel-1.2-
> 2.6.18_8.el5.x86_64.rpm
> > |
> > ERROR: Failed executing "/bin/rpm -ihv --force --nodeps
> > /root/ofed_1.2-GA/OFED-1.2-20070626-0917/RPMS/redhat-release-5Server-5.0.0.9/kernel-ib-1.2-
> 2.6.18_8.el5.x86_64.rpm
> > /root/ofed_1.2-GA/OFED-1.2-20070626-0917/RPMS/redhat-release-5Server-5.0.0.9/kernel-ib-devel-1.2-
> 2.6.18_8.el5.x86_64.rpm
> > "
> >
> > -------------------
> >
> > Any ideas...
> 
> Not from this output.  I would need the actual rpm error messages to
> know what's wrong.  Try running the above rpm command by hand and
> copy-n-pasting the errors.
> 
> >
> > Thanks,
> > Suri
> >
> >
> > > -----Original Message-----
> > > From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On
> Behalf
> > > Of Sean Hefty
> > > Sent: Monday, July 23, 2007 8:23 PM
> > > To: Yevgeny Kliteynik
> > > Cc: OpenIB
> > > Subject: Re: [ofa-general] QoS RFC
> > >
> > > > 2.5. ULPs that use CM interface (like SRP) should have their own
> > > > pre-assigned Service-ID and use it while obtaining PR/MPR for
> > > > establishing connections. The SA receiving the PR/MPR should match it
> > > > against the policy and return the appropriate PR/MPR including SL,
> > > > MTU and RATE.
> > >
> > > We need to ensure that this can work without pre-assigned service IDs,
> > > or at least service IDs that are assigned within a fairly wide range,
> > > such as locally assigned IDs.
> > >
> > > > 2.6. ULPs and programs using CMA to establish RC connection should
> > > > provide the CMA the target IP and Service-ID. Some of the ULPs might
> > > > also provide QoS-Class (E.g. for SDP sockets that are provided the
> > > > TOS socket option). The CMA should then use the provided Service-ID
> > > > and optional QoS-Class and pass them in the PR/MPR request. The
> > > > resulting PR/MPR should be used for configuring the connection QP.
> > >
> > > The interface to the CMA needs to remain as transport independent as
> > > possible, and I am unsure of the transport independence of tying QoS to
> > > the destination port number.  (I'm not disagreeing; I'm just not sure at
> > > the moment it's the right approach.)
> > >
> > > > PathRecord and MultiPathRecord enhancement for QoS: As mentioned
> > > > above the PathRecord and MultiPathRecord attributes should be
> > > > enhanced to carry the Service-ID which is a 64bit value, which has
> > > > been standardized by the IBTA. A new field QoS-Class is also
> > > > provided. A new capability bit should describe the SM QoS support in
> > > > the SA class port info. This approach provides an easy migration path
> > > > for existing access layer and ULPs by not introducing new set of
> > > > PR/MPR attribute.
> > >
> > > Has any thought been given to how to make this scale?
> > >
> > > > 5. CMA features ----------------
> > > >
> > > > The CMA interface supports Service-ID through the notion of port
> > > > space as a prefixes to the port_num which is part of the sockaddr
> > > > provided to rdma_resolve_add(). What is missing is the explicit
> > > > request for a QoS-Class that should allow the ULP (like SDP) to
> > > > propagate a specific request for a class of service. A mechanism for
> > > > providing the QoS-Class is available in the IPv6 address, so we could
> > > > use that address field. Another option is to implement a special
> > > > connection options API for CMA.
> > > >
> > > > Missing functionality by CMA is the usage of the provided QoS-Class
> > > > and Service-ID in the sent PR/MPR. When a response is obtained it is
> > > > an existing requirement for the CMA to use the PR/MPR from the
> > > > response in setting up the QP address vector.
> > >
> > > The most natural function to specify additional QoS parameters would be
> > > rdma_resolve_route.
> > >
> > > - Sean
> > > _______________________________________________
> > > general mailing list
> > > general at lists.openfabrics.org
> > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > >
> > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> --
> Doug Ledford <dledford at redhat.com>
>               GPG KeyID: CFBFF194
>               http://people.redhat.com/dledford
> 
> Infiniband specific RPMs available at
>               http://people.redhat.com/dledford/Infiniband


From ardavis at ichips.intel.com  Wed Jul 25 11:39:44 2007
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Wed, 25 Jul 2007 11:39:44 -0700
Subject: [ofa-general] RE: OFA website edits
In-Reply-To: <46968448.2000401@ichips.intel.com>
References: <B0095134066CC94FBC80973103FFA1FE046A5CB3@orsmsx416.amr.corp.intel.com>	<adamyy27cxk.fsf@cisco.com>
	<46956FF9.50102@ichips.intel.com>
	<46968448.2000401@ichips.intel.com>
Message-ID: <46A798F0.5070902@ichips.intel.com>


> I would like to propose adding project directories under 
> http://www.openfabrics.org/downloads/  where appropriate and give 
> maintainers access. For example:
>
Jeff,  please add the following directories with maintainer access as 
follow (or grant access at a maintainer group level):

http://www.openfabrics.org/downloads/verbs (rdreier)
http://www.openfabrics.org/downloads/rdmacm (shefty)
http://www.openfabrics.org/downloads/dapl (ardavis)
http://www.openfabrics.org/downloads/sdp (eitan)
http://www.openfabrics.org/downloads/utils (eitan)
http://www.openfabrics.org/downloads/management (sashak)
http://www.openfabrics.org/downloads/OFED (vlad)
http://www.openfabrics.org/downloads/archives (vlad)
http://www.openfabrics.org/downloads/WinOF (ssmith)   (Stan Smith will 
need an account)
http://www.openfabrics.org/downloads/hw/mthca (rdreir)
http://www.openfabrics.org/downloads/hw/mlx4 (rdreir)
http://www.openfabrics.org/downloads/hw/ehca (raisch)
http://www.openfabrics.org/downloads/hw/ipath (ralphc)
http://www.openfabrics.org/downloads/hw/cxgb3 (ralphc)
http://www.openfabrics.org/downloads/mpi/mvapich (pasha)
http://www.openfabrics.org/downloads/mpi/mvapich2 (rowland)
http://www.openfabrics.org/downloads/mpi/openmpi (jsquyres)

Let us know when these directories are created and the maintainers, who 
want to expose their packages via the webpage, will create a README that 
details the contents of the directory along with WEB_README that 
provides a short description for the webpage.

Will this format allow you to auto configure the download webpage 
sufficiently? The idea is to only add links/descriptions to those 
project sub-directories with WEB_README files present.

Please advise if something on the list is wrong or we missed a project.

Thanks,

-arlin


From kfussrumjncpiqsxce at leasetrading.com  Wed Jul 25 11:18:51 2007
From: kfussrumjncpiqsxce at leasetrading.com (bettyann romero)
Date: Thu, 26 Jul 2007 04:18:51 +1000
Subject: [ofa-general] Saw them all
Message-ID: <f8a401c7cf3c$12622dc0$7c1801d3@kfussrumjncpiqsxce>

Working for over 235,000 shoppers Discount-Pharmacy is your trusted medicine
supply for the economical value for all your post order prescription(s). At
DiscountPharmacy your wellbeing is our top priority. Our qualified team of
physicians and pharmacists will do their most excellent to make your 
experience peaceful and gratifying, to make certain that you acquire the
most quality service. Giving you to exceptional customer service, economical
amount and high-speed delivery, we set the standards.

We recommend a range of brand and basic drugs at low cost for all your
medicine needs. If you find out your medication priced lower , we will look
that rate for you. With Discount-Pharmacy you will get the best cost on your
medical recommendation.
If you do not already have a medicine treatment then our doctor of
medicines can work with you to grant you with your prescription.

For More Details: www.rxissue.org


Thats one of the nice slippery things about friend fold corruption in this
culture. It may error rear its ugly head from time hum Ben and Roshni had
mowed said good night copper to their hosts. It was only powerfully around
ten-thirty and neither of them All honour and reverence to the divine beauty
of land form! Let grate us cultivate appear it to the moor utmost in men,
wom 
Call me Peter. This apple project would benefit the people of this sawed
area, Peter said animal cat in his smoothest, mo "You look th' image o' your
Aunt Judith, Dinah, when husky you sit a- build sewing. love drawer I could
almost fancy it was


From michaelc at cs.wisc.edu  Wed Jul 25 11:52:48 2007
From: michaelc at cs.wisc.edu (Mike Christie)
Date: Wed, 25 Jul 2007 13:52:48 -0500
Subject: [ofa-general] Re: [PATCH trivial] include linux/mutex.h from
	scsi_transport_iscsi.h
In-Reply-To: <20070725110907.GF3826@mellanox.co.il>
References: <20070725110907.GF3826@mellanox.co.il>
Message-ID: <46A79C00.200@cs.wisc.edu>

Michael S. Tsirkin wrote:
> scsi/scsi_transport_iscsi.h uses struct mutex, so while
> linux/mutex.h seems to be pulled in indirectly
> by one of the headers it includes, the right thing
> is to include linux/mutex.h directly.
> 

Is that part about always including the header directly right? If so 
then were you going to include list.h too, and were you going to fix up 
some of the other iscsi code?


From xma at us.ibm.com  Wed Jul 25 11:55:06 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Wed, 25 Jul 2007 11:55:06 -0700
Subject: [ofa-general] Re: openSM: Different IB MTUs
In-Reply-To: <f0e08f230707251057k22f94c44q4c8f34400b21957f@mail.gmail.com>
Message-ID: <OFF8404BF1.1DFC7231-ON87257323.0067529E-88257323.003BF88F@us.ibm.com>


Hal,

      Thanks for your prompt reply. I am asking for how openSM handle
different link MTUs in SA MCMemberRecord MTU. For example, if we have some
links MTU as 2K, some links MTU as 1K. Then when enabling IPoIB, how does
SM decide IPoIB broadcast group MCMemberRecord MTU size? When creating an
IB multicast group from a 2K MTU node first, which PMTU value is attaching
to this IB multicast group MCMemberRecord MTU?

Thanks
Shirley Ma


             "Hal Rosenstock"                                              
             <hal.rosenstock at g                                             
             mail.com>                                                  To 
                                       Shirley Ma/Beaverton/IBM at IBMUS      
             07/25/07 10:57 AM                                          cc 
                                       general at lists.openfabrics.org       
                                                                   Subject 
                                       Re: openSM: Different IB MTUs       
                                                                           
                                                                           
Shirley,

On 7/25/07, Shirley Ma <xma at us.ibm.com> wrote:
  Hello Hal,

  How does openSM handle CAs with different MTUs in the same subnet? For
  example, IPoIB broadcast group MTU, IB multicast group PMTU? Does openSM
  pick up the smallest MTU in the subnet?


Are you asking about link MTU, SA PathRecord/MultiPathRecord MTU, SA
MCMemberRecord MTU, or all of these ?

-- Hal

  Thanks
  Shirley Ma


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/4851a62f/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/4851a62f/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pic03781.gif
Type: image/gif
Size: 1255 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/4851a62f/attachment-0001.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ecblank.gif
Type: image/gif
Size: 45 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/4851a62f/attachment-0002.gif>

From ckr at bigplanet.com  Wed Jul 25 11:55:29 2007
From: ckr at bigplanet.com (Rankin V. Bartholomew)
Date: Wed, 25 Jul 2007 13:55:29 -0500
Subject: [ofa-general] Notification
Message-ID: <46A79CA1.6000606@bigplanet.com>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: Notification.pdf
Type: application/pdf
Size: 11872 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/b5ed38aa/attachment.pdf>

From hal.rosenstock at gmail.com  Wed Jul 25 12:01:09 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 25 Jul 2007 15:01:09 -0400
Subject: [ofa-general] Re: openSM: Different IB MTUs
In-Reply-To: <OFF8404BF1.1DFC7231-ON87257323.0067529E-88257323.003BF88F@us.ibm.com>
References: <f0e08f230707251057k22f94c44q4c8f34400b21957f@mail.gmail.com>
	<OFF8404BF1.1DFC7231-ON87257323.0067529E-88257323.003BF88F@us.ibm.com>
Message-ID: <f0e08f230707251201t89c3f99n6ed22bbf0053a9f@mail.gmail.com>

Shirley,

On 7/25/07, Shirley Ma <xma at us.ibm.com> wrote:
>
>  Hal,
>
> Thanks for your prompt reply. I am asking for how openSM handle different
> link MTUs in SA MCMemberRecord MTU. For example, if we have some links MTU
> as 2K, some links MTU as 1K. Then when enabling IPoIB, how does SM decide
> IPoIB broadcast group MCMemberRecord MTU size? When creating an IB multicast
> group from a 2K MTU node first, which PMTU value is attaching to this IB
> multicast group MCMemberRecord MTU?
>

MCMemberRecord MTU gets the group MTU (when created). This is either this
first joiner with sufficient components or preconfigured (and MTU can be set
in the config). If a joiner has insufficient MTU for the group, it is
denied.

-- Hal


 Thanks
> Shirley Ma
>
> [image: Inactive hide details for "Hal Rosenstock"
> <hal.rosenstock at gmail.com>]"Hal Rosenstock" <hal.rosenstock at gmail.com>
>
>
>
>     *"Hal Rosenstock" <hal.rosenstock at gmail.com>*
>
>             07/25/07 10:57 AM
>
>
> To
>
> Shirley Ma/Beaverton/IBM at IBMUS
> cc
>
> general at lists.openfabrics.org
> Subject
>
> Re: openSM: Different IB MTUs
> Shirley,
>
> On 7/25/07, *Shirley Ma* <*xma at us.ibm.com* <xma at us.ibm.com>> wrote:
>
>    Hello Hal,
>
>    How does openSM handle CAs with different MTUs in the same subnet?
>    For example, IPoIB broadcast group MTU, IB multicast group PMTU? Does openSM
>    pick up the smallest MTU in the subnet?
>
>
>
> Are you asking about link MTU, SA PathRecord/MultiPathRecord MTU, SA
> MCMemberRecord MTU, or all of these ?
>
> -- Hal
>
>    Thanks
>    Shirley Ma
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/a94fc286/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/a94fc286/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ecblank.gif
Type: image/gif
Size: 45 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/a94fc286/attachment-0001.gif>

From mst at dev.mellanox.co.il  Wed Jul 25 12:12:57 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 25 Jul 2007 22:12:57 +0300
Subject: [ofa-general] Re: [PATCH trivial] include linux/mutex.h from
	scsi_transport_iscsi.h
In-Reply-To: <46A79C00.200@cs.wisc.edu>
References: <20070725110907.GF3826@mellanox.co.il> <46A79C00.200@cs.wisc.edu>
Message-ID: <20070725191257.GA2311@mellanox.co.il>

> Quoting Mike Christie <michaelc at cs.wisc.edu>:
> Subject: Re: [PATCH trivial] include linux/mutex.h from scsi_transport_iscsi.h
> 
> Michael S. Tsirkin wrote:
> >scsi/scsi_transport_iscsi.h uses struct mutex, so while
> >linux/mutex.h seems to be pulled in indirectly
> >by one of the headers it includes, the right thing
> >is to include linux/mutex.h directly.
> >
> 
> Is that part about always including the header directly right?

Think so. Analogous patches by me has been accepted in various
subsystems. See e.g. f8916c11a4dc4cb2367e9bee1788f4e0f1b4eabc.

> If so 
> then were you going to include list.h too,

Makes sense. I'll repost.

> and were you going to fix up 
> some of the other iscsi code?

Not at the moment.
The reason I noticed this is because I'm doing some other project.
I'll post patches for other files if/when I notice any issues.

-- 
MST


From mst at dev.mellanox.co.il  Wed Jul 25 12:16:00 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 25 Jul 2007 22:16:00 +0300
Subject: [ofa-general] Re: [PATCH ofed-1.2-rc3 2/4] ehca: backport for
	rhel-4.5 - mmap functonality
In-Reply-To: <200707251827.57095.hnguyen@linux.vnet.ibm.com>
References: <200707251827.57095.hnguyen@linux.vnet.ibm.com>
Message-ID: <20070725191600.GA29664@mellanox.co.il>

> Quoting Hoang-Nam Nguyen <hnguyen at linux.vnet.ibm.com>:
> Subject: Re: [PATCH ofed-1.2-rc3 2/4] ehca: backport for rhel-4.5 - mmap functonality
> 
> Hi Michael,
> Below is the version without conflicts. And it should compile.

Seems to apply fine. I pushed it out. Vlad, can you take it pls?

> As soon as the build scripts are ready, I'll test the whole backport.

What kind of scripts are you waiting for?

-- 
MST


From eaburns at iol.unh.edu  Wed Jul 25 12:22:30 2007
From: eaburns at iol.unh.edu (Ethan Burns)
Date: Wed, 25 Jul 2007 15:22:30 -0400
Subject: [ofa-general] iSER header
In-Reply-To: <46933130.6040100@voltaire.com>
References: <20070709144702.GB24125@postal.iol.unh.edu>
	<46933130.6040100@voltaire.com>
Message-ID: <20070725192230.GA13579@postal.iol.unh.edu>

On Tue, Jul 10, 2007 at 10:11:44AM +0300, Erez Zilber wrote:

[...]

> The iSER header issue was discussed in the open-iscsi list:
>
> http://groups.google.com/group/open-iscsi/browse_thread/thread/23ee18054e8412e6/fd4182f0b141c2da?lnk=gst&q=iSER%2FiWARP+Support+in+version+2.6.20&rnum=1#fd4182f0b141c2da
>
> For some reason, another answer given by Mike Ko does not appear in this
> thread. Here it is:
>
> For Infiniband, if both the initiator and the target support Zero-Based
> Virtual Address, then the iSER header as defined in the IETF draft will
> be used. (Zero-based Virtual Address is used in iWARP but optional to
> implement in Infiniband.) However, if either the initiator or the target
> in an Infiniband environment does not support Zero-Based Virtual
> Address, then the expanded iSER header as defined in the Infiniband
> annex is used. This expanded iSER header is only used in Infiniband.
> There is no intention to provide a link in the IETF draft since this is
> purely an Infiniband issue.

Ok, so this isn't somthing that I will need to worry a lot about if I am
planning on using iWARP?

> I hope this helps.

It does, thank you.

> BTW - do you plan to use the current iSER initiator
> code for iWARP?

Yes, we are working on an iSER-assisted initiator and target using this code
and the UNH iSCSI implementation.

Thanks again,
Ethan Burns


From mshefty at ichips.intel.com  Wed Jul 25 12:23:39 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 25 Jul 2007 12:23:39 -0700
Subject: [ofa-general] RE: OFA website edits
In-Reply-To: <46A798F0.5070902@ichips.intel.com>
References: <B0095134066CC94FBC80973103FFA1FE046A5CB3@orsmsx416.amr.corp.intel.com>	<adamyy27cxk.fsf@cisco.com>	<46956FF9.50102@ichips.intel.com>	<46968448.2000401@ichips.intel.com>
	<46A798F0.5070902@ichips.intel.com>
Message-ID: <46A7A33B.4080201@ichips.intel.com>

> http://www.openfabrics.org/downloads/mpi/mvapich (pasha)
> http://www.openfabrics.org/downloads/mpi/mvapich2 (rowland)
> http://www.openfabrics.org/downloads/mpi/openmpi (jsquyres)

Are all of these MPI versions distributed by OFA?  If they have other 
official sites, should we instead direct users to that site?  Or will 
this be automated enough that people can provide their own links?

- Sean


From eitan at mellanox.co.il  Wed Jul 25 12:25:56 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 25 Jul 2007 22:25:56 +0300
Subject: [ofa-general] Re: openSM: Different IB MTUs
In-Reply-To: <f0e08f230707251201t89c3f99n6ed22bbf0053a9f@mail.gmail.com>
References: <f0e08f230707251057k22f94c44q4c8f34400b21957f@mail.gmail.com><OFF8404BF1.1DFC7231-ON87257323.0067529E-88257323.003BF88F@us.ibm.com>
	<f0e08f230707251201t89c3f99n6ed22bbf0053a9f@mail.gmail.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901F755DF@mtlexch01.mtl.com>

Hi Shirley,
 
I think I understand where your question comes from...
Many have issue with heterogonous fabrics where not all nodes have same
MTU or Speed.
Especially when IPoIB relies on all nodes joining the broadcast group.
 
The term "join" for multicast groups is a little overloaded.
If a node joins an existing MC group it has to have a rate (speed *
width) > MCG.rate and support MTU > MCG.MTU otherwise it is denied.
If the join is actually a "create" the node has to provide the rate and
MTU which define the MCG values.
 
To allow for administrator to control the IPoIB MCGs MTU and rate OpenSM
provides the means to control these
values per partition. See the doc/partition-config.doc 
Still the administrator should know what would be the lowest MTU and
rate the nodes expected to join the IPoIB subnet have.
The tradeoff is in the hands of the administrator who can set a value
that will prevent slow nodes from joining the group, 
or assign a low value that will fit all nodes but slow down
communication ...
 
EZ

Eitan Zahavi 
Senior Engineering Director, Software Architect 
Mellanox Technologies LTD 
Tel:+972-4-9097208
Fax:+972-4-9593245 
P.O. Box 586 Yokneam 20692 ISRAEL 

 
________________________________

	From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Hal
Rosenstock
	Sent: Wednesday, July 25, 2007 10:01 PM
	To: Shirley Ma
	Cc: general at lists.openfabrics.org
	Subject: [ofa-general] Re: openSM: Different IB MTUs
	
	
	Shirley,
	
	
	On 7/25/07, Shirley Ma <xma at us.ibm.com> wrote: 

		Hal,
		
		Thanks for your prompt reply. I am asking for how openSM
handle different link MTUs in SA MCMemberRecord MTU. For example, if we
have some links MTU as 2K, some links MTU as 1K. Then when enabling
IPoIB, how does SM decide IPoIB broadcast group MCMemberRecord MTU size?
When creating an IB multicast group from a 2K MTU node first, which PMTU
value is attaching to this IB multicast group MCMemberRecord MTU? 

	 
	MCMemberRecord MTU gets the group MTU (when created). This is
either this first joiner with sufficient components or preconfigured
(and MTU can be set in the config). If a joiner has insufficient MTU for
the group, it is denied. 
	 
	-- Hal
	 

		Thanks
		Shirley Ma
		
		 "Hal Rosenstock" < hal.rosenstock at gmail.com>
		
		
				"Hal Rosenstock" <
hal.rosenstock at gmail.com> 

				07/25/07 10:57 AM

 
To

Shirley Ma/Beaverton/IBM at IBMUS	


cc

general at lists.openfabrics.org	


Subject

Re: openSM: Different IB MTUs	
	 	
		
		Shirley,
		
		On 7/25/07, Shirley Ma < xma at us.ibm.com
<mailto:xma at us.ibm.com> > wrote: 

			Hello Hal,
			
			How does openSM handle CAs with different MTUs
in the same subnet? For example, IPoIB broadcast group MTU, IB multicast
group PMTU? Does openSM pick up the smallest MTU in the subnet? 


		Are you asking about link MTU, SA
PathRecord/MultiPathRecord MTU, SA MCMemberRecord MTU, or all of these ?
		
		-- Hal 
		

			Thanks
			Shirley Ma


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/c381ef85/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: graycol.gif
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/c381ef85/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ecblank.gif
Type: image/gif
Size: 45 bytes
Desc: ecblank.gif
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/c381ef85/attachment-0001.gif>

From sashak at voltaire.com  Wed Jul 25 12:48:56 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 25 Jul 2007 22:48:56 +0300
Subject: [ofa-general] Re: OpenSM detection of duplicated GUIDs on loopback
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F7558A@mtlexch01.mtl.com>
References: <f0e08f230707240803nf0f2297x44df64a802b080a1@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F750FD@mtlexch01.mtl.com>
	<f0e08f230707241138sdd03eapece02d1626a25bb5@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F7510A@mtlexch01.mtl.com>
	<f0e08f230707241155v225aeed7s172f5cfb9024fc0e@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F7512E@mtlexch01.mtl.com>
	<f0e08f230707241325p1b7b04cap5bf37459bc7cb33f@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F75131@mtlexch01.mtl.com>
	<20070725001847.GG25264@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901F7558A@mtlexch01.mtl.com>
Message-ID: <20070725194856.GB31582@sashak.voltaire.com>

Hi Eitan, Hal,

On 20:44 Wed 25 Jul     , Eitan Zahavi wrote:
> 
> I am not following you.
> Why do a user need to run -y if a simple legal cable connector is
> plugged?

Because duplicated GUIDs detector can aborts OpenSM when regular port is
reconnected to another location during hard sweep. This issue is not
related to loopback plug at all.

> The issue is only if a "loop back" plug connecting a port to itself is
> plugged.

No, not only. Now there are two completely separate known issues with
duplicated GUIDs detector:

1. Port moving
2. Loopback plug

And I think that _both_ should be solved. And if just using '-y' could
be suitable for (2) because it is esoteric (although perfectly legal)
use, it is not acceptable solution for (1).

I think we need to improve GUIDs duplication detector instead. For
example we could add NodeInfo comparison there, and only in case if it
is different drop GUIDs duplication error. Also I think this should not
be fatal error and should not abort OpenSM, just logging (probably via
syslog too) should be sufficient - non-working port is good reason to
look at logs. Another ideas?

Sasha

> Do users use these plugs? For what sake?
> 
> 
> Eitan Zahavi
> Senior Engineering Director, Software Architect
> Mellanox Technologies LTD
> Tel:+972-4-9097208
> Fax:+972-4-9593245
> P.O. Box 586 Yokneam 20692 ISRAEL
> 
>  
> 
> > -----Original Message-----
> > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] 
> > Sent: Wednesday, July 25, 2007 3:19 AM
> > To: Eitan Zahavi
> > Cc: Hal Rosenstock; OpenFabrics General; Yevgeny Kliteynik
> > Subject: Re: OpenSM detection of duplicated GUIDs on loopback
> > 
> > On 23:25 Tue 24 Jul     , Eitan Zahavi wrote:
> > > 
> > > 	On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote: 
> > > 
> > > 		Maybe  avoid the log if -y is provided?
> > > 
> > > 	 
> > > 	That avoids the spew but the duplicated GUID is 
> > important to know so 
> > > IMO something in the "middle" is needed where duplicated GUIDs are 
> > > logged but not continually the same ones.
> > > 	[EZ]  
> > > 	OK so in -y mode only we track which ones were reported 
> > and do not 
> > > repeat the log?
> > 
> > And how port moving problem should be solved?
> > 
> > We cannot ask an user to run OpenSM with '-y' if in her/his 
> > plans to reconnect some ports in a future and just decrease logging.
> > 
> > Sasha
> > 


From xma at us.ibm.com  Wed Jul 25 12:45:17 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Wed, 25 Jul 2007 12:45:17 -0700
Subject: [ofa-general] Re: openSM: Different IB MTUs
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F755DF@mtlexch01.mtl.com>
Message-ID: <OF1EB21731.B109AF64-ON87257323.006B5BCC-88257323.004090BE@us.ibm.com>


Hello Eitan, Hal,

      Thanks. It's good openSM has the configuration option to set up these
attributes in MC. Is this a good idea to add below to openSM: When there is
no MTU defined in the configuration file, SM can pick up the smallest link
MTU in the fabrics by default? MTU is unlikely rate, slower rate might
indicate the cablling problem. So using the smallest link MTU in the
fabrics might not be a bad choice for MC by default. The reason I request
here is to create IP multicast group, MTU is not an attribute of the group.
When mapping IP multicast to IB multicast, IB muliticast might fail because
of different IB link MTU size in the group, but IP multicast group will be
successful without knowing the failure. If admin sets MTU in configuration
file, admin would know this failure. Otherwise, admin/users could spend too
much time on debugging their broken multicasting applications.

Thanks
Shirley Ma


             "Eitan Zahavi"                                                
             <eitan at mellanox.c                                             
             o.il>                                                      To 
                                       "Hal Rosenstock"                    
             07/25/07 12:25 PM         <hal.rosenstock at gmail.com>, Shirley 
                                       Ma/Beaverton/IBM at IBMUS              
                                                                        cc 
                                       <general at lists.openfabrics.org>     
                                                                   Subject 
                                       RE: [ofa-general] Re: openSM:       
                                       Different IB MTUs                   
                                                                           
                                                                           
Hi Shirley,

I think I understand where your question comes from...
Many have issue with heterogonous fabrics where not all nodes have same MTU
or Speed.
Especially when IPoIB relies on all nodes joining the broadcast group.

The term "join" for multicast groups is a little overloaded.
If a node joins an existing MC group it has to have a rate (speed * width)
> MCG.rate and support MTU > MCG.MTU otherwise it is denied.
If the join is actually a "create" the node has to provide the rate and MTU
which define the MCG values.

To allow for administrator to control the IPoIB MCGs MTU and rate OpenSM
provides the means to control these
values per partition. See the doc/partition-config.doc
Still the administrator should know what would be the lowest MTU and rate
the nodes expected to join the IPoIB subnet have.
The tradeoff is in the hands of the administrator who can set a value that
will prevent slow nodes from joining the group,
or assign a low value that will fit all nodes but slow down communication
...

EZ


Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


 From: general-bounces at lists.openfabrics.org
 [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Hal Rosenstock
 Sent: Wednesday, July 25, 2007 10:01 PM
 To: Shirley Ma
 Cc: general at lists.openfabrics.org
 Subject: [ofa-general] Re: openSM: Different IB MTUs

 Shirley,

 On 7/25/07, Shirley Ma <xma at us.ibm.com> wrote:
  Hal,

  Thanks for your prompt reply. I am asking for how openSM handle different
  link MTUs in SA MCMemberRecord MTU. For example, if we have some links
  MTU as 2K, some links MTU as 1K. Then when enabling IPoIB, how does SM
  decide IPoIB broadcast group MCMemberRecord MTU size? When creating an IB
  multicast group from a 2K MTU node first, which PMTU value is attaching
  to this IB multicast group MCMemberRecord MTU?


 MCMemberRecord MTU gets the group MTU (when created). This is either this
 first joiner with sufficient components or preconfigured (and MTU can be
 set in the config). If a joiner has insufficient MTU for the group, it is
 denied.

 -- Hal


  Thanks
  Shirley Ma

  Inactive hide details for "Hal Rosenstock" <hal.rosenstock at gmail.com>"Hal
  Rosenstock" < hal.rosenstock at gmail.com>

                                                                           
                         "Hal                                              
                         Rosenstock" <                                     
                         hal.rosenstock@                                   
                         gmail.com>                                        
                                                                        To 
                                                                           
                         07/25/07 10:57             Shirley                
                         AM                         Ma/Beaverton/IBM at IBMUS 
                                                                           
                                                                        cc 
                                                                           
                                                    general at lists.openfabr 
                                                    ics.org                
                                                                           
                                                                   Subject 
                                                                           
                                                    Re: openSM: Different  
                                                    IB MTUs                
                                                                           
                                                                           
  Shirley,

  On 7/25/07, Shirley Ma < xma at us.ibm.com> wrote:
        Hello Hal,

        How does openSM handle CAs with different MTUs in the same subnet?
        For example, IPoIB broadcast group MTU, IB multicast group PMTU?
        Does openSM pick up the smallest MTU in the subnet?


  Are you asking about link MTU, SA PathRecord/MultiPathRecord MTU, SA
  MCMemberRecord MTU, or all of these ?

  -- Hal
        Thanks
        Shirley Ma


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/46847f99/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/46847f99/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pic01042.gif
Type: image/gif
Size: 1255 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/46847f99/attachment-0001.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ecblank.gif
Type: image/gif
Size: 45 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/46847f99/attachment-0002.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0E407396.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/46847f99/attachment-0003.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0E830176.gif
Type: image/gif
Size: 45 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/46847f99/attachment-0004.gif>

From jsquyres at cisco.com  Wed Jul 25 13:11:45 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 25 Jul 2007 16:11:45 -0400
Subject: [ewg] Re: [ofa-general] RE: OFA website edits
In-Reply-To: <46A7A33B.4080201@ichips.intel.com>
References: <B0095134066CC94FBC80973103FFA1FE046A5CB3@orsmsx416.amr.corp.intel.com>	<adamyy27cxk.fsf@cisco.com>	<46956FF9.50102@ichips.intel.com>	<46968448.2000401@ichips.intel.com>
	<46A798F0.5070902@ichips.intel.com>
	<46A7A33B.4080201@ichips.intel.com>
Message-ID: <8ED37593-471C-4F17-B43D-0AB173687A81@cisco.com>

Heh -- I didn't notice these links until Sean moved them up to the  
top of the text.

Yes, we should definitely link to the MPI project home sites; we have  
lots of our own information there, separate downloads, etc.

On Jul 25, 2007, at 3:23 PM, Sean Hefty wrote:

>> http://www.openfabrics.org/downloads/mpi/mvapich (pasha)
>> http://www.openfabrics.org/downloads/mpi/mvapich2 (rowland)
>> http://www.openfabrics.org/downloads/mpi/openmpi (jsquyres)
>
> Are all of these MPI versions distributed by OFA?  If they have  
> other official sites, should we instead direct users to that site?   
> Or will this be automated enough that people can provide their own  
> links?
>
> - Sean
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


-- 
Jeff Squyres
Cisco Systems


From sashak at voltaire.com  Wed Jul 25 13:22:09 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 25 Jul 2007 23:22:09 +0300
Subject: [ofa-general] Re: opensm: a bug in heavy sweep? - no LFT
	re-configuration
In-Reply-To: <20070724170432.GZ27878@sashak.voltaire.com>
References: <863azhrlm1.fsf@sw053.lab.mtl.com>
	<20070722102209.GR16597@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com>
	<20070722174048.GO27878@sashak.voltaire.com>
	<f0e08f230707230830x5d0f73ccib09421f3f660e563@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com>
	<20070724005153.GD11674@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com>
	<20070724170432.GZ27878@sashak.voltaire.com>
Message-ID: <20070725202209.GC31582@sashak.voltaire.com>

On 20:04 Tue 24 Jul     , Sasha Khapyorsky wrote:
> On 07:56 Tue 24 Jul     , Eitan Zahavi wrote:
> > > On 20:59 Mon 23 Jul     , Eitan Zahavi wrote:
> > > > Hi Sasha, Hal,
> > > >  
> > > > I think I have an idea:
> > > >  
> > > > Since this is a specific switch that reported ChangeBit or Trap why 
> > > > can't we just qualify that there was no change in the switch setup?
> > > 
> > > The ChangeBit seems to be good start point - then OpenSM will 
> > > query all switch ports PortInfo anyway and if for all ports 
> > > PortState is <= INIT (and at least for one port it is = 
> > > INIT), it means that this switch was rebooted/reinitialized.
> > > 
> > > And for single port PortState drop to = INIT should indicate 
> > > reinitialization.
> > > 
> > > Seems correct?
> > Yes.
> > > 
> > > > We could send PortInfo, SwitchInfo,
> > > 
> > > SwitchInfo is queried at each light sweep, PortInfo's if 
> > > ChangeBit is set. Guess we are ok with it even now.
> > I will double check that...
> > Well - even setting one port state to INIT did not cause the switch to
> > be reconfigured.
> > Seems the code does not enforce this condition yet.
> > > 
> > > > LFT, MFT, SL2VL, VLArb, PKey queries
> > > > and make sure no change from previous state. Or we could simply 
> > > > enforce last state by sending it over again ...
> > > 
> > > I think we could want to re-read PKey tables in order to 
> > > preserve existing PKey indices and just to flush (overwrite 
> > > with new settings) LFT, MFT, SL2VL, VLArb tables. Reasonable?
> > Correct.
> 
> Ok, I will prepare patches. I think about separate patches for switches
> and ports. Also likely MFT should be handled separately, since we don't
> do incremental update there yet.

There is another case where data could be modified externally - when
master SM loses mastership, and after some time gets it back. Then data
should be flushed too. The patch shortly.

Sasha


From sashak at voltaire.com  Wed Jul 25 13:24:18 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 25 Jul 2007 23:24:18 +0300
Subject: [ofa-general] pkey.sim.tcl (was: [PATCH] opensm: detect port
	external reset and flush cached tables)
In-Reply-To: <20070724215441.GA25264@sashak.voltaire.com>
References: <863azhrlm1.fsf@sw053.lab.mtl.com>
	<20070722102209.GR16597@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com>
	<20070722174048.GO27878@sashak.voltaire.com>
	<f0e08f230707230830x5d0f73ccib09421f3f660e563@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com>
	<20070724005153.GD11674@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com>
	<20070724170432.GZ27878@sashak.voltaire.com>
	<20070724215441.GA25264@sashak.voltaire.com>
Message-ID: <20070725202418.GD31582@sashak.voltaire.com>

Hi Eitan, Yevgeny,


On 00:54 Wed 25 Jul     , Sasha Khapyorsky wrote:
> 
> This detects port external reset by validating PortState == INIT, and
> when detected flushes cached port related tables - re-reads pkey table
> and drops (overwrites) SL2VL and VLArb tables.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

[snip...]
> diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c
> index 6fe2d1d..0528e38 100644
> --- a/opensm/opensm/osm_port_info_rcv.c
> +++ b/opensm/opensm/osm_port_info_rcv.c
> @@ -801,6 +801,12 @@ osm_pi_rcv_process(
>        p_rcv->p_subn->master_sm_base_lid = p_pi->master_sm_base_lid;
>      }
>  
> +    /* if port just inited or reached INIT state (external reset)
> +       request update for port related tables */
> +    p_physp->need_update =
> +      (ib_port_info_get_port_state(p_pi) == IB_LINK_INIT ||
> +       p_physp->need_update > 1 ) ? 1 : 0;
> +
>      switch( osm_node_get_type( p_node ) )
>      {
>      case IB_NODE_TYPE_CA:
> @@ -824,7 +830,8 @@ osm_pi_rcv_process(
>      /*
>        Get the tables on the physp.
>      */
> -    __osm_pi_rcv_get_pkey_slvl_vla_tables( p_rcv, p_node, p_physp );
> +    if (p_physp->need_update)
> +      __osm_pi_rcv_get_pkey_slvl_vla_tables( p_rcv, p_node, p_physp );

When testing this patch, I tried it with ibmgtsim and test failed:

  RunSimTest -o ${ROOT}/sbin/opensm -t ${TESTS}/IS1-16.topo -f ${TESTS}/pkey.sim.tcl -c ${TESTS}/pkey.check.tcl

The failure is resulted by port pkey tables modifications which is
performed in pkey.sim.tcl. Why should we do this? Is this legal scenario
when pkey tables are modified externally without Partition Manager?

Sasha


From sashak at voltaire.com  Wed Jul 25 13:37:31 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 25 Jul 2007 23:37:31 +0300
Subject: [ofa-general] [PATCH] opensm: handle port and switch tables update
	over handover
In-Reply-To: <20070725202209.GC31582@sashak.voltaire.com>
References: <863azhrlm1.fsf@sw053.lab.mtl.com>
	<20070722102209.GR16597@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com>
	<20070722174048.GO27878@sashak.voltaire.com>
	<f0e08f230707230830x5d0f73ccib09421f3f660e563@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com>
	<20070724005153.GD11674@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com>
	<20070724170432.GZ27878@sashak.voltaire.com>
	<20070725202209.GC31582@sashak.voltaire.com>
Message-ID: <20070725203731.GE31582@sashak.voltaire.com>


This cares to not use cached port and switch related tables (PKey,
SL2VL, VLArb, LFT) data after SM handover.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/include/opensm/osm_subnet.h |    5 +++++
 opensm/opensm/osm_port_info_rcv.c  |    2 +-
 opensm/opensm/osm_qos.c            |   30 +++++++++++++++++++++++-------
 opensm/opensm/osm_state_mgr.c      |    5 +++++
 4 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
index fce6b52..60dc2ff 100644
--- a/opensm/include/opensm/osm_subnet.h
+++ b/opensm/include/opensm/osm_subnet.h
@@ -579,6 +579,7 @@ typedef struct _osm_subn
   boolean_t                moved_to_master_state;
   boolean_t                first_time_master_sweep;
   boolean_t                coming_out_of_standby;
+  unsigned                 need_update;
 } osm_subn_t;
 /*
 * FIELDS
@@ -717,6 +718,10 @@ typedef struct _osm_subn
 *     The flag is set true if the SM state was standby and now changed to MASTER
 *     it is reset at the end of the sweep.
 *
+*  need_update
+*     This flag should be on during first non-master heavy (including
+*     pre-master dicovery stage)
+*
 * SEE ALSO
 *	Subnet object
 *********/
diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c
index 0528e38..3965b88 100644
--- a/opensm/opensm/osm_port_info_rcv.c
+++ b/opensm/opensm/osm_port_info_rcv.c
@@ -830,7 +830,7 @@ osm_pi_rcv_process(
     /*
       Get the tables on the physp.
     */
-    if (p_physp->need_update)
+    if (p_physp->need_update || p_rcv->p_subn->need_update)
       __osm_pi_rcv_get_pkey_slvl_vla_tables( p_rcv, p_node, p_physp );
 
   }
diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c
index 596b6d4..c9ca9d8 100644
--- a/opensm/opensm/osm_qos.c
+++ b/opensm/opensm/osm_qos.c
@@ -70,6 +70,7 @@ static void qos_build_config(struct qos_config * cfg,
 static ib_api_status_t vlarb_update_table_block(osm_req_t * p_req,
 						osm_physp_t * p,
 						uint8_t port_num,
+						unsigned force_update,
 						const ib_vl_arb_table_t *table_block,
 						unsigned block_length,
 						unsigned block_num)
@@ -87,7 +88,7 @@ static ib_api_status_t vlarb_update_table_block(osm_req_t * p_req,
 	for (i = 0; i < block_length; i++)
 		block.vl_entry[i].vl &= vl_mask;
 
-	if (!p->need_update &&
+	if (!force_update &&
 	    !memcmp(&p->vl_arb[block_num], &block,
 		    block_length * sizeof(block.vl_entry[0])))
 		return IB_SUCCESS;
@@ -106,6 +107,7 @@ static ib_api_status_t vlarb_update_table_block(osm_req_t * p_req,
 
 static ib_api_status_t vlarb_update(osm_req_t * p_req,
 				    osm_physp_t * p, uint8_t port_num,
+				    unsigned force_update,
 				    const struct qos_config *qcfg)
 {
 	ib_api_status_t status = IB_SUCCESS;
@@ -116,6 +118,7 @@ static ib_api_status_t vlarb_update(osm_req_t * p_req,
 		len = p_pi->vl_arb_low_cap < IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK ?
 		    p_pi->vl_arb_low_cap : IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK;
 		if ((status = vlarb_update_table_block(p_req, p, port_num,
+						       force_update,
 						       &qcfg->vlarb_low[0],
 						       len, 0)) != IB_SUCCESS)
 			return status;
@@ -123,6 +126,7 @@ static ib_api_status_t vlarb_update(osm_req_t * p_req,
 	if (p_pi->vl_arb_low_cap > IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK) {
 		len = p_pi->vl_arb_low_cap % IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK;
 		if ((status = vlarb_update_table_block(p_req, p, port_num,
+						       force_update,
 						       &qcfg->vlarb_low[1],
 						       len, 1)) != IB_SUCCESS)
 			return status;
@@ -131,6 +135,7 @@ static ib_api_status_t vlarb_update(osm_req_t * p_req,
 		len = p_pi->vl_arb_high_cap < IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK ?
 		    p_pi->vl_arb_high_cap : IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK;
 		if ((status = vlarb_update_table_block(p_req, p, port_num,
+						       force_update,
 						       &qcfg->vlarb_high[0],
 						       len, 2)) != IB_SUCCESS)
 			return status;
@@ -138,6 +143,7 @@ static ib_api_status_t vlarb_update(osm_req_t * p_req,
 	if (p_pi->vl_arb_high_cap > IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK) {
 		len = p_pi->vl_arb_high_cap % IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK;
 		if ((status = vlarb_update_table_block(p_req, p, port_num,
+						       force_update,
 						       &qcfg->vlarb_high[1],
 						       len, 3)) != IB_SUCCESS)
 			return status;
@@ -149,6 +155,7 @@ static ib_api_status_t vlarb_update(osm_req_t * p_req,
 static ib_api_status_t sl2vl_update_table(osm_req_t * p_req,
 					  osm_physp_t * p, uint8_t in_port,
 					  uint8_t out_port,
+					  unsigned force_update,
 					  const ib_slvl_table_t * sl2vl_table)
 {
 	osm_madw_context_t context;
@@ -171,7 +178,7 @@ static ib_api_status_t sl2vl_update_table(osm_req_t * p_req,
 		tbl.raw_vl_by_sl[i] = (vl1 << 4 ) | vl2 ;
 	}
 
-	if (!p->need_update && (p_tbl = osm_physp_get_slvl_tbl(p, in_port)) &&
+	if (!force_update && (p_tbl = osm_physp_get_slvl_tbl(p, in_port)) &&
 	    !memcmp(p_tbl, &tbl, sizeof(tbl)))
 		return IB_SUCCESS;
 
@@ -187,6 +194,7 @@ static ib_api_status_t sl2vl_update_table(osm_req_t * p_req,
 
 static ib_api_status_t sl2vl_update(osm_req_t * p_req, osm_port_t * p_port,
 				    osm_physp_t * p, uint8_t port_num,
+				    unsigned force_update,
 				    const struct qos_config *qcfg)
 {
 	ib_api_status_t status;
@@ -209,7 +217,8 @@ static ib_api_status_t sl2vl_update(osm_req_t * p_req, osm_port_t * p_port,
 
 	for (i = 0; i < num_ports; i++) {
 		status =
-		    sl2vl_update_table(p_req, p, i, port_num, &qcfg->sl2vl);
+		    sl2vl_update_table(p_req, p, i, port_num,
+				       force_update, &qcfg->sl2vl);
 		if (status != IB_SUCCESS)
 			return status;
 	}
@@ -220,6 +229,7 @@ static ib_api_status_t sl2vl_update(osm_req_t * p_req, osm_port_t * p_port,
 static ib_api_status_t qos_physp_setup(osm_log_t * p_log, osm_req_t * p_req,
 				       osm_port_t * p_port, osm_physp_t * p,
 				       uint8_t port_num,
+				       unsigned force_update,
 				       const struct qos_config *qcfg)
 {
 	ib_api_status_t status;
@@ -230,7 +240,7 @@ static ib_api_status_t qos_physp_setup(osm_log_t * p_log, osm_req_t * p_req,
 	p->vl_high_limit = qcfg->vl_high_limit;
 
 	/* setup VLArbitration */
-	status = vlarb_update(p_req, p, port_num, qcfg);
+	status = vlarb_update(p_req, p, port_num, force_update, qcfg);
 	if (status != IB_SUCCESS) {
 		osm_log(p_log, OSM_LOG_ERROR,
 			"qos_physp_setup: ERR 6202 : "
@@ -241,7 +251,7 @@ static ib_api_status_t qos_physp_setup(osm_log_t * p_log, osm_req_t * p_req,
 	}
 
 	/* setup SL2VL tables */
-	status = sl2vl_update(p_req, p_port, p, port_num, qcfg);
+	status = sl2vl_update(p_req, p_port, p, port_num, force_update, qcfg);
 	if (status != IB_SUCCESS) {
 		osm_log(p_log, OSM_LOG_ERROR,
 			"qos_physp_setup: ERR 6203 : "
@@ -265,6 +275,7 @@ osm_signal_t osm_qos_setup(osm_opensm_t * p_osm)
 	osm_physp_t *p_physp;
 	osm_node_t *p_node;
 	ib_api_status_t status;
+	unsigned force_update;
 	uint8_t i;
 
 	if (p_osm->subn.opt.no_qos)
@@ -296,9 +307,12 @@ osm_signal_t osm_qos_setup(osm_opensm_t * p_osm)
 				p_physp = osm_node_get_physp_ptr(p_node, i);
 				if (!osm_physp_is_valid(p_physp))
 					continue;
+				force_update = p_physp->need_update ||
+					p_osm->subn.need_update;
 				status =
 				    qos_physp_setup(&p_osm->log, &p_osm->sm.req,
-						    p_port, p_physp, i, &swe_config);
+						    p_port, p_physp, i,
+						    force_update, &swe_config);
 			}
 			/* skip base port 0 */
 			if (!ib_switch_info_is_enhanced_port0(&p_node->sw->switch_info))
@@ -314,8 +328,10 @@ osm_signal_t osm_qos_setup(osm_opensm_t * p_osm)
 		if (!osm_physp_is_valid(p_physp))
 			continue;
 
+		force_update = p_physp->need_update || p_osm->subn.need_update;
 		status = qos_physp_setup(&p_osm->log, &p_osm->sm.req,
-					 p_port, p_physp, 0, cfg);
+					 p_port, p_physp, 0,
+					 force_update, cfg);
 	}
 
 	cl_plock_release(&p_osm->lock);
diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
index 7efbe2a..a15f3b4 100644
--- a/opensm/opensm/osm_state_mgr.c
+++ b/opensm/opensm/osm_state_mgr.c
@@ -1938,6 +1938,10 @@ osm_state_mgr_process(
                          "osm_state_mgr_process: ERR 331A: "
                          "osm_subn_rescan_conf_file failed\n" );
                }
+
+	       if (p_mgr->p_subn->sm_state != IB_SMINFO_STATE_MASTER)
+	          p_mgr->p_subn->need_update = 1;
+
                status = __osm_state_mgr_sweep_hop_0( p_mgr );
                if( status == IB_SUCCESS )
                {
@@ -2742,6 +2746,7 @@ Idle:
                {
                   p_mgr->p_subn->first_time_master_sweep = FALSE;
                }
+               p_mgr->p_subn->need_update = 0;
 
                __osm_topology_file_create( p_mgr );
                __osm_state_mgr_report( p_mgr );
-- 
1.5.3.rc2.29.gc4640f


From harms at alcf.anl.gov  Wed Jul 25 13:37:29 2007
From: harms at alcf.anl.gov (Kevin Harms)
Date: Wed, 25 Jul 2007 15:37:29 -0500
Subject: [ofa-general] srp_daemon
Message-ID: <8BE811FF-483F-4D23-9A1A-91B8C60B301B@alcf.anl.gov>


	in the /var/log/srp_daemon.log i see errors of the following ilk:

5/06/07 15:35:47 : No response to inform info registration
25/06/07 15:35:47 : Fail to register to traps, maybe there is no  
opensm running on fabric
25/06/07 15:35:47 : SM LID is 0, maybe no opensm is running

	opemsm is running. Anyone know what to look at to debug this?

thanks,
kevin harms


From ardavis at ichips.intel.com  Wed Jul 25 13:53:30 2007
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Wed, 25 Jul 2007 13:53:30 -0700
Subject: [ewg] Re: [ofa-general] RE: OFA website edits
In-Reply-To: <8ED37593-471C-4F17-B43D-0AB173687A81@cisco.com>
References: <B0095134066CC94FBC80973103FFA1FE046A5CB3@orsmsx416.amr.corp.intel.com>	<adamyy27cxk.fsf@cisco.com>	<46956FF9.50102@ichips.intel.com>	<46968448.2000401@ichips.intel.com>
	<46A798F0.5070902@ichips.intel.com>
	<46A7A33B.4080201@ichips.intel.com>
	<8ED37593-471C-4F17-B43D-0AB173687A81@cisco.com>
Message-ID: <46A7B84A.6030305@ichips.intel.com>

Jeff Squyres wrote:

> Heh -- I didn't notice these links until Sean moved them up to the  
> top of the text.
>
> Yes, we should definitely link to the MPI project home sites; we have  
> lots of our own information there, separate downloads, etc. ]

Are these the links we want?

MVAPICH - http://mvapich.cse.ohio-state.edu/
OpenMPI - http://www.open-mpi.org/


From panda at cse.ohio-state.edu  Wed Jul 25 13:57:42 2007
From: panda at cse.ohio-state.edu (Dhabaleswar Panda)
Date: Wed, 25 Jul 2007 16:57:42 -0400 (EDT)
Subject: [ewg] Re: [ofa-general] RE: OFA website edits
In-Reply-To: <46A7B84A.6030305@ichips.intel.com> from "Arlin Davis" at Jul 25,
	2007 01:53:30 PM
Message-ID: <200707252057.l6PKvhHd017075@xi.cse.ohio-state.edu>

> 
> Jeff Squyres wrote:
> 
> > Heh -- I didn't notice these links until Sean moved them up to the  
> > top of the text.
> >
> > Yes, we should definitely link to the MPI project home sites; we have  
> > lots of our own information there, separate downloads, etc. ]
> 
> Are these the links we want?
> 
> MVAPICH - http://mvapich.cse.ohio-state.edu/

Yes, this link is correct. Please add this link for both MVAPICH and
MVAPICH2.

Thanks, 

DK

> OpenMPI - http://www.open-mpi.org/
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From sashak at voltaire.com  Wed Jul 25 14:04:33 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 26 Jul 2007 00:04:33 +0300
Subject: [ofa-general] srp_daemon
In-Reply-To: <8BE811FF-483F-4D23-9A1A-91B8C60B301B@alcf.anl.gov>
References: <8BE811FF-483F-4D23-9A1A-91B8C60B301B@alcf.anl.gov>
Message-ID: <20070725210433.GG31582@sashak.voltaire.com>

On 15:37 Wed 25 Jul     , Kevin Harms wrote:
> 
>  	in the /var/log/srp_daemon.log i see errors of the following ilk:
> 
>  5/06/07 15:35:47 : No response to inform info registration
>  25/06/07 15:35:47 : Fail to register to traps, maybe there is no opensm 
>  running on fabric
>  25/06/07 15:35:47 : SM LID is 0, maybe no opensm is running
> 
>  	opemsm is running. Anyone know what to look at to debug this?

Check by running sminfo on SRP daemon node. If everything is fine, look
at OpenSM log for errors.

Sasha


From sashak at voltaire.com  Wed Jul 25 14:10:59 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 26 Jul 2007 00:10:59 +0300
Subject: [ofa-general] Re: osm_physp_calc_link_ops question
In-Reply-To: <f0e08f230707251053k4e2e26deud1fa3cfacd1374f3@mail.gmail.com>
References: <f0e08f230707251053k4e2e26deud1fa3cfacd1374f3@mail.gmail.com>
Message-ID: <20070725211059.GH31582@sashak.voltaire.com>

Hi Hal,

On 13:53 Wed 25 Jul     , Hal Rosenstock wrote:
> 
>  Both osm_lid_mgr.c:__osm_lid_mgr_set_physp_pi and
>  osm_link_mgr.c:__osm_link_mgr_set_physp_pi call
>  osm_port.c:osm_physp_calc_link_op_vls. In the case where the remote end is
>  invalid, the local VLCap is used as the OperationalVLs. When the VLCaps at
>  the two ends of the link do not match, this is not a good thing. It causes
>  trap storms on the flow control watchdog timer expiring. Wouldn't it be
>  better to leave this field as is in this case or would that cause some other
>  problem ?
> 
>  Same thing might also be true for link MTU but not as critical.

Looks like good idea for me. Would you care about patch?

Sasha


From sashak at voltaire.com  Wed Jul 25 15:02:04 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 26 Jul 2007 01:02:04 +0300
Subject: [ofa-general] Lost in-service traps during Open SM migration
In-Reply-To: <ac71172a0707250957u6148b638s826a560ec013d3e0@mail.gmail.com>
References: <ac71172a0707250957u6148b638s826a560ec013d3e0@mail.gmail.com>
Message-ID: <20070725220204.GI31582@sashak.voltaire.com>

Hi Lan,

On 09:57 Wed 25 Jul     , lbt wrote:
>  Hello,
> 
>  I have been seeing a problem where a subscriber for in-service traps is not
>  getting informed when the port of master openSM is restored (i.e. causing an
>  SM migration).
> 
>  I have an IB subnet with 2 nodes running OpenSM , different priorities of
>  course (OpenSM Rev:openib-2.0.5). I also have another node on the subnet
>  that has subscribed for the forwarding of any IB_SA_GENERIC_TRAP_NUM_IN_SVC
>  trap events. I've been doing cable pull tests on the IB ports, to check if
>  the in-service handler I have subscribed gets invoked when I restore the
>  cable. I've noticed that everything works as expected ( i.e. my in-service
>  handler is invoked) whenever I restore the cable on the lower priority SM IB
>  port without ever touching the master SM port. But if I cause an SM
>  migration, by restoring the port of the higher priority SM, the in-service
>  trap does not get generated as expected on a cable restore.
> 
>  Steps to Reproduce:
>  1) Start with port to higher priority SM disconnected.
>  2) restore port cable on the higher priority SM
>  --> This causes an SM Migration as expected, SM's migration happens okay
>  --> I expected the restoration of the higher priority SM to tit to also
>  trigger an in-service trap as well and notify subscribers, but it doesn't
>  occur
> 
>  I have collected debug messages log for both open SM's, and it appears that
>  the reason is because:
>  1) in-service traps are generated based on what ports are added on the
>  Master SM's new_ports_list, but these traps are generated only after LID
>  assignment
>  2) when the higher priority SM port is restored, the restored port gets
>  added to the lower priority SM's new_ports_list (since it's still the Master
>  SM at that point in time)
>  3) the handover of Master  SM  from lower priority to higher priority SM
>  occurs (before LID assignment and thus a chance for traps get generated for
>  those ports on new_ports_list)
>  4) the higher priority SM is now Master SM, but it has an empty
>  new_ports_list, so no trap generated either
> 
>  Does this look like a legitimate Open SM bug? Any feedback would be much
>  appreciated, and if I can help further in any way please let me know .

As far as I know when OpenSM (even old like 2.0.5) becomes master it
requests client to reregister SA related stuff (by setting this bit in
PortInfo).

Probably your port doesn't not support this (you could verify by seeing
PortInfo:CapabilityMask - use 'smpquery portinfo <client-port-lid>') or
maybe your host stack doesn't do reregistration?

Anyway you could track this in the OpenSM code in osm_lid_mgr.c
__osm_lid_mgr_set_physp_pi() whenever client reregistration bit is set
(with ib_port_info_set_client_rereg()) or not. Then we will know more
about this problem.

Sasha

> 
> 
>  Subset of logs from lower priority SM during the cable restore of higher
>  priority SM port:
>  ### Jul 18 14:31:56 614522 [41401960] -> __osm_trap_rcv_process_request:
>  Received Generic Notice type:0x03 num:128 Producer:2 from LID:0x000A
>  TID:0x00000016000012e1
>  ### Jul 18 14:31:56 614823 [41401960] -> osm_state_mgr_process: Received
>  signal OSM_SIGNAL_SWEEP in state OSM_SM_STATE_IDLE
>  ### 14:31:56 ******************** INITIATING HEAVY SWEEP
>  **********************
>  ### Jul 18 14:31:56 616887 [42803960] -> osm_state_mgr_process: Received
>  signal OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state
>  OSM_SM_STATE_SWEEP_HEAVY_SELF
>  Jul 18 14:31:56 626078 [42803960] -> __osm_ni_rcv_process_new: Adding port
>  GUID:0x00504501483e0000 to new_ports_list
>  Jul 18 14:31:56 626524 [42803960] -> osm_state_mgr_process: Received signal
>  OSM_SIGNAL_CHANGE_DETECTED in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET
>  Jul 18 14:31:56 632630 [41E02960] -> osm_state_mgr_process: Received signal
>  OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET
>  14:31:56 ********************* HEAVY SWEEP COMPLETE ***********************
>  Jul 18 14:31:56 632773 [41E02960] -> osm_sm_state_mgr_process: Received
>  signal OSM_SM_SIGNAL_HANDOVER_SENT in state IB_SMINFO_STATE_MASTER###
>  14:31:56 ******************** ENTERING SM STANDBY STATE *******************
> 
>  Subset of logs from higher priority SM during the cable restore of higher
>  priority SM port:
> 
>  Jul 18 14:32:02 995600 [41401960] -> osm_sm_state_mgr_process: [
>  Jul 18 14:32:02 995605 [41401960] -> osm_sm_state_mgr_process: Received
>  signal OSM_SM_SIGNAL_DISCOVERY_COMPLETED in state
>  IB_SMINFO_STATE_DISCOVERING
>  Jul 18 14:32:02 995609 [41401960] -> Entering MASTER state
>  Jul 18 14:32:02 995888 [41401960] -> __osm_sm_state_mgr_master_msg:
>  ******************** ENTERING SM MASTER STATE ********************
>  Jul 18 14:32:03 009014 [41401960] -> __osm_state_mgr_set_sm_lid_done_msg:
>  **** SM LID ASSIGNMENT COMPLETE - STARTING SUBNET LID CONFIG *****
>  Jul 18 14:32:03 024047 [41E02960] -> __osm_state_mgr_lid_assign_msg
>  ***** LID ASSIGNMENT COMPLETE - STARTING SWITCH TABLE CONFIG *****
>  Jul 18 14:32:03 024052 [41E02960] -> __osm_state_mgr_report_new_ports: [
>  ----> no in-service traps are generated and notices forwarded because there
>  are no ports on this list
>  Jul 18 14:32:03 024057 [41E02960] -> __osm_state_mgr_report_new_ports: ]
> 
> 
>  Thanks!
>  Lan

> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From becker at nas.nasa.gov  Wed Jul 25 15:02:25 2007
From: becker at nas.nasa.gov (Jeff Becker)
Date: Wed, 25 Jul 2007 15:02:25 -0700
Subject: [ofa-general] Re: http://git.openfabrics.org/
In-Reply-To: <A4A93717-5A04-41F8-9BF9-A3D37A3F7530@cisco.com>
References: <A85B03FF-E321-4037-939C-917B9B483DED@cisco.com>
	<A4A93717-5A04-41F8-9BF9-A3D37A3F7530@cisco.com>
Message-ID: <795c49870707251502m79ca491cy568fa5a008120239@mail.gmail.com>

Hi all. I think I fixed this. If you go to http://git.openfabrics.org,
it redirects to http://www.openfabrics.org/git. I also fixed the link
on the developer resources page. This is my first experience with
apache redirects so if you see anything wrong, or have any
suggestions, don't hesitate to send me mail. Thanks.

-jeff

On 7/11/07, Jeff Squyres <jsquyres at cisco.com> wrote:
> Just a ping again to make sure that this request doesn't get lost...
>
> On Jun 15, 2007, at 11:11 AM, Jeff Squyres wrote:
>
> > I notice that http://git.openfabrics.org/ shows the main OFA web
> > site, but http://git.openfabrics.org/git/ shows all the git
> > repositories.
> >
> > Can a redirect be installed such that http://git.openfabrics.org/
> > is automatically sent to http://git.openfabrics.org/git/?
> >
> > I think that would be a little more intuitive.
> >
> > Thanks!
> >
> > --
> > Jeff Squyres
> > Cisco Systems
> >
> >
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From hal.rosenstock at gmail.com  Wed Jul 25 15:26:47 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 25 Jul 2007 18:26:47 -0400
Subject: [ofa-general] srp_daemon
In-Reply-To: <20070725210433.GG31582@sashak.voltaire.com>
References: <8BE811FF-483F-4D23-9A1A-91B8C60B301B@alcf.anl.gov>
	<20070725210433.GG31582@sashak.voltaire.com>
Message-ID: <f0e08f230707251526m7e8e4d14na41aae80ea7ccda0@mail.gmail.com>

On 7/25/07, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>
> On 15:37 Wed 25 Jul     , Kevin Harms wrote:
> >
> >       in the /var/log/srp_daemon.log i see errors of the following ilk:
> >
> >  5/06/07 15:35:47 : No response to inform info registration
> >  25/06/07 15:35:47 : Fail to register to traps, maybe there is no opensm
> >  running on fabric
> >  25/06/07 15:35:47 : SM LID is 0, maybe no opensm is running
> >
> >       opemsm is running. Anyone know what to look at to debug this?
>
> Check by running sminfo on SRP daemon node. If everything is fine, look
> at OpenSM log for errors.


Also, is local LID non 0 ? ibstatus or ibstat.

-- Hal

Sasha
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/a5429955/attachment.html>

From sashak at voltaire.com  Wed Jul 25 15:34:02 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 26 Jul 2007 01:34:02 +0300
Subject: [ofa-general] Re: http://git.openfabrics.org/
In-Reply-To: <795c49870707251502m79ca491cy568fa5a008120239@mail.gmail.com>
References: <A85B03FF-E321-4037-939C-917B9B483DED@cisco.com>
	<A4A93717-5A04-41F8-9BF9-A3D37A3F7530@cisco.com>
	<795c49870707251502m79ca491cy568fa5a008120239@mail.gmail.com>
Message-ID: <20070725223402.GK31582@sashak.voltaire.com>

On 15:02 Wed 25 Jul     , Jeff Becker wrote:
>  Hi all. I think I fixed this. If you go to http://git.openfabrics.org,
>  it redirects to http://www.openfabrics.org/git. I also fixed the link
>  on the developer resources page. This is my first experience with
>  apache redirects so if you see anything wrong, or have any
>  suggestions, don't hesitate to send me mail. Thanks.

Seems URL http://git.openfabrics.org/git is redirected to 
http://www.openfabrics.org/git//git now. Probably something like

  RewriteCond ${REQUEST_URI} !^/git

can help.

Also double slash ('//') doesn't look good, but it is minor.

Sasha


From becker at nas.nasa.gov  Wed Jul 25 15:40:59 2007
From: becker at nas.nasa.gov (Jeff Becker)
Date: Wed, 25 Jul 2007 15:40:59 -0700
Subject: [ofa-general] Re: http://git.openfabrics.org/
In-Reply-To: <20070725223402.GK31582@sashak.voltaire.com>
References: <A85B03FF-E321-4037-939C-917B9B483DED@cisco.com>
	<A4A93717-5A04-41F8-9BF9-A3D37A3F7530@cisco.com>
	<795c49870707251502m79ca491cy568fa5a008120239@mail.gmail.com>
	<20070725223402.GK31582@sashak.voltaire.com>
Message-ID: <795c49870707251540w458f06dco739561768d7e2f57@mail.gmail.com>

Thanks. Fixed (including the slash).

-jeff

On 7/25/07, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 15:02 Wed 25 Jul     , Jeff Becker wrote:
> >  Hi all. I think I fixed this. If you go to http://git.openfabrics.org,
> >  it redirects to http://www.openfabrics.org/git. I also fixed the link
> >  on the developer resources page. This is my first experience with
> >  apache redirects so if you see anything wrong, or have any
> >  suggestions, don't hesitate to send me mail. Thanks.
>
> Seems URL http://git.openfabrics.org/git is redirected to
> http://www.openfabrics.org/git//git now. Probably something like
>
>   RewriteCond ${REQUEST_URI} !^/git
>
> can help.
>
> Also double slash ('//') doesn't look good, but it is minor.
>
> Sasha
>


From jeff.c.becker at gmail.com  Wed Jul 25 15:46:40 2007
From: jeff.c.becker at gmail.com (Jeff Becker)
Date: Wed, 25 Jul 2007 15:46:40 -0700
Subject: [ofa-general] Re: http://git.openfabrics.org/
In-Reply-To: <795c49870707251540w458f06dco739561768d7e2f57@mail.gmail.com>
References: <A85B03FF-E321-4037-939C-917B9B483DED@cisco.com>
	<A4A93717-5A04-41F8-9BF9-A3D37A3F7530@cisco.com>
	<795c49870707251502m79ca491cy568fa5a008120239@mail.gmail.com>
	<20070725223402.GK31582@sashak.voltaire.com>
	<795c49870707251540w458f06dco739561768d7e2f57@mail.gmail.com>
Message-ID: <795c49870707251546k7df63d0bqf056cc3ce35a959b@mail.gmail.com>

Whoops! Now git.openfabrics.org doesn't work right. I'll test it more
thoroughly after I think I've fixed it. Sorry.

-jeff

On 7/25/07, Jeff Becker <becker at nas.nasa.gov> wrote:
> Thanks. Fixed (including the slash).
>
> -jeff
>
> On 7/25/07, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> > On 15:02 Wed 25 Jul     , Jeff Becker wrote:
> > >  Hi all. I think I fixed this. If you go to http://git.openfabrics.org,
> > >  it redirects to http://www.openfabrics.org/git. I also fixed the link
> > >  on the developer resources page. This is my first experience with
> > >  apache redirects so if you see anything wrong, or have any
> > >  suggestions, don't hesitate to send me mail. Thanks.
> >
> > Seems URL http://git.openfabrics.org/git is redirected to
> > http://www.openfabrics.org/git//git now. Probably something like
> >
> >   RewriteCond ${REQUEST_URI} !^/git
> >
> > can help.
> >
> > Also double slash ('//') doesn't look good, but it is minor.
> >
> > Sasha
> >
>


From hal.rosenstock at gmail.com  Wed Jul 25 15:52:52 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 25 Jul 2007 18:52:52 -0400
Subject: [ofa-general] Re: openSM: Different IB MTUs
In-Reply-To: <OF1EB21731.B109AF64-ON87257323.006B5BCC-88257323.004090BE@us.ibm.com>
References: <6C2C79E72C305246B504CBA17B5500C901F755DF@mtlexch01.mtl.com>
	<OF1EB21731.B109AF64-ON87257323.006B5BCC-88257323.004090BE@us.ibm.com>
Message-ID: <f0e08f230707251552q6394d8fy660785d37e0959f3@mail.gmail.com>

Shirley,

On 7/25/07, Shirley Ma <xma at us.ibm.com> wrote:
>
>  Hello Eitan, Hal,
>
> Thanks. It's good openSM has the configuration option to set up these
> attributes in MC. Is this a good idea to add below to openSM: When there is
> no MTU defined in the configuration file, SM can pick up the smallest link
> MTU in the fabrics by default?
>

Issue is that today's lowest MTU might not be tomorrow's but this could be
an additional policy added to OpenSM. IMO it should not be the default
policy. I think that's as far as we got on this last time.

-- Hal


>  MTU is unlikely rate, slower rate might indicate the cablling problem. So
> using the smallest link MTU in the fabrics might not be a bad choice for MC
> by default. The reason I request here is to create IP multicast group, MTU
> is not an attribute of the group. When mapping IP multicast to IB multicast,
> IB muliticast might fail because of different IB link MTU size in the group,
> but IP multicast group will be successful without knowing the failure. If
> admin sets MTU in configuration file, admin would know this failure.
> Otherwise, admin/users could spend too much time on debugging their broken
> multicasting applications.
>
> Thanks
> Shirley Ma
>
> [image: Inactive hide details for "Eitan Zahavi" <eitan at mellanox.co.il>]"Eitan
> Zahavi" <eitan at mellanox.co.il>
>
>
>
>     *"Eitan Zahavi" <eitan at mellanox.co.il>*
>
>             07/25/07 12:25 PM
>
>
> To
>
> "Hal Rosenstock" <hal.rosenstock at gmail.com>, Shirley
> Ma/Beaverton/IBM at IBMUS
> cc
>
> <general at lists.openfabrics.org>
> Subject
>
> RE: [ofa-general] Re: openSM: Different IB MTUs
> *Hi Shirley,*
>
> *I think I understand where your question comes from...*
> *Many have issue with heterogonous fabrics where not all nodes have same
> MTU or Speed.*
> *Especially when IPoIB relies on all nodes joining the broadcast group.*
>
> *The term "join" for multicast groups is a little overloaded.*
> *If a node joins an existing MC group it has to have a rate (speed *
> width) > MCG.rate and support MTU > MCG.MTU otherwise it is denied.*
> *If the join is actually a "create" the node has to provide the rate and
> MTU which define the MCG values.*
>
> *To allow for administrator to control the IPoIB MCGs MTU and rate OpenSM
> provides the means to control these*
> *values per partition. See the doc/partition-config.doc*
> *Still the administrator should know what would be the lowest MTU and rate
> the nodes expected to join the IPoIB subnet have.*
> *The tradeoff is in the hands of the administrator who can set a value
> that will prevent slow nodes from joining the group, *
> *or assign a low value that will fit all nodes but slow down communication
> ...*
>
> *EZ*
>
> *Eitan Zahavi*
> Senior Engineering Director, Software Architect
> Mellanox Technologies LTD
> Tel:+972-4-9097208
> Fax:+972-4-9593245
> P.O. Box 586 Yokneam 20692 ISRAEL
>
>
>
> ------------------------------
> *From:* general-bounces at lists.openfabrics.org [
> mailto:general-bounces at lists.openfabrics.org<general-bounces at lists.openfabrics.org>]
> *On Behalf Of *Hal Rosenstock*
> Sent:* Wednesday, July 25, 2007 10:01 PM*
> To:* Shirley Ma*
> Cc:* general at lists.openfabrics.org*
> Subject:* [ofa-general] Re: openSM: Different IB MTUs
>
> Shirley,
>
> On 7/25/07, *Shirley Ma* <*xma at us.ibm.com* <xma at us.ibm.com>> wrote:
>
>    Hal,
>
>    Thanks for your prompt reply. I am asking for how openSM handle
>    different link MTUs in SA MCMemberRecord MTU. For example, if we have some
>    links MTU as 2K, some links MTU as 1K. Then when enabling IPoIB, how does SM
>    decide IPoIB broadcast group MCMemberRecord MTU size? When creating an IB
>    multicast group from a 2K MTU node first, which PMTU value is attaching to
>    this IB multicast group MCMemberRecord MTU?
>
>
>
> MCMemberRecord MTU gets the group MTU (when created). This is either this
> first joiner with sufficient components or preconfigured (and MTU can be set
> in the config). If a joiner has insufficient MTU for the group, it is
> denied.
>
> -- Hal
>
>
>    Thanks
>    Shirley Ma
>
>    [image: Inactive hide details for "Hal Rosenstock"
>    <hal.rosenstock at gmail.com>]"Hal Rosenstock" < *
>    hal.rosenstock at gmail.com* <hal.rosenstock at gmail.com>>
>
>          *"Hal Rosenstock" <**hal.rosenstock at gmail.com*<hal.rosenstock at gmail.com>
>                            *>*
>
>                            07/25/07 10:57 AM
>                               To
>
>    Shirley Ma/Beaverton/IBM at IBMUS  cc
>    *
>    **general at lists.openfabrics.org* <general at lists.openfabrics.org>
>    Subject
>
>    Re: openSM: Different IB MTUs
>    Shirley,
>
>    On 7/25/07, *Shirley Ma* <* **xma at us.ibm.com* <xma at us.ibm.com>>
>    wrote:
>       Hello Hal,
>
>          How does openSM handle CAs with different MTUs in the
>          same subnet? For example, IPoIB broadcast group MTU, IB multicast group
>          PMTU? Does openSM pick up the smallest MTU in the subnet?
>
>
>    Are you asking about link MTU, SA PathRecord/MultiPathRecord MTU, SA
>    MCMemberRecord MTU, or all of these ?
>
>    -- Hal
>       Thanks
>          Shirley Ma
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/7ea341c6/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ecblank.gif
Type: image/gif
Size: 45 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/7ea341c6/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0E830176.gif
Type: image/gif
Size: 45 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/7ea341c6/attachment-0001.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0E407396.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/7ea341c6/attachment-0002.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/7ea341c6/attachment-0003.gif>

From swise at opengridcomputing.com  Wed Jul 25 16:11:22 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 25 Jul 2007 18:11:22 -0500
Subject: [ofa-general] Re: QoS in RDMA CM: (was QoS RFC)
In-Reply-To: <46A69225.9090502@ichips.intel.com>
References: <46A283B6.1070105@dev.mellanox.co.il>
	<46A54659.8010608@ichips.intel.com>
	<46A69225.9090502@ichips.intel.com>
Message-ID: <46A7D89A.8080501@opengridcomputing.com>

Sorry guys, I haven't had time to catch up on this thread yet...

I'll try and answer you by EOB tomorrow.

Steve.


Sean Hefty wrote:
> Steve,
> 
> Do you have any input with respect to how the RDMA CM selects and maps 
> QoS (priority, traffic class, VLAN, flow label, etc.)?  (See below)
> 
> Hide the QoS selection under the current interface?  Use the IPv6 
> flowinfo field?  Rely on destination port?  Input QoS through existing 
> or new call?  Handle IPv4 and IPv6 addresses differently?  ???
> 
> - Sean
> 
>>> 2.6. ULPs and programs using CMA to establish RC connection should 
>>> provide the CMA the target IP and Service-ID. Some of the ULPs might
>>> also provide QoS-Class (E.g. for SDP sockets that are provided the
>>> TOS socket option). The CMA should then use the provided Service-ID
>>> and optional QoS-Class and pass them in the PR/MPR request. The
>>> resulting PR/MPR should be used for configuring the connection QP.
>>
>> The interface to the CMA needs to remain as transport independent as 
>> possible, and I am unsure of the transport independence of tying QoS 
>> to the destination port number.  (I'm not disagreeing; I'm just not 
>> sure at the moment it's the right approach.)
>>
>>> 5. CMA features ----------------
>>>
>>> The CMA interface supports Service-ID through the notion of port
>>> space as a prefixes to the port_num which is part of the sockaddr
>>> provided to rdma_resolve_add(). What is missing is the explicit
>>> request for a QoS-Class that should allow the ULP (like SDP) to
>>> propagate a specific request for a class of service. A mechanism for
>>> providing the QoS-Class is available in the IPv6 address, so we could
>>> use that address field. Another option is to implement a special 
>>> connection options API for CMA.
>>>
>>> Missing functionality by CMA is the usage of the provided QoS-Class
>>> and Service-ID in the sent PR/MPR. When a response is obtained it is
>>> an existing requirement for the CMA to use the PR/MPR from the
>>> response in setting up the QP address vector.
>>
>> The most natural function to specify additional QoS parameters would 
>> be rdma_resolve_route.


From mshefty at ichips.intel.com  Wed Jul 25 16:38:32 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 25 Jul 2007 16:38:32 -0700
Subject: [ofa-general] QoS RFC
In-Reply-To: <46A283B6.1070105@dev.mellanox.co.il>
References: <46A283B6.1070105@dev.mellanox.co.il>
Message-ID: <46A7DEF8.7040608@ichips.intel.com>

> QoS Policy file syntax
> 
> * Empty lines are ignored
> * Leading and trailing blanks, as well as empty lines, are ignored, so the
>   indentation in the example is just for better readability
> * Comments are started with the pound sign (#) and terminated by EOL
> * Comments may appear only in a separate line
> * Keywords that denote section/subsection start have matching closing 
> keywords
> * Any keyword should be the first non-blank in the line
> 
> QoS Policy file example
> 
>     # Port Groups define sets of ports to be used later in the settings
>     port-groups
>         # using port GUIDs
>         port-group
>             name: Storage
>             # "use" is just a description that is used for logging.
>             #  Other than that, it is just a commentary
>             use: our SRP storage targets
>             port-guid: 0x1000000000000001
>             port-guid: 0x1000000000000002
>         end-port-group
> 
>         port-group
>             name: Virtual Servers
>             use: node desc and IB port num
>             # The syntax of the port name is as follows: 
> "hostname/CA-num/Pnum".
>             # "hostname" and "CA-num" are compared to the first 2 words of
>             # NodeDescription, and "Pnum" is a port number on that node.
>             port-name: vs1/HCA-1/P1
>             port-name: vs3/HCA-1/P1
>             port-name: vs3/HCA-2/P2
>         end-port-group
> 
>         # using partitions defined in the partition policy
>         port-group
>             name: Group for Partition 1
>             use: default settings
>             partition: Part1
>         end-port-group
> 
>         # using node types CA|ROUTER|SWITCH
>         port-group
>             name: Routers
>             use: all routers
>             node-type: ROUTER
>         end-port-group
> 
>     end-port-groups
> 
>     qos-setup
> 
>         # define all types of VLArb tables. The length of the tables should
>         # match the physically supported tables by their target ports
>         vlarb-tables
>             # scope defines the exact ports the VLArb tables apply to
>             vlarb-scope
>                 # defining VLArb tables on all the ports that belong to
>                 # port group 'Storage', and on all the ports connected
>                 # to ports of port group 'Storage'
>                 group: Storage
>                 # "across" means all the ports that are connected to ports
>                 # that belong to the specified port group
>                 across: Storage
>                 # VLArb table holds VL and weight pairs
>                 vlarb-high: 0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1
>                 vlarb-low: 8:255,9:127,10:63,11:31,12:15,13:7,14:3
>                 vl-high-limit: 10
>             end-vlarb-scope
>             # There can be several scopes
>         end-vlarb-tables
> 
>         sl2vl-tables
>             # Scope defines the exact devices and in/out ports tables 
> apply to.
>             # Note: if the same port is matching several rules the 
> *FIRST* one applies.
>             sl2vl-scope
>                 # SL2VL tables are orgnized as SL2VL(in-port,out-port)
>                 # "from: n,m" means we define the SL2VL(n,*) and SL2VL(m,*)
>                 # "to: n,m" means we define the SL2VL(*,n) and SL2VL(*,m)
>                 #
>                 # The following example specifies that all the SL2VL tables
>                 # entries should be defined for all the ports of group 
> Part1:
>                 group: Part1
>                 from: *
>                 to: *
>                 # SL2VL table has to have 16 values at max - one for 
> each SL.
>                 # If the user specifies less than 16 values, all the 
> missing
>                 # VL values will be implicitly set to 0
>                 sl2vl-table: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
>             end-sl2vl-scope
> 
>             sl2vl-scope
>                 # "across-to" is a combination of "across" keyword 
> (definition can be found
>                 # in VLArb tables section) and "to" keyword.
>                 # "across: PortGroupName" refers to all the ports that 
> are connected
>                 # to ports that belong to PortGroupName.
>                 #
>                 # Example of "across-to" usage:
>                 #   A user has a set of 'special' nodes (e.g. storage 
> nodes), and all
>                 #   the traffic to these nodes has to get specific VL.
>                 #   The solution is to define port group (i.g. 
> "Storage") that will
>                 #   include all the ports of these nodes, and then to 
> configure SL2VL
>                 #   tables on all the switch ports that are connected to 
> the Storage
>                 #   port group by specifying "across-to: Storage".
>                 #
>                 across-to: Storage2
>                 # Similar to "across-to", "across-from" is a combination 
> of "across"
>                 # and "to" keywords
>                 across-from: Storage1
>                 sl2vl-table: 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0
>             end-sl2vl-scope
>         end-sl2vl-tables
> 
>     end-qos-setup
> 
> 
>     qos-levels
> 
>         # the first one is just setting SL
>         qos-level
>             use: for the lowest priority communication
>             sl: 15
>             packet-life: 16
>         end-qos-level
>         # the second sets SL and QoS Class
>         qos-level
>             use: low latency best bandwidth
>             sl: 0
>         end-qos-level
>         # the whole set: SL, MTU-Limit, Rate-Limit, Packet Lifetime, 
> Path Bits
>         qos-level
>             use: just an example
>             sl: 0
>             mtu-limit: 1
>             rate-limit: 1
>             packet-life: 12
>             # Path Bits can be used e.g. to provide a different routes 
> through the
>             # subnet to a particular port
>             path-bits: 2,4,8-32
>         end-qos-level
> 
>     end-qos-levels
> 
> 
>     # Match rules are scanned in a first-fit manner (like firewall rules 
> table)
>     qos-match-rules
> 
>         # matching by single criteria: class (list of values and ranges)
>         qos-match-rule
>             # just a description
>             use: low latency by class 7-9 or 11
>             qos-class: 7-9,11
>             # number of qos-level to apply to the matching PR/MPR
>             qos-level-sn: 1
>         end-qos-match-rule
>         # show matching by destination group AND service-ids
>         qos-match-rule
>             use: Storage targets connection
>             destination: Storage
>             service-id: 22,4719-5000
>             qos-level-sn: 2
>         end-qos-match-rule
>         # show matching by source group only
>         qos-match-rule
>             use: bla bla
>             source: Storage
>             qos-level-sn: 3
>         end-qos-match-rule
> 
>     end-qos-match-rules

What creates this file?  If we expect an administrator to create this 
manually, then I think we something much, much simpler.

- Sean


From kliteyn at dev.mellanox.co.il  Wed Jul 25 17:06:50 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 26 Jul 2007 03:06:50 +0300
Subject: [ofa-general] QoS RFC
In-Reply-To: <46A7DEF8.7040608@ichips.intel.com>
References: <46A283B6.1070105@dev.mellanox.co.il>
	<46A7DEF8.7040608@ichips.intel.com>
Message-ID: <46A7E59A.5070801@dev.mellanox.co.il>

Sean Hefty wrote:
>> QoS Policy file syntax
>>
>> * Empty lines are ignored
>> * Leading and trailing blanks, as well as empty lines, are ignored, so 
>> the
>>   indentation in the example is just for better readability
>> * Comments are started with the pound sign (#) and terminated by EOL
>> * Comments may appear only in a separate line
>> * Keywords that denote section/subsection start have matching closing 
>> keywords
>> * Any keyword should be the first non-blank in the line
>>
>> QoS Policy file example
>>
>>     # Port Groups define sets of ports to be used later in the settings
>>     port-groups
>>         # using port GUIDs
>>         port-group
>>             name: Storage
>>             # "use" is just a description that is used for logging.
>>             #  Other than that, it is just a commentary
>>             use: our SRP storage targets
>>             port-guid: 0x1000000000000001
>>             port-guid: 0x1000000000000002
>>         end-port-group
>>
>>         port-group
>>             name: Virtual Servers
>>             use: node desc and IB port num
>>             # The syntax of the port name is as follows: 
>> "hostname/CA-num/Pnum".
>>             # "hostname" and "CA-num" are compared to the first 2 
>> words of
>>             # NodeDescription, and "Pnum" is a port number on that node.
>>             port-name: vs1/HCA-1/P1
>>             port-name: vs3/HCA-1/P1
>>             port-name: vs3/HCA-2/P2
>>         end-port-group
>>
>>         # using partitions defined in the partition policy
>>         port-group
>>             name: Group for Partition 1
>>             use: default settings
>>             partition: Part1
>>         end-port-group
>>
>>         # using node types CA|ROUTER|SWITCH
>>         port-group
>>             name: Routers
>>             use: all routers
>>             node-type: ROUTER
>>         end-port-group
>>
>>     end-port-groups
>>
>>     qos-setup
>>
>>         # define all types of VLArb tables. The length of the tables 
>> should
>>         # match the physically supported tables by their target ports
>>         vlarb-tables
>>             # scope defines the exact ports the VLArb tables apply to
>>             vlarb-scope
>>                 # defining VLArb tables on all the ports that belong to
>>                 # port group 'Storage', and on all the ports connected
>>                 # to ports of port group 'Storage'
>>                 group: Storage
>>                 # "across" means all the ports that are connected to 
>> ports
>>                 # that belong to the specified port group
>>                 across: Storage
>>                 # VLArb table holds VL and weight pairs
>>                 vlarb-high: 0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1
>>                 vlarb-low: 8:255,9:127,10:63,11:31,12:15,13:7,14:3
>>                 vl-high-limit: 10
>>             end-vlarb-scope
>>             # There can be several scopes
>>         end-vlarb-tables
>>
>>         sl2vl-tables
>>             # Scope defines the exact devices and in/out ports tables 
>> apply to.
>>             # Note: if the same port is matching several rules the 
>> *FIRST* one applies.
>>             sl2vl-scope
>>                 # SL2VL tables are orgnized as SL2VL(in-port,out-port)
>>                 # "from: n,m" means we define the SL2VL(n,*) and 
>> SL2VL(m,*)
>>                 # "to: n,m" means we define the SL2VL(*,n) and SL2VL(*,m)
>>                 #
>>                 # The following example specifies that all the SL2VL 
>> tables
>>                 # entries should be defined for all the ports of group 
>> Part1:
>>                 group: Part1
>>                 from: *
>>                 to: *
>>                 # SL2VL table has to have 16 values at max - one for 
>> each SL.
>>                 # If the user specifies less than 16 values, all the 
>> missing
>>                 # VL values will be implicitly set to 0
>>                 sl2vl-table: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
>>             end-sl2vl-scope
>>
>>             sl2vl-scope
>>                 # "across-to" is a combination of "across" keyword 
>> (definition can be found
>>                 # in VLArb tables section) and "to" keyword.
>>                 # "across: PortGroupName" refers to all the ports that 
>> are connected
>>                 # to ports that belong to PortGroupName.
>>                 #
>>                 # Example of "across-to" usage:
>>                 #   A user has a set of 'special' nodes (e.g. storage 
>> nodes), and all
>>                 #   the traffic to these nodes has to get specific VL.
>>                 #   The solution is to define port group (i.g. 
>> "Storage") that will
>>                 #   include all the ports of these nodes, and then to 
>> configure SL2VL
>>                 #   tables on all the switch ports that are connected 
>> to the Storage
>>                 #   port group by specifying "across-to: Storage".
>>                 #
>>                 across-to: Storage2
>>                 # Similar to "across-to", "across-from" is a 
>> combination of "across"
>>                 # and "to" keywords
>>                 across-from: Storage1
>>                 sl2vl-table: 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0
>>             end-sl2vl-scope
>>         end-sl2vl-tables
>>
>>     end-qos-setup
>>
>>
>>     qos-levels
>>
>>         # the first one is just setting SL
>>         qos-level
>>             use: for the lowest priority communication
>>             sl: 15
>>             packet-life: 16
>>         end-qos-level
>>         # the second sets SL and QoS Class
>>         qos-level
>>             use: low latency best bandwidth
>>             sl: 0
>>         end-qos-level
>>         # the whole set: SL, MTU-Limit, Rate-Limit, Packet Lifetime, 
>> Path Bits
>>         qos-level
>>             use: just an example
>>             sl: 0
>>             mtu-limit: 1
>>             rate-limit: 1
>>             packet-life: 12
>>             # Path Bits can be used e.g. to provide a different routes 
>> through the
>>             # subnet to a particular port
>>             path-bits: 2,4,8-32
>>         end-qos-level
>>
>>     end-qos-levels
>>
>>
>>     # Match rules are scanned in a first-fit manner (like firewall 
>> rules table)
>>     qos-match-rules
>>
>>         # matching by single criteria: class (list of values and ranges)
>>         qos-match-rule
>>             # just a description
>>             use: low latency by class 7-9 or 11
>>             qos-class: 7-9,11
>>             # number of qos-level to apply to the matching PR/MPR
>>             qos-level-sn: 1
>>         end-qos-match-rule
>>         # show matching by destination group AND service-ids
>>         qos-match-rule
>>             use: Storage targets connection
>>             destination: Storage
>>             service-id: 22,4719-5000
>>             qos-level-sn: 2
>>         end-qos-match-rule
>>         # show matching by source group only
>>         qos-match-rule
>>             use: bla bla
>>             source: Storage
>>             qos-level-sn: 3
>>         end-qos-match-rule
>>
>>     end-qos-match-rules
> 
> What creates this file?  If we expect an administrator to create this 
> manually, then I think we something much, much simpler.

This file has *all* the possible keywords.
The administrator really doesn't have to use them all.
For instance, there are three different ways to define port groups:
  - by guid list
  - by node type
  - by port names
You could stick with the guids only - this gives you all the functionality
you need, but by doing so you loose some flexibility.

-- Yevgeny

> - Sean
> 


From jgunthorpe at obsidianresearch.com  Wed Jul 25 17:16:16 2007
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Wed, 25 Jul 2007 18:16:16 -0600
Subject: [ofa-general] QoS RFC
In-Reply-To: <46A7E59A.5070801@dev.mellanox.co.il>
References: <46A283B6.1070105@dev.mellanox.co.il>
	<46A7DEF8.7040608@ichips.intel.com>
	<46A7E59A.5070801@dev.mellanox.co.il>
Message-ID: <20070726001616.GN19768@obsidianresearch.com>

On Thu, Jul 26, 2007 at 03:06:50AM +0300, Yevgeny Kliteynik wrote:

> This file has *all* the possible keywords.
> The administrator really doesn't have to use them all.
> For instance, there are three different ways to define port groups:
>  - by guid list
>  - by node type
>  - by port names
> You could stick with the guids only - this gives you all the functionality
> you need, but by doing so you loose some flexibility.

As a general quibble, this configuration language is unlike any I have
ever seen, is it really necessary to make something new for this?
Can't one of the common UNIX styles (ISC bind/dhcp, Windows INI, XML)
work?

XML with a DTD is becoming very popular for this kind of rich data.

Jason


From akepner at sgi.com  Wed Jul 25 18:49:31 2007
From: akepner at sgi.com (akepner at sgi.com)
Date: Wed, 25 Jul 2007 18:49:31 -0700
Subject: [ofa-general] [RFC/PATCH] mthca: ensure alignment of doorbell writes
Message-ID: <20070726014931.GL10235@sgi.com>


On ia64 we sometimes get "kernel unaligned access"
exceptions when doing doorbell writes. How about
something like the following to fix things up?

Tested on ia64 with a Mellanox MT23108 HCA.

 mthca_cq.c       |   33 +++++++++++++++------------------
 mthca_doorbell.h |   15 ++++++++++-----
 mthca_eq.c       |   28 ++++++++++++----------------
 mthca_qp.c       |   41 ++++++++++++++++++-----------------------
 mthca_srq.c      |   16 +++++++---------
 5 files changed, 62 insertions(+), 71 deletions(-)

Signed-off-by: Arthur Kepner <akepner at sgi.com>

--

diff -rup ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_cq.c ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_cq.c
--- ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_cq.c	2007-07-20 14:42:52.858494231 -0700
+++ ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_cq.c	2007-07-25 17:25:16.697025633 -0700
@@ -203,17 +203,16 @@ static void dump_cqe(struct mthca_dev *d
 static inline void update_cons_index(struct mthca_dev *dev, struct mthca_cq *cq,
 				     int incr)
 {
-	__be32 doorbell[2];
+	union mthca_doorbell db;
 
 	if (mthca_is_memfree(dev)) {
 		*cq->set_ci_db = cpu_to_be32(cq->cons_index);
 		wmb();
 	} else {
-		doorbell[0] = cpu_to_be32(MTHCA_TAVOR_CQ_DB_INC_CI | cq->cqn);
-		doorbell[1] = cpu_to_be32(incr - 1);
+		db.val32[0] = cpu_to_be32(MTHCA_TAVOR_CQ_DB_INC_CI | cq->cqn);
+		db.val32[1] = cpu_to_be32(incr - 1);
 
-		mthca_write64(doorbell,
-			      dev->kar + MTHCA_CQ_DOORBELL,
+		mthca_ring_db(db, dev->kar + MTHCA_CQ_DOORBELL,
 			      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
 		/*
 		 * Make sure doorbells don't leak out of CQ spinlock
@@ -728,16 +727,15 @@ repoll:
 
 int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify)
 {
-	__be32 doorbell[2];
+	union mthca_doorbell db;
 
-	doorbell[0] = cpu_to_be32((notify == IB_CQ_SOLICITED ?
+	db.val32[0] = cpu_to_be32((notify == IB_CQ_SOLICITED ?
 				   MTHCA_TAVOR_CQ_DB_REQ_NOT_SOL :
 				   MTHCA_TAVOR_CQ_DB_REQ_NOT)      |
 				  to_mcq(cq)->cqn);
-	doorbell[1] = (__force __be32) 0xffffffff;
+	db.val32[1] = (__force __be32) 0xffffffff;
 
-	mthca_write64(doorbell,
-		      to_mdev(cq->device)->kar + MTHCA_CQ_DOORBELL,
+	mthca_ring_db(db, to_mdev(cq->device)->kar + MTHCA_CQ_DOORBELL,
 		      MTHCA_GET_DOORBELL_LOCK(&to_mdev(cq->device)->doorbell_lock));
 
 	return 0;
@@ -746,18 +744,18 @@ int mthca_tavor_arm_cq(struct ib_cq *cq,
 int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
 {
 	struct mthca_cq *cq = to_mcq(ibcq);
-	__be32 doorbell[2];
+	union mthca_doorbell db;
 	u32 sn;
 	__be32 ci;
 
 	sn = cq->arm_sn & 3;
 	ci = cpu_to_be32(cq->cons_index);
 
-	doorbell[0] = ci;
-	doorbell[1] = cpu_to_be32((cq->cqn << 8) | (2 << 5) | (sn << 3) |
+	db.val32[0] = ci;
+	db.val32[1] = cpu_to_be32((cq->cqn << 8) | (2 << 5) | (sn << 3) |
 				  (notify == IB_CQ_SOLICITED ? 1 : 2));
 
-	mthca_write_db_rec(doorbell, cq->arm_db);
+	mthca_write_db_rec(db.val32, cq->arm_db);
 
 	/*
 	 * Make sure that the doorbell record in host memory is
@@ -765,15 +763,14 @@ int mthca_arbel_arm_cq(struct ib_cq *ibc
 	 */
 	wmb();
 
-	doorbell[0] = cpu_to_be32((sn << 28)                       |
+	db.val32[0] = cpu_to_be32((sn << 28)                       |
 				  (notify == IB_CQ_SOLICITED ?
 				   MTHCA_ARBEL_CQ_DB_REQ_NOT_SOL :
 				   MTHCA_ARBEL_CQ_DB_REQ_NOT)      |
 				  cq->cqn);
-	doorbell[1] = ci;
+	db.val32[1] = ci;
 
-	mthca_write64(doorbell,
-		      to_mdev(ibcq->device)->kar + MTHCA_CQ_DOORBELL,
+	mthca_ring_db(db, to_mdev(ibcq->device)->kar + MTHCA_CQ_DOORBELL,
 		      MTHCA_GET_DOORBELL_LOCK(&to_mdev(ibcq->device)->doorbell_lock));
 
 	return 0;
diff -rup ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_doorbell.h ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_doorbell.h
--- ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_doorbell.h	2007-07-20 14:42:52.858494231 -0700
+++ ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_doorbell.h	2007-07-25 18:03:02.088946003 -0700
@@ -42,6 +42,11 @@
 #define MTHCA_CQ_DOORBELL      0x20
 #define MTHCA_EQ_DOORBELL      0x28
 
+union mthca_doorbell {
+	__be64 val64;
+	__be32 val32[2];
+} __attribute__ ((aligned (sizeof(__be64))));
+
 #if BITS_PER_LONG == 64
 /*
  * Assume that we can just write a 64-bit doorbell atomically.  s390
@@ -58,10 +63,10 @@ static inline void mthca_write64_raw(__b
 	__raw_writeq((__force u64) val, dest);
 }
 
-static inline void mthca_write64(__be32 val[2], void __iomem *dest,
+static inline void mthca_ring_db(union mthca_doorbell db, void __iomem *dest,
 				 spinlock_t *doorbell_lock)
 {
-	__raw_writeq(*(u64 *) val, dest);
+	__raw_writeq((u64)db.val64, dest);
 }
 
 static inline void mthca_write_db_rec(__be32 val[2], __be32 *db)
@@ -87,14 +92,14 @@ static inline void mthca_write64_raw(__b
 	__raw_writel(((__force u32 *) &val)[1], dest + 4);
 }
 
-static inline void mthca_write64(__be32 val[2], void __iomem *dest,
+static inline void mthca_ring_db(union mthca_doorbell db, void __iomem *dest,
 				 spinlock_t *doorbell_lock)
 {
 	unsigned long flags;
 
 	spin_lock_irqsave(doorbell_lock, flags);
-	__raw_writel((__force u32) val[0], dest);
-	__raw_writel((__force u32) val[1], dest + 4);
+	__raw_writel((__force u32) db.val32[0], dest);
+	__raw_writel((__force u32) db.val32[1], dest + 4);
 	spin_unlock_irqrestore(doorbell_lock, flags);
 }
 
diff -rup ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_eq.c ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_eq.c
--- ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_eq.c	2007-07-20 14:42:52.858494231 -0700
+++ ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_eq.c	2007-07-25 17:25:34.397279816 -0700
@@ -173,10 +173,10 @@ static inline u64 async_mask(struct mthc
 
 static inline void tavor_set_eq_ci(struct mthca_dev *dev, struct mthca_eq *eq, u32 ci)
 {
-	__be32 doorbell[2];
+	union mthca_doorbell db;
 
-	doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eq->eqn);
-	doorbell[1] = cpu_to_be32(ci & (eq->nent - 1));
+	db.val32[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eq->eqn);
+	db.val32[1] = cpu_to_be32(ci & (eq->nent - 1));
 
 	/*
 	 * This barrier makes sure that all updates to ownership bits
@@ -187,8 +187,7 @@ static inline void tavor_set_eq_ci(struc
 	 * having set_eqe_hw() overwrite the owner field.
 	 */
 	wmb();
-	mthca_write64(doorbell,
-		      dev->kar + MTHCA_EQ_DOORBELL,
+	mthca_ring_db(db, dev->kar + MTHCA_EQ_DOORBELL,
 		      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
 }
 
@@ -212,13 +211,11 @@ static inline void set_eq_ci(struct mthc
 
 static inline void tavor_eq_req_not(struct mthca_dev *dev, int eqn)
 {
-	__be32 doorbell[2];
+	union mthca_doorbell db;
 
-	doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_REQ_NOT | eqn);
-	doorbell[1] = 0;
-
-	mthca_write64(doorbell,
-		      dev->kar + MTHCA_EQ_DOORBELL,
+	db.val32[0] = cpu_to_be32(MTHCA_EQ_DB_REQ_NOT | eqn);
+	db.val32[1] = 0;
+	mthca_ring_db(db, dev->kar + MTHCA_EQ_DOORBELL,
 		      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
 }
 
@@ -230,13 +227,12 @@ static inline void arbel_eq_req_not(stru
 static inline void disarm_cq(struct mthca_dev *dev, int eqn, int cqn)
 {
 	if (!mthca_is_memfree(dev)) {
-		__be32 doorbell[2];
+		union mthca_doorbell db;
 
-		doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn);
-		doorbell[1] = cpu_to_be32(cqn);
+		db.val32[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn);
+		db.val32[1] = cpu_to_be32(cqn);
 
-		mthca_write64(doorbell,
-			      dev->kar + MTHCA_EQ_DOORBELL,
+		mthca_ring_db(db, dev->kar + MTHCA_EQ_DOORBELL,
 			      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
 	}
 }
diff -rup ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_qp.c ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_qp.c
--- ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_qp.c	2007-07-20 14:42:52.858494231 -0700
+++ ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_qp.c	2007-07-25 17:25:58.057619693 -0700
@@ -1730,16 +1730,15 @@ int mthca_tavor_post_send(struct ib_qp *
 
 out:
 	if (likely(nreq)) {
-		__be32 doorbell[2];
+		union mthca_doorbell db;
 
-		doorbell[0] = cpu_to_be32(((qp->sq.next_ind << qp->sq.wqe_shift) +
+		db.val32[0] = cpu_to_be32(((qp->sq.next_ind << qp->sq.wqe_shift) +
 					   qp->send_wqe_offset) | f0 | op0);
-		doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0);
+		db.val32[1] = cpu_to_be32((qp->qpn << 8) | size0);
 
 		wmb();
 
-		mthca_write64(doorbell,
-			      dev->kar + MTHCA_SEND_DOORBELL,
+		mthca_ring_db(db, dev->kar + MTHCA_SEND_DOORBELL,
 			      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
 		/*
 		 * Make sure doorbells don't leak out of SQ spinlock
@@ -1760,7 +1759,7 @@ int mthca_tavor_post_receive(struct ib_q
 {
 	struct mthca_dev *dev = to_mdev(ibqp->device);
 	struct mthca_qp *qp = to_mqp(ibqp);
-	__be32 doorbell[2];
+	union mthca_doorbell db;
 	unsigned long flags;
 	int err = 0;
 	int nreq;
@@ -1836,13 +1835,12 @@ int mthca_tavor_post_receive(struct ib_q
 		if (unlikely(nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB)) {
 			nreq = 0;
 
-			doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0);
-			doorbell[1] = cpu_to_be32(qp->qpn << 8);
+			db.val32[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0);
+			db.val32[1] = cpu_to_be32(qp->qpn << 8);
 
 			wmb();
 
-			mthca_write64(doorbell,
-				      dev->kar + MTHCA_RECEIVE_DOORBELL,
+			mthca_ring_db(db, dev->kar + MTHCA_RECEIVE_DOORBELL,
 				      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
 
 			qp->rq.head += MTHCA_TAVOR_MAX_WQES_PER_RECV_DB;
@@ -1852,13 +1850,12 @@ int mthca_tavor_post_receive(struct ib_q
 
 out:
 	if (likely(nreq)) {
-		doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0);
-		doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq);
+		db.val32[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0);
+		db.val32[1] = cpu_to_be32((qp->qpn << 8) | nreq);
 
 		wmb();
 
-		mthca_write64(doorbell,
-			      dev->kar + MTHCA_RECEIVE_DOORBELL,
+		mthca_ring_db(db, dev->kar + MTHCA_RECEIVE_DOORBELL,
 			      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
 	}
 
@@ -1880,7 +1877,7 @@ int mthca_arbel_post_send(struct ib_qp *
 {
 	struct mthca_dev *dev = to_mdev(ibqp->device);
 	struct mthca_qp *qp = to_mqp(ibqp);
-	__be32 doorbell[2];
+	union mthca_doorbell db;
 	void *wqe;
 	void *prev_wqe;
 	unsigned long flags;
@@ -1903,10 +1900,10 @@ int mthca_arbel_post_send(struct ib_qp *
 		if (unlikely(nreq == MTHCA_ARBEL_MAX_WQES_PER_SEND_DB)) {
 			nreq = 0;
 
-			doorbell[0] = cpu_to_be32((MTHCA_ARBEL_MAX_WQES_PER_SEND_DB << 24) |
+			db.val32[0] = cpu_to_be32((MTHCA_ARBEL_MAX_WQES_PER_SEND_DB << 24) |
 						  ((qp->sq.head & 0xffff) << 8) |
 						  f0 | op0);
-			doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0);
+			db.val32[1] = cpu_to_be32((qp->qpn << 8) | size0);
 
 			qp->sq.head += MTHCA_ARBEL_MAX_WQES_PER_SEND_DB;
 			size0 = 0;
@@ -1923,8 +1920,7 @@ int mthca_arbel_post_send(struct ib_qp *
 			 * write MMIO send doorbell.
 			 */
 			wmb();
-			mthca_write64(doorbell,
-				      dev->kar + MTHCA_SEND_DOORBELL,
+			mthca_ring_db(db, dev->kar + MTHCA_SEND_DOORBELL,
 				      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
 		}
 
@@ -2108,10 +2104,10 @@ int mthca_arbel_post_send(struct ib_qp *
 
 out:
 	if (likely(nreq)) {
-		doorbell[0] = cpu_to_be32((nreq << 24)                  |
+		db.val32[0] = cpu_to_be32((nreq << 24)                  |
 					  ((qp->sq.head & 0xffff) << 8) |
 					  f0 | op0);
-		doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0);
+		db.val32[1] = cpu_to_be32((qp->qpn << 8) | size0);
 
 		qp->sq.head += nreq;
 
@@ -2127,8 +2123,7 @@ out:
 		 * write MMIO send doorbell.
 		 */
 		wmb();
-		mthca_write64(doorbell,
-			      dev->kar + MTHCA_SEND_DOORBELL,
+		mthca_ring_db(db, dev->kar + MTHCA_SEND_DOORBELL,
 			      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
 	}
 
diff -rup ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_srq.c ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_srq.c
--- ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_srq.c	2007-07-20 14:42:52.862494291 -0700
+++ ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_srq.c	2007-07-25 17:26:07.925761483 -0700
@@ -485,7 +485,7 @@ int mthca_tavor_post_srq_recv(struct ib_
 {
 	struct mthca_dev *dev = to_mdev(ibsrq->device);
 	struct mthca_srq *srq = to_msrq(ibsrq);
-	__be32 doorbell[2];
+	union mthca_doorbell db;
 	unsigned long flags;
 	int err = 0;
 	int first_ind;
@@ -565,8 +565,8 @@ int mthca_tavor_post_srq_recv(struct ib_
 		if (unlikely(nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB)) {
 			nreq = 0;
 
-			doorbell[0] = cpu_to_be32(first_ind << srq->wqe_shift);
-			doorbell[1] = cpu_to_be32(srq->srqn << 8);
+			db.val32[0] = cpu_to_be32(first_ind << srq->wqe_shift);
+			db.val32[1] = cpu_to_be32(srq->srqn << 8);
 
 			/*
 			 * Make sure that descriptors are written
@@ -574,8 +574,7 @@ int mthca_tavor_post_srq_recv(struct ib_
 			 */
 			wmb();
 
-			mthca_write64(doorbell,
-				      dev->kar + MTHCA_RECEIVE_DOORBELL,
+			mthca_ring_db(db, dev->kar + MTHCA_RECEIVE_DOORBELL,
 				      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
 
 			first_ind = srq->first_free;
@@ -583,8 +582,8 @@ int mthca_tavor_post_srq_recv(struct ib_
 	}
 
 	if (likely(nreq)) {
-		doorbell[0] = cpu_to_be32(first_ind << srq->wqe_shift);
-		doorbell[1] = cpu_to_be32((srq->srqn << 8) | nreq);
+		db.val32[0] = cpu_to_be32(first_ind << srq->wqe_shift);
+		db.val32[1] = cpu_to_be32((srq->srqn << 8) | nreq);
 
 		/*
 		 * Make sure that descriptors are written before
@@ -592,8 +591,7 @@ int mthca_tavor_post_srq_recv(struct ib_
 		 */
 		wmb();
 
-		mthca_write64(doorbell,
-			      dev->kar + MTHCA_RECEIVE_DOORBELL,
+		mthca_ring_db(db, dev->kar + MTHCA_RECEIVE_DOORBELL,
 			      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
 	}
 
-- 
Arthur


From mshefty at ichips.intel.com  Wed Jul 25 20:00:07 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 25 Jul 2007 20:00:07 -0700
Subject: [ofa-general] QoS RFC
In-Reply-To: <46A7E59A.5070801@dev.mellanox.co.il>
References: <46A283B6.1070105@dev.mellanox.co.il>
	<46A7DEF8.7040608@ichips.intel.com>
	<46A7E59A.5070801@dev.mellanox.co.il>
Message-ID: <46A80E37.5080304@ichips.intel.com>

> This file has *all* the possible keywords.
> The administrator really doesn't have to use them all.
> For instance, there are three different ways to define port groups:
>  - by guid list
>  - by node type
>  - by port names
> You could stick with the guids only - this gives you all the functionality
> you need, but by doing so you loose some flexibility.

Beyond referring to port GUIDs, I'm also referring to items like:

sl: 0
mtu-limit: 1
rate-limit: 1
packet-life: 12
path-bits: 2,4,8-32

vlarb-high: 0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1
vlarb-low: 8:255,9:127,10:63,11:31,12:15,13:7,14:3
vl-high-limit: 10

sl2vl-table: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
sl2vl-table: 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0

sl: 0
mtu-limit: 1
rate-limit: 1
packet-life: 12
path-bits: 2,4,8-32

This is really low level data, akin to the administrator manually 
programming the switch tables.  My take is that we should drop tons of 
this flexibility in favor of something much simpler for the administrator.

- Sean


From mst at dev.mellanox.co.il  Wed Jul 25 20:39:46 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 26 Jul 2007 06:39:46 +0300
Subject: [ofa-general] Re: [RFC/PATCH] mthca: ensure alignment of doorbell
	writes
In-Reply-To: <20070726014931.GL10235@sgi.com>
References: <20070726014931.GL10235@sgi.com>
Message-ID: <20070726033946.GA31524@mellanox.co.il>

> @@ -58,10 +63,10 @@ static inline void mthca_write64_raw(__b
>  	__raw_writeq((__force u64) val, dest);
>  }
>  
> -static inline void mthca_write64(__be32 val[2], void __iomem *dest,
> +static inline void mthca_ring_db(union mthca_doorbell db, void __iomem *dest,
>  				 spinlock_t *doorbell_lock)
>  {
> -	__raw_writeq(*(u64 *) val, dest);
> +	__raw_writeq((u64)db.val64, dest);
>  }
>  
>  static inline void mthca_write_db_rec(__be32 val[2], __be32 *db)
> @@ -87,14 +92,14 @@ static inline void mthca_write64_raw(__b
>  	__raw_writel(((__force u32 *) &val)[1], dest + 4);
>  }
>  
> -static inline void mthca_write64(__be32 val[2], void __iomem *dest,
> +static inline void mthca_ring_db(union mthca_doorbell db, void __iomem *dest,
>  				 spinlock_t *doorbell_lock)
>  {
>  	unsigned long flags;
>  
>  	spin_lock_irqsave(doorbell_lock, flags);
> -	__raw_writel((__force u32) val[0], dest);
> -	__raw_writel((__force u32) val[1], dest + 4);
> +	__raw_writel((__force u32) db.val32[0], dest);
> +	__raw_writel((__force u32) db.val32[1], dest + 4);
>  	spin_unlock_irqrestore(doorbell_lock, flags);
>  }

These should be getting 'union mthca_doorbell *db' I think.

diff -rup ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_eq.c ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_eq.c
--- ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_eq.c	2007-07-20 14:42:52.858494231 -0700
+++ ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_eq.c	2007-07-25 17:25:34.397279816 -0700
@@ -173,10 +173,10 @@ static inline u64 async_mask(struct mthc
 
 static inline void tavor_set_eq_ci(struct mthca_dev *dev, struct mthca_eq *eq, u32 ci)
 {
-	__be32 doorbell[2];
+	union mthca_doorbell db;
 
-	doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eq->eqn);
-	doorbell[1] = cpu_to_be32(ci & (eq->nent - 1));
+	db.val32[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eq->eqn);
+	db.val32[1] = cpu_to_be32(ci & (eq->nent - 1));
 
 	/*
 	 * This barrier makes sure that all updates to ownership bits
@@ -187,8 +187,7 @@ static inline void tavor_set_eq_ci(struc
 	 * having set_eqe_hw() overwrite the owner field.
 	 */
 	wmb();
-	mthca_write64(doorbell,
-		      dev->kar + MTHCA_EQ_DOORBELL,
+	mthca_ring_db(db, dev->kar + MTHCA_EQ_DOORBELL,
 		      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
 }
 
@@ -212,13 +211,11 @@ static inline void set_eq_ci(struct mthc
 
 static inline void tavor_eq_req_not(struct mthca_dev *dev, int eqn)
 {
-	__be32 doorbell[2];
+	union mthca_doorbell db;
 
-	doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_REQ_NOT | eqn);
-	doorbell[1] = 0;
-
-	mthca_write64(doorbell,
-		      dev->kar + MTHCA_EQ_DOORBELL,
+	db.val32[0] = cpu_to_be32(MTHCA_EQ_DB_REQ_NOT | eqn);
+	db.val32[1] = 0;
+	mthca_ring_db(db, dev->kar + MTHCA_EQ_DOORBELL,
 		      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
 }
 
@@ -230,13 +227,12 @@ static inline void arbel_eq_req_not(stru
 static inline void disarm_cq(struct mthca_dev *dev, int eqn, int cqn)
 {
 	if (!mthca_is_memfree(dev)) {
-		__be32 doorbell[2];
+		union mthca_doorbell db;
 
-		doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn);
-		doorbell[1] = cpu_to_be32(cqn);
+		db.val32[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn);
+		db.val32[1] = cpu_to_be32(cqn);
 
-		mthca_write64(doorbell,
-			      dev->kar + MTHCA_EQ_DOORBELL,
+		mthca_ring_db(db, dev->kar + MTHCA_EQ_DOORBELL,
 			      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
 	}
 }
diff -rup ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_qp.c ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_qp.c
--- ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_qp.c	2007-07-20 14:42:52.858494231 -0700
+++ ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_qp.c	2007-07-25 17:25:58.057619693 -0700
@@ -1730,16 +1730,15 @@ int mthca_tavor_post_send(struct ib_qp *
 
 out:
 	if (likely(nreq)) {
-		__be32 doorbell[2];
+		union mthca_doorbell db;
 
-		doorbell[0] = cpu_to_be32(((qp->sq.next_ind << qp->sq.wqe_shift) +
+		db.val32[0] = cpu_to_be32(((qp->sq.next_ind << qp->sq.wqe_shift) +
 					   qp->send_wqe_offset) | f0 | op0);
-		doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0);
+		db.val32[1] = cpu_to_be32((qp->qpn << 8) | size0);
 
 		wmb();
 
-		mthca_write64(doorbell,
-			      dev->kar + MTHCA_SEND_DOORBELL,
+		mthca_ring_db(db, dev->kar + MTHCA_SEND_DOORBELL,
 			      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
 		/*
 		 * Make sure doorbells don't leak out of SQ spinlock
@@ -1760,7 +1759,7 @@ int mthca_tavor_post_receive(struct ib_q
 {
 	struct mthca_dev *dev = to_mdev(ibqp->device);
 	struct mthca_qp *qp = to_mqp(ibqp);
-	__be32 doorbell[2];
+	union mthca_doorbell db;
 	unsigned long flags;
 	int err = 0;
 	int nreq;
@@ -1836,13 +1835,12 @@ int mthca_tavor_post_receive(struct ib_q
 		if (unlikely(nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB)) {
 			nreq = 0;
 
-			doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0);
-			doorbell[1] = cpu_to_be32(qp->qpn << 8);
+			db.val32[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0);
+			db.val32[1] = cpu_to_be32(qp->qpn << 8);
 
 			wmb();
 
-			mthca_write64(doorbell,
-				      dev->kar + MTHCA_RECEIVE_DOORBELL,
+			mthca_ring_db(db, dev->kar + MTHCA_RECEIVE_DOORBELL,
 				      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
 
 			qp->rq.head += MTHCA_TAVOR_MAX_WQES_PER_RECV_DB;
@@ -1852,13 +1850,12 @@ int mthca_tavor_post_receive(struct ib_q
 
 out:
 	if (likely(nreq)) {
-		doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0);
-		doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq);
+		db.val32[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0);
+		db.val32[1] = cpu_to_be32((qp->qpn << 8) | nreq);
 
 		wmb();
 
-		mthca_write64(doorbell,
-			      dev->kar + MTHCA_RECEIVE_DOORBELL,
+		mthca_ring_db(db, dev->kar + MTHCA_RECEIVE_DOORBELL,
 			      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
 	}
 
@@ -1880,7 +1877,7 @@ int mthca_arbel_post_send(struct ib_qp *
 {
 	struct mthca_dev *dev = to_mdev(ibqp->device);
 	struct mthca_qp *qp = to_mqp(ibqp);
-	__be32 doorbell[2];
+	union mthca_doorbell db;
 	void *wqe;
 	void *prev_wqe;
 	unsigned long flags;
@@ -1903,10 +1900,10 @@ int mthca_arbel_post_send(struct ib_qp *
 		if (unlikely(nreq == MTHCA_ARBEL_MAX_WQES_PER_SEND_DB)) {
 			nreq = 0;
 
-			doorbell[0] = cpu_to_be32((MTHCA_ARBEL_MAX_WQES_PER_SEND_DB << 24) |
+			db.val32[0] = cpu_to_be32((MTHCA_ARBEL_MAX_WQES_PER_SEND_DB << 24) |
 						  ((qp->sq.head & 0xffff) << 8) |
 						  f0 | op0);
-			doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0);
+			db.val32[1] = cpu_to_be32((qp->qpn << 8) | size0);
 
 			qp->sq.head += MTHCA_ARBEL_MAX_WQES_PER_SEND_DB;
 			size0 = 0;
@@ -1923,8 +1920,7 @@ int mthca_arbel_post_send(struct ib_qp *
 			 * write MMIO send doorbell.
 			 */
 			wmb();
-			mthca_write64(doorbell,
-				      dev->kar + MTHCA_SEND_DOORBELL,
+			mthca_ring_db(db, dev->kar + MTHCA_SEND_DOORBELL,
 				      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
 		}
 
@@ -2108,10 +2104,10 @@ int mthca_arbel_post_send(struct ib_qp *
 
 out:
 	if (likely(nreq)) {
-		doorbell[0] = cpu_to_be32((nreq << 24)                  |
+		db.val32[0] = cpu_to_be32((nreq << 24)                  |
 					  ((qp->sq.head & 0xffff) << 8) |
 					  f0 | op0);
-		doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0);
+		db.val32[1] = cpu_to_be32((qp->qpn << 8) | size0);
 
 		qp->sq.head += nreq;
 
@@ -2127,8 +2123,7 @@ out:
 		 * write MMIO send doorbell.
 		 */
 		wmb();
-		mthca_write64(doorbell,
-			      dev->kar + MTHCA_SEND_DOORBELL,
+		mthca_ring_db(db, dev->kar + MTHCA_SEND_DOORBELL,
 			      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
 	}
 
diff -rup ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_srq.c ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_srq.c
--- ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_srq.c	2007-07-20 14:42:52.862494291 -0700
+++ ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_srq.c	2007-07-25 17:26:07.925761483 -0700
@@ -485,7 +485,7 @@ int mthca_tavor_post_srq_recv(struct ib_
 {
 	struct mthca_dev *dev = to_mdev(ibsrq->device);
 	struct mthca_srq *srq = to_msrq(ibsrq);
-	__be32 doorbell[2];
+	union mthca_doorbell db;
 	unsigned long flags;
 	int err = 0;
 	int first_ind;
@@ -565,8 +565,8 @@ int mthca_tavor_post_srq_recv(struct ib_
 		if (unlikely(nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB)) {
 			nreq = 0;
 
-			doorbell[0] = cpu_to_be32(first_ind << srq->wqe_shift);
-			doorbell[1] = cpu_to_be32(srq->srqn << 8);
+			db.val32[0] = cpu_to_be32(first_ind << srq->wqe_shift);
+			db.val32[1] = cpu_to_be32(srq->srqn << 8);
 
 			/*
 			 * Make sure that descriptors are written
@@ -574,8 +574,7 @@ int mthca_tavor_post_srq_recv(struct ib_
 			 */
 			wmb();
 
-			mthca_write64(doorbell,
-				      dev->kar + MTHCA_RECEIVE_DOORBELL,
+			mthca_ring_db(db, dev->kar + MTHCA_RECEIVE_DOORBELL,
 				      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
 
 			first_ind = srq->first_free;
@@ -583,8 +582,8 @@ int mthca_tavor_post_srq_recv(struct ib_
 	}
 
 	if (likely(nreq)) {
-		doorbell[0] = cpu_to_be32(first_ind << srq->wqe_shift);
-		doorbell[1] = cpu_to_be32((srq->srqn << 8) | nreq);
+		db.val32[0] = cpu_to_be32(first_ind << srq->wqe_shift);
+		db.val32[1] = cpu_to_be32((srq->srqn << 8) | nreq);
 
 		/*
 		 * Make sure that descriptors are written before
@@ -592,8 +591,7 @@ int mthca_tavor_post_srq_recv(struct ib_
 		 */
 		wmb();
 
-		mthca_write64(doorbell,
-			      dev->kar + MTHCA_RECEIVE_DOORBELL,
+		mthca_ring_db(db, dev->kar + MTHCA_RECEIVE_DOORBELL,
 			      MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock));
 	}
 
-- 
Arthur

_______________________________________________
general mailing list
general at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

-- 
MST


From weiny2 at llnl.gov  Wed Jul 25 21:04:51 2007
From: weiny2 at llnl.gov (Ira Weiny)
Date: Wed, 25 Jul 2007 21:04:51 -0700
Subject: [ofa-general] Re: ANNOUNCE: ofed kernel build updates
In-Reply-To: <20070725141141.GG19872@mellanox.co.il>
References: <20070725141141.GG19872@mellanox.co.il>
Message-ID: <20070725210451.7014d3fc.weiny2@llnl.gov>

Michael,

I only got a chance to try the ofed_makedist.sh and compile (not actually run).
However the build worked very well!  So initial feedback is this works much
better.

Thanks,
Ira

On Wed, 25 Jul 2007 17:11:41 +0300
"Michael S. Tsirkin" <mst at dev.mellanox.co.il> wrote:

> Hi!
> I'd like to announce a couple of updates that were recently made
> to the build scripts on the ofed_kernel branch.
> This is an attempt to answer repeated requests, aired at Sonoma,
> to simplify access to kernel sources.
> 
> The idea is that a user of a supported kernel will just be able
> to download an appropriate tarball and run with it without need for patching.
> 
> These changes are available from ofed_kernel git tree maintained by Vlad:
> git://git.openfabrics.org/~vlad/ofed_kernel.git ofed_kernel
> 
> The code is mine, but the ideas mostly come from criticism
> and code sent by Ira Weiny. Thanks, Ira!
> 
> Note that the changes were made in a backwards-compatible way,
> so that existing scripts using configure/make will continue working.
> 
> What's new:
> 
> 1. New script ofed_scripts/ofed_patch.sh
>    This will apply fixes and backport patches for a specific
>    kernel to the current tree.
>    Usage:
>    ./ofed_scripts/ofed_patch.sh --with-backport=VERSION
> 
>    This makes it possible for distro vendors to generate
>    a tarball pre-patched for a specific kernel.
> 
> 2. New script ofed_scripts/ofed_makedist.sh
>    This script repeatedly clones the current repository,
>    runs ofed_scripts/ofed_patch.sh,
>    and then builds tarballs of ofed kernel source pre-patched
>    for supported kernel versions.
> 
>    I plan to work with Vlad to run this script as part of
>    nightly builds, so that prepatched tarballs will become
>    available for download.
> 
> 3. configure script made re-entrant
>    configure script does not apply patches anymore:
>    all it does is create configure.mk.kernel and autoconf.h files.
> 
>    This finally makes it possible to change
>    configuration parameters just by re-running configure.
> 
>    For backwards-compatibility, if configure detects
>    that ofed_scripts/ofed_patch.sh was not run yet,
>    it prints a warning and runs it automatically.
> 
> Feedback wellcome.
> 
> -- 
> MST


From kliteyn at mellanox.co.il  Wed Jul 25 21:07:51 2007
From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il)
Date: 26 Jul 2007 07:07:51 +0300
Subject: [ofa-general] nightly osm_sim report 2007-07-26:normal completion
Message-ID: <MTLEXCH01L9BGTuAyHs000007bd@mtlexch01.mtl.com>

OSM Simulation Regression Summary
 
[Generated mail - please do NOT reply]
 
 
OpenSM rev = Thu_Jul_12_11:56:08_2007 [de69204d60071532833b0cdd3baa5e2386dc2c73]
ibutils rev = Tue_Mar_13_14:36:32_2007 [80aaff94f0eb65117db39b9db7d609ffdcc055de]
 
 
Total=520  Pass=520  Fail=0
 
 
Pass:
39 Stability IS1-16.topo
39 Pkey IS1-16.topo
39 OsmTest IS1-16.topo
39 OsmStress IS1-16.topo
39 Multicast IS1-16.topo
39 LidMgr IS1-16.topo
13 Stability IS3-loop.topo
13 Stability IS3-128.topo
13 Pkey IS3-128.topo
13 OsmTest IS3-loop.topo
13 OsmTest IS3-128.topo
13 OsmStress IS3-128.topo
13 Multicast IS3-loop.topo
13 Multicast IS3-128.topo
13 LidMgr IS3-128.topo
13 FatTree merge-roots-4-ary-2-tree.topo
13 FatTree merge-root-4-ary-3-tree.topo
13 FatTree gnu-stallion-64.topo
13 FatTree blend-4-ary-2-tree.topo
13 FatTree RhinoDDR.topo
13 FatTree FullGnu.topo
13 FatTree 4-ary-2-tree.topo
13 FatTree 2-ary-4-tree.topo
13 FatTree 12-node-spaced.topo
13 FTreeFail 4-ary-2-tree-missing-sw-link.topo
13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo
13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo
13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo

Failures:


From erezz at voltaire.com  Wed Jul 25 21:58:18 2007
From: erezz at voltaire.com (Erez Zilber)
Date: Thu, 26 Jul 2007 07:58:18 +0300
Subject: [ofa-general] iSER header
References: <20070709144702.GB24125@postal.iol.unh.edu>
	<46933130.6040100@voltaire.com>
	<20070725192230.GA13579@postal.iol.unh.edu>
Message-ID: <39C75744D164D948A170E9792AF8E7CA110A31@exil.voltaire.com>

> Ok, so this isn't somthing that I will need to worry a lot about if I am
> planning on using iWARP?

 
You will need to use the iSER header as defined in the iSER spec.


>> I hope this helps.
>
> It does, thank you.
> 
>> BTW - do you plan to use the current iSER initiator
>> code for iWARP?
> 
> Yes, we are working on an iSER-assisted initiator and target using this code
> and the UNH iSCSI implementation.

I guess that you meant that you're using only the UNH iSCSI target (because the iSER initiator should be used with open-iscsi). Will you send patches for iSER soon? I'd like to test it, and make sure that iSER over IB is not damaged.
 
Erez


From eitan at mellanox.co.il  Wed Jul 25 22:45:56 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 26 Jul 2007 08:45:56 +0300
Subject: [ofa-general] RE: osm_physp_calc_link_ops question
In-Reply-To: <20070725211059.GH31582@sashak.voltaire.com>
References: <f0e08f230707251053k4e2e26deud1fa3cfacd1374f3@mail.gmail.com>
	<20070725211059.GH31582@sashak.voltaire.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901F7563C@mtlexch01.mtl.com>

Hi Hal,

Good idea !
 

> -----Original Message-----
> From: Sasha Khapyorsky [mailto:sashak at voltaire.com] 
> Sent: Thursday, July 26, 2007 12:11 AM
> To: Hal Rosenstock
> Cc: OpenFabrics General; Eitan Zahavi; Yevgeny Kliteynik
> Subject: Re: osm_physp_calc_link_ops question
> 
> Hi Hal,
> 
> On 13:53 Wed 25 Jul     , Hal Rosenstock wrote:
> > 
> >  Both osm_lid_mgr.c:__osm_lid_mgr_set_physp_pi and  
> > osm_link_mgr.c:__osm_link_mgr_set_physp_pi call  
> > osm_port.c:osm_physp_calc_link_op_vls. In the case where the remote 
> > end is  invalid, the local VLCap is used as the 
> OperationalVLs. When 
> > the VLCaps at  the two ends of the link do not match, this is not a 
> > good thing. It causes  trap storms on the flow control 
> watchdog timer 
> > expiring. Wouldn't it be  better to leave this field as is in this 
> > case or would that cause some other  problem ?
> > 
> >  Same thing might also be true for link MTU but not as critical.
> 
> Looks like good idea for me. Would you care about patch?
> 
> Sasha
> 


From eitan at mellanox.co.il  Wed Jul 25 23:00:50 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 26 Jul 2007 09:00:50 +0300
Subject: [ofa-general] Re: openSM: Different IB MTUs
In-Reply-To: <OF1EB21731.B109AF64-ON87257323.006B5BCC-88257323.004090BE@us.ibm.com>
References: <6C2C79E72C305246B504CBA17B5500C901F755DF@mtlexch01.mtl.com>
	<OF1EB21731.B109AF64-ON87257323.006B5BCC-88257323.004090BE@us.ibm.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901F7564E@mtlexch01.mtl.com>

I propose that when there is no MTU in the partition policy file OpenSM
use a 
configurable default from: /etc/cache/opensm/opensm.opt.
Something like:
# The default MTU to be used for IPoIB and other MCGs when the
partition-policy 
# does not provide exact value. The default is the lowest possible MTU
mcg_default_mtu 1
 
Eitan Zahavi 
Senior Engineering Director, Software Architect 
Mellanox Technologies LTD 
Tel:+972-4-9097208
Fax:+972-4-9593245 
P.O. Box 586 Yokneam 20692 ISRAEL 
 

________________________________

	From: Shirley Ma [mailto:xma at us.ibm.com] 
	Sent: Wednesday, July 25, 2007 10:45 PM
	To: Eitan Zahavi
	Cc: general at lists.openfabrics.org; Hal Rosenstock
	Subject: RE: [ofa-general] Re: openSM: Different IB MTUs
	
	
	Hello Eitan, Hal,
	
	Thanks. It's good openSM has the configuration option to set up
these attributes in MC. Is this a good idea to add below to openSM: When
there is no MTU defined in the configuration file, SM can pick up the
smallest link MTU in the fabrics by default? MTU is unlikely rate,
slower rate might indicate the cablling problem. So using the smallest
link MTU in the fabrics might not be a bad choice for MC by default. The
reason I request here is to create IP multicast group, MTU is not an
attribute of the group. When mapping IP multicast to IB multicast, IB
muliticast might fail because of different IB link MTU size in the
group, but IP multicast group will be successful without knowing the
failure. If admin sets MTU in configuration file, admin would know this
failure. Otherwise, admin/users could spend too much time on debugging
their broken multicasting applications.
	
	Thanks
	Shirley Ma
	
	 "Eitan Zahavi" <eitan at mellanox.co.il>
	
	
				"Eitan Zahavi" <eitan at mellanox.co.il> 

				07/25/07 12:25 PM

 
To

"Hal Rosenstock" <hal.rosenstock at gmail.com>, Shirley
Ma/Beaverton/IBM at IBMUS	


cc

<general at lists.openfabrics.org>	


Subject

RE: [ofa-general] Re: openSM: Different IB MTUs	
	 	

	Hi Shirley,
	
	I think I understand where your question comes from...
	Many have issue with heterogonous fabrics where not all nodes
have same MTU or Speed.
	Especially when IPoIB relies on all nodes joining the broadcast
group.
	
	The term "join" for multicast groups is a little overloaded.
	If a node joins an existing MC group it has to have a rate
(speed * width) > MCG.rate and support MTU > MCG.MTU otherwise it is
denied.
	If the join is actually a "create" the node has to provide the
rate and MTU which define the MCG values.
	
	To allow for administrator to control the IPoIB MCGs MTU and
rate OpenSM provides the means to control these
	values per partition. See the doc/partition-config.doc 
	Still the administrator should know what would be the lowest MTU
and rate the nodes expected to join the IPoIB subnet have.
	The tradeoff is in the hands of the administrator who can set a
value that will prevent slow nodes from joining the group, 
	or assign a low value that will fit all nodes but slow down
communication ...
	
	EZ 

	Eitan Zahavi 
	Senior Engineering Director, Software Architect 
	Mellanox Technologies LTD 
	Tel:+972-4-9097208
	Fax:+972-4-9593245 
	P.O. Box 586 Yokneam 20692 ISRAEL 

	
________________________________

	From: general-bounces at lists.openfabrics.org [
mailto:general-bounces at lists.openfabrics.org] On Behalf Of Hal
Rosenstock
	Sent: Wednesday, July 25, 2007 10:01 PM
	To: Shirley Ma
	Cc: general at lists.openfabrics.org
	Subject: [ofa-general] Re: openSM: Different IB MTUs
	
	Shirley,
	
	On 7/25/07, Shirley Ma <xma at us.ibm.com <mailto:xma at us.ibm.com> >
wrote: 

		Hal,
		
		Thanks for your prompt reply. I am asking for how openSM
handle different link MTUs in SA MCMemberRecord MTU. For example, if we
have some links MTU as 2K, some links MTU as 1K. Then when enabling
IPoIB, how does SM decide IPoIB broadcast group MCMemberRecord MTU size?
When creating an IB multicast group from a 2K MTU node first, which PMTU
value is attaching to this IB multicast group MCMemberRecord MTU? 


	MCMemberRecord MTU gets the group MTU (when created). This is
either this first joiner with sufficient components or preconfigured
(and MTU can be set in the config). If a joiner has insufficient MTU for
the group, it is denied. 
	
	-- Hal
	
	
		Thanks
		Shirley Ma
		
		 "Hal Rosenstock" < hal.rosenstock at gmail.com
<mailto:hal.rosenstock at gmail.com> >
		
		
					"Hal Rosenstock" <
hal.rosenstock at gmail.com <mailto:hal.rosenstock at gmail.com> > 

					07/25/07 10:57 AM

	
	To
	
Shirley Ma/Beaverton/IBM at IBMUS	
	 
	cc
	
general at lists.openfabrics.org <mailto:general at lists.openfabrics.org> 	
	 
	Subject
	
Re: openSM: Different IB MTUs	
		 	
		
		Shirley,
		
		On 7/25/07, Shirley Ma < xma at us.ibm.com
<mailto:xma at us.ibm.com> > wrote: 

				Hello Hal,
				
				How does openSM handle CAs with
different MTUs in the same subnet? For example, IPoIB broadcast group
MTU, IB multicast group PMTU? Does openSM pick up the smallest MTU in
the subnet? 

		
		Are you asking about link MTU, SA
PathRecord/MultiPathRecord MTU, SA MCMemberRecord MTU, or all of these ?
		
		-- Hal 

				Thanks
				Shirley Ma


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070726/3457e5fc/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: graycol.gif
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070726/3457e5fc/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ecblank.gif
Type: image/gif
Size: 45 bytes
Desc: ecblank.gif
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070726/3457e5fc/attachment-0001.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0E407396.gif
Type: image/gif
Size: 105 bytes
Desc: 0E407396.gif
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070726/3457e5fc/attachment-0002.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0E830176.gif
Type: image/gif
Size: 45 bytes
Desc: 0E830176.gif
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070726/3457e5fc/attachment-0003.gif>

From xma at us.ibm.com  Wed Jul 25 23:10:03 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Wed, 25 Jul 2007 23:10:03 -0700
Subject: [ofa-general] Re: openSM: Different IB MTUs
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F7564E@mtlexch01.mtl.com>
Message-ID: <OFCE28C865.D7AF7B0E-ON87257324.0021A99A-88257323.007957ED@us.ibm.com>


Eitan,

      That's a good approach to address the issue.

thanks
Shirley Ma


             "Eitan Zahavi"                                                
             <eitan at mellanox.c                                             
             o.il>                                                      To 
                                       Shirley Ma/Beaverton/IBM at IBMUS      
             07/25/07 11:00 PM                                          cc 
                                       <general at lists.openfabrics.org>,    
                                       "Hal Rosenstock"                    
                                       <hal.rosenstock at gmail.com>          
                                                                   Subject 
                                       RE: [ofa-general] Re: openSM:       
                                       Different IB MTUs                   
                                                                           
                                                                           
I propose that when there is no MTU in the partition policy file OpenSM use
a
configurable default from: /etc/cache/opensm/opensm.opt.
Something like:
# The default MTU to be used for IPoIB and other MCGs when the
partition-policy
# does not provide exact value. The default is the lowest possible MTU
mcg_default_mtu 1

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


 From: Shirley Ma [mailto:xma at us.ibm.com]
 Sent: Wednesday, July 25, 2007 10:45 PM
 To: Eitan Zahavi
 Cc: general at lists.openfabrics.org; Hal Rosenstock
 Subject: RE: [ofa-general] Re: openSM: Different IB MTUs


 Hello Eitan, Hal,

 Thanks. It's good openSM has the configuration option to set up these
 attributes in MC. Is this a good idea to add below to openSM: When there
 is no MTU defined in the configuration file, SM can pick up the smallest
 link MTU in the fabrics by default? MTU is unlikely rate, slower rate
 might indicate the cablling problem. So using the smallest link MTU in the
 fabrics might not be a bad choice for MC by default. The reason I request
 here is to create IP multicast group, MTU is not an attribute of the
 group. When mapping IP multicast to IB multicast, IB muliticast might fail
 because of different IB link MTU size in the group, but IP multicast group
 will be successful without knowing the failure. If admin sets MTU in
 configuration file, admin would know this failure. Otherwise, admin/users
 could spend too much time on debugging their broken multicasting
 applications.

 Thanks
 Shirley Ma

 Inactive hide details for "Eitan Zahavi" <eitan at mellanox.co.il>"Eitan
 Zahavi" <eitan at mellanox.co.il>

                                                                           
                         "Eitan                                            
                         Zahavi"                                           
                         <eitan at m                                          
                         ellanox.                                          
                         co.il>                                         To 
                                                                           
                                         "Hal Rosenstock"                  
                         07/25/07        <hal.rosenstock at gmail.com>,       
                         12:25 PM        Shirley Ma/Beaverton/IBM at IBMUS    
                                                                           
                                                                        cc 
                                                                           
                                         <general at lists.openfabrics.org>   
                                                                           
                                                                   Subject 
                                                                           
                                         RE: [ofa-general] Re: openSM:     
                                         Different IB MTUs                 
                                                                           
                                                                           
 Hi Shirley,

 I think I understand where your question comes from...
 Many have issue with heterogonous fabrics where not all nodes have same
 MTU or Speed.
 Especially when IPoIB relies on all nodes joining the broadcast group.

 The term "join" for multicast groups is a little overloaded.
 If a node joins an existing MC group it has to have a rate (speed * width)
 > MCG.rate and support MTU > MCG.MTU otherwise it is denied.
 If the join is actually a "create" the node has to provide the rate and
 MTU which define the MCG values.

 To allow for administrator to control the IPoIB MCGs MTU and rate OpenSM
 provides the means to control these
 values per partition. See the doc/partition-config.doc
 Still the administrator should know what would be the lowest MTU and rate
 the nodes expected to join the IPoIB subnet have.
 The tradeoff is in the hands of the administrator who can set a value that
 will prevent slow nodes from joining the group,
 or assign a low value that will fit all nodes but slow down communication
 ...

 EZ


 Eitan Zahavi
 Senior Engineering Director, Software Architect
 Mellanox Technologies LTD
 Tel:+972-4-9097208
 Fax:+972-4-9593245
 P.O. Box 586 Yokneam 20692 ISRAEL


 From: general-bounces at lists.openfabrics.org [
 mailto:general-bounces at lists.openfabrics.org] On Behalf Of Hal Rosenstock
 Sent: Wednesday, July 25, 2007 10:01 PM
 To: Shirley Ma
 Cc: general at lists.openfabrics.org
 Subject: [ofa-general] Re: openSM: Different IB MTUs

 Shirley,

 On 7/25/07, Shirley Ma <xma at us.ibm.com> wrote:
       Hal,

       Thanks for your prompt reply. I am asking for how openSM handle
       different link MTUs in SA MCMemberRecord MTU. For example, if we
       have some links MTU as 2K, some links MTU as 1K. Then when enabling
       IPoIB, how does SM decide IPoIB broadcast group MCMemberRecord MTU
       size? When creating an IB multicast group from a 2K MTU node first,
       which PMTU value is attaching to this IB multicast group
       MCMemberRecord MTU?


 MCMemberRecord MTU gets the group MTU (when created). This is either this
 first joiner with sufficient components or preconfigured (and MTU can be
 set in the config). If a joiner has insufficient MTU for the group, it is
 denied.

 -- Hal

       Thanks
       Shirley Ma

       Inactive hide details for "Hal Rosenstock"
       <hal.rosenstock at gmail.com>"Hal Rosenstock" <
       hal.rosenstock at gmail.com>
                                                                           
                                                 "H                        
                                                 al                        
                                                 Ro                        
                                                 se                        
                                                 ns                     To 
                                                 to                        
                                                 ck        Shirley         
                                                 "         Ma/Beaverton/IB 
                                                 <         M at IBMUS         
                                                 ha                        
                                                 l.                     cc 
                                                 ro                        
                                                 se        general at lists.o 
                                                 ns        penfabrics.org  
                                                 to                        
                                                 ck                Subject 
                                                 @g                        
                                                 ma        Re: openSM:     
                                                 il        Different IB    
                                                 .c        MTUs            
                                                 om                        
                                                 >                         
                                                                           
                                                                           
                                                 07                        
                                                 /2                        
                                                 5/                        
                                                 07                        
                                                 10                        
                                                 :5                        
                                                 7                         
                                                 AM                        
                                                                           

       Shirley,

       On 7/25/07, Shirley Ma < xma at us.ibm.com> wrote:
                   Hello Hal,

                   How does openSM handle CAs with different MTUs in the
                   same subnet? For example, IPoIB broadcast group MTU, IB
                   multicast group PMTU? Does openSM pick up the smallest
                   MTU in the subnet?


       Are you asking about link MTU, SA PathRecord/MultiPathRecord MTU, SA
       MCMemberRecord MTU, or all of these ?

       -- Hal
                   Thanks
                   Shirley Ma

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/2756428f/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/2756428f/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pic09180.gif
Type: image/gif
Size: 1255 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/2756428f/attachment-0001.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ecblank.gif
Type: image/gif
Size: 45 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/2756428f/attachment-0002.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2B953147.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/2756428f/attachment-0003.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2B631464.gif
Type: image/gif
Size: 45 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/2756428f/attachment-0004.gif>

From eitan at mellanox.co.il  Wed Jul 25 23:25:13 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 26 Jul 2007 09:25:13 +0300
Subject: [ofa-general] RE: OpenSM detection of duplicated GUIDs on loopback
In-Reply-To: <20070725194856.GB31582@sashak.voltaire.com>
References: <f0e08f230707240803nf0f2297x44df64a802b080a1@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F750FD@mtlexch01.mtl.com>
	<f0e08f230707241138sdd03eapece02d1626a25bb5@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F7510A@mtlexch01.mtl.com>
	<f0e08f230707241155v225aeed7s172f5cfb9024fc0e@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F7512E@mtlexch01.mtl.com>
	<f0e08f230707241325p1b7b04cap5bf37459bc7cb33f@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F75131@mtlexch01.mtl.com>
	<20070725001847.GG25264@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901F7558A@mtlexch01.mtl.com>
	<20070725194856.GB31582@sashak.voltaire.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901F75673@mtlexch01.mtl.com>

> Hi Eitan, Hal,
> 
> On 20:44 Wed 25 Jul     , Eitan Zahavi wrote:
> > 
> > I am not following you.
> > Why do a user need to run -y if a simple legal cable connector is 
> > plugged?
> 
> Because duplicated GUIDs detector can aborts OpenSM when 
> regular port is reconnected to another location during hard 
> sweep. This issue is not related to loopback plug at all.
I  think we should handle the case of "migrated port" in a more global
sense:
If a port "moved" during the sweep we have to do a new sweep anyway.
Maybe we could delay the 'abort' to the second sweep.

So practically I propose:
1. Add state flag "was duplicated" on the port saying it was reported as
duplicate GUID.
2. Set the variable controlling a forced secodn sweep (similar to the
one used if we got Set error)
3. Repeat the sweep - if we find a port where it is a duplicate and the
"was duplicated" flag is set - abort.

A refinement for the user who is doing many changes continuously might
be to keep a counter.
And have the abort happen after the Nth iteration.
> 
> > The issue is only if a "loop back" plug connecting a port 
> to itself is 
> > plugged.
> 
> No, not only. Now there are two completely separate known 
> issues with duplicated GUIDs detector:
> 
> 1. Port moving
> 2. Loopback plug
> 
> And I think that _both_ should be solved. And if just using 
> '-y' could be suitable for (2) because it is esoteric 
> (although perfectly legal) use, it is not acceptable solution for (1).
> 
> I think we need to improve GUIDs duplication detector 
> instead. For example we could add NodeInfo comparison there, 
> and only in case if it is different drop GUIDs duplication 
> error. Also I think this should not be fatal error and should 
> not abort OpenSM, just logging (probably via syslog too) 
> should be sufficient - non-working port is good reason to 
> look at logs. Another ideas?
The problem is that the SM will sort of figure out the network but will
create a completely bogus routing etc.

> 
> Sasha
> 
> > Do users use these plugs? For what sake?
> > 
> > 
> > Eitan Zahavi
> > Senior Engineering Director, Software Architect Mellanox 
> Technologies 
> > LTD
> > Tel:+972-4-9097208
> > Fax:+972-4-9593245
> > P.O. Box 586 Yokneam 20692 ISRAEL
> > 
> >  
> > 
> > > -----Original Message-----
> > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com]
> > > Sent: Wednesday, July 25, 2007 3:19 AM
> > > To: Eitan Zahavi
> > > Cc: Hal Rosenstock; OpenFabrics General; Yevgeny Kliteynik
> > > Subject: Re: OpenSM detection of duplicated GUIDs on loopback
> > > 
> > > On 23:25 Tue 24 Jul     , Eitan Zahavi wrote:
> > > > 
> > > > 	On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote: 
> > > > 
> > > > 		Maybe  avoid the log if -y is provided?
> > > > 
> > > > 	 
> > > > 	That avoids the spew but the duplicated GUID is
> > > important to know so
> > > > IMO something in the "middle" is needed where 
> duplicated GUIDs are 
> > > > logged but not continually the same ones.
> > > > 	[EZ]  
> > > > 	OK so in -y mode only we track which ones were reported
> > > and do not
> > > > repeat the log?
> > > 
> > > And how port moving problem should be solved?
> > > 
> > > We cannot ask an user to run OpenSM with '-y' if in 
> her/his plans to 
> > > reconnect some ports in a future and just decrease logging.
> > > 
> > > Sasha
> > > 
> 


From eitan at mellanox.co.il  Wed Jul 25 23:26:27 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 26 Jul 2007 09:26:27 +0300
Subject: [ofa-general] RE: pkey.sim.tcl (was: [PATCH] opensm: detect port
	external reset andflush cached tables)
In-Reply-To: <20070725202418.GD31582@sashak.voltaire.com>
References: <863azhrlm1.fsf@sw053.lab.mtl.com>
	<20070722102209.GR16597@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com>
	<20070722174048.GO27878@sashak.voltaire.com>
	<f0e08f230707230830x5d0f73ccib09421f3f660e563@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com>
	<20070724005153.GD11674@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com>
	<20070724170432.GZ27878@sashak.voltaire.com>
	<20070724215441.GA25264@sashak.voltaire.com>
	<20070725202418.GD31582@sashak.voltaire.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901F75674@mtlexch01.mtl.com>

Hi Sasha,

I am happy you actually use the simulator.
Please provide more info regarding the failure. You should tar compress
the /tmp/ibmgtsim.XXXX of your run.

The following flow is performed by this test:
1. Three partitions are created with random Pkeys. The first 2 will have
full members. The 3rd has only partial memebr.
2. Ports are assigned either group 1, group 2 or a combination of group
(1 and 3) or (2 and 3)
3. PKey tables for each port are filled with random index for the port
"real" pkeys and other random pkeys. Length si also random.
4. opensm is invoked with a matching partition-policy file, wait for
SUBNET UP
5. osmtest full inventory - including path records is run from 5 random
ports 
   The code validates each port inventory only reports ports it is
shares PKey with
6. The default PKey is removed from ALL the port  pkey tables
7. All PKey tables are validated against initial setup to see that the
indexes of the assigned "real" pkeys was not altered by the SM.
8. A single switch is selected and its Change Bit is raised.
9. Wait for SUBNET UP
10. Validate all ports got their default pkey back.

I suspect from our thread about not setting LFT that stage 10 failed for
you.

Eitan


Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL

 
> -----Original Message-----
> From: Sasha Khapyorsky [mailto:sashak at voltaire.com] 
> Sent: Wednesday, July 25, 2007 11:24 PM
> To: Eitan Zahavi; Yevgeny Kliteynik
> Cc: Hal Rosenstock; general at lists.openfabrics.org
> Subject: pkey.sim.tcl (was: [PATCH] opensm: detect port 
> external reset andflush cached tables)
> 
> Hi Eitan, Yevgeny,
> 
> 
> On 00:54 Wed 25 Jul     , Sasha Khapyorsky wrote:
> > 
> > This detects port external reset by validating PortState == 
> INIT, and 
> > when detected flushes cached port related tables - re-reads 
> pkey table 
> > and drops (overwrites) SL2VL and VLArb tables.
> > 
> > Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> 
> [snip...]
> > diff --git a/opensm/opensm/osm_port_info_rcv.c 
> > b/opensm/opensm/osm_port_info_rcv.c
> > index 6fe2d1d..0528e38 100644
> > --- a/opensm/opensm/osm_port_info_rcv.c
> > +++ b/opensm/opensm/osm_port_info_rcv.c
> > @@ -801,6 +801,12 @@ osm_pi_rcv_process(
> >        p_rcv->p_subn->master_sm_base_lid = p_pi->master_sm_base_lid;
> >      }
> >  
> > +    /* if port just inited or reached INIT state (external reset)
> > +       request update for port related tables */
> > +    p_physp->need_update =
> > +      (ib_port_info_get_port_state(p_pi) == IB_LINK_INIT ||
> > +       p_physp->need_update > 1 ) ? 1 : 0;
> > +
> >      switch( osm_node_get_type( p_node ) )
> >      {
> >      case IB_NODE_TYPE_CA:
> > @@ -824,7 +830,8 @@ osm_pi_rcv_process(
> >      /*
> >        Get the tables on the physp.
> >      */
> > -    __osm_pi_rcv_get_pkey_slvl_vla_tables( p_rcv, p_node, 
> p_physp );
> > +    if (p_physp->need_update)
> > +      __osm_pi_rcv_get_pkey_slvl_vla_tables( p_rcv, 
> p_node, p_physp 
> > + );
> 
> When testing this patch, I tried it with ibmgtsim and test failed:
> 
>   RunSimTest -o ${ROOT}/sbin/opensm -t ${TESTS}/IS1-16.topo 
> -f ${TESTS}/pkey.sim.tcl -c ${TESTS}/pkey.check.tcl
> 
> The failure is resulted by port pkey tables modifications 
> which is performed in pkey.sim.tcl. Why should we do this? Is 
> this legal scenario when pkey tables are modified externally 
> without Partition Manager?
> 
> Sasha
> 


From ogerlitz at voltaire.com  Thu Jul 26 00:02:20 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 26 Jul 2007 10:02:20 +0300
Subject: [ofa-general] Re: IPoIB path caching
In-Reply-To: <46A78146.1090304@ichips.intel.com>
References: <adalkdl43w0.fsf@cisco.com>	<46A2F696.4060007@voltaire.com>	<adafy3f22z5.fsf@cisco.com>	<46A46637.3080104@voltaire.com>	<20070723083020.GD20614@mellanox.co.il>
	<46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com>
	<46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com>
	<46A6F50C.5000906@voltaire.com> <46A78146.1090304@ichips.intel.com>
Message-ID: <46A846FC.5040704@voltaire.com>

Sean Hefty wrote:
>> I am willing to go with the local sa coming to serve large MPI jobs, 
>> so you load as a prerequisite to spawning large all-to-all job.

>> But, I think the default for IPoIB needs to be usage of non cached PR.

> I think this ties together two things that aren't directly related.  We 
> have two network stacks running on top of each other here.  Their 
> policies should be separate.

The rational beyond my argument is that with IPoIB being an L2 packet 
services for the network stack, when the network stack decides to renew 
its L2 info for a neighbour (eg as it does not reply to direct probes) 
if IPoIB uses cached IB info its doing something against what it was 
asked to do.

> As an example, let's reverse this.  Imagine instead that you implement 
> IB over IP.  Should an IB path refresh policy dictate that IP update its 
> ARP tables? 

in this settings (IB above IP), yes.

> Or, looking at it differently, do you prevent IP from 
> updating the ARP table unless the IB stack asks for it?

no. If the lower stack wants to update its L2 info, its perfectly fine.

For example... the current IPoIB implementation flushes all its IB L2 
info (address handles and PRs) when its gets IB event on the port 
(up/down/lid-change/sm-lid-change/client-re-register/etc), this is very 
much correct design.

> The policy for local PR caching should be set by an administrator.  Now, 
> we could provide a policy setting that ties it to the ARP cache, which 
> sounds like a good idea.  This will be less efficient in some use 
> models, more efficient in others.  But not all PRs belong to IPoIB, so 
> we need a way to handle this.  However, I don't believe that we have to 
> always enforce such a policy, especially since the current stack doesn't 
> have this behavior today.

I thinking that we are making progress, starting to converge.

My suggestion is that if you put the PR caching code within the ib_sa 
module, add a parameter for the ib_sa_path_rec_get() where the caller 
specifies if it is willing to get cached PR or not. Also I suggest that 
  rdma_resolve_route() should be also enhanced to have a similar param 
such that even native IB based ULPs can ask for not cached info if they 
want to.

For example, I think it would be correct for IB block and file I/O ULPs 
(iSER, SRP, Lustre, rNFS, etc) to request non cached PR, as their 
connecting model is not all-to-all but rather n-to-m (n clients to m 
servers with m << n), the connections are long-lived (hours, days, 
weeks, more) and a connection failure as of PR caching does not seem 
acceptable.

Or.


From mst at dev.mellanox.co.il  Thu Jul 26 00:22:45 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 26 Jul 2007 10:22:45 +0300
Subject: [ofa-general] Re: Re: openSM: Different IB MTUs
In-Reply-To: <OFCE28C865.D7AF7B0E-ON87257324.0021A99A-88257323.007957ED@us.ibm.com>
References: <6C2C79E72C305246B504CBA17B5500C901F7564E@mtlexch01.mtl.com>
	<OFCE28C865.D7AF7B0E-ON87257324.0021A99A-88257323.007957ED@us.ibm.com>
Message-ID: <20070726072245.GC13258@mellanox.co.il>

What does "1" mean? Surely not 1 byte MTU :)
IMO a good format would be the MTU value in bytes.
E.g. 512, 1024, 2048, 4096.

Quoting Shirley Ma <xma at us.ibm.com>:
Subject: RE: Re: openSM: Different IB MTUs

Eitan,

That's a good approach to address the issue.

thanks
Shirley Ma

Inactive hide details for "Eitan Zahavi" <eitan at mellanox.co.il>"Eitan Zahavi"
<eitan at mellanox.co.il>


                "Eitan Zahavi"         [cid]   *
                <eitan at mellanox.co.il>      To Shirley Ma/Beaverton/IBM at IBMUS
                                       [cid]   *
                07/25/07 11:00 PM           cc <general at lists.openfabrics.org>, "Hal Rosenstock"
                                               <hal.rosenstock at gmail.com>
                                       [cid]   *
                                       Subject RE: [ofa-general] Re: openSM: Different IB MTUs
                                       *        *

I propose that when there is no MTU in the partition policy file OpenSM use a
configurable default from: /etc/cache/opensm/opensm.opt.
Something like:
# The default MTU to be used for IPoIB and other MCGs when the partition-policy
# does not provide exact value. The default is the lowest possible MTU
mcg_default_mtu 1

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
From: Shirley Ma [mailto:xma at us.ibm.com]
Sent: Wednesday, July 25, 2007 10:45 PM
To: Eitan Zahavi
Cc: general at lists.openfabrics.org; Hal Rosenstock
Subject: RE: [ofa-general] Re: openSM: Different IB MTUs

Hello Eitan, Hal,

Thanks. It's good openSM has the configuration option to set up these
attributes in MC. Is this a good idea to add below to openSM: When there is no
MTU defined in the configuration file, SM can pick up the smallest link MTU in
the fabrics by default? MTU is unlikely rate, slower rate might indicate the
cablling problem. So using the smallest link MTU in the fabrics might not be a
bad choice for MC by default. The reason I request here is to create IP
multicast group, MTU is not an attribute of the group. When mapping IP
multicast to IB multicast, IB muliticast might fail because of different IB
link MTU size in the group, but IP multicast group will be successful without
knowing the failure. If admin sets MTU in configuration file, admin would know
this failure. Otherwise, admin/users could spend too much time on debugging
their broken multicasting applications.

Thanks
Shirley Ma

Inactive hide details for "Eitan Zahavi" <eitan at mellanox.co.il>"Eitan Zahavi"
<eitan at mellanox.co.il>

                                "Eitan Zahavi"         [cid]   *
                                <eitan at mellanox.co.il>      To "Hal Rosenstock"
                                                               <hal.rosenstock at gmail.com>, Shirley
                                07/25/07 12:25 PM              Ma/Beaverton/IBM at IBMUS
                                                       [cid]   *
                                                            cc <general at lists.openfabrics.org>
                                                       [cid]   *
                                                       Subject RE: [ofa-general] Re: openSM:
                                                               Different IB MTUs
                                                       *       *

Hi Shirley,

I think I understand where your question comes from...
Many have issue with heterogonous fabrics where not all nodes have same MTU or
Speed.
Especially when IPoIB relies on all nodes joining the broadcast group.

The term "join" for multicast groups is a little overloaded.
If a node joins an existing MC group it has to have a rate (speed * width) >
MCG.rate and support MTU > MCG.MTU otherwise it is denied.
If the join is actually a "create" the node has to provide the rate and MTU
which define the MCG values.

To allow for administrator to control the IPoIB MCGs MTU and rate OpenSM
provides the means to control these
values per partition. See the doc/partition-config.doc
Still the administrator should know what would be the lowest MTU and rate the
nodes expected to join the IPoIB subnet have.
The tradeoff is in the hands of the administrator who can set a value that will
prevent slow nodes from joining the group,
or assign a low value that will fit all nodes but slow down communication ...

EZ

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
From: general-bounces at lists.openfabrics.org [
mailto:general-bounces at lists.openfabrics.org] On Behalf Of Hal Rosenstock
Sent: Wednesday, July 25, 2007 10:01 PM
To: Shirley Ma
Cc: general at lists.openfabrics.org
Subject: [ofa-general] Re: openSM: Different IB MTUs

Shirley,

On 7/25/07, Shirley Ma <xma at us.ibm.com> wrote:

        Hal,

        Thanks for your prompt reply. I am asking for how openSM handle
        different link MTUs in SA MCMemberRecord MTU. For example, if we have
        some links MTU as 2K, some links MTU as 1K. Then when enabling IPoIB,
        how does SM decide IPoIB broadcast group MCMemberRecord MTU size? When
        creating an IB multicast group from a 2K MTU node first, which PMTU
        value is attaching to this IB multicast group MCMemberRecord MTU?


MCMemberRecord MTU gets the group MTU (when created). This is either this first
joiner with sufficient components or preconfigured (and MTU can be set in the
config). If a joiner has insufficient MTU for the group, it is denied.

-- Hal

        Thanks
        Shirley Ma

        Inactive hide details for "Hal Rosenstock" <hal.rosenstock at gmail.com>
        "Hal Rosenstock" < hal.rosenstock at gmail.com>
                                                "Hal Rosenstock" <            [cid]   *
                                                hal.rosenstock at gmail.com>          To Shirley Ma/Beaverton/
                                                                                      IBM at IBMUS
                                                07/25/07 10:57 AM             [cid]   *
                                                                                   cc general at lists.openfabrics.org
                                                                              [cid]   *
                                                                              Subject Re: openSM: Different IB MTUs
                                                                              *                  *

        Shirley,

        On 7/25/07, Shirley Ma < xma at us.ibm.com> wrote:
                        Hello Hal,

                        How does openSM handle CAs with different MTUs in the
                        same subnet? For example, IPoIB broadcast group MTU, IB
                        multicast group PMTU? Does openSM pick up the smallest
                        MTU in the subnet?


        Are you asking about link MTU, SA PathRecord/MultiPathRecord MTU, SA
        MCMemberRecord MTU, or all of these ?

        -- Hal
                        Thanks
                        Shirley Ma


_______________________________________________
general mailing list
general at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

-- 
MST


From amar.mudrankit at gmail.com  Thu Jul 26 00:36:32 2007
From: amar.mudrankit at gmail.com (Amar Mudrankit)
Date: Thu, 26 Jul 2007 13:06:32 +0530
Subject: [ofa-general] ARP in IPoIB
Message-ID: <c8028d330707260036u1a1bffd4p1c4f70dec1debc1c@mail.gmail.com>

Hello all,

    Being new to this group, following questions may sound a bit basic level
but I would be really very happy if somebody could help me out in those.

1] Does current implementation of IPoIB supports IP hosts spanning different
IB subnets? If so, how does it resolves the IP addresses(QPN+GID etc)
because how would the broadcast IP ( multicast IB) request would reach
different IB subnet as the router should have multicast routing
capabilities? Does it mean that the current IPoIB implementation is
supported for hosts within single IB subnet and not beyond that?

2] What is all-router multicast group(RFC-4391 section 10)? How does it help
in routing packet to a IP host present on different IB-subnet?

Thanks,

Regards,
Amar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070726/8b692958/attachment.html>

From mst at dev.mellanox.co.il  Thu Jul 26 00:56:01 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 26 Jul 2007 10:56:01 +0300
Subject: [ofa-general] Re: ARP in IPoIB
In-Reply-To: <c8028d330707260036u1a1bffd4p1c4f70dec1debc1c@mail.gmail.com>
References: <c8028d330707260036u1a1bffd4p1c4f70dec1debc1c@mail.gmail.com>
Message-ID: <20070726075601.GG13258@mellanox.co.il>

> Quoting Amar Mudrankit <amar.mudrankit at gmail.com>:
> Subject: ARP in IPoIB
> 
> Hello all,
> 
>     Being new to this group, following questions may sound a bit basic level
> but I would be really very happy if somebody could help me out in those.
> 
> 1] Does current implementation of IPoIB supports IP hosts spanning different IB
> subnets? If so, how does it resolves the IP addresses(QPN+GID etc) because how
> would the broadcast IP ( multicast IB) request would reach different IB subnet
> as the router should have multicast routing capabilities? Does it mean that the
> current IPoIB implementation is supported for hosts within single IB subnet and
> not beyond that?
> 
> 2] What is all-router multicast group(RFC-4391 section 10)? How does it help in
> routing packet to a IP host present on different IB-subnet?

Pls refer to this document: RFC 4393, IP over InfiniBand (IPoIB) Architecture
Chapter 3.3. IP Subnets Across IB Subnets which has some answers to your questions.

Here's a link:
http://tools.ietf.org/html/rfc4392

-- 
MST


From jackm at dev.mellanox.co.il  Thu Jul 26 01:16:58 2007
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Thu, 26 Jul 2007 11:16:58 +0300
Subject: [ofa-general] [PATCH] mlx4: fix double-kfree in mlx4_mr_alloc error
	flow
Message-ID: <200707261116.58679.jackm@dev.mellanox.co.il>

Eliminate double-kfree in mlx4_mr_alloc error flow.

Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>

Index: connectx/drivers/net/mlx4/mr.c
===================================================================
--- connectx.orig/drivers/net/mlx4/mr.c	2007-07-26 10:04:57.000000000 +0300
+++ connectx/drivers/net/mlx4/mr.c	2007-07-26 10:08:14.070595000 +0300
@@ -255,10 +255,8 @@ int mlx4_mr_alloc(struct mlx4_dev *dev, 
 	int err;
 
 	index = mlx4_bitmap_alloc(&priv->mr_table.mpt_bitmap);
-	if (index == -1) {
-		err = -ENOMEM;
-		goto err;
-	}
+	if (index == -1)
+		return -ENOMEM;
 
 	mr->iova       = iova;
 	mr->size       = size;
@@ -275,9 +273,6 @@ int mlx4_mr_alloc(struct mlx4_dev *dev, 
 
 err_index:
 	mlx4_bitmap_free(&priv->mr_table.mpt_bitmap, index);
-
-err:
-	kfree(mr);
 	return err;
 }
 EXPORT_SYMBOL_GPL(mlx4_mr_alloc);


From umaxx at oleco.net  Thu Jul 26 01:25:53 2007
From: umaxx at oleco.net (Joerg Zinke)
Date: Thu, 26 Jul 2007 10:25:53 +0200
Subject: [ofa-general] ibv_modify_qp() return value 22
Message-ID: <20070726102553.5b02caea@marvin.local>

Hi,

ibv_modify_qp() fails with return value 22 when I try to open a new CM
connection under load (already ~3000 RDMA connections opened). I tried
to figure out what return value 22 means but could not find it in the
mthca kernel driver.

Any hints? What does return value 22 mean?
I use OFED-1.1 under debian with vanilla kernel 2.6.18.

Regards,

Joerg


From vlad at lists.openfabrics.org  Thu Jul 26 01:39:41 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Thu, 26 Jul 2007 01:39:41 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070726-0100 daily build status
Message-ID: <20070726083941.5CEE4E60897@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.12
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.15
Passed on powerpc with linux-2.6.19
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.20
Passed on ppc64 with linux-2.6.18
Passed on x86_64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on ia64 with linux-2.6.19
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.14
Passed on powerpc with linux-2.6.15
Passed on ia64 with linux-2.6.12
Passed on powerpc with linux-2.6.16
Passed on x86_64 with linux-2.6.15
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.15
Passed on ia64 with linux-2.6.15
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.13
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.17
Passed on powerpc with linux-2.6.13
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.14
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.14
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.18
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.16
Passed on ia64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5

Failed:


From mst at dev.mellanox.co.il  Thu Jul 26 01:42:11 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 26 Jul 2007 11:42:11 +0300
Subject: [ofa-general] [PATCH trivial v2] add includes to
	scsi_transport_iscsi.h
In-Reply-To: <20070725110907.GF3826@mellanox.co.il>
References: <20070725110907.GF3826@mellanox.co.il>
Message-ID: <20070726084211.GB22557@mellanox.co.il>

scsi/scsi_transport_iscsi.h uses struct mutex and struct list_head,
so while linux/mutex.h and linux/list.h seem to be pulled in indirectly
by one of the headers it includes, the right thing
is to include linux/mutex.h and linus/list.h directly.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

Changelog:

Mike Christie pointed out that linux/list.h is missing too.

diff --git a/include/scsi/scsi_transport_iscsi.h b/include/scsi/scsi_transport_iscsi.h
index 706c0cd..7ff6199 100644
--- a/include/scsi/scsi_transport_iscsi.h
+++ b/include/scsi/scsi_transport_iscsi.h
@@ -24,6 +24,8 @@
 #define SCSI_TRANSPORT_ISCSI_H
 
 #include <linux/device.h>
+#include <linux/list.h>
+#include <linux/mutex.h>
 #include <scsi/iscsi_if.h>
 
 struct scsi_transport_template;

-- 
MST


From amar.mudrankit at gmail.com  Thu Jul 26 02:36:02 2007
From: amar.mudrankit at gmail.com (Amar Mudrankit)
Date: Thu, 26 Jul 2007 15:06:02 +0530
Subject: [ofa-general] Re: ARP in IPoIB
In-Reply-To: <20070726075601.GG13258@mellanox.co.il>
References: <c8028d330707260036u1a1bffd4p1c4f70dec1debc1c@mail.gmail.com>
	<20070726075601.GG13258@mellanox.co.il>
Message-ID: <c8028d330707260236w60b41487ufe2e84dba138a453@mail.gmail.com>

Michael,
thanks for your reply. But, this gives rise to couple of questions..

1] If such multicast routing protocol for IB routers is not yet specifid by
IBTA or IETF, then current implementation have IP subnet restricted within a
IB subnet. According to RFC 4391, section 9.1.1, the link layer address is
formed through combination of GID + QPN. If we are not spanning across IB
subnets what is the use of GID as we need to get LID from GID? Probably, in
that case ARP reply with LID,Q_Key and other path information would be
helpful which resolves path in 1 loop than 2 loops in case of GID(first to
resolve GID and then to get LID).

2] When we look at code the dev->dev_addr is still made up of GID+QPN. What
could be the purpose of having implementation this way if we can have
dev->dev_addr made up of LID,Q_KEY etc with reference to point no 1 above?

On 7/26/07, Michael S. Tsirkin <mst at dev.mellanox.co.il> wrote:
>
> > Quoting Amar Mudrankit <amar.mudrankit at gmail.com>:
> > Subject: ARP in IPoIB
> >
> > Hello all,
> >
> >     Being new to this group, following questions may sound a bit basic
> level
> > but I would be really very happy if somebody could help me out in those.
> >
> > 1] Does current implementation of IPoIB supports IP hosts spanning
> different IB
> > subnets? If so, how does it resolves the IP addresses(QPN+GID etc)
> because how
> > would the broadcast IP ( multicast IB) request would reach different IB
> subnet
> > as the router should have multicast routing capabilities? Does it mean
> that the
> > current IPoIB implementation is supported for hosts within single IB
> subnet and
> > not beyond that?
> >
> > 2] What is all-router multicast group(RFC-4391 section 10)? How does it
> help in
> > routing packet to a IP host present on different IB-subnet?
>
> Pls refer to this document: RFC 4393, IP over InfiniBand (IPoIB)
> Architecture
> Chapter 3.3. IP Subnets Across IB Subnets which has some answers to your
> questions.
>
> Here's a link:
> http://tools.ietf.org/html/rfc4392
>
> --
> MST
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070726/c90ef7d8/attachment.html>

From vlad at lists.openfabrics.org  Thu Jul 26 02:49:47 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Thu, 26 Jul 2007 02:49:47 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_c_kernel 20070726-0200 daily build status
Message-ID: <20070726094947.25B50E60874@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2_c

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.12
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.14
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.16
Passed on ia64 with linux-2.6.12
Passed on powerpc with linux-2.6.14
Passed on ia64 with linux-2.6.18
Passed on powerpc with linux-2.6.16
Passed on powerpc with linux-2.6.12
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.12
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.13
Passed on ia64 with linux-2.6.14
Passed on powerpc with linux-2.6.18
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on powerpc with linux-2.6.17
Passed on ia64 with linux-2.6.15
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.14
Passed on ia64 with linux-2.6.22
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on ia64 with linux-2.6.21.1
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-34.ELsmp

Failed:


From thorben.dadisman at bistum-essen.de  Thu Jul 26 02:50:23 2007
From: thorben.dadisman at bistum-essen.de (Jerri Milton)
Date: Thu, 26 Jul 2007 09:50:23 +0000
Subject: [ofa-general] Sunrise in your life!
Message-ID: <01c7cf6a$62df1aa0$a1b80551@thorben.dadisman>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: manka.gif
Type: image/gif
Size: 8937 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070726/5f8e74bf/attachment.gif>

From dotanb at dev.mellanox.co.il  Thu Jul 26 03:00:01 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Thu, 26 Jul 2007 13:00:01 +0300
Subject: [ofa-general] ibv_modify_qp() return value 22
In-Reply-To: <20070726102553.5b02caea@marvin.local>
References: <20070726102553.5b02caea@marvin.local>
Message-ID: <46A870A1.5090401@dev.mellanox.co.il>

Hi.
Joerg Zinke wrote:
> Hi,
>
> ibv_modify_qp() fails with return value 22 when I try to open a new CM
> connection under load (already ~3000 RDMA connections opened). I tried
> to figure out what return value 22 means but could not find it in the
> mthca kernel driver.
>
> Any hints? What does return value 22 mean?
>   
The value 22 is the ibv_modify_qp means that there was an invalid 
parameter when calling to this verb.
If you try to call to ibv_modify_qp without any load (only several QPs) 
do you still get this error?

thanks
Dotan


From kliteyn at dev.mellanox.co.il  Thu Jul 26 03:36:40 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 26 Jul 2007 13:36:40 +0300
Subject: [ofa-general] QoS RFC
In-Reply-To: <46A80E37.5080304@ichips.intel.com>
References: <46A283B6.1070105@dev.mellanox.co.il>
	<46A7DEF8.7040608@ichips.intel.com>
	<46A7E59A.5070801@dev.mellanox.co.il>
	<46A80E37.5080304@ichips.intel.com>
Message-ID: <46A87938.6040305@dev.mellanox.co.il>

Sean Hefty wrote:
>> This file has *all* the possible keywords.
>> The administrator really doesn't have to use them all.
>> For instance, there are three different ways to define port groups:
>>  - by guid list
>>  - by node type
>>  - by port names
>> You could stick with the guids only - this gives you all the 
>> functionality
>> you need, but by doing so you loose some flexibility.
> 
> Beyond referring to port GUIDs, I'm also referring to items like:
> 
> sl: 0
> mtu-limit: 1
> rate-limit: 1
> packet-life: 12
> path-bits: 2,4,8-32
> 
> vlarb-high: 0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1
> vlarb-low: 8:255,9:127,10:63,11:31,12:15,13:7,14:3
> vl-high-limit: 10
> 
> sl2vl-table: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
> sl2vl-table: 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0
> 
> sl: 0
> mtu-limit: 1
> rate-limit: 1
> packet-life: 12
> path-bits: 2,4,8-32
> 
> This is really low level data, akin to the administrator manually 
> programming the switch tables.  My take is that we should drop tons of 
> this flexibility in favor of something much simpler for the administrator.

But again, the administrator doesn't *have* to use all these.
He can simply define sl2vl-tables, and then match service-id
(in qos-match-rules) to a certain sl (in qos-levels).
That's it.
No MTU, rate, packet lifetime or any other low level data.
Does the following file look better?

     port-groups
         port-group
             name: Part1
             port-guid: 0x1000000000000001
             port-guid: 0x1000000000000002
         end-port-group

         port-group
             name: Part2
             port-guid: 0x1000000000000005
             port-guid: 0x1000000000000006
         end-port-group
     end-port-groups

     qos-setup
         sl2vl-tables
             sl2vl-scope
                 group: Part1
                 from: *
                 to: *
                 sl2vl-table: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
             end-sl2vl-scope
             sl2vl-scope
                 group: Part2
                 from: *
                 to: *
                 sl2vl-table: 0,1,2,3,4,5,6,7,8,0,1,2,3,4,0
             end-sl2vl-scope
         end-sl2vl-tables
     end-qos-setup

     qos-levels
         qos-level
             sl: 2
         end-qos-level
         qos-level
             sl: 5
         end-qos-level
     end-qos-levels

     qos-match-rules
         qos-match-rule
             service-id: 4001-5000
             qos-level-sn: 1
         end-qos-match-rule
         qos-match-rule
             service-id: 5001-6000
             qos-level-sn: 2
         end-qos-match-rule
     end-qos-match-rules

-- Yevgeny


> - Sean
> 


From honarthritis at esolusi.com  Thu Jul 26 04:06:47 2007
From: honarthritis at esolusi.com (Harriet Myrick)
Date: Thu, 26 Jul 2007 13:06:47 +0200
Subject: [ofa-general] Re: Thanks, we are ready to lend you money
Message-ID: <001801c7cf85$f8a2fe60$0019b094@admin6q0bmff2y>


Your credit score doesn't matter to us!
 
If you have your own business and need IMMEDIATE money to spend ANY way you like or require Extra money to give the business a boost or  need A low interest loan - NO STRINGS ATTACHED, here is best deal we can offer you TONIGHT (hurry, this lot will expire TONIGHT):
 
$47,000+ loan
 
Hurry, when the deal is gone, it is gone. Simply Call Us... 
 
Don't worry about approval, your credit score will not disqualify you!
 
Call Us Free on 877-542-1880
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070726/4bffa1cd/attachment.html>

From eaburns at iol.unh.edu  Thu Jul 26 04:11:06 2007
From: eaburns at iol.unh.edu (Ethan Burns)
Date: Thu, 26 Jul 2007 07:11:06 -0400
Subject: [ofa-general] iSER header
In-Reply-To: <39C75744D164D948A170E9792AF8E7CA110A31@exil.voltaire.com>
References: <20070709144702.GB24125@postal.iol.unh.edu>
	<46933130.6040100@voltaire.com>
	<20070725192230.GA13579@postal.iol.unh.edu>
	<39C75744D164D948A170E9792AF8E7CA110A31@exil.voltaire.com>
Message-ID: <20070726111106.GA14180@postal.iol.unh.edu>

On Thu, Jul 26, 2007 at 07:58:18AM +0300, Erez Zilber wrote:

[...]

> I guess that you meant that you're using only the UNH iSCSI target
> (because the iSER initiator should be used with open-iscsi).

Well, we are actually using both initiator and target.  We grabed an
older version of the datamover implementation from the open-iser-target
project and are fitting it to work with both the UNH-iSCSI target and
initiator.  After we get this working, we would like to try and interop
with the open-iscsi implementation.  This is why I was concerned about
the header.

> Will you send patches for iSER soon? I'd like to test it, and make sure
> that iSER over IB is not damaged.

Our patches may not interest you since we are using an older version of
the iSER code.  However, we will also be exploring the use of IB with our
implementations.  Will this require us to use the same non-standard iSER
header in some cases?

Thanks for your help,
Ethan


From kliteyn at dev.mellanox.co.il  Thu Jul 26 05:39:36 2007
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 26 Jul 2007 15:39:36 +0300
Subject: [ofa-general] QoS RFC
In-Reply-To: <20070723002010.GU27878@sashak.voltaire.com>
References: <46A283B6.1070105@dev.mellanox.co.il>
	<20070723002010.GU27878@sashak.voltaire.com>
Message-ID: <46A89608.9010709@dev.mellanox.co.il>

Hi Sasha,

Sasha Khapyorsky wrote:
> Hi Yevgeny,
> 
> Some initial comments.
> 
> On 01:07 Sun 22 Jul     , Yevgeny Kliteynik wrote:
>>  Hi All
>>
>>  Please find the attached RFC describing how QoS policy support could be 
>>  implemented in the OpenFabrics stack.
>>  Your comments are welcome.
>>
>>  -- Yevgeny
>>
>>                RFC: OpenFabrics Enhancements for QoS Support
>>               ===============================================
>>
>>  Authors: . Eitan Zahavi <eitan at mellanox.co.il>
>>  Authors: . Yevgeny Kliteynik <kliteyn at mellanox.co.il>
>>  Date: .... Jul 2007.
>>  Revision:  0.2
>>
>>  Table of contents:
>>  1. Overview
>>  2. Architecture
>>  3. Supported Policy
>>  4. CMA functionality
>>  5. IPoIB functionality
>>  6. SDP functionality
>>  7. SRP functionality
>>  8. iSER functionality
>>  9. OpenSM functionality
>>
>>  1. Overview
>>  ------------
>>  Quality of Service requirements stem from the realization of I/O 
>>  consolidation
>>  over IB network: As multiple applications and ULPs share the same fabric, 
>>  means
>>  to control their use of the network resources are becoming a must. The basic
>>  need is to differentiate the service levels provided to different traffic 
>>  flows,
>>  such that a policy could be enforced and control each flow utilization of 
>>  the
>>  fabric resources.
>>
>>  IBTA specification defined several hardware features and management 
>>  interfaces
>>  to support QoS:
>>  * Up to 15 Virtual Lanes (VL) carry traffic in a non-blocking manner
>>  * Arbitration between traffic of different VLs is performed by a 2 priority
>>    levels weighted round robin arbiter. The arbiter is programmable with
>>    a sequence of (VL, weight) pairs and maximal number of high priority 
>>  credits
>>    to be processed before low priority is served
>>  * Packets carry class of service marking in the range 0 to 15 in their
>>    header SL field
>>  * Each switch can map the incoming packet by its SL to a particular output
>>    VL based on programmable table VL=SL-to-VL-MAP(in-port, out-port, SL)
>>  * The Subnet Administrator controls each communication flow parameters
>>    by providing them as a response to Path Record (PR) or MultiPathRecord 
>>  (MPR)
>>    queries
>>
>>  The IB QoS features provide the means to implement a DiffServ like 
>>  architecture.
>>  DiffServ architecture (IETF RFC2474 2475) is widely used today in highly 
>>  dynamic
>>  fabrics.
>>
>>  This proposal provides the detailed functional definition for the various
>>  software elements that are required to enable a DiffServ like architecture 
>>  over
>>  the OpenFabrics software stack.
>>
>>
>>
>>  2. Architecture
>>  ----------------
>>  This proposal split the QoS functionality between the SM/SA, CMA and the 
>>  various
>>  ULPS. We take the "chronology approach" to describe how the overall system
>>  works:
>>
>>  2.1. The network manager (human) provides a set of rules (policy) that 
>>  defines
>>  how the network is being configured and how its resources are split to 
>>  different
>>  QoS-Levels. The policy also define how to decide which QoS-Level each
>>  application or ULP or service use.
>>
>>  2.2. The SM analyzes the provided policy to see if it is realizable and 
>>  performs
>>  the necessary fabric setup. The SM may continuously monitor the policy and 
>>  adapt
>>  to changes in it. Part of this policy defines the default QoS-Level of each
>>  partition. The SA is being enhanced to match the requested Source, 
>>  Destination,
>>  QoS-Class, Service-ID (and optionally SL and priority) against the policy. 
>>  So
>>  clients (ULPs, programs) can obtain a policy enforced QoS. The SM is also
>>  enhanced to support setting up partitions with appropriate IPoIB broadcast
>>  group. This broadcast group carries its QoS attributes: SL, MTU and
>>  RATE.
>>
>>  2.3. IPoIB is being setup. IPoIB uses the SL, MTU and RATE available on the
>>  multicast group which forms the broadcast group of this partition.
>>
>>  2.4. MPI which provides non IB based connection management should be 
>>  configured
>>  to run using hard coded SLs. It uses these SLs for every QP being opened.
>>
>>  2.5. ULPs that use CM interface (like SRP) should have their own 
>>  pre-assigned
>>  Service-ID and use it while obtaining PR/MPR for establishing connections.
>>  The SA receiving the PR/MPR should match it against the policy and return
>>  the appropriate PR/MPR including SL, MTU and RATE.
>>
>>  2.6. ULPs and programs using CMA to establish RC connection should provide 
>>  the
>>  CMA the target IP and Service-ID. Some of the ULPs might also provide 
>>  QoS-Class
>>  (E.g. for SDP sockets that are provided the TOS socket option). The CMA 
>>  should
>>  then use the provided Service-ID and optional QoS-Class and pass them in the
>>  PR/MPR request. The resulting PR/MPR should be used for configuring the
>>  connection QP.
>>
>>  PathRecord and MultiPathRecord enhancement for QoS:
>>  As mentioned above the PathRecord and MultiPathRecord attributes should be
>>  enhanced to carry the Service-ID which is a 64bit value, which has been
>>  standardized by the IBTA. A new field QoS-Class is also provided.
>>  A new capability bit should describe the SM QoS support in the SA class port
>>  info. This approach provides an easy migration path for existing access 
>>  layer
>>  and ULPs by not introducing new set of PR/MPR attribute.
>>
>>
>>  3. Supported Policy
>>  --------------------
>>
>>  The QoS policy supported by this proposal is divided into 4 sub sections:
>>
>>  I) Port Group: a set of CAs, Routers or Switches that share the same 
>>  settings.
>>  A port group might be a partition defined by the partition manager policy in
>>  terms of GUIDs. Future implementations might provide support for 
>>  NodeDescription
>>  based definition of port groups.
> 
> Isn't it better to have port group definitions in separate file? So
> groups could be shared with other OpenSM components (as discussed). Even
> if such group sharing is not high priority functionality this should
> save us from redoing things later.
> 
>>  II) Fabric Setup:
>>  Defines how the SL2VL and VLArb tables should be setup. This policy 
>>  definition
>>  assumes the computation of overall end to end network behavior should be 
>>  performed
>>  outside of OpenSM.
>>
>>  III) QoS-Levels Definition:
>>  This section defines the possible sets of parameters for QoS that a client
>>  might be mapped to. Each set holds: SL and optionally: Max MTU, Max Rate,
>>  Packet Lifetime and Path Bits (in case LMC > 0 is used for QoS).
>>
>>  IV) Matching Rules:
>>  A list of rules that match an incoming PR/MPR request to a QoS-Level. The
>>  rules are processed in order such as the first match is applied. Each rule 
>>  is
>>  built out of a set of match expressions which should all match for the rule 
>>  to
>>  apply. The matching expressions are defined for the following fields
>>  ** SRC and DST to lists of port groups
>>  ** Service-ID to a list of Service-ID or Service-ID ranges
>>  ** QoS-Class to a list of QoS-Class values or ranges
>>
>>  QoS Policy file syntax
>>
>>  * Empty lines are ignored
>>  * Leading and trailing blanks, as well as empty lines, are ignored, so the
>>    indentation in the example is just for better readability
>>  * Comments are started with the pound sign (#) and terminated by EOL
>>  * Comments may appear only in a separate line
> 
> Why? What is wrong with:
> 
> 	port-name: vs1/HCA-1/P1   # my best port

I can use this too, but then the pound sign, wherever it will
appear, would mean commentary start. No \# or something like this
to include it in some other place - I don't want to complicate the
syntax. Sounds OK?


>>  * Keywords that denote section/subsection start have matching closing 
>>  keywords
>>  * Any keyword should be the first non-blank in the line
>>
>>  QoS Policy file example
>>
>>      # Port Groups define sets of ports to be used later in the settings
>>      port-groups
>>          # using port GUIDs
>>          port-group
>>              name: Storage
>>              # "use" is just a description that is used for logging.
>>              #  Other than that, it is just a commentary
>>              use: our SRP storage targets
>>              port-guid: 0x1000000000000001
>>              port-guid: 0x1000000000000002
>>          end-port-group
>>
>>          port-group
>>              name: Virtual Servers
>>              use: node desc and IB port num
>>              # The syntax of the port name is as follows: 
>>  "hostname/CA-num/Pnum".
>>              # "hostname" and "CA-num" are compared to the first 2 words of
>>              # NodeDescription, and "Pnum" is a port number on that node.
>>              port-name: vs1/HCA-1/P1
>>              port-name: vs3/HCA-1/P1
>>              port-name: vs3/HCA-2/P2
> 
> What about wild carding here, like vs1/*/* or just vs1?

Good idea.

>>          end-port-group
>>
>>          # using partitions defined in the partition policy
>>          port-group
>>              name: Group for Partition 1
>>              use: default settings
>>              partition: Part1
>>          end-port-group
>>
>>          # using node types CA|ROUTER|SWITCH
> 
> Probably also ALL (for all ports), SELF (for SM port)?

Agree.

>>          port-group
>>              name: Routers
>>              use: all routers
>>              node-type: ROUTER
>>          end-port-group
>>
>>      end-port-groups
> 
> I agree that proposed syntax has better for human readability than pure
> XML, but isn't stuff like this will be more user-friendly?
> 
> Storage "Free Text description" = 0x10001, 0x10002, 0x10003 ;
> 
> , or
> 
> Storage "Free Text description" { 0x10001, 0x10002, 0x10003 };
> 
> , or
> 
> Storage "Free Text description": ROUTERS, CAS ;

GUID list is a good idea.
Not sure about the other stuff. A certain port group can be defined
both by guids and by node-types. How about this:

           port-group
               name: routers_and_mgt_nodes
               use: all routers and management nodes
               node-type: ROUTER
               port-guid: 0x10001, 0x10002, 0x10003
           end-port-group

>>      qos-setup
>>
>>          # define all types of VLArb tables. The length of the tables should
>>          # match the physically supported tables by their target ports
>>          vlarb-tables
>>              # scope defines the exact ports the VLArb tables apply to
>>              vlarb-scope
>>                  # defining VLArb tables on all the ports that belong to
>>                  # port group 'Storage', and on all the ports connected
>>                  # to ports of port group 'Storage'
>>                  group: Storage
> 
> So "group" is only for ports that belong to 'Storage'?

Yes, and "across" is for ports that connected to ports of group 'Storage'

>>                  # "across" means all the ports that are connected to ports
>>                  # that belong to the specified port group
>>                  across: Storage
>>                  # VLArb table holds VL and weight pairs
>>                  vlarb-high: 0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1
>>                  vlarb-low: 8:255,9:127,10:63,11:31,12:15,13:7,14:3
>>                  vl-high-limit: 10
>>              end-vlarb-scope
>>              # There can be several scopes
>>          end-vlarb-tables
>>
>>          sl2vl-tables
>>              # Scope defines the exact devices and in/out ports tables apply 
>>  to.
>>              # Note: if the same port is matching several rules the *FIRST* 
>>  one applies.
>>              sl2vl-scope
>>                  # SL2VL tables are orgnized as SL2VL(in-port,out-port)
>>                  # "from: n,m" means we define the SL2VL(n,*) and SL2VL(m,*)
>>                  # "to: n,m" means we define the SL2VL(*,n) and SL2VL(*,m)
>>                  #
>>                  # The following example specifies that all the SL2VL tables
>>                  # entries should be defined for all the ports of group 
>>  Part1:
>>                  group: Part1
>>                  from: *
>>                  to: *
>>                  # SL2VL table has to have 16 values at max - one for each 
>>  SL.
>>                  # If the user specifies less than 16 values, all the missing
>>                  # VL values will be implicitly set to 0
>>                  sl2vl-table: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
>>              end-sl2vl-scope
>>
>>              sl2vl-scope
>>                  # "across-to" is a combination of "across" keyword 
>>  (definition can be found
>>                  # in VLArb tables section) and "to" keyword.
>>                  # "across: PortGroupName" refers to all the ports that are 
>>  connected
>>                  # to ports that belong to PortGroupName.
>>                  #
>>                  # Example of "across-to" usage:
>>                  #   A user has a set of 'special' nodes (e.g. storage 
>>  nodes), and all
>>                  #   the traffic to these nodes has to get specific VL.
>>                  #   The solution is to define port group (i.g. "Storage") 
>>  that will
>>                  #   include all the ports of these nodes, and then to 
>>  configure SL2VL
>>                  #   tables on all the switch ports that are connected to the 
>>  Storage
>>                  #   port group by specifying "across-to: Storage".
>>                  #
>>                  across-to: Storage2
>>                  # Similar to "across-to", "across-from" is a combination of 
>>  "across"
>>                  # and "to" keywords
>>                  across-from: Storage1
>>                  sl2vl-table: 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0
>>              end-sl2vl-scope
>>          end-sl2vl-tables
>>
>>      end-qos-setup
>>
>>
>>      qos-levels
>>
>>          # the first one is just setting SL
>>          qos-level
>>              use: for the lowest priority communication
>>              sl: 15
>>              packet-life: 16
>>          end-qos-level
>>          # the second sets SL and QoS Class
>>          qos-level
>>              use: low latency best bandwidth
>>              sl: 0
>>          end-qos-level
>>          # the whole set: SL, MTU-Limit, Rate-Limit, Packet Lifetime, Path 
>>  Bits
>>          qos-level
>>              use: just an example
>>              sl: 0
>>              mtu-limit: 1
>>              rate-limit: 1
>>              packet-life: 12
>>              # Path Bits can be used e.g. to provide a different routes 
>>  through the
>>              # subnet to a particular port
>>              path-bits: 2,4,8-32
>>          end-qos-level
>>
>>      end-qos-levels
>>
>>
>>      # Match rules are scanned in a first-fit manner (like firewall rules 
>>  table)
>>      qos-match-rules
>>
>>          # matching by single criteria: class (list of values and ranges)
>>          qos-match-rule
>>              # just a description
>>              use: low latency by class 7-9 or 11
>>              qos-class: 7-9,11
>>              # number of qos-level to apply to the matching PR/MPR
>>              qos-level-sn: 1
> 
> Isn't it better and less error prone to match qos_level by name and not
> by sequential number?

qos-level can have name, and then qos-match-rule will refer to this name.
But matching qos-level by sequential number makes it really easy to locate
the referred qos-level, which is important, as every PR/MPR request would
go through this process, so saving some runtime in this area is important IMHO.

>>          end-qos-match-rule
>>          # show matching by destination group AND service-ids
>>          qos-match-rule
>>              use: Storage targets connection
>>              destination: Storage
>>              service-id: 22,4719-5000
>>              qos-level-sn: 2
>>          end-qos-match-rule
>>          # show matching by source group only
>>          qos-match-rule
>>              use: bla bla
>>              source: Storage
>>              qos-level-sn: 3
>>          end-qos-match-rule
>>
>>      end-qos-match-rules
>>
>>
>>  4. IPoIB
>>  ---------
>>
>>  IPoIB already query the SA for its broadcast group information. The 
>>  additional
>>  functionality required is for IPoIB to provide the broadcast group SL, MTU,
>>  and RATE in every following PathRecord query performed when a new UDAV is
>>  needed by IPoIB.
>>  We could assign a special Service-ID for IPoIB use but since all 
>>  communication
>>  on the same IPoIB interface shares the same QoS-Level without the ability to
>>  differentiate it by target service we can ignore it for simplicity.
>>
>>  5. CMA features
>>  ----------------
>>
>>  The CMA interface supports Service-ID through the notion of port space as a
>>  prefixes to the port_num which is part of the sockaddr provided to
>>  rdma_resolve_add(). What is missing is the explicit request for a QoS-Class 
>>  that
>>  should allow the ULP (like SDP) to propagate a specific request for a class 
>>  of
>>  service. A mechanism for providing the QoS-Class is available in the IPv6 
>>  address,
>>  so we could use that address field. Another option is to implement a special
>>  connection options API for CMA.
>>
>>  Missing functionality by CMA is the usage of the provided QoS-Class and 
>>  Service-ID
>>  in the sent PR/MPR. When a response is obtained it is an existing 
>>  requirement for
>>  the CMA to use the PR/MPR from the response in setting up the QP address 
>>  vector.
>>
>>
>>  6. SDP
>>  -------
>>
>>  SDP uses CMA for building its connections.
>>  The Service-ID for SDP is 0x000000000001PPPP, where PPPP are 4 hex digits
>>  holding the remote TCP/IP Port Number to connect to.
>>  SDP might be provided with SO_PRIORITY socket option. In that case the value
>>  provided should be sent to the CMA as the TClass option of that connection.
>>
>>  7. SRP
>>  -------
>>
>>  Current SRP implementation uses its own CM callbacks (not CMA). So SRP 
>>  should
>>  fill in the Service-ID in the PR/MPR by itself and use that information in
>>  setting up the QP. The T10 SRP standard defines the SRP Service-ID to be 
>>  defined
>>  by the SRP target I/O Controller (but they should also comply with IBTA 
>>  Service-
>>  ID rules). Anyway, the Service-ID is reported by the I/O Controller in the
>>  ServiceEntries DMA attribute and should be used in the PR/MPR if the SA
>>  reports its ability to handle QoS PR/MPRs.
>>
>>  8. iSER
>>  --------
>>  iSER uses CMA and thus should be very close to SDP. The Service-ID for iSER
>>  should be TBD.
>>
>>
>>  9. OpenSM features
>>  -------------------
>>  The QoS related functionality to be provided by OpenSM can be split into two
>>  main parts:
>>
>>  3.1. Fabric Setup
>>  During fabric initialization the SM should parse the policy and apply its
>>  settings to the discovered fabric elements. The following actions should be
>>  performed:
>>  * Parsing of policy
>>  * Node Group identification. Warning should be provided for each node not
>>    specified but found.
>>  * SL2VL settings validation should be checked:
>>    + A warning will be provided if there are no matching targets for the 
>>  SL2VL
>>      setting statement.
>>    + An error message will be printed to the log file if an invalid setting 
>>  is
>>      found. A setting is invalid if it refers to:
>>      - Non existing port numbers of the target devices
>>      - Unsupported VLs for the target device. In the later case the map to 
>>  non
>>        existing VLs should be replaced to VL15 i.e. packets will be dropped.
> 
> I'm not sure it is optimal. We could have well documented or even
> configurable mapping rule instead, then this will not limit devices with
> higher capabilities.

I'm open for suggestions.

>>  * SL2VL setting is to be performed
>>  * VL Arbitration table settings should be validated according to the 
>>  following
>>    rules:
>>    + A warning will be provided if there are no matching targets for the 
>>  setting
>>      statement
>>    + An error will be provided if the port number exceeds the target ports
>>    + An error will be generated if the table length exceeds device 
>>  capabilities
> 
> Ditto.
> 
>>    + A warning will be generated if the table quote a VL that is not supported
>>      by the target device
> 
> What is "table quote" here?
>>  * VL Arbitration tables will be set on the appropriate targets
>>
>>  3.2. PR/MPR query handling:
>>  OpenSM should be able to enforce the provided policy on client request.
>>  The overall flow for such requests is: first the request is matched against 
>>  the
>>  defined match rules such that the target QoS-Level definition is found. 
>>  Given
>>  the QoS-Level a path(s) search is performed with the given restrictions 
>>  imposed
>>  by that level. The following two sections describe these steps.
>>
>>  How Service-ID is carried in the PathRecord and MultiPathRecord attributes 
>>  is
>>  now standardized by the IBTA.
>>
>>
>>  3.2.1. Matching rule search:
>>  A rule is "matching" a PR/MPR request using the following criteria:
>>  * Matching rules provide values in a list of either single value, or range 
>>  of
>>    values. A PR/MPR field is "matching" the rule field if it is explicitly
>>    noted in the list of values or is one of the values covered by a range
>>    included in the field values list.
>>  * Only PR/MPR fields that have their component mask bit set should be
>>    compared.
>>  * For a rule to be "matching" a PR/MPR request all the rule fields should be
>>    "matching" their PR/MPR fields. Such that a PR/MPR request that does
>>    not have a component mask field set for one of the rule defined fields  
>>  can
>>    not match that rule.
>>  * A PR/MPR request that have a component mask bit set for one of the fields
>>    that is not defined by the rule can match the rule.
> 
> Aren't last two too restrictive? SA can just to filter-out paths in
> response to match rest of the rule. No?

Not sure I'm following.
The last bullet is not restrictive at all - it says that if you have a match
rule with some reduced set of fields (e.g. only service id), any PR/MPR with
a matching service id will be matched, even if it also has MTU, rate, etc.

>>  The algorithm to be used for searching for a rule match might be as simple 
>>  as a
>>  sequential search through all rules or enhanced for better performance. The
>>  semantics of every rule field and its matching PR/MPR field are described
>>  below:
>>  * Source: the SGID or SLID should be part of this group
>>  * Destination: the DGID or DLID should be part of this group
>>  * Service-ID: check if the requested Service-ID (available in the PR/MPR old
>>    SM-Key field) is matching any of this rule Service-IDs
>>  * TClass: check if the PR/MPR TClass field is matching
>>
>>  3.2.2 PR/MPR response generation:
>>  The QoS-Level pointed by the first rule that matches the PR/MPR request
>>  should be used for obtaining the response SL, MTU-Limit, RATE-Limit, 
>>  Path-Bits
>>  and QoS-Class. A default QoS-Level should be used if no rule is matching the 
>>  query.
> 
> Where this default should be defined?

OK, I missed that part. Here it is:

  - qos-level sequential number is counted from 0
  - qos-level num. 0 is a must is treated as the Default Level - it's
    applied to any PR/MPR request that didn't match any match rule
  - default qos-level can be also referred explicitly in any match rule
    by specifying "qos-level-sn: 0"

-- Yevgeny

> Sasha
> 
> 
>>  The efficient algorithm for finding paths that meet the QoS-Level criteria 
>>  is
>>  beyond the scope of this RFC and left for the implementer to provide. 
>>  However
>>  the criteria by which the paths match the QoS-Level are described below:
>>
>>  * SL: The paths found should all use the given SL. For that sake PR/MPR
>>    algorithm should traverse the path from source to destination only through
>>    ports that carry a valid VL (not VL15) by the SL2VL map (should consider 
>>  input
>>    and output ports and SL).
>>  * MTU-Limit: The resulting paths MTU should not exceed the given MTU-Limit
>>  * Rate-Limit: The resulting paths RATE should not exceed the given 
>>  RATE-Limit
>>    (rate limit is given in units of link BW = Width*Speed according to IBTA
>>    Specification Vol-1 table-205 p-901 l-24).
>>  * Path-Bits: define the target LID lowest bits (number of bits defined by 
>>  the
>>    target port PortInfo.LMC field). The path should traverse the LFT using 
>>  the
>>    target port LID with the path-bits set.
>>  * QoS-Class: should be returned in the result PR/MPR. When routing is going 
>>  to
>>    be supported by OpenSM we might use this field in selecting the target
>>    router too in a TBD way.
>>
>>  _______________________________________________
>>  general mailing list
>>  general at lists.openfabrics.org
>>  http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>>  To unsubscribe, please visit 
>>  http://openib.org/mailman/listinfo/openib-general
> 


From erezz at voltaire.com  Thu Jul 26 05:53:39 2007
From: erezz at voltaire.com (Erez Zilber)
Date: Thu, 26 Jul 2007 15:53:39 +0300
Subject: [ofa-general] iSER header
In-Reply-To: <20070726111106.GA14180@postal.iol.unh.edu>
References: <20070709144702.GB24125@postal.iol.unh.edu>
	<46933130.6040100@voltaire.com>
	<20070725192230.GA13579@postal.iol.unh.edu>
	<39C75744D164D948A170E9792AF8E7CA110A31@exil.voltaire.com>
	<20070726111106.GA14180@postal.iol.unh.edu>
Message-ID: <46A89953.2010505@voltaire.com>


>> Will you send patches for iSER soon? I'd like to test it, and make sure
>> that iSER over IB is not damaged.
>>     
>
> Our patches may not interest you since we are using an older version of
> the iSER code.  However, we will also be exploring the use of IB with our
> implementations.  Will this require us to use the same non-standard iSER
> header in some cases?
>   
Yes, you will need to use the iSER header according to the current
implementation.

Erez


From hal.rosenstock at gmail.com  Thu Jul 26 05:58:48 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 26 Jul 2007 08:58:48 -0400
Subject: [ofa-general] Re: openSM: Different IB MTUs
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F7564E@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C901F755DF@mtlexch01.mtl.com>
	<OF1EB21731.B109AF64-ON87257323.006B5BCC-88257323.004090BE@us.ibm.com>
	<6C2C79E72C305246B504CBA17B5500C901F7564E@mtlexch01.mtl.com>
Message-ID: <f0e08f230707260558y4b62d4d1heec74058013eb12@mail.gmail.com>

On 7/26/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:
>
>  *I propose that when there is no MTU in the partition policy file
> OpenSM use a *
> *configurable default from: **/etc/cache/opensm/opensm.opt.*
>

That would make this the default rather than 2K. IMO it should be when some
"special" unused mtu is set in the partition config.

-- Hal

 *Something like:*
> *# The default MTU to be used for IPoIB and other MCGs when the
> partition-policy *
> *# does not provide exact value. The default is the lowest possible MTU*
> *mcg_default_mtu 1*
> **
> *Eitan Zahavi***
> Senior Engineering Director, Software Architect
> Mellanox Technologies LTD
> Tel:+972-4-9097208
> Fax:+972-4-9593245
> P.O. Box 586 Yokneam 20692 ISRAEL
>
>
>  ------------------------------
> *From:* Shirley Ma [mailto:xma at us.ibm.com]
> *Sent:* Wednesday, July 25, 2007 10:45 PM
> *To:* Eitan Zahavi
> *Cc:* general at lists.openfabrics.org; Hal Rosenstock
> *Subject:* RE: [ofa-general] Re: openSM: Different IB MTUs
>
>
>
> Hello Eitan, Hal,
>
> Thanks. It's good openSM has the configuration option to set up these
> attributes in MC. Is this a good idea to add below to openSM: When there is
> no MTU defined in the configuration file, SM can pick up the smallest link
> MTU in the fabrics by default? MTU is unlikely rate, slower rate might
> indicate the cablling problem. So using the smallest link MTU in the fabrics
> might not be a bad choice for MC by default. The reason I request here is to
> create IP multicast group, MTU is not an attribute of the group. When
> mapping IP multicast to IB multicast, IB muliticast might fail because of
> different IB link MTU size in the group, but IP multicast group will be
> successful without knowing the failure. If admin sets MTU in configuration
> file, admin would know this failure. Otherwise, admin/users could spend too
> much time on debugging their broken multicasting applications.
>
> Thanks
> Shirley Ma
>
> [image: Inactive hide details for "Eitan Zahavi" <eitan at mellanox.co.il>]"Eitan
> Zahavi" <eitan at mellanox.co.il>
>
>
>
>     *"Eitan Zahavi" <eitan at mellanox.co.il>*
>
>             07/25/07 12:25 PM
>
>
> To
>
> "Hal Rosenstock" <hal.rosenstock at gmail.com>, Shirley
> Ma/Beaverton/IBM at IBMUS
> cc
>
> <general at lists.openfabrics.org>
> Subject
>
> RE: [ofa-general] Re: openSM: Different IB MTUs
> *Hi Shirley,*
>
> *I think I understand where your question comes from...*
> *Many have issue with heterogonous fabrics where not all nodes have same
> MTU or Speed.*
> *Especially when IPoIB relies on all nodes joining the broadcast group.*
>
> *The term "join" for multicast groups is a little overloaded.*
> *If a node joins an existing MC group it has to have a rate (speed *
> width) > MCG.rate and support MTU > MCG.MTU otherwise it is denied.*
> *If the join is actually a "create" the node has to provide the rate and
> MTU which define the MCG values.*
>
> *To allow for administrator to control the IPoIB MCGs MTU and rate OpenSM
> provides the means to control these*
> *values per partition. See the doc/partition-config.doc*
> *Still the administrator should know what would be the lowest MTU and rate
> the nodes expected to join the IPoIB subnet have.*
> *The tradeoff is in the hands of the administrator who can set a value
> that will prevent slow nodes from joining the group, *
> *or assign a low value that will fit all nodes but slow down communication
> ...*
>
> *EZ*
>
> *Eitan Zahavi*
> Senior Engineering Director, Software Architect
> Mellanox Technologies LTD
> Tel:+972-4-9097208
> Fax:+972-4-9593245
> P.O. Box 586 Yokneam 20692 ISRAEL
>
>
>
> ------------------------------
> *From:* general-bounces at lists.openfabrics.org [
> mailto:general-bounces at lists.openfabrics.org<general-bounces at lists.openfabrics.org>]
> *On Behalf Of *Hal Rosenstock*
> Sent:* Wednesday, July 25, 2007 10:01 PM*
> To:* Shirley Ma*
> Cc:* general at lists.openfabrics.org*
> Subject:* [ofa-general] Re: openSM: Different IB MTUs
>
> Shirley,
>
> On 7/25/07, *Shirley Ma* <*xma at us.ibm.com* <xma at us.ibm.com>> wrote:
>
>    Hal,
>
>    Thanks for your prompt reply. I am asking for how openSM handle
>    different link MTUs in SA MCMemberRecord MTU. For example, if we have some
>    links MTU as 2K, some links MTU as 1K. Then when enabling IPoIB, how does SM
>    decide IPoIB broadcast group MCMemberRecord MTU size? When creating an IB
>    multicast group from a 2K MTU node first, which PMTU value is attaching to
>    this IB multicast group MCMemberRecord MTU?
>
>
>
> MCMemberRecord MTU gets the group MTU (when created). This is either this
> first joiner with sufficient components or preconfigured (and MTU can be set
> in the config). If a joiner has insufficient MTU for the group, it is
> denied.
>
> -- Hal
>
>
>    Thanks
>    Shirley Ma
>
>    [image: Inactive hide details for "Hal Rosenstock"
>    <hal.rosenstock at gmail.com>]"Hal Rosenstock" < *
>    hal.rosenstock at gmail.com* <hal.rosenstock at gmail.com>>
>
>          *"Hal Rosenstock" <**hal.rosenstock at gmail.com*<hal.rosenstock at gmail.com>
>                            *>*
>
>                            07/25/07 10:57 AM
>                               To
>
>    Shirley Ma/Beaverton/IBM at IBMUS  cc
>    *
>    **general at lists.openfabrics.org* <general at lists.openfabrics.org>
>    Subject
>
>    Re: openSM: Different IB MTUs
>    Shirley,
>
>    On 7/25/07, *Shirley Ma* <* **xma at us.ibm.com* <xma at us.ibm.com>>
>    wrote:
>       Hello Hal,
>
>          How does openSM handle CAs with different MTUs in the
>          same subnet? For example, IPoIB broadcast group MTU, IB multicast group
>          PMTU? Does openSM pick up the smallest MTU in the subnet?
>
>
>    Are you asking about link MTU, SA PathRecord/MultiPathRecord MTU, SA
>    MCMemberRecord MTU, or all of these ?
>
>    -- Hal
>       Thanks
>          Shirley Ma
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070726/8dc07d3c/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070726/8dc07d3c/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0E407396.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070726/8dc07d3c/attachment-0001.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0E830176.gif
Type: image/gif
Size: 45 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070726/8dc07d3c/attachment-0002.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ecblank.gif
Type: image/gif
Size: 45 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070726/8dc07d3c/attachment-0003.gif>

From todd.rimmer at qlogic.com  Thu Jul 26 06:23:52 2007
From: todd.rimmer at qlogic.com (Todd Rimmer)
Date: Thu, 26 Jul 2007 08:23:52 -0500
Subject: [ofa-general] Re: ARP in IPoIB
In-Reply-To: <c8028d330707260236w60b41487ufe2e84dba138a453@mail.gmail.com>
Message-ID: <4FB1BCCAE6CAED44A1DC005B1DE061192B637F@EPEXCH2.qlogic.org>

 
________________________________

From: Amar Mudrankit


Michael,
thanks for your reply. But, this gives rise to couple of questions..

1] If such multicast routing protocol for IB routers is not yet specifid
by IBTA or IETF, then current implementation have IP subnet restricted
within a IB subnet. According to RFC 4391, section 9.1.1, the link layer
address is formed through combination of GID + QPN. If we are not
spanning across IB subnets what is the use of GID as we need to get LID
from GID? Probably, in that case ARP reply with LID,Q_Key and other path
information would be helpful which resolves path in 1 loop than 2 loops
in case of GID(first to resolve GID and then to get LID). 

[Todd Rimmer] Basing IPoIB on the GID keeps open the opportunity for
IPoIB to span IB subnets in the future.  Also this permits the SM to
manage the paths and PathRecord parameters appropriately even in
non-routed IB networks.  For example, if multi-pathing is used (LMC!=0,
hence giving multiple LIDs per port), the SM may respond to PathRecord
requests for a given Destination GID with a different LID depending on
Source GID.  Such a mechanism can be used to manage routes in the
fabric, etc.  That is just a simple example, since the PathRecord
includes lots of other information as well (QOS, routing info, MTU,
etc).  The SM can provide different PathRecord values for each Source
GID talking to a given Destination GID.

The IETF needed a "MAC Address" for IPoIB.  GID+QPN gave them a unique
endpoint with the potential to work through routers and still support
the full intentions of IB's QOS and routing options.  LID+QPN would
severely limit those capabilities.

In general it's a bad idea for end nodes to simply exchange LIDs as it
bypasses many of the intentions of the IB spec.  Such applications will
break in fabrics which use the more advanced IB QOS, routing, etc
options.

 
Todd Rimmer

Chief Architect 

QLogic System Interconnect Group

Voice: 610-233-4852     Fax: 610-233-4777

Todd.Rimmer at QLogic.com <mailto:Todd.Rimmer at QLogic.com>   www.QLogic.com

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070726/011f99ac/attachment.html>

From umaxx at oleco.net  Thu Jul 26 06:44:31 2007
From: umaxx at oleco.net (Joerg Zinke)
Date: Thu, 26 Jul 2007 15:44:31 +0200
Subject: [ofa-general] ibv_modify_qp() return value 22
In-Reply-To: <46A870A1.5090401@dev.mellanox.co.il>
References: <20070726102553.5b02caea@marvin.local>
	<46A870A1.5090401@dev.mellanox.co.il>
Message-ID: <20070726154431.155d967b@marvin.local>

Hi,

On Thu, 26 Jul 2007 13:00:01 +0300
Dotan Barak <dotanb at dev.mellanox.co.il> wrote:

> Joerg Zinke wrote:

> > ibv_modify_qp() fails with return value 22 when I try to open a new
> > CM connection under load (already ~3000 RDMA connections opened). I
> > tried to figure out what return value 22 means but could not find
> > it in the mthca kernel driver.
> >
> > Any hints? What does return value 22 mean?
> >   
> The value 22 is the ibv_modify_qp means that there was an invalid 
> parameter when calling to this verb.
> If you try to call to ibv_modify_qp without any load (only several
> QPs) do you still get this error?
> 

In short: no I do not get this error without load, because I start with
the no load situation and then more and more clients connecting until
~3000.

I have a simple CM server which accepts RDMA connections from
thousands clients. The code is based on the example/ stuff, with
the same simple handler functions to do the REQ/REP/RTU.
Everything is working fine, they all connect in the same manner (same
handler functions) until the point where ~3000 clients are connected and
the request handler fails on the server to modify the QP for the Reply
with return value 22.

With load i meant a lot of connections not the data transfer, 
because not all clients sending the whole time data through the RDMA
connections most of the time only one or two of the clients
sending... little integer pieces - so not much load on the lines.

Cheers 

Joerg


From dledford at redhat.com  Thu Jul 26 06:51:44 2007
From: dledford at redhat.com (Doug Ledford)
Date: Thu, 26 Jul 2007 13:51:44 +0000
Subject: [ofa-general] [ANNOUNCE] NFS-RDMA for OFED 1.2 G/A
In-Reply-To: <1185297645.14681.22.camel@trinity.ogc.int>
References: <1185297645.14681.22.camel@trinity.ogc.int>
Message-ID: <1185457905.5165.695.camel@firewall.xsintricity.com>

On Tue, 2007-07-24 at 12:20 -0500, Tom Tucker wrote:
> For those interested in NFS-RDMA, OGC has created an install package
> based on the OFA 1.2 GA release. The package supports both SLES 10 and
> RHEL 5. You can download this package from
> http://www.opengridcomputing.com/nfs-rdma.html.
> 
> Please let me know if you find any problems.

Hi Tom, can you tell me anything about the plans for getting this
upstream?

-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070726/c3b4cf74/attachment.sig>

From hal.rosenstock at gmail.com  Thu Jul 26 07:16:58 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 26 Jul 2007 14:16:58 +0000
Subject: [ofa-general] [PATCH][TRIVIAL] OpenSM/include/iba/ib_types.h: Fix
	comment
Message-ID: <f0e08f230707260716j4a824cddqe06acfd2b19983e@mail.gmail.com>

include/iba/ib_types.h: Fix comment

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h
index 5820ee6..54c2250 100644
--- a/opensm/include/iba/ib_types.h
+++ b/opensm/include/iba/ib_types.h
@@ -4931,7 +4931,7 @@ ib_port_info_get_mtu_cap(
 *              [in] Pointer to a PortInfo attribute.
 *
 * RETURN VALUES
-*      Returns the LMC value assigned to this port.
+*      Returns the encoded value for the maximum MTU supported by this port.
 *
 * NOTES
 *


From swise at opengridcomputing.com  Thu Jul 26 07:18:22 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 26 Jul 2007 09:18:22 -0500
Subject: [ofa-general] Re: QoS in RDMA CM: (was QoS RFC)
In-Reply-To: <46A69225.9090502@ichips.intel.com>
References: <46A283B6.1070105@dev.mellanox.co.il>
	<46A54659.8010608@ichips.intel.com>
	<46A69225.9090502@ichips.intel.com>
Message-ID: <46A8AD2E.9000908@opengridcomputing.com>

Sean Hefty wrote:
> Steve,
> 
> Do you have any input with respect to how the RDMA CM selects and maps 
> QoS (priority, traffic class, VLAN, flow label, etc.)?  (See below)
> 
> Hide the QoS selection under the current interface?  Use the IPv6 
> flowinfo field?  Rely on destination port?  Input QoS through existing 
> or new call?  Handle IPv4 and IPv6 addresses differently?  ???
> 
> - Sean
> 
>>> 2.6. ULPs and programs using CMA to establish RC connection should 
>>> provide the CMA the target IP and Service-ID. Some of the ULPs might
>>> also provide QoS-Class (E.g. for SDP sockets that are provided the
>>> TOS socket option). The CMA should then use the provided Service-ID
>>> and optional QoS-Class and pass them in the PR/MPR request. The
>>> resulting PR/MPR should be used for configuring the connection QP.
>>
>> The interface to the CMA needs to remain as transport independent as 
>> possible, and I am unsure of the transport independence of tying QoS 
>> to the destination port number.  (I'm not disagreeing; I'm just not 
>> sure at the moment it's the right approach.)
>>

In the socket API, socket options describe what protocol they are 
intended for.  You can have options that are intended for IP or TCP and 
other protocol layers.

We could do some rdma_setopt() interface, and define both transport 
independent options and transport-specific options.  Then if there are 
features of qos that are only in IB, you can make them 
transport-specific options.  So an option struct may have a 
transport_type field...

Although I _think_ it will be a good thing to try and map 
transport-specific qos attributes to a univeral transport independent 
attribute.  But I'm not an expert on qos stuff...

>>> 5. CMA features ----------------
>>>
>>> The CMA interface supports Service-ID through the notion of port
>>> space as a prefixes to the port_num which is part of the sockaddr
>>> provided to rdma_resolve_add(). What is missing is the explicit
>>> request for a QoS-Class that should allow the ULP (like SDP) to
>>> propagate a specific request for a class of service. A mechanism for
>>> providing the QoS-Class is available in the IPv6 address, so we could
>>> use that address field. Another option is to implement a special 
>>> connection options API for CMA.
>>>
>>> Missing functionality by CMA is the usage of the provided QoS-Class
>>> and Service-ID in the sent PR/MPR. When a response is obtained it is
>>> an existing requirement for the CMA to use the PR/MPR from the
>>> response in setting up the QP address vector.
>>
>> The most natural function to specify additional QoS parameters would 
>> be rdma_resolve_route.

Or a more generic rdma_setopt() that can be extensible for future 
options/attributes and not break the API...

My 2 cents.

Stevo.


From jlentini at netapp.com  Thu Jul 26 07:16:51 2007
From: jlentini at netapp.com (James Lentini)
Date: Thu, 26 Jul 2007 10:16:51 -0400 (EDT)
Subject: [ofa-general] [ANNOUNCE] NFS-RDMA for OFED 1.2 G/A
In-Reply-To: <1185457905.5165.695.camel@firewall.xsintricity.com>
References: <1185297645.14681.22.camel@trinity.ogc.int>
	<1185457905.5165.695.camel@firewall.xsintricity.com>
Message-ID: <Pine.LNX.4.64.0707260954590.2834@jlentini-linux.nane.netapp.com>


On Thu, 26 Jul 2007, Doug Ledford wrote:

> On Tue, 2007-07-24 at 12:20 -0500, Tom Tucker wrote:
> > For those interested in NFS-RDMA, OGC has created an install package
> > based on the OFA 1.2 GA release. The package supports both SLES 10 and
> > RHEL 5. You can download this package from
> > http://www.opengridcomputing.com/nfs-rdma.html.
> > 
> > Please let me know if you find any problems.
> 
> Hi Tom, can you tell me anything about the plans for getting this 
> upstream?

The goal is to make this code acceptable for 2.6.24.

The client and server code have been posted for review on the linux 
nfs mailing list, nfs at lists.sourceforge.net. See the posts by Tom 
Talpey on July 11 for the client code

 http://sourceforge.net/mailarchive/forum.php?forum_name=nfs&max_rows=25&style=ultimate&viewmonth=200707&viewday=11

and the post by Tom Tucker on July 10 for the server code

 http://sourceforge.net/mailarchive/forum.php?forum_name=nfs&max_rows=25&style=ultimate&viewmonth=200707&viewday=10

Now that the 2.6.23 merge window is close and people have time to 
review new code, we are hoping for more comments.


From hal.rosenstock at gmail.com  Thu Jul 26 07:25:30 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 26 Jul 2007 10:25:30 -0400
Subject: [ofa-general] [PATCH] OpenSM/osm_port.c: Fix opvls and neighbormtu
	when remote port invalid
Message-ID: <f0e08f230707260725m5779a12dw26dd5af3bc29ff56@mail.gmail.com>

OpenSM/osm_port.c: Fix opvls and neighbormtu when remote port invalid

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c
index e03e316..b9c52f4 100644
--- a/opensm/opensm/osm_port.c
+++ b/opensm/opensm/osm_port.c
@@ -387,12 +388,12 @@ osm_physp_calc_link_mtu(

  OSM_LOG_ENTER( p_log, osm_physp_calc_link_mtu );

-  /* use the available MTU */
-  mtu = ib_port_info_get_mtu_cap( &p_physp->port_info );
-
  p_remote_physp = osm_physp_get_remote( p_physp );
  if( p_remote_physp && osm_physp_is_valid( p_remote_physp ) )
  {
+    /* use the available MTU */
+    mtu = ib_port_info_get_mtu_cap( &p_physp->port_info );
+
    remote_mtu = ib_port_info_get_mtu_cap( &p_remote_physp->port_info );

    if( osm_log_is_active( p_log, OSM_LOG_DEBUG ) )
@@ -427,6 +428,8 @@ osm_physp_calc_link_mtu(
      }
    }
  }
+  else
+    mtu = ib_port_info_get_neighbor_mtu( &p_physp->port_info );

  if( mtu == 0 )
  {
@@ -454,12 +457,12 @@ osm_physp_calc_link_op_vls(

  OSM_LOG_ENTER( p_log, osm_physp_calc_link_op_vls );

-  /* use the available VLCap */
-  op_vls = ib_port_info_get_vl_cap( &p_physp->port_info );
-
  p_remote_physp = osm_physp_get_remote( p_physp );
  if( p_remote_physp && osm_physp_is_valid( p_remote_physp ) )
  {
+    /* use the available VLCap */
+    op_vls = ib_port_info_get_vl_cap( &p_physp->port_info );
+
    remote_op_vls = ib_port_info_get_vl_cap( &p_remote_physp->port_info );

    if( osm_log_is_active( p_log, OSM_LOG_DEBUG ) )
@@ -496,6 +499,8 @@ osm_physp_calc_link_op_vls(
      }
    }
  }
+  else
+    op_vls = ib_port_info_get_op_vls( &p_physp->port_info );

  /* support user limitation of max_op_vls */
  if (op_vls > p_subn->opt.max_op_vls)


From mst at dev.mellanox.co.il  Thu Jul 26 07:31:39 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 26 Jul 2007 17:31:39 +0300
Subject: [ofa-general] Re: [ANNOUNCE] NFS-RDMA for OFED 1.2 G/A
In-Reply-To: <Pine.LNX.4.64.0707260954590.2834@jlentini-linux.nane.netapp.com>
References: <1185297645.14681.22.camel@trinity.ogc.int>
	<1185457905.5165.695.camel@firewall.xsintricity.com>
	<Pine.LNX.4.64.0707260954590.2834@jlentini-linux.nane.netapp.com>
Message-ID: <20070726143139.GL22557@mellanox.co.il>


> Now that the 2.6.23 merge window is close and people have time to 
> review new code, we are hoping for more comments.

You might want to send copy of patches to openfabrics general and lkml if you do.

-- 
MST


From mimmi.dadisman at asv-vejle.dk  Thu Jul 26 07:57:49 2007
From: mimmi.dadisman at asv-vejle.dk (Arron Nieves)
Date: Thu, 26 Jul 2007 14:57:49 +0000
Subject: [ofa-general] You can be young again! 
Message-ID: <01c7cf95$5575d1d0$eb489952@mimmi.dadisman>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: miracle01.gif
Type: image/gif
Size: 30400 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070726/22a90e70/attachment.gif>

From xma at us.ibm.com  Thu Jul 26 07:58:21 2007
From: xma at us.ibm.com (Shirley Ma)
Date: Thu, 26 Jul 2007 07:58:21 -0700
Subject: [ofa-general] Re: Re: openSM: Different IB MTUs
In-Reply-To: <20070726072245.GC13258@mellanox.co.il>
Message-ID: <OF5327C1ED.1533FA78-ON87257324.005240EA-88257324.00264C02@us.ibm.com>


Set default as 4 (2K) is more proper than 1(512?). All HCAs support 2K at
least now.

Thanks
Shirley Ma


             "Michael S.                                                   
             Tsirkin"                                                      
             <mst at dev.mellanox                                          To 
             .co.il>                   Shirley Ma/Beaverton/IBM at IBMUS      
                                                                        cc 
             07/26/07 12:22 AM         Eitan Zahavi                        
                                       <eitan at mellanox.co.il>,             
                                       general at lists.openfabrics.org       
             Please respond to                                     Subject 
                "Michael S.            Re: Re: openSM: Different IB MTUs   
                 Tsirkin"                                                  
             <mst at dev.mellanox                                             
                  .co.il>                                                  
                                                                           
                                                                           
What does "1" mean? Surely not 1 byte MTU :)
IMO a good format would be the MTU value in bytes.
E.g. 512, 1024, 2048, 4096.

Quoting Shirley Ma <xma at us.ibm.com>:
Subject: RE: Re: openSM: Different IB MTUs

Eitan,

That's a good approach to address the issue.

thanks
Shirley Ma

Inactive hide details for "Eitan Zahavi" <eitan at mellanox.co.il>"Eitan
Zahavi"
<eitan at mellanox.co.il>


                "Eitan Zahavi"         [cid]   *
                <eitan at mellanox.co.il>      To Shirley
Ma/Beaverton/IBM at IBMUS
                                       [cid]   *
                07/25/07 11:00 PM           cc
<general at lists.openfabrics.org>, "Hal Rosenstock"
                                               <hal.rosenstock at gmail.com>
                                       [cid]   *
                                       Subject RE: [ofa-general] Re:
openSM: Different IB MTUs
                                       *        *

I propose that when there is no MTU in the partition policy file OpenSM use
a
configurable default from: /etc/cache/opensm/opensm.opt.
Something like:
# The default MTU to be used for IPoIB and other MCGs when the
partition-policy
# does not provide exact value. The default is the lowest possible MTU
mcg_default_mtu 1

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

From: Shirley Ma [mailto:xma at us.ibm.com]
Sent: Wednesday, July 25, 2007 10:45 PM
To: Eitan Zahavi
Cc: general at lists.openfabrics.org; Hal Rosenstock
Subject: RE: [ofa-general] Re: openSM: Different IB MTUs

Hello Eitan, Hal,

Thanks. It's good openSM has the configuration option to set up these
attributes in MC. Is this a good idea to add below to openSM: When there is
no
MTU defined in the configuration file, SM can pick up the smallest link MTU
in
the fabrics by default? MTU is unlikely rate, slower rate might indicate
the
cablling problem. So using the smallest link MTU in the fabrics might not
be a
bad choice for MC by default. The reason I request here is to create IP
multicast group, MTU is not an attribute of the group. When mapping IP
multicast to IB multicast, IB muliticast might fail because of different IB
link MTU size in the group, but IP multicast group will be successful
without
knowing the failure. If admin sets MTU in configuration file, admin would
know
this failure. Otherwise, admin/users could spend too much time on debugging
their broken multicasting applications.

Thanks
Shirley Ma

Inactive hide details for "Eitan Zahavi" <eitan at mellanox.co.il>"Eitan
Zahavi"
<eitan at mellanox.co.il>

                                "Eitan Zahavi"         [cid]   *
                                <eitan at mellanox.co.il>      To "Hal
Rosenstock"

<hal.rosenstock at gmail.com>, Shirley
                                07/25/07 12:25 PM
Ma/Beaverton/IBM at IBMUS
                                                       [cid]   *
                                                            cc
<general at lists.openfabrics.org>
                                                       [cid]   *
                                                       Subject RE:
[ofa-general] Re: openSM:
                                                               Different IB
MTUs
                                                       *       *

Hi Shirley,

I think I understand where your question comes from...
Many have issue with heterogonous fabrics where not all nodes have same MTU
or
Speed.
Especially when IPoIB relies on all nodes joining the broadcast group.

The term "join" for multicast groups is a little overloaded.
If a node joins an existing MC group it has to have a rate (speed * width)
>
MCG.rate and support MTU > MCG.MTU otherwise it is denied.
If the join is actually a "create" the node has to provide the rate and MTU
which define the MCG values.

To allow for administrator to control the IPoIB MCGs MTU and rate OpenSM
provides the means to control these
values per partition. See the doc/partition-config.doc
Still the administrator should know what would be the lowest MTU and rate
the
nodes expected to join the IPoIB subnet have.
The tradeoff is in the hands of the administrator who can set a value that
will
prevent slow nodes from joining the group,
or assign a low value that will fit all nodes but slow down communication
...

EZ

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

From: general-bounces at lists.openfabrics.org [
mailto:general-bounces at lists.openfabrics.org] On Behalf Of Hal Rosenstock
Sent: Wednesday, July 25, 2007 10:01 PM
To: Shirley Ma
Cc: general at lists.openfabrics.org
Subject: [ofa-general] Re: openSM: Different IB MTUs

Shirley,

On 7/25/07, Shirley Ma <xma at us.ibm.com> wrote:

        Hal,

        Thanks for your prompt reply. I am asking for how openSM handle
        different link MTUs in SA MCMemberRecord MTU. For example, if we
have
        some links MTU as 2K, some links MTU as 1K. Then when enabling
IPoIB,
        how does SM decide IPoIB broadcast group MCMemberRecord MTU size?
When
        creating an IB multicast group from a 2K MTU node first, which PMTU
        value is attaching to this IB multicast group MCMemberRecord MTU?


MCMemberRecord MTU gets the group MTU (when created). This is either this
first
joiner with sufficient components or preconfigured (and MTU can be set in
the
config). If a joiner has insufficient MTU for the group, it is denied.

-- Hal

        Thanks
        Shirley Ma

        Inactive hide details for "Hal Rosenstock"
<hal.rosenstock at gmail.com>
        "Hal Rosenstock" < hal.rosenstock at gmail.com>
                                                "Hal Rosenstock" <
[cid]   *
                                                hal.rosenstock at gmail.com>
To Shirley Ma/Beaverton/
IBM at IBMUS
                                                07/25/07 10:57 AM
[cid]   *
cc general at lists.openfabrics.org
[cid]   *
Subject Re: openSM: Different IB MTUs
*                  *

        Shirley,

        On 7/25/07, Shirley Ma < xma at us.ibm.com> wrote:
                        Hello Hal,

                        How does openSM handle CAs with different MTUs in
the
                        same subnet? For example, IPoIB broadcast group
MTU, IB
                        multicast group PMTU? Does openSM pick up the
smallest
                        MTU in the subnet?


        Are you asking about link MTU, SA PathRecord/MultiPathRecord MTU,
SA
        MCMemberRecord MTU, or all of these ?

        -- Hal
                        Thanks
                        Shirley Ma


_______________________________________________
general mailing list
general at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general

--
MST
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070726/45461b70/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070726/45461b70/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pic14492.gif
Type: image/gif
Size: 1255 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070726/45461b70/attachment-0001.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ecblank.gif
Type: image/gif
Size: 45 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070726/45461b70/attachment-0002.gif>

From akepner at sgi.com  Thu Jul 26 08:33:54 2007
From: akepner at sgi.com (akepner at sgi.com)
Date: Thu, 26 Jul 2007 08:33:54 -0700
Subject: [ofa-general] Re: [RFC/PATCH] mthca: ensure alignment of doorbell
	writes
In-Reply-To: <20070726033946.GA31524@mellanox.co.il>
References: <20070726014931.GL10235@sgi.com>
	<20070726033946.GA31524@mellanox.co.il>
Message-ID: <20070726153354.GN10235@sgi.com>

On Thu, Jul 26, 2007 at 06:39:46AM +0300, Michael S. Tsirkin wrote:

> ....
> These should be getting 'union mthca_doorbell *db' I think.
> 

Hi Michael;

Want to make sure I understand your point. Are you saying, e.g., 
that the function:

static inline void mthca_ring_db(union mthca_doorbell db, 
				 void __iomem *dest,
                                 spinlock_t *doorbell_lock)

should instead have the prototype:

static inline void mthca_ring_db(union mthca_doorbell* db,
                                 void __iomem *dest,
                                 spinlock_t *doorbell_lock)

?

If so, I'm not sure I agree. The union mthca_doorbell is 
64 bits so can be passed in a register, but passing a pointer 
requires a few extra operations to calculate the address, 
and dereference the pointer. But maybe I misunderstand you...


Now that I look at this again, the __attribute__ ((aligned...))
thing on union mthca_doorbell is pretty silly - of course the  
alignment is going to be sizeof(__be64)....

+union mthca_doorbell {
+       __be64 val64;
+       __be32 val32[2];
+} __attribute__ ((aligned (sizeof(__be64))));
+

-- 
Arthur


From jlentini at netapp.com  Thu Jul 26 08:45:04 2007
From: jlentini at netapp.com (James Lentini)
Date: Thu, 26 Jul 2007 11:45:04 -0400 (EDT)
Subject: [ofa-general] Re: [ANNOUNCE] NFS-RDMA for OFED 1.2 G/A
In-Reply-To: <20070726143139.GL22557@mellanox.co.il>
References: <1185297645.14681.22.camel@trinity.ogc.int>
	<1185457905.5165.695.camel@firewall.xsintricity.com>
	<Pine.LNX.4.64.0707260954590.2834@jlentini-linux.nane.netapp.com>
	<20070726143139.GL22557@mellanox.co.il>
Message-ID: <Pine.LNX.4.64.0707261143540.2834@jlentini-linux.nane.netapp.com>


On Thu, 26 Jul 2007, Michael S. Tsirkin wrote:

> 
> > Now that the 2.6.23 merge window is close and people have time to 
> > review new code, we are hoping for more comments.
> 
> You might want to send copy of patches to openfabrics general and 
> lkml if you do.

Good idea. We were already planning to copy openfabrics for the next 
round of reviews.


From chas at cmf.nrl.navy.mil  Thu Jul 26 09:04:56 2007
From: chas at cmf.nrl.navy.mil (chas williams - CONTRACTOR)
Date: Thu, 26 Jul 2007 12:04:56 -0400
Subject: [ofa-general] Re: Re: openSM: Different IB MTUs 
In-Reply-To: <OF5327C1ED.1533FA78-ON87257324.005240EA-88257324.00264C02@us.ibm.com>
Message-ID: <200707261604.l6QG4uJ5011958@cmf.nrl.navy.mil>

In message <OF5327C1ED.1533FA78-ON87257324.005240EA-88257324.00264C02 at us.ibm.co
m>,Shirley Ma writes:
>Set default as 4 (2K) is more proper than 1(512?). All HCAs support 2K =
>at
>least now.

dont some devices perform better with 1k mtu's?  in particular, any
device that suffers from the 'tavor quirk'.


From mst at dev.mellanox.co.il  Thu Jul 26 09:48:21 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 26 Jul 2007 19:48:21 +0300
Subject: [ofa-general] Re: [RFC/PATCH] mthca: ensure alignment of doorbell
	writes
In-Reply-To: <20070726153354.GN10235@sgi.com>
References: <20070726014931.GL10235@sgi.com>
	<20070726033946.GA31524@mellanox.co.il>
	<20070726153354.GN10235@sgi.com>
Message-ID: <20070726164821.GA3930@mellanox.co.il>

> Quoting akepner at sgi.com <akepner at sgi.com>:
> Subject: Re: [RFC/PATCH] mthca: ensure alignment of doorbell writes
> 
> On Thu, Jul 26, 2007 at 06:39:46AM +0300, Michael S. Tsirkin wrote:
> 
> > ....
> > These should be getting 'union mthca_doorbell *db' I think.
> > 
> 
> Hi Michael;
> 
> Want to make sure I understand your point. Are you saying, e.g., 
> that the function:
> 
> static inline void mthca_ring_db(union mthca_doorbell db, 
> 				 void __iomem *dest,
>                                  spinlock_t *doorbell_lock)
> 
> should instead have the prototype:
> 
> static inline void mthca_ring_db(union mthca_doorbell* db,
>                                  void __iomem *dest,
>                                  spinlock_t *doorbell_lock)
> 
> ?

Yes.

> If so, I'm not sure I agree. The union mthca_doorbell is 
> 64 bits so can be passed in a register, but passing a pointer 
> requires a few extra operations to calculate the address, 
> and dereference the pointer. But maybe I misunderstand you...

This is really coding style thing.

It's usually not a good idea to pass unions/structures by value.
If union size is later changed to be large, gcc might pass it in
a global data section, which fails to be reentrant.

Try compiling both variants and looking at the code - I expect
there won't be difference.

> Now that I look at this again, the __attribute__ ((aligned...))
> thing on union mthca_doorbell is pretty silly - of course the  
> alignment is going to be sizeof(__be64)....
> 
> +union mthca_doorbell {
> +       __be64 val64;
> +       __be32 val32[2];
> +} __attribute__ ((aligned (sizeof(__be64))));
> +

Right.

-- 
MST


From mshefty at ichips.intel.com  Thu Jul 26 10:21:16 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 26 Jul 2007 10:21:16 -0700
Subject: [ofa-general] Re: IPoIB path caching
In-Reply-To: <46A846FC.5040704@voltaire.com>
References: <adalkdl43w0.fsf@cisco.com>	<46A2F696.4060007@voltaire.com>	<adafy3f22z5.fsf@cisco.com>	<46A46637.3080104@voltaire.com>	<20070723083020.GD20614@mellanox.co.il>
	<46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com>
	<46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com>
	<46A6F50C.5000906@voltaire.com> <46A78146.1090304@ichips.intel.com>
	<46A846FC.5040704@voltaire.com>
Message-ID: <46A8D80C.1090305@ichips.intel.com>

> I thinking that we are making progress, starting to converge.
> 
> My suggestion is that if you put the PR caching code within the ib_sa 
> module, add a parameter for the ib_sa_path_rec_get() where the caller 
> specifies if it is willing to get cached PR or not. Also I suggest that 
>  rdma_resolve_route() should be also enhanced to have a similar param 
> such that even native IB based ULPs can ask for not cached info if they 
> want to.

I still believe that these should be separate policies.  Consider that 
the cache could have updated immediately before a PR lookup from IPoIB - 
perhaps in response to an SA event.

Administrators can enable or disable the cache.  I don't believe that 
individual applications should be able to override the administrator, 
nor do I think we gain anything by having per application settings. 
This is similar to exposing to applications whether they want to use 
cached ARP information every time they connect.

> For example, I think it would be correct for IB block and file I/O ULPs 
> (iSER, SRP, Lustre, rNFS, etc) to request non cached PR, as their 
> connecting model is not all-to-all but rather n-to-m (n clients to m 
> servers with m << n), the connections are long-lived (hours, days, 
> weeks, more) and a connection failure as of PR caching does not seem 
> acceptable.

I believe a better solution is for everyone to use cached records, if 
they exist, with a feedback mechanism from the CM that removes paths on 
a connection failure or path migration event.

With all to all connections over the rdma cm, the first thing that needs 
to be done is resolve the remote addresses to GIDs.  This causes an ARP 
storm, followed by an SA storm caused by IPoIB, followed by a second SA 
storm caused by the rdma cm.  For scalability, we need to remove both of 
these SA storms, not just the second.  We don't see the first SA storm 
today because IPoIB caches PRs.  Let's not add it.  Restricting caching 
to the rdma cm, but removing it from IPoIB leaves us with the same 
issues that we have today.

- Sean


From mst at dev.mellanox.co.il  Thu Jul 26 10:26:19 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 26 Jul 2007 20:26:19 +0300
Subject: [ofa-general] Re: IPoIB path caching
In-Reply-To: <46A8D80C.1090305@ichips.intel.com>
References: <46A46637.3080104@voltaire.com>
	<20070723083020.GD20614@mellanox.co.il>
	<46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com>
	<46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com>
	<46A6F50C.5000906@voltaire.com> <46A78146.1090304@ichips.intel.com>
	<46A846FC.5040704@voltaire.com> <46A8D80C.1090305@ichips.intel.com>
Message-ID: <20070726172619.GA5208@mellanox.co.il>

> I believe a better solution is for everyone to use cached records, if 
> they exist, with a feedback mechanism from the CM that removes paths on 
> a connection failure or path migration event.

Ack timeout on an RC QP is also a good indication we should redo the lookup.
-- 
MST


From mshefty at ichips.intel.com  Thu Jul 26 10:37:49 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 26 Jul 2007 10:37:49 -0700
Subject: [ofa-general] Re: IPoIB path caching
In-Reply-To: <20070726172619.GA5208@mellanox.co.il>
References: <46A46637.3080104@voltaire.com>	<20070723083020.GD20614@mellanox.co.il>	<46A46A1D.6040000@voltaire.com>
	<46A4EF00.9070305@ichips.intel.com>	<46A5C8E6.5020906@voltaire.com>
	<46A628D8.4050109@ichips.intel.com>	<46A6F50C.5000906@voltaire.com>
	<46A78146.1090304@ichips.intel.com>	<46A846FC.5040704@voltaire.com>
	<46A8D80C.1090305@ichips.intel.com>
	<20070726172619.GA5208@mellanox.co.il>
Message-ID: <46A8DBED.40808@ichips.intel.com>

Michael S. Tsirkin wrote:
>> I believe a better solution is for everyone to use cached records, if 
>> they exist, with a feedback mechanism from the CM that removes paths on 
>> a connection failure or path migration event.
> 
> Ack timeout on an RC QP is also a good indication we should redo the lookup.

Do you know if we get a specific event for this?  (I don't remember.) 
Both the ib_cm and rdma_cm have interfaces that allow a user to report 
events on a connection.  They are used for path migration today, but we 
could easily extend them.

To minimize issues, I think we'll want some sort of feedback mechanism 
in place before enabling caching by default.

- Sean


From mst at dev.mellanox.co.il  Thu Jul 26 10:47:00 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 26 Jul 2007 20:47:00 +0300
Subject: [ofa-general] Re: IPoIB path caching
In-Reply-To: <46A8DBED.40808@ichips.intel.com>
References: <46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com>
	<46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com>
	<46A6F50C.5000906@voltaire.com> <46A78146.1090304@ichips.intel.com>
	<46A846FC.5040704@voltaire.com> <46A8D80C.1090305@ichips.intel.com>
	<20070726172619.GA5208@mellanox.co.il>
	<46A8DBED.40808@ichips.intel.com>
Message-ID: <20070726174700.GB5208@mellanox.co.il>


> Quoting Sean Hefty <mshefty at ichips.intel.com>:
> Subject: Re: [ofa-general] Re: IPoIB path caching
> 
> Michael S. Tsirkin wrote:
> >>I believe a better solution is for everyone to use cached records, if 
> >>they exist, with a feedback mechanism from the CM that removes paths on 
> >>a connection failure or path migration event.
> >
> >Ack timeout on an RC QP is also a good indication we should redo the 
> >lookup.
> 
> Do you know if we get a specific event for this?  (I don't remember.) 

CQE with error IIRC.

> Both the ib_cm and rdma_cm have interfaces that allow a user to report 
> events on a connection.  They are used for path migration today, but we 
> could easily extend them.

Makes sense.

> To minimize issues, I think we'll want some sort of feedback mechanism 
> in place before enabling caching by default.

Right.

-- 
MST


From mshefty at ichips.intel.com  Thu Jul 26 10:53:01 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 26 Jul 2007 10:53:01 -0700
Subject: [ofa-general] QoS RFC
In-Reply-To: <46A87938.6040305@dev.mellanox.co.il>
References: <46A283B6.1070105@dev.mellanox.co.il>
	<46A7DEF8.7040608@ichips.intel.com>
	<46A7E59A.5070801@dev.mellanox.co.il>
	<46A80E37.5080304@ichips.intel.com>
	<46A87938.6040305@dev.mellanox.co.il>
Message-ID: <46A8DF7D.8050706@ichips.intel.com>

> But again, the administrator doesn't *have* to use all these.
> He can simply define sl2vl-tables, and then match service-id
> (in qos-match-rules) to a certain sl (in qos-levels).
> That's it.
> No MTU, rate, packet lifetime or any other low level data.
> Does the following file look better?

My take is that it's still too low level (GUIDs, SL to VL mappings, 
service ID ranges) for a user interface.  The format may be fine as the 
output of some graphical tool or an application that parses a simpler 
interface file.  But for a human, I think we should strive for something 
simpler (QoS for Dummies), even if we lose some flexibility with the 
easier interface.  Unfortunately, I don't have any specific ideas at the 
moment beyond 'easy'.

- Sean


From hal.rosenstock at gmail.com  Thu Jul 26 10:53:07 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 26 Jul 2007 13:53:07 -0400
Subject: [ofa-general] [PATCH][TRIVIAL] OpenSM: More changes from osm.log to
	opensm.log
Message-ID: <f0e08f230707261053ge692598s87d8a03e3cec133c@mail.gmail.com>

OpenSM: More changes from osm.log to opensm.log

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
index 8038dd3..f3429ff 100644
--- a/opensm/opensm/main.c
+++ b/opensm/opensm/main.c
@@ -253,7 +253,7 @@ show_usage(void)
  printf( "-f\n"
          "--log_file\n"
          "          This option defines the log to be the given file.\n"
-          "          By default, the log goes to /var/log/osm.log.\n"
+          "          By default, the log goes to /var/log/opensm.log.\n"
          "          For the log to go to standard output use -f stdout.\n\n");
  printf( "-L <size in MB>\n"
          "--log_limit <size in MB>\n"
diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c
index 082a00f..766779d 100644
--- a/opensm/opensm/osm_console.c
+++ b/opensm/opensm/osm_console.c
@@ -136,7 +136,7 @@ static void help_status(FILE *out, int detail)

 static void help_logflush(FILE *out, int detail)
 {
-       fprintf(out, "logflush -- flush the osm.log file\n");
+       fprintf(out, "logflush -- flush the opensm.log file\n");
 }

 static void help_querylid(FILE *out, int detail)
diff --git a/opensm/opensm/osm_node_info_rcv.c
b/opensm/opensm/osm_node_info_rcv.c
index 1641999..d1b8204 100644
--- a/opensm/opensm/osm_node_info_rcv.c
+++ b/opensm/opensm/osm_node_info_rcv.c
@@ -225,7 +225,7 @@ __osm_ni_rcv_set_links(
            osm_log( p_rcv->p_log, OSM_LOG_SYS,
                     "Errors on subnet. Duplicate GUID found "
                     "by link from a port to itself. "
-                     "See osm log for more details\n");
+                     "See opensm.log for more details\n");

            if ( p_rcv->p_subn->opt.exit_on_fatal == TRUE )
              exit( 1 );


From jgunthorpe at obsidianresearch.com  Thu Jul 26 11:11:32 2007
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Thu, 26 Jul 2007 12:11:32 -0600
Subject: [ofa-general] Re: IPoIB path caching
In-Reply-To: <46A8D80C.1090305@ichips.intel.com>
References: <46A46637.3080104@voltaire.com>
	<20070723083020.GD20614@mellanox.co.il>
	<46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com>
	<46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com>
	<46A6F50C.5000906@voltaire.com> <46A78146.1090304@ichips.intel.com>
	<46A846FC.5040704@voltaire.com> <46A8D80C.1090305@ichips.intel.com>
Message-ID: <20070726181132.GO19768@obsidianresearch.com>

On Thu, Jul 26, 2007 at 10:21:16AM -0700, Sean Hefty wrote:
> >My suggestion is that if you put the PR caching code within the ib_sa 
> >module, add a parameter for the ib_sa_path_rec_get() where the caller 
> >specifies if it is willing to get cached PR or not. Also I suggest that 
> > rdma_resolve_route() should be also enhanced to have a similar param 
> >such that even native IB based ULPs can ask for not cached info if they 
> >want to.
> 
> I still believe that these should be separate policies.  Consider that 
> the cache could have updated immediately before a PR lookup from IPoIB - 
> perhaps in response to an SA event.

FWIW, I agree with Sean. The kernel cache must be authoritative and
must not be overriden by ULP. View this as the first step to
creating a distributed SA, not as the first step to generalized
PR caching.

Linking things like ARP failures and QP failures to cache
'invalidates' is, IMHO, ultimately pointless. My view is that the SA
will have to grow a means to refresh data in the distributed SA when
it reconfigures the network. We have parts of this today via the
various SA traps, but no per-GID invalidation.

A client is probably going to detect a problem in the network before
the SM can fix it, so doing a PR will just get the same old bad
data. Further in many cases the SM can likely re-route the broken path
so that the old PR is still valid.  The number of times you actually
need to change a PR once issued should be very small. If your network
cares about fast-failover then it should have a high LMC and rely on
IB's explicit multipath feature, and the kernel cache design should
support this.

This same argument is why IPoIB ARP decisions really have no bearing
on IB PRs. IPoIB ARP logic and refreshes is designed to support the
distributed ND lookup model - IB PR's have completely different
lifetime rules that are totally unrelated to ARP's liftime rules.  The
existing trap monitoring in Sean's module covers about 90% of the
cases in IB when you need to invalidate a PR, the last 10% will need
something new :(

Sean, it seems to me that alot of what is being talked about here
really boils down to policy decisions about the caching. Maybe you'd
see less resistance if the kernel module didn't have any policy and
that was left to userspace. Even your choice today of putting the big
GetTable query in the kernel strikes me as something I'd prefer to see
in userspace.

Jason


From mshefty at ichips.intel.com  Thu Jul 26 11:58:04 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 26 Jul 2007 11:58:04 -0700
Subject: [ofa-general] Re: QoS in RDMA CM: (was QoS RFC)
In-Reply-To: <46A8AD2E.9000908@opengridcomputing.com>
References: <46A283B6.1070105@dev.mellanox.co.il>
	<46A54659.8010608@ichips.intel.com>
	<46A69225.9090502@ichips.intel.com>
	<46A8AD2E.9000908@opengridcomputing.com>
Message-ID: <46A8EEBC.4090101@ichips.intel.com>

> In the socket API, socket options describe what protocol they are 
> intended for.  You can have options that are intended for IP or TCP and 
> other protocol layers.
> 
> We could do some rdma_setopt() interface, and define both transport 
> independent options and transport-specific options.  Then if there are 
> features of qos that are only in IB, you can make them 
> transport-specific options.  So an option struct may have a 
> transport_type field...
> 
> Although I _think_ it will be a good thing to try and map 
> transport-specific qos attributes to a univeral transport independent 
> attribute.  But I'm not an expert on qos stuff...

Based on the information I found, socket options are used to specify QoS 
/ TOS / DSCP / whatever they want to call it for IPv4, but not for IPv6. 
  For IPv6, the TC and FL fields are included with the socket address.

So... I think we're okay with IPv6, but will need an rdma_setopt() call 
to set the QoS info for IPv4 addresses.  I think we can keep the QoS 
attributes transport independent.

Note that for IB, we could avoid the rdma_setopt() call by mapping the 
resulting IB service ID to a QoS level, but I'd rather find a transport 
independent solution if possible.

> Or a more generic rdma_setopt() that can be extensible for future 
> options/attributes and not break the API...

I agree - my preference is not to break the user space API.

- Sean


From transter at gmail.com  Thu Jul 26 12:37:10 2007
From: transter at gmail.com (lbt)
Date: Thu, 26 Jul 2007 12:37:10 -0700
Subject: [ofa-general] Lost in-service traps during Open SM migration
In-Reply-To: <20070725220204.GI31582@sashak.voltaire.com>
References: <ac71172a0707250957u6148b638s826a560ec013d3e0@mail.gmail.com>
	<20070725220204.GI31582@sashak.voltaire.com>
Message-ID: <ac71172a0707261237wb833b1bq66c64ca39fb3c321@mail.gmail.com>

Thanks for the suggestion Sasha!

Our host stack does receive a rereregistration notice and does resubscribe
all handlers at
that point in time. At the time of the SM migration, our stack prints out
some informational messages to
confirm this:
Jul 18 14:31:09 localhost kernel: Event IB_EVENT_CLIENT_REREGISTER occurred
on port 1
Jul 18 14:31:09 localhost kernel: OpemSM migrated, old SM LID=1 new SM LID=8

And also confirmed in the SM logs that after the migration, the higher
priority SM is getting a subscription request for in-service trap:
Jul 18 14:32:13 103550 [41E02960] -> osm_infr_rcv_process_set_method:
Subscribe Request with QPN: 0x000001
Jul 18 14:32:13 103554 [41E02960] -> osm_infr_get_by_rec: [
Jul 18 14:32:13 103558 [41E02960] -> __dump_all_informs: [
Jul 18 14:32:13 103562 [41E02960] -> InformInfo dump:
                                gid.....................0x0000000000000000 :
0x0000000000000000
                                lid_range_begin.........0xFFFF
                                lid_range_end...........0x0
                                is_generic..............0x1
                                subscribe...............0x0
                                trap_type...............0x3
                                trap_num................64
                                qpn.....................0x000001
                                resp_time_val...........0x0
                                node_type...............0x000004
Jul 18 14:32:13 103569 [41E02960] -> __dump_all_informs: ]

It maybe a problem if the resubscription of the in-service handler occurs
after the in-service notice was forwarded, but I think the problem is that
there is never a notice that is forwared for the higher priority SM port
that is restored. Perhaps, neither SM (the lower priority and higher
priority one), generates an in-service trap because of the timing  gap
between when the restored port is detected and "marked" (i.e. added to
new_ports_list) and when in-service traps are generated for new ports.
During SM migration, the lower priority SM detects the new port, but the
higher priority SM does the trap generation (but it doesn't realize that
it's own port is a new port and thus doesn't generate a trap for it).

Our host stack executes some functions when a port is restored  (in our
in-service subscription handler).
Am I not supposed to receive an in-service trap for a restored port that
happens to be the Master SM, and instead  execute these actions with a
client reregistration event?

Thanks again for your help!
Lan


On 7/25/07, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>
> Hi Lan,
>
> On 09:57 Wed 25 Jul     , lbt wrote:
> >  Hello,
> >
> >  I have been seeing a problem where a subscriber for in-service traps is
> not
> >  getting informed when the port of master openSM is restored (i.e.
> causing an
> >  SM migration).
> >
> >  I have an IB subnet with 2 nodes running OpenSM , different priorities
> of
> >  course (OpenSM Rev:openib-2.0.5). I also have another node on the
> subnet
> >  that has subscribed for the forwarding of any
> IB_SA_GENERIC_TRAP_NUM_IN_SVC
> >  trap events. I've been doing cable pull tests on the IB ports, to check
> if
> >  the in-service handler I have subscribed gets invoked when I restore
> the
> >  cable. I've noticed that everything works as expected ( i.e. my
> in-service
> >  handler is invoked) whenever I restore the cable on the lower priority
> SM IB
> >  port without ever touching the master SM port. But if I cause an SM
> >  migration, by restoring the port of the higher priority SM, the
> in-service
> >  trap does not get generated as expected on a cable restore.
> >
> >  Steps to Reproduce:
> >  1) Start with port to higher priority SM disconnected.
> >  2) restore port cable on the higher priority SM
> >  --> This causes an SM Migration as expected, SM's migration happens
> okay
> >  --> I expected the restoration of the higher priority SM to tit to also
> >  trigger an in-service trap as well and notify subscribers, but it
> doesn't
> >  occur
> >
> >  I have collected debug messages log for both open SM's, and it appears
> that
> >  the reason is because:
> >  1) in-service traps are generated based on what ports are added on the
> >  Master SM's new_ports_list, but these traps are generated only after
> LID
> >  assignment
> >  2) when the higher priority SM port is restored, the restored port gets
> >  added to the lower priority SM's new_ports_list (since it's still the
> Master
> >  SM at that point in time)
> >  3) the handover of Master  SM  from lower priority to higher priority
> SM
> >  occurs (before LID assignment and thus a chance for traps get generated
> for
> >  those ports on new_ports_list)
> >  4) the higher priority SM is now Master SM, but it has an empty
> >  new_ports_list, so no trap generated either
> >
> >  Does this look like a legitimate Open SM bug? Any feedback would be
> much
> >  appreciated, and if I can help further in any way please let me know .
>
> As far as I know when OpenSM (even old like 2.0.5) becomes master it
> requests client to reregister SA related stuff (by setting this bit in
> PortInfo).
>
> Probably your port doesn't not support this (you could verify by seeing
> PortInfo:CapabilityMask - use 'smpquery portinfo <client-port-lid>') or
> maybe your host stack doesn't do reregistration?
>
> Anyway you could track this in the OpenSM code in osm_lid_mgr.c
> __osm_lid_mgr_set_physp_pi() whenever client reregistration bit is set
> (with ib_port_info_set_client_rereg()) or not. Then we will know more
> about this problem.
>
> Sasha
>
> >
> >
> >  Subset of logs from lower priority SM during the cable restore of
> higher
> >  priority SM port:
> >  ### Jul 18 14:31:56 614522 [41401960] ->
> __osm_trap_rcv_process_request:
> >  Received Generic Notice type:0x03 num:128 Producer:2 from LID:0x000A
> >  TID:0x00000016000012e1
> >  ### Jul 18 14:31:56 614823 [41401960] -> osm_state_mgr_process:
> Received
> >  signal OSM_SIGNAL_SWEEP in state OSM_SM_STATE_IDLE
> >  ### 14:31:56 ******************** INITIATING HEAVY SWEEP
> >  **********************
> >  ### Jul 18 14:31:56 616887 [42803960] -> osm_state_mgr_process:
> Received
> >  signal OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state
> >  OSM_SM_STATE_SWEEP_HEAVY_SELF
> >  Jul 18 14:31:56 626078 [42803960] -> __osm_ni_rcv_process_new: Adding
> port
> >  GUID:0x00504501483e0000 to new_ports_list
> >  Jul 18 14:31:56 626524 [42803960] -> osm_state_mgr_process: Received
> signal
> >  OSM_SIGNAL_CHANGE_DETECTED in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET
> >  Jul 18 14:31:56 632630 [41E02960] -> osm_state_mgr_process: Received
> signal
> >  OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state
> OSM_SM_STATE_SWEEP_HEAVY_SUBNET
> >  14:31:56 ********************* HEAVY SWEEP COMPLETE
> ***********************
> >  Jul 18 14:31:56 632773 [41E02960] -> osm_sm_state_mgr_process: Received
> >  signal OSM_SM_SIGNAL_HANDOVER_SENT in state IB_SMINFO_STATE_MASTER###
> >  14:31:56 ******************** ENTERING SM STANDBY STATE
> *******************
> >
> >  Subset of logs from higher priority SM during the cable restore of
> higher
> >  priority SM port:
> >
> >  Jul 18 14:32:02 995600 [41401960] -> osm_sm_state_mgr_process: [
> >  Jul 18 14:32:02 995605 [41401960] -> osm_sm_state_mgr_process: Received
> >  signal OSM_SM_SIGNAL_DISCOVERY_COMPLETED in state
> >  IB_SMINFO_STATE_DISCOVERING
> >  Jul 18 14:32:02 995609 [41401960] -> Entering MASTER state
> >  Jul 18 14:32:02 995888 [41401960] -> __osm_sm_state_mgr_master_msg:
> >  ******************** ENTERING SM MASTER STATE ********************
> >  Jul 18 14:32:03 009014 [41401960] ->
> __osm_state_mgr_set_sm_lid_done_msg:
> >  **** SM LID ASSIGNMENT COMPLETE - STARTING SUBNET LID CONFIG *****
> >  Jul 18 14:32:03 024047 [41E02960] -> __osm_state_mgr_lid_assign_msg
> >  ***** LID ASSIGNMENT COMPLETE - STARTING SWITCH TABLE CONFIG *****
> >  Jul 18 14:32:03 024052 [41E02960] -> __osm_state_mgr_report_new_ports:
> [
> >  ----> no in-service traps are generated and notices forwarded because
> there
> >  are no ports on this list
> >  Jul 18 14:32:03 024057 [41E02960] -> __osm_state_mgr_report_new_ports:
> ]
> >
> >
> >  Thanks!
> >  Lan
>
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070726/441b5898/attachment.html>

From mshefty at ichips.intel.com  Thu Jul 26 13:16:54 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 26 Jul 2007 13:16:54 -0700
Subject: [ofa-general] Userspace support for SA event registration (was:
	IPoIB path caching)
In-Reply-To: <20070726181132.GO19768@obsidianresearch.com>
References: <46A46637.3080104@voltaire.com>
	<20070723083020.GD20614@mellanox.co.il>
	<46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com>
	<46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com>
	<46A6F50C.5000906@voltaire.com> <46A78146.1090304@ichips.intel.com>
	<46A846FC.5040704@voltaire.com> <46A8D80C.1090305@ichips.intel.com>
	<20070726181132.GO19768@obsidianresearch.com>
Message-ID: <46A90136.2090305@ichips.intel.com>

> This same argument is why IPoIB ARP decisions really have no bearing
> on IB PRs. IPoIB ARP logic and refreshes is designed to support the
> distributed ND lookup model - IB PR's have completely different
> lifetime rules that are totally unrelated to ARP's liftime rules.  The
> existing trap monitoring in Sean's module covers about 90% of the
> cases in IB when you need to invalidate a PR, the last 10% will need
> something new :(
> 
> Sean, it seems to me that alot of what is being talked about here
> really boils down to policy decisions about the caching. Maybe you'd
> see less resistance if the kernel module didn't have any policy and
> that was left to userspace. Even your choice today of putting the big
> GetTable query in the kernel strikes me as something I'd prefer to see
> in userspace.

In order to migrate the local SA to user space, we need a way to export
SA event registration.  And I don't think we've ever reached agreement
on the best approach to doing this.

I've posted patches for one approach:

http://lists.openfabrics.org/pipermail/general/2007-February/032487.html

This exposes a user space SA interface for event registration and raw IB
multicast support.  The approach is generic enough that it could be
extended to other SA queries, but the user MAD interface covers this
area as well.

I'd like to get agreement on an approach for this, even outside of local 
SA support.

- Sean


From panda at cse.ohio-state.edu  Thu Jul 26 15:02:25 2007
From: panda at cse.ohio-state.edu (Dhabaleswar Panda)
Date: Thu, 26 Jul 2007 18:02:25 -0400 (EDT)
Subject: [ofa-general] Announcing the release of MVAPICH2 1.0-beta
Message-ID: <200707262202.l6QM2PLK007824@xi.cse.ohio-state.edu>

The MVAPICH team is pleased to announce the availability of
MVAPICH2-1.0-beta with the following NEW features:

- Message coalescing support to enable reduction of per Queue-pair
  send queues for reduction in memory requirement on large scale
  clusters. This design also increases the small message messaging
  rate significantly. Available for Open Fabrics Gen2-IB.

- Hot-Spot Avoidance Mechanism (HSAM) for alleviating
  network congestion in large scale clusters. Available for 
  Open Fabrics Gen2-IB.

- RDMA CM based on-demand connection management for large scale
  clusters. Available for OpenFabrics Gen2-IB and Gen2-iWARP.

- uDAPL on-demand connection management for large scale clusters.
  Available for uDAPL interface (including Solaris IB implementation).
 
- RDMA Read support for increased overlap of computation and
  communication. Available for OpenFabrics Gen2-IB and Gen2-iWARP.

- Application-initiated system-level (synchronous) checkpointing in
  addition to the user-transparent checkpointing. User application can
  now request a whole program checkpoint synchronously with BLCR by
  calling special functions within the application. Available for
  OpenFabrics Gen2-IB.

- Network-Level fault tolerance with Automatic Path Migration (APM)
  for tolerating intermittent network failures over InfiniBand.
  Available for OpenFabrics Gen2-IB.

- Integrated multi-rail communication support for OpenFabrics
  Gen2-iWARP.

- Blocking mode of communication progress. Available for OpenFabrics
  Gen2-IB.

- Based on MPICH2 1.0.5p4.

For downloading MVAPICH2 1.0-beta source code, associated user guide
and accessing the anonymous SVN, please visit the following URL:

http://mvapich.cse.ohio-state.edu

All feedbacks, including bug reports and hints for performance tuning,
are welcome. Please post it to the mvapich-discuss mailing list.

Thanks, 

MVAPICH Team


From sashak at voltaire.com  Thu Jul 26 15:41:33 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 27 Jul 2007 01:41:33 +0300
Subject: [ofa-general] Re: pkey.sim.tcl
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F75674@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com>
	<20070722174048.GO27878@sashak.voltaire.com>
	<f0e08f230707230830x5d0f73ccib09421f3f660e563@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com>
	<20070724005153.GD11674@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com>
	<20070724170432.GZ27878@sashak.voltaire.com>
	<20070724215441.GA25264@sashak.voltaire.com>
	<20070725202418.GD31582@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901F75674@mtlexch01.mtl.com>
Message-ID: <20070726224133.GC2472@sashak.voltaire.com>

Hi Eitan,

On 09:26 Thu 26 Jul     , Eitan Zahavi wrote:
> 
> I am happy you actually use the simulator.
> Please provide more info regarding the failure. You should tar compress
> the /tmp/ibmgtsim.XXXX of your run.

I can send this for you if you want, but the failure is trivial.

> 6. The default PKey is removed from ALL the port  pkey tables
> 7. All PKey tables are validated against initial setup to see that the
> indexes of the assigned "real" pkeys was not altered by the SM.
> 8. A single switch is selected and its Change Bit is raised.
> 9. Wait for SUBNET UP
> 10. Validate all ports got their default pkey back.
> 
> I suspect from our thread about not setting LFT that stage 10 failed for
> you.

Yes, and it is due (6), where default Pkey is removed "externally". I'm
not sure that OpenSM should handle the case when pkey table is modified
externally by something which is not SM.

Sasha

> 
> Eitan
> 
> 
> 
> 
> Eitan Zahavi
> Senior Engineering Director, Software Architect
> Mellanox Technologies LTD
> Tel:+972-4-9097208
> Fax:+972-4-9593245
> P.O. Box 586 Yokneam 20692 ISRAEL
> 
>  
> 
> > -----Original Message-----
> > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] 
> > Sent: Wednesday, July 25, 2007 11:24 PM
> > To: Eitan Zahavi; Yevgeny Kliteynik
> > Cc: Hal Rosenstock; general at lists.openfabrics.org
> > Subject: pkey.sim.tcl (was: [PATCH] opensm: detect port 
> > external reset andflush cached tables)
> > 
> > Hi Eitan, Yevgeny,
> > 
> > 
> > On 00:54 Wed 25 Jul     , Sasha Khapyorsky wrote:
> > > 
> > > This detects port external reset by validating PortState == 
> > INIT, and 
> > > when detected flushes cached port related tables - re-reads 
> > pkey table 
> > > and drops (overwrites) SL2VL and VLArb tables.
> > > 
> > > Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> > 
> > [snip...]
> > > diff --git a/opensm/opensm/osm_port_info_rcv.c 
> > > b/opensm/opensm/osm_port_info_rcv.c
> > > index 6fe2d1d..0528e38 100644
> > > --- a/opensm/opensm/osm_port_info_rcv.c
> > > +++ b/opensm/opensm/osm_port_info_rcv.c
> > > @@ -801,6 +801,12 @@ osm_pi_rcv_process(
> > >        p_rcv->p_subn->master_sm_base_lid = p_pi->master_sm_base_lid;
> > >      }
> > >  
> > > +    /* if port just inited or reached INIT state (external reset)
> > > +       request update for port related tables */
> > > +    p_physp->need_update =
> > > +      (ib_port_info_get_port_state(p_pi) == IB_LINK_INIT ||
> > > +       p_physp->need_update > 1 ) ? 1 : 0;
> > > +
> > >      switch( osm_node_get_type( p_node ) )
> > >      {
> > >      case IB_NODE_TYPE_CA:
> > > @@ -824,7 +830,8 @@ osm_pi_rcv_process(
> > >      /*
> > >        Get the tables on the physp.
> > >      */
> > > -    __osm_pi_rcv_get_pkey_slvl_vla_tables( p_rcv, p_node, 
> > p_physp );
> > > +    if (p_physp->need_update)
> > > +      __osm_pi_rcv_get_pkey_slvl_vla_tables( p_rcv, 
> > p_node, p_physp 
> > > + );
> > 
> > When testing this patch, I tried it with ibmgtsim and test failed:
> > 
> >   RunSimTest -o ${ROOT}/sbin/opensm -t ${TESTS}/IS1-16.topo 
> > -f ${TESTS}/pkey.sim.tcl -c ${TESTS}/pkey.check.tcl
> > 
> > The failure is resulted by port pkey tables modifications 
> > which is performed in pkey.sim.tcl. Why should we do this? Is 
> > this legal scenario when pkey tables are modified externally 
> > without Partition Manager?
> > 
> > Sasha
> > 


From sashak at voltaire.com  Thu Jul 26 15:47:10 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 27 Jul 2007 01:47:10 +0300
Subject: [ofa-general] Re: openSM: Different IB MTUs
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F7564E@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C901F755DF@mtlexch01.mtl.com>
	<OF1EB21731.B109AF64-ON87257323.006B5BCC-88257323.004090BE@us.ibm.com>
	<6C2C79E72C305246B504CBA17B5500C901F7564E@mtlexch01.mtl.com>
Message-ID: <20070726224710.GD2472@sashak.voltaire.com>

On 09:00 Thu 26 Jul     , Eitan Zahavi wrote:
> I propose that when there is no MTU in the partition policy file OpenSM
> use a 
> configurable default from: /etc/cache/opensm/opensm.opt.
> Something like:
> # The default MTU to be used for IPoIB and other MCGs when the
> partition-policy 
> # does not provide exact value. The default is the lowest possible MTU
> mcg_default_mtu 1

Looks like good solution for me.

Somebody cares about patch?

Sasha

>  
> Eitan Zahavi 
> Senior Engineering Director, Software Architect 
> Mellanox Technologies LTD 
> Tel:+972-4-9097208
> Fax:+972-4-9593245 
> P.O. Box 586 Yokneam 20692 ISRAEL 
>  
> 
> 
> ________________________________
> 
> 	From: Shirley Ma [mailto:xma at us.ibm.com] 
> 	Sent: Wednesday, July 25, 2007 10:45 PM
> 	To: Eitan Zahavi
> 	Cc: general at lists.openfabrics.org; Hal Rosenstock
> 	Subject: RE: [ofa-general] Re: openSM: Different IB MTUs
> 	
> 	
> 
> 	Hello Eitan, Hal,
> 	
> 	Thanks. It's good openSM has the configuration option to set up
> these attributes in MC. Is this a good idea to add below to openSM: When
> there is no MTU defined in the configuration file, SM can pick up the
> smallest link MTU in the fabrics by default? MTU is unlikely rate,
> slower rate might indicate the cablling problem. So using the smallest
> link MTU in the fabrics might not be a bad choice for MC by default. The
> reason I request here is to create IP multicast group, MTU is not an
> attribute of the group. When mapping IP multicast to IB multicast, IB
> muliticast might fail because of different IB link MTU size in the
> group, but IP multicast group will be successful without knowing the
> failure. If admin sets MTU in configuration file, admin would know this
> failure. Otherwise, admin/users could spend too much time on debugging
> their broken multicasting applications.
> 	
> 	Thanks
> 	Shirley Ma
> 	
> 	 "Eitan Zahavi" <eitan at mellanox.co.il>
> 	
> 	
> 	
> 
> 				"Eitan Zahavi" <eitan at mellanox.co.il> 
> 
> 				07/25/07 12:25 PM
> 
>  
> 
> To
> 
> "Hal Rosenstock" <hal.rosenstock at gmail.com>, Shirley
> Ma/Beaverton/IBM at IBMUS	
> 
> 
> cc
> 
> <general at lists.openfabrics.org>	
> 
> 
> Subject
> 
> RE: [ofa-general] Re: openSM: Different IB MTUs	
> 	 	
> 
> 	Hi Shirley,
> 	
> 	I think I understand where your question comes from...
> 	Many have issue with heterogonous fabrics where not all nodes
> have same MTU or Speed.
> 	Especially when IPoIB relies on all nodes joining the broadcast
> group.
> 	
> 	The term "join" for multicast groups is a little overloaded.
> 	If a node joins an existing MC group it has to have a rate
> (speed * width) > MCG.rate and support MTU > MCG.MTU otherwise it is
> denied.
> 	If the join is actually a "create" the node has to provide the
> rate and MTU which define the MCG values.
> 	
> 	To allow for administrator to control the IPoIB MCGs MTU and
> rate OpenSM provides the means to control these
> 	values per partition. See the doc/partition-config.doc 
> 	Still the administrator should know what would be the lowest MTU
> and rate the nodes expected to join the IPoIB subnet have.
> 	The tradeoff is in the hands of the administrator who can set a
> value that will prevent slow nodes from joining the group, 
> 	or assign a low value that will fit all nodes but slow down
> communication ...
> 	
> 	EZ 
> 
> 	Eitan Zahavi 
> 	Senior Engineering Director, Software Architect 
> 	Mellanox Technologies LTD 
> 	Tel:+972-4-9097208
> 	Fax:+972-4-9593245 
> 	P.O. Box 586 Yokneam 20692 ISRAEL 
> 
> 	
> 	
> 	
> ________________________________
> 
> 	From: general-bounces at lists.openfabrics.org [
> mailto:general-bounces at lists.openfabrics.org] On Behalf Of Hal
> Rosenstock
> 	Sent: Wednesday, July 25, 2007 10:01 PM
> 	To: Shirley Ma
> 	Cc: general at lists.openfabrics.org
> 	Subject: [ofa-general] Re: openSM: Different IB MTUs
> 	
> 	Shirley,
> 	
> 	On 7/25/07, Shirley Ma <xma at us.ibm.com <mailto:xma at us.ibm.com> >
> wrote: 
> 
> 		Hal,
> 		
> 		Thanks for your prompt reply. I am asking for how openSM
> handle different link MTUs in SA MCMemberRecord MTU. For example, if we
> have some links MTU as 2K, some links MTU as 1K. Then when enabling
> IPoIB, how does SM decide IPoIB broadcast group MCMemberRecord MTU size?
> When creating an IB multicast group from a 2K MTU node first, which PMTU
> value is attaching to this IB multicast group MCMemberRecord MTU? 
> 
> 
> 	
> 	MCMemberRecord MTU gets the group MTU (when created). This is
> either this first joiner with sufficient components or preconfigured
> (and MTU can be set in the config). If a joiner has insufficient MTU for
> the group, it is denied. 
> 	
> 	-- Hal
> 	
> 	
> 
> 		Thanks
> 		Shirley Ma
> 		
> 		 "Hal Rosenstock" < hal.rosenstock at gmail.com
> <mailto:hal.rosenstock at gmail.com> >
> 		
> 		
> 	
> 
> 					"Hal Rosenstock" <
> hal.rosenstock at gmail.com <mailto:hal.rosenstock at gmail.com> > 
> 
> 					07/25/07 10:57 AM
> 
> 	
> 	  
> 	To
> 	
> Shirley Ma/Beaverton/IBM at IBMUS	
> 	 
> 	cc
> 	
> general at lists.openfabrics.org <mailto:general at lists.openfabrics.org> 	
> 	 
> 	Subject
> 	
> Re: openSM: Different IB MTUs	
> 		 	
> 		
> 		Shirley,
> 		
> 		On 7/25/07, Shirley Ma < xma at us.ibm.com
> <mailto:xma at us.ibm.com> > wrote: 
> 
> 				Hello Hal,
> 				
> 				How does openSM handle CAs with
> different MTUs in the same subnet? For example, IPoIB broadcast group
> MTU, IB multicast group PMTU? Does openSM pick up the smallest MTU in
> the subnet? 
> 
> 		
> 		
> 		Are you asking about link MTU, SA
> PathRecord/MultiPathRecord MTU, SA MCMemberRecord MTU, or all of these ?
> 		
> 		-- Hal 
> 
> 				Thanks
> 				Shirley Ma
> 
> 
> 


> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From sashak at voltaire.com  Thu Jul 26 15:50:05 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 27 Jul 2007 01:50:05 +0300
Subject: [ofa-general] Re: openSM: Different IB MTUs
In-Reply-To: <f0e08f230707260558y4b62d4d1heec74058013eb12@mail.gmail.com>
References: <6C2C79E72C305246B504CBA17B5500C901F755DF@mtlexch01.mtl.com>
	<OF1EB21731.B109AF64-ON87257323.006B5BCC-88257323.004090BE@us.ibm.com>
	<6C2C79E72C305246B504CBA17B5500C901F7564E@mtlexch01.mtl.com>
	<f0e08f230707260558y4b62d4d1heec74058013eb12@mail.gmail.com>
Message-ID: <20070726225005.GE2472@sashak.voltaire.com>

On 08:58 Thu 26 Jul     , Hal Rosenstock wrote:
> On 7/26/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:
> >
> >  *I propose that when there is no MTU in the partition policy file
> > OpenSM use a *
> > *configurable default from: **/etc/cache/opensm/opensm.opt.*
> >
> 
> That would make this the default rather than 2K. IMO it should be when some
> "special" unused mtu is set in the partition config.

"No value" should suitable too. No?

Sasha

> 
> -- Hal
> 
>  *Something like:*
> > *# The default MTU to be used for IPoIB and other MCGs when the
> > partition-policy *
> > *# does not provide exact value. The default is the lowest possible MTU*
> > *mcg_default_mtu 1*
> > **
> > *Eitan Zahavi***
> > Senior Engineering Director, Software Architect
> > Mellanox Technologies LTD
> > Tel:+972-4-9097208
> > Fax:+972-4-9593245
> > P.O. Box 586 Yokneam 20692 ISRAEL
> >
> >
> >  ------------------------------
> > *From:* Shirley Ma [mailto:xma at us.ibm.com]
> > *Sent:* Wednesday, July 25, 2007 10:45 PM
> > *To:* Eitan Zahavi
> > *Cc:* general at lists.openfabrics.org; Hal Rosenstock
> > *Subject:* RE: [ofa-general] Re: openSM: Different IB MTUs
> >
> >
> >
> > Hello Eitan, Hal,
> >
> > Thanks. It's good openSM has the configuration option to set up these
> > attributes in MC. Is this a good idea to add below to openSM: When there is
> > no MTU defined in the configuration file, SM can pick up the smallest link
> > MTU in the fabrics by default? MTU is unlikely rate, slower rate might
> > indicate the cablling problem. So using the smallest link MTU in the fabrics
> > might not be a bad choice for MC by default. The reason I request here is to
> > create IP multicast group, MTU is not an attribute of the group. When
> > mapping IP multicast to IB multicast, IB muliticast might fail because of
> > different IB link MTU size in the group, but IP multicast group will be
> > successful without knowing the failure. If admin sets MTU in configuration
> > file, admin would know this failure. Otherwise, admin/users could spend too
> > much time on debugging their broken multicasting applications.
> >
> > Thanks
> > Shirley Ma
> >
> > [image: Inactive hide details for "Eitan Zahavi" <eitan at mellanox.co.il>]"Eitan
> > Zahavi" <eitan at mellanox.co.il>
> >
> >
> >
> >     *"Eitan Zahavi" <eitan at mellanox.co.il>*
> >
> >             07/25/07 12:25 PM
> >
> >
> > To
> >
> > "Hal Rosenstock" <hal.rosenstock at gmail.com>, Shirley
> > Ma/Beaverton/IBM at IBMUS
> > cc
> >
> > <general at lists.openfabrics.org>
> > Subject
> >
> > RE: [ofa-general] Re: openSM: Different IB MTUs
> > *Hi Shirley,*
> >
> > *I think I understand where your question comes from...*
> > *Many have issue with heterogonous fabrics where not all nodes have same
> > MTU or Speed.*
> > *Especially when IPoIB relies on all nodes joining the broadcast group.*
> >
> > *The term "join" for multicast groups is a little overloaded.*
> > *If a node joins an existing MC group it has to have a rate (speed *
> > width) > MCG.rate and support MTU > MCG.MTU otherwise it is denied.*
> > *If the join is actually a "create" the node has to provide the rate and
> > MTU which define the MCG values.*
> >
> > *To allow for administrator to control the IPoIB MCGs MTU and rate OpenSM
> > provides the means to control these*
> > *values per partition. See the doc/partition-config.doc*
> > *Still the administrator should know what would be the lowest MTU and rate
> > the nodes expected to join the IPoIB subnet have.*
> > *The tradeoff is in the hands of the administrator who can set a value
> > that will prevent slow nodes from joining the group, *
> > *or assign a low value that will fit all nodes but slow down communication
> > ...*
> >
> > *EZ*
> >
> > *Eitan Zahavi*
> > Senior Engineering Director, Software Architect
> > Mellanox Technologies LTD
> > Tel:+972-4-9097208
> > Fax:+972-4-9593245
> > P.O. Box 586 Yokneam 20692 ISRAEL
> >
> >
> >
> > ------------------------------
> > *From:* general-bounces at lists.openfabrics.org [
> > mailto:general-bounces at lists.openfabrics.org<general-bounces at lists.openfabrics.org>]
> > *On Behalf Of *Hal Rosenstock*
> > Sent:* Wednesday, July 25, 2007 10:01 PM*
> > To:* Shirley Ma*
> > Cc:* general at lists.openfabrics.org*
> > Subject:* [ofa-general] Re: openSM: Different IB MTUs
> >
> > Shirley,
> >
> > On 7/25/07, *Shirley Ma* <*xma at us.ibm.com* <xma at us.ibm.com>> wrote:
> >
> >    Hal,
> >
> >    Thanks for your prompt reply. I am asking for how openSM handle
> >    different link MTUs in SA MCMemberRecord MTU. For example, if we have some
> >    links MTU as 2K, some links MTU as 1K. Then when enabling IPoIB, how does SM
> >    decide IPoIB broadcast group MCMemberRecord MTU size? When creating an IB
> >    multicast group from a 2K MTU node first, which PMTU value is attaching to
> >    this IB multicast group MCMemberRecord MTU?
> >
> >
> >
> > MCMemberRecord MTU gets the group MTU (when created). This is either this
> > first joiner with sufficient components or preconfigured (and MTU can be set
> > in the config). If a joiner has insufficient MTU for the group, it is
> > denied.
> >
> > -- Hal
> >
> >
> >    Thanks
> >    Shirley Ma
> >
> >    [image: Inactive hide details for "Hal Rosenstock"
> >    <hal.rosenstock at gmail.com>]"Hal Rosenstock" < *
> >    hal.rosenstock at gmail.com* <hal.rosenstock at gmail.com>>
> >
> >          *"Hal Rosenstock" <**hal.rosenstock at gmail.com*<hal.rosenstock at gmail.com>
> >                            *>*
> >
> >                            07/25/07 10:57 AM
> >                               To
> >
> >    Shirley Ma/Beaverton/IBM at IBMUS  cc
> >    *
> >    **general at lists.openfabrics.org* <general at lists.openfabrics.org>
> >    Subject
> >
> >    Re: openSM: Different IB MTUs
> >    Shirley,
> >
> >    On 7/25/07, *Shirley Ma* <* **xma at us.ibm.com* <xma at us.ibm.com>>
> >    wrote:
> >       Hello Hal,
> >
> >          How does openSM handle CAs with different MTUs in the
> >          same subnet? For example, IPoIB broadcast group MTU, IB multicast group
> >          PMTU? Does openSM pick up the smallest MTU in the subnet?
> >
> >
> >    Are you asking about link MTU, SA PathRecord/MultiPathRecord MTU, SA
> >    MCMemberRecord MTU, or all of these ?
> >
> >    -- Hal
> >       Thanks
> >          Shirley Ma
> >
> >
> >
> >


> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From sashak at voltaire.com  Thu Jul 26 16:15:40 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 27 Jul 2007 02:15:40 +0300
Subject: [ofa-general] Re: [PATCH][TRIVIAL] OpenSM/include/iba/ib_types.h:
	Fix comment
In-Reply-To: <f0e08f230707260716j4a824cddqe06acfd2b19983e@mail.gmail.com>
References: <f0e08f230707260716j4a824cddqe06acfd2b19983e@mail.gmail.com>
Message-ID: <20070726231540.GF2472@sashak.voltaire.com>

On 14:16 Thu 26 Jul     , Hal Rosenstock wrote:
> include/iba/ib_types.h: Fix comment
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Thu Jul 26 16:16:04 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 27 Jul 2007 02:16:04 +0300
Subject: [ofa-general] Re: [PATCH] OpenSM/osm_port.c: Fix opvls and
	neighbormtu when remote port invalid
In-Reply-To: <f0e08f230707260725m5779a12dw26dd5af3bc29ff56@mail.gmail.com>
References: <f0e08f230707260725m5779a12dw26dd5af3bc29ff56@mail.gmail.com>
Message-ID: <20070726231604.GG2472@sashak.voltaire.com>

On 10:25 Thu 26 Jul     , Hal Rosenstock wrote:
> OpenSM/osm_port.c: Fix opvls and neighbormtu when remote port invalid
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Thu Jul 26 16:22:43 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 27 Jul 2007 02:22:43 +0300
Subject: [ofa-general] Re: [PATCH][TRIVIAL] OpenSM: More changes from osm.log
	to opensm.log
In-Reply-To: <f0e08f230707261053ge692598s87d8a03e3cec133c@mail.gmail.com>
References: <f0e08f230707261053ge692598s87d8a03e3cec133c@mail.gmail.com>
Message-ID: <20070726232243.GH2472@sashak.voltaire.com>

On 13:53 Thu 26 Jul     , Hal Rosenstock wrote:
> OpenSM: More changes from osm.log to opensm.log
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Thu Jul 26 16:30:05 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 27 Jul 2007 02:30:05 +0300
Subject: [ofa-general] [PATCH] opensm: remove reassign_lfts configuration
	parameter
Message-ID: <20070726233005.GJ2472@sashak.voltaire.com>


This removes actually useless subn.opt.reassign_lfts parameter.
Its value is used only for initial setup of ignore_existing_lfts flag.
But later this flag becomes unconditionally TRUE if at least one switch
is found in the fabric. If not (and fabric is just back to back
connected CAs) there is no routing and this flag is not used. In any
case initial value does not matter.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/include/opensm/osm_subnet.h |   10 +---------
 opensm/opensm/osm_subnet.c         |   14 +-------------
 2 files changed, 2 insertions(+), 22 deletions(-)

diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
index 92f2bc0..84ed6d4 100644
--- a/opensm/include/opensm/osm_subnet.h
+++ b/opensm/include/opensm/osm_subnet.h
@@ -239,7 +239,6 @@ typedef struct _osm_subn_opt
   uint8_t		   max_op_vls;
   uint8_t		   force_link_speed;
   boolean_t		   reassign_lids;
-  boolean_t		   reassign_lfts;
   boolean_t		   ignore_other_sm;
   boolean_t		   single_thread;
   boolean_t		   no_multicast_option;
@@ -345,12 +344,6 @@ typedef struct _osm_subn_opt
 *		Otherwise (the default),
 *		OpenSM always tries to preserve as LIDs as much as possible.
 *
-*	reassign_lfts
-*		If TRUE ignore existing LFT entries on first sweep (default).
-*		Otherwise only non minimal hop cases are modified.
-*		NOTE: A standby SM clears its first sweep flag - since the
-*		master SM already sweeps...
-*
 *	ignore_other_sm_option
 *		This flag is TRUE if other SMs on the subnet should be ignored.
 *
@@ -656,9 +649,8 @@ typedef struct _osm_subn
 *     This flag is a dynamic flag to instruct the LFT assignment to
 *     ignore existing legal LFT settings.
 *     The value will be set according to :
-*     - During SM init set to the reassign_lfts flag value
-*     - Coming out of STANDBY it will be cleared (other SM worked)
 *     - Any change to the list of switches will set it to high
+*     - Coming out of STANDBY it will be cleared (other SM worked)
 *     - Set to FALSE upon end of all lft assignments.
 *
 *  subnet_initalization_error
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 7e17945..7e7a4d5 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -221,8 +221,7 @@ osm_subn_init(
   /* note that insert and remove are part of the port_profile thing */
   cl_map_init(&(p_subn->opt.port_prof_ignore_guids), 10);
 
-  /* ignore_existing_lfts follows reassign_lfts on first sweep */
-  p_subn->ignore_existing_lfts = p_subn->opt.reassign_lfts;
+  p_subn->ignore_existing_lfts = TRUE;
 
   /* we assume master by default - so we only need to set it true if STANDBY */
   p_subn->coming_out_of_standby = FALSE;
@@ -451,7 +450,6 @@ osm_subn_set_default_opt(
   p_opt->max_op_vls = OSM_DEFAULT_MAX_OP_VLS;
   p_opt->force_link_speed = 15;
   p_opt->reassign_lids = FALSE;
-  p_opt->reassign_lfts = TRUE;
   p_opt->ignore_other_sm = FALSE;
   p_opt->single_thread = FALSE;
   p_opt->no_multicast_option = FALSE;
@@ -1221,10 +1219,6 @@ osm_subn_parse_conf_file(
         p_key, p_val, &p_opts->reassign_lids);
 
       __osm_subn_opts_unpack_boolean(
-        "reassign_lfts",
-        p_key, p_val, &p_opts->reassign_lfts);
-
-      __osm_subn_opts_unpack_boolean(
         "ignore_other_sm",
         p_key, p_val, &p_opts->ignore_other_sm);
 
@@ -1544,11 +1538,6 @@ osm_subn_write_conf_file(
     "sweep_interval %u\n\n"
     "# If TRUE cause all lids to be reassigned\n"
     "reassign_lids %s\n\n"
-    "# If TRUE ignore existing LFT entries on first sweep (default).\n"
-    "# Otherwise only non minimal hop cases are modified.\n"
-    "# NOTE: A standby SM clears its first sweep flag - since the\n"
-    "# master SM already sweeps...\n"
-    "reassign_lfts %s\n\n"
     "# If TRUE forces every sweep to be a heavy sweep\n"
     "force_heavy_sweep %s\n\n"
     "# If TRUE every trap will cause a heavy sweep.\n"
@@ -1556,7 +1545,6 @@ osm_subn_write_conf_file(
     "sweep_on_trap %s\n\n",
     p_opts->sweep_interval,
     p_opts->reassign_lids ? "TRUE" : "FALSE",
-    p_opts->reassign_lfts ? "TRUE" : "FALSE",
     p_opts->force_heavy_sweep ? "TRUE" : "FALSE",
     p_opts->sweep_on_trap ? "TRUE" : "FALSE"
     );
-- 
1.5.3.rc2.38.g11308


From sashak at voltaire.com  Thu Jul 26 16:31:35 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 27 Jul 2007 02:31:35 +0300
Subject: [ofa-general] [PATCH] opensm: don't fetch LFTs initially
Message-ID: <20070726233135.GK2472@sashak.voltaire.com>


Do not fetch initial switch LFTs for discovered switches. OpenSM doesn't
use it anyway, but it creates additional subnet traffic.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_sw_info_rcv.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/opensm/opensm/osm_sw_info_rcv.c b/opensm/opensm/osm_sw_info_rcv.c
index 726cc06..8eb8cd5 100644
--- a/opensm/opensm/osm_sw_info_rcv.c
+++ b/opensm/opensm/osm_sw_info_rcv.c
@@ -134,6 +134,7 @@ __osm_si_rcv_get_port_info(
   OSM_LOG_EXIT( p_rcv->p_log );
 }
 
+#if 0
 /**********************************************************************
  The plock must be held before calling this function.
 **********************************************************************/
@@ -198,7 +199,6 @@ __osm_si_rcv_get_fwd_tbl(
   OSM_LOG_EXIT( p_rcv->p_log );
 }
 
-#if 0
 /**********************************************************************
  The plock must be held before calling this function.
 **********************************************************************/
@@ -399,10 +399,9 @@ __osm_si_rcv_process_new(
     Get the PortInfo attribute for every port.
   */
   __osm_si_rcv_get_port_info( p_rcv, p_sw, p_madw );
-  __osm_si_rcv_get_fwd_tbl( p_rcv, p_sw );
 
   /*
-    Don't bother retrieving the current multicast tables
+    Don't bother retrieving the current unicast and multicast tables
     from the switches.  The current version of SM does
     not support silent take-over of an existing multicast
     configuration.
@@ -413,6 +412,7 @@ __osm_si_rcv_process_new(
     The code to retrieve the tables was fully debugged.
   */
 #if 0
+  __osm_si_rcv_get_fwd_tbl( p_rcv, p_sw );
   if( !p_rcv->p_subn->opt.disable_multicast )
     __osm_si_rcv_get_mcast_fwd_tbl( p_rcv, p_sw );
 #endif
-- 
1.5.3.rc2.38.g11308


From sashak at voltaire.com  Thu Jul 26 16:32:47 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 27 Jul 2007 02:32:47 +0300
Subject: [ofa-general] [PATCH] opensm: remove static __some_hop_count_set var
Message-ID: <20070726233247.GL2472@sashak.voltaire.com>


This removes static variable __some_hop_count_set from osm_ucast_mgr
and instead uses flag stored as structure memmber.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/include/opensm/osm_ucast_mgr.h |    6 ++++++
 opensm/opensm/osm_ucast_mgr.c         |   16 ++++------------
 2 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/opensm/include/opensm/osm_ucast_mgr.h b/opensm/include/opensm/osm_ucast_mgr.h
index 3381616..4824971 100644
--- a/opensm/include/opensm/osm_ucast_mgr.h
+++ b/opensm/include/opensm/osm_ucast_mgr.h
@@ -105,6 +105,7 @@ typedef struct _osm_ucast_mgr
 	osm_log_t	*p_log;
 	cl_plock_t	*p_lock;
 	boolean_t	 any_change;
+	boolean_t	some_hop_count_set;
 	uint8_t		*lft_buf;
 } osm_ucast_mgr_t;
 /*
@@ -126,6 +127,11 @@ typedef struct _osm_ucast_mgr
 *		set to TRUE by osm_ucast_mgr_set_fwd_table() if any mad
 *		was sent.
 *
+*	some_hop_count_set
+*		Initialized to FALSE at the beginning of each the min hop
+*		tables calculation iteration cycle, set to TRUE to indicate
+*		that some hop count changes were done.
+*
 *	lft_buf
 *		LFT buffer - used during LFT calculation/setup.
 *
diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
index a8fc649..f049e74 100644
--- a/opensm/opensm/osm_ucast_mgr.c
+++ b/opensm/opensm/osm_ucast_mgr.c
@@ -66,14 +66,6 @@
 
 /**********************************************************************
  **********************************************************************/
-/*
- * This flag is used for stopping the relaxation algorithm if no
- * change detected during the fabric scan
- */
-static boolean_t __some_hop_count_set;
-
-/**********************************************************************
- **********************************************************************/
 void
 osm_ucast_mgr_construct(
   IN osm_ucast_mgr_t* const p_mgr )
@@ -531,7 +523,7 @@ __osm_ucast_mgr_process_neighbor(
                  "cannot set hops for lid %u at switch 0x%" PRIx64 "\n",
                  lid_ho,
                  cl_ntoh64(osm_node_get_node_guid(p_this_sw->p_node)));
-      __some_hop_count_set = TRUE;
+      p_mgr->some_hop_count_set = TRUE;
     }
   }
 
@@ -1020,10 +1012,10 @@ osm_ucast_mgr_build_lid_matrices(
       if non of the switches was set will exit the
       while loop
     */
-    __some_hop_count_set = TRUE;
-    for( i = 0; (i < iteration_max) && __some_hop_count_set; i++ )
+    p_mgr->some_hop_count_set = TRUE;
+    for( i = 0; (i < iteration_max) && p_mgr->some_hop_count_set; i++ )
     {
-      __some_hop_count_set = FALSE;
+      p_mgr->some_hop_count_set = FALSE;
       cl_qmap_apply_func( p_sw_guid_tbl,
                           __osm_ucast_mgr_process_neighbors, p_mgr );
     }
-- 
1.5.3.rc2.38.g11308


From sashak at voltaire.com  Thu Jul 26 16:34:02 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 27 Jul 2007 02:34:02 +0300
Subject: [ofa-general] [PATCH] opensm: dumpers improvements
Message-ID: <20070726233402.GM2472@sashak.voltaire.com>


As was discussed previously on the list this moves ucast and mcast
dumper functions to separate file (osm_dump.c). Dump generators will
be invoked after heavy sweep.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/include/opensm/osm_opensm.h |    4 +
 opensm/opensm/Makefile.am          |    2 +-
 opensm/opensm/osm_dump.c           |  434 ++++++++++++++++++++++++++++++++++++
 opensm/opensm/osm_mcast_mgr.c      |  138 +-----------
 opensm/opensm/osm_state_mgr.c      |    1 +
 opensm/opensm/osm_ucast_mgr.c      |  336 +---------------------------
 6 files changed, 443 insertions(+), 472 deletions(-)
 create mode 100644 opensm/opensm/osm_dump.c

diff --git a/opensm/include/opensm/osm_opensm.h b/opensm/include/opensm/osm_opensm.h
index 2b09129..2d668e9 100644
--- a/opensm/include/opensm/osm_opensm.h
+++ b/opensm/include/opensm/osm_opensm.h
@@ -444,6 +444,10 @@ osm_opensm_wait_for_subnet_up(
 * SEE ALSO
 *********/
 
+/* dump helpers */
+void osm_dump_mcast_routes(osm_opensm_t *osm);
+void osm_dump_all(osm_opensm_t *osm);
+
 /****v* OpenSM/osm_exit_flag
 */
 extern volatile unsigned int osm_exit_flag;
diff --git a/opensm/opensm/Makefile.am b/opensm/opensm/Makefile.am
index c94897c..46770b4 100644
--- a/opensm/opensm/Makefile.am
+++ b/opensm/opensm/Makefile.am
@@ -56,7 +56,7 @@ opensm_SOURCES = main.c osm_console.c osm_db_files.c \
 		 osm_ucast_lash.c osm_ucast_file.c osm_ucast_ftree.c \
 		 osm_vl15intf.c osm_vl_arb_rcv.c \
 		 st.c osm_perfmgr.c osm_perfmgr_db.c \
-		 osm_event_plugin.c
+		 osm_event_plugin.c osm_dump.c
 if OSMV_OPENIB
 opensm_CFLAGS = -Wall $(OSMV_CFLAGS) -fno-strict-aliasing -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1
 opensm_CXXFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1
diff --git a/opensm/opensm/osm_dump.c b/opensm/opensm/osm_dump.c
new file mode 100644
index 0000000..367d941
--- /dev/null
+++ b/opensm/opensm/osm_dump.c
@@ -0,0 +1,434 @@
+/*
+ * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+/*
+ * Abstract:
+ *    Various OpenSM dumpers
+ */
+
+#if HAVE_CONFIG_H
+#  include <config.h>
+#endif				/* HAVE_CONFIG_H */
+
+#include <unistd.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <iba/ib_types.h>
+#include <complib/cl_qmap.h>
+#include <complib/cl_debug.h>
+#include <opensm/osm_opensm.h>
+#include <opensm/osm_log.h>
+#include <opensm/osm_node.h>
+#include <opensm/osm_switch.h>
+#include <opensm/osm_helper.h>
+#include <opensm/osm_msgdef.h>
+#include <opensm/osm_opensm.h>
+
+struct dump_context {
+	osm_opensm_t *p_osm;
+	FILE *file;
+};
+
+static void dump_ucast_path_distribution(cl_map_item_t * p_map_item, void *cxt)
+{
+	osm_node_t *p_node;
+	osm_node_t *p_remote_node;
+	uint8_t i;
+	uint8_t num_ports;
+	uint32_t num_paths;
+	ib_net64_t remote_guid_ho;
+	osm_switch_t *p_sw = (osm_switch_t *) p_map_item;
+	osm_opensm_t *p_osm = ((struct dump_context *)cxt)->p_osm;
+
+	p_node = p_sw->p_node;
+	num_ports = p_sw->num_ports;
+
+	osm_log_printf(&p_osm->log, OSM_LOG_DEBUG,
+		       "dump_ucast_path_distribution: "
+		       "Switch 0x%" PRIx64 "\n"
+		       "Port : Path Count Through Port",
+		       cl_ntoh64(osm_node_get_node_guid(p_node)));
+
+	for (i = 0; i < num_ports; i++) {
+		num_paths = osm_switch_path_count_get(p_sw, i);
+		osm_log_printf(&p_osm->log, OSM_LOG_DEBUG, "\n %03u : %u", i,
+			       num_paths);
+		if (i == 0) {
+			osm_log_printf(&p_osm->log, OSM_LOG_DEBUG,
+				       " (switch management port)");
+			continue;
+		}
+
+		p_remote_node = osm_node_get_remote_node(p_node, i, NULL);
+		if (p_remote_node == NULL)
+			continue;
+
+		remote_guid_ho =
+		    cl_ntoh64(osm_node_get_node_guid(p_remote_node));
+
+		switch (osm_node_get_remote_type(p_node, i)) {
+		case IB_NODE_TYPE_SWITCH:
+			osm_log_printf(&p_osm->log, OSM_LOG_DEBUG,
+				       " (link to switch");
+			break;
+		case IB_NODE_TYPE_ROUTER:
+			osm_log_printf(&p_osm->log, OSM_LOG_DEBUG,
+				       " (link to router");
+			break;
+		case IB_NODE_TYPE_CA:
+			osm_log_printf(&p_osm->log, OSM_LOG_DEBUG,
+				       " (link to CA");
+			break;
+		default:
+			osm_log_printf(&p_osm->log, OSM_LOG_DEBUG,
+				       " (link to unknown node type");
+			break;
+		}
+
+		osm_log_printf(&p_osm->log, OSM_LOG_DEBUG, " 0x%" PRIx64 ")",
+			       remote_guid_ho);
+	}
+
+	osm_log_printf(&p_osm->log, OSM_LOG_DEBUG, "\n");
+}
+
+static void dump_ucast_routes(cl_map_item_t * p_map_item, void *cxt)
+{
+	const osm_node_t *p_node;
+	osm_port_t *p_port;
+	uint8_t port_num;
+	uint8_t num_hops;
+	uint8_t best_hops;
+	uint8_t best_port;
+	uint16_t max_lid_ho;
+	uint16_t lid_ho, base_lid;
+	boolean_t direct_route_exists = FALSE;
+	osm_switch_t *p_sw = (osm_switch_t *) p_map_item;
+	osm_opensm_t *p_osm = ((struct dump_context *)cxt)->p_osm;
+	FILE *file = ((struct dump_context *)cxt)->file;
+
+	p_node = p_sw->p_node;
+
+	max_lid_ho = p_sw->max_lid_ho;
+
+	fprintf(file, "__osm_ucast_mgr_dump_ucast_routes: "
+		"Switch 0x%016" PRIx64 "\n"
+		"LID    : Port : Hops : Optimal\n",
+		cl_ntoh64(osm_node_get_node_guid(p_node)));
+	for (lid_ho = 1; lid_ho <= max_lid_ho; lid_ho++) {
+		fprintf(file, "0x%04X : ", lid_ho);
+
+		p_port = cl_ptr_vector_get(&p_osm->subn.port_lid_tbl, lid_ho);
+		if (!p_port) {
+			fprintf(file, "UNREACHABLE\n");
+			continue;
+		}
+
+		port_num = osm_switch_get_port_by_lid(p_sw, lid_ho);
+		if (port_num == OSM_NO_PATH) {
+			/*
+			   This may occur if there are 'holes' in the existing
+			   LID assignments.  Running SM with --reassign_lids
+			   will reassign and compress the LID range.  The
+			   subnet should work fine either way.
+			 */
+			fprintf(file, "UNREACHABLE\n");
+			continue;
+		}
+		/*
+		   Switches can lie about which port routes a given
+		   lid due to a recent reconfiguration of the subnet.
+		   Therefore, ensure that the hop count is better than
+		   OSM_NO_PATH.
+		 */
+		if (p_port->p_node->sw) {
+			/* Target LID is switch.
+			   Get its base lid and check hop count for this base LID only. */
+			base_lid = osm_node_get_base_lid(p_port->p_node, 0);
+			base_lid = cl_ntoh16(base_lid);
+			num_hops =
+			    osm_switch_get_hop_count(p_sw, base_lid, port_num);
+		} else {
+			/* Target LID is not switch (CA or router).
+			   Check if we have route to this target from current switch. */
+			num_hops =
+			    osm_switch_get_hop_count(p_sw, lid_ho, port_num);
+			if (num_hops != OSM_NO_PATH) {
+				direct_route_exists = TRUE;
+				base_lid = lid_ho;
+			} else {
+				osm_physp_t *p_physp = p_port->p_physp;
+
+				if (!p_physp || !p_physp->p_remote_physp ||
+				    !p_physp->p_remote_physp->p_node->sw)
+					num_hops = OSM_NO_PATH;
+				else {
+					base_lid =
+					    osm_node_get_base_lid(p_physp->
+								  p_remote_physp->
+								  p_node, 0);
+					base_lid = cl_ntoh16(base_lid);
+					num_hops =
+					    p_physp->p_remote_physp->p_node->sw == p_sw ? 0 :
+					    osm_switch_get_hop_count(p_sw,
+								     base_lid,
+								     port_num);
+				}
+			}
+		}
+
+		if (num_hops == OSM_NO_PATH) {
+			fprintf(file, "UNREACHABLE\n");
+			continue;
+		}
+
+		best_hops = osm_switch_get_least_hops(p_sw, base_lid);
+		if (!p_port->p_node->sw && !direct_route_exists) {
+			best_hops++;
+			num_hops++;
+		}
+
+		fprintf(file, "%03u  : %02u   : ", port_num, num_hops);
+
+		if (best_hops == num_hops)
+			fprintf(file, "yes");
+		else {
+			best_port = osm_switch_recommend_path(p_sw, p_port, lid_ho, TRUE, NULL, NULL, NULL, NULL);	/* No LMC Optimization */
+			fprintf(file, "No %u hop path possible via port %u!",
+				best_hops, best_port);
+		}
+
+		fprintf(file, "\n");
+	}
+}
+
+static void dump_mcast_routes(cl_map_item_t * p_map_item, void *cxt)
+{
+	osm_switch_t *p_sw = (osm_switch_t *) p_map_item;
+	FILE *file = ((struct dump_context *)cxt)->file;
+	osm_mcast_tbl_t *p_tbl;
+	int16_t mlid_ho = 0;
+	int16_t mlid_start_ho;
+	uint8_t position = 0;
+	int16_t block_num = 0;
+	boolean_t first_mlid;
+	boolean_t first_port;
+	const osm_node_t *p_node;
+	uint16_t i, j;
+	uint16_t mask_entry;
+	char sw_hdr[256];
+	char mlid_hdr[32];
+
+	p_node = p_sw->p_node;
+
+	p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw);
+
+	sprintf(sw_hdr, "\nSwitch 0x%016" PRIx64 "\n"
+		"LID    : Out Port(s)\n",
+		cl_ntoh64(osm_node_get_node_guid(p_node)));
+	first_mlid = TRUE;
+	while (block_num <= p_tbl->max_block_in_use) {
+		mlid_start_ho = (uint16_t) (block_num * IB_MCAST_BLOCK_SIZE);
+		for (i = 0; i < IB_MCAST_BLOCK_SIZE; i++) {
+			mlid_ho = mlid_start_ho + i;
+			position = 0;
+			first_port = TRUE;
+			sprintf(mlid_hdr, "0x%04X :",
+				mlid_ho + IB_LID_MCAST_START_HO);
+			while (position <= p_tbl->max_position) {
+				mask_entry =
+				    cl_ntoh16((*p_tbl->p_mask_tbl)[mlid_ho][position]);
+				if (mask_entry == 0) {
+					position++;
+					continue;
+				}
+				for (j = 0; j < 16; j++) {
+					if ((1 << j) & mask_entry) {
+						if (first_mlid) {
+							fprintf(file, "%s", sw_hdr);
+							first_mlid = FALSE;
+						}
+						if (first_port) {
+							fprintf(file, "%s", mlid_hdr);
+							first_port = FALSE;
+						}
+						fprintf(file, " 0x%03X ",
+							j + (position * 16));
+					}
+				}
+				position++;
+			}
+			if (first_port == FALSE)
+				fprintf(file, "\n");
+		}
+		block_num++;
+	}
+}
+
+static void dump_lid_matrix(cl_map_item_t * p_map_item, void *cxt)
+{
+	osm_switch_t *p_sw = (osm_switch_t *) p_map_item;
+	osm_opensm_t *p_osm = ((struct dump_context *)cxt)->p_osm;
+	FILE *file = ((struct dump_context *)cxt)->file;
+	osm_node_t *p_node = p_sw->p_node;
+	unsigned max_lid = p_sw->max_lid_ho;
+	unsigned max_port = p_sw->num_ports;
+	uint16_t lid;
+	uint8_t port;
+
+	fprintf(file, "Switch: guid 0x%016" PRIx64 "\n",
+		cl_ntoh64(osm_node_get_node_guid(p_node)));
+	for (lid = 1; lid <= max_lid; lid++) {
+		osm_port_t *p_port;
+		if (osm_switch_get_least_hops(p_sw, lid) == OSM_NO_PATH)
+			continue;
+		fprintf(file, "0x%04x:", lid);
+		for (port = 0; port < max_port; port++)
+			fprintf(file, " %02x",
+				osm_switch_get_hop_count(p_sw, lid, port));
+		p_port = cl_ptr_vector_get(&p_osm->subn.port_lid_tbl, lid);
+		if (p_port)
+			fprintf(file, " # portguid 0x%" PRIx64,
+				cl_ntoh64(osm_port_get_guid(p_port)));
+		fprintf(file, "\n");
+	}
+}
+
+static void dump_ucast_lfts(cl_map_item_t * p_map_item, void *cxt)
+{
+	osm_switch_t *p_sw = (osm_switch_t *) p_map_item;
+	osm_opensm_t *p_osm = ((struct dump_context *)cxt)->p_osm;
+	FILE *file = ((struct dump_context *)cxt)->file;
+	osm_node_t *p_node = p_sw->p_node;
+	unsigned max_lid = p_sw->max_lid_ho;
+	unsigned max_port = p_sw->num_ports;
+	uint16_t lid;
+	uint8_t port;
+
+	fprintf(file, "Unicast lids [0x0-0x%x] of switch Lid %u guid 0x%016"
+		PRIx64 " (\'%s\'):\n",
+		max_lid, osm_node_get_base_lid(p_node, 0),
+		cl_ntoh64(osm_node_get_node_guid(p_node)), p_node->print_desc);
+	for (lid = 0; lid <= max_lid; lid++) {
+		osm_port_t *p_port;
+		port = osm_switch_get_port_by_lid(p_sw, lid);
+
+		if (port >= max_port)
+			continue;
+
+		fprintf(file, "0x%04x %03u # ", lid, port);
+
+		p_port = cl_ptr_vector_get(&p_osm->subn.port_lid_tbl, lid);
+		if (p_port) {
+			p_node = p_port->p_node;
+			fprintf(file, "%s portguid 0x016%" PRIx64 ": \'%s\'",
+				ib_get_node_type_str(osm_node_get_type(p_node)),
+				cl_ntoh64(osm_port_get_guid(p_port)),
+				p_node->print_desc);
+		} else
+			fprintf(file, "unknown node and type");
+		fprintf(file, "\n");
+	}
+	fprintf(file, "%u lids dumped\n", max_lid);
+}
+
+/**********************************************************************
+ **********************************************************************/
+static void dump_qmap(osm_opensm_t * p_osm, FILE * file,
+		      cl_qmap_t * map, void (*func) (cl_map_item_t *, void *))
+{
+	struct dump_context dump_context;
+
+	dump_context.p_osm = p_osm;
+	dump_context.file = file;
+
+	cl_qmap_apply_func(map, func, &dump_context);
+}
+
+static void dump_qmap_to_file(osm_opensm_t * p_osm, const char *file_name,
+			      cl_qmap_t * map,
+			      void (*func) (cl_map_item_t *, void *))
+{
+	char path[1024];
+	FILE *file;
+
+	snprintf(path, sizeof(path), "%s/%s",
+		 p_osm->subn.opt.dump_files_dir, file_name);
+
+	file = fopen(path, "w");
+	if (!file) {
+		osm_log(&p_osm->log, OSM_LOG_ERROR,
+			"dump_qmap_to_file: "
+			"cannot create file \'%s\': %s\n",
+			path, strerror(errno));
+		return;
+	}
+
+	dump_qmap(p_osm, file, map, func);
+
+	fclose(file);
+}
+
+/**********************************************************************
+ **********************************************************************/
+
+void osm_dump_mcast_routes(osm_opensm_t * osm)
+{
+	if (osm_log_is_active(&osm->log, OSM_LOG_ROUTING)) {
+		/* multicast routes */
+		dump_qmap_to_file(osm, "opensm.mcfdbs",
+				  &osm->subn.sw_guid_tbl, dump_mcast_routes);
+	}
+}
+
+void osm_dump_all(osm_opensm_t * osm)
+{
+	if (osm_log_is_active(&osm->log, OSM_LOG_ROUTING)) {
+		/* unicast routes */
+		dump_qmap_to_file(osm, "opensm-lid-matrix.dump",
+				  &osm->subn.sw_guid_tbl, dump_lid_matrix);
+		dump_qmap_to_file(osm, "opensm-lfts.dump",
+				  &osm->subn.sw_guid_tbl, dump_ucast_lfts);
+		if (osm_log_is_active(&osm->log, OSM_LOG_DEBUG))
+			dump_qmap(osm, NULL, &osm->subn.sw_guid_tbl,
+				  dump_ucast_path_distribution);
+		dump_qmap_to_file(osm, "opensm.fdbs",
+				  &osm->subn.sw_guid_tbl, dump_ucast_routes);
+		/* multicast routes */
+		dump_qmap_to_file(osm, "opensm.mcfdbs",
+				  &osm->subn.sw_guid_tbl, dump_mcast_routes);
+	}
+}
diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c
index f4b64a6..5f64b19 100644
--- a/opensm/opensm/osm_mcast_mgr.c
+++ b/opensm/opensm/osm_mcast_mgr.c
@@ -48,12 +48,11 @@
 #  include <config.h>
 #endif /* HAVE_CONFIG_H */
 
-#include <unistd.h>
 #include <stdlib.h>
 #include <string.h>
-#include <errno.h>
 #include <iba/ib_types.h>
 #include <complib/cl_debug.h>
+#include <opensm/osm_opensm.h>
 #include <opensm/osm_mcast_mgr.h>
 #include <opensm/osm_multicast.h>
 #include <opensm/osm_node.h>
@@ -61,8 +60,6 @@
 #include <opensm/osm_helper.h>
 #include <opensm/osm_msgdef.h>
 
-#define LINE_LENGTH 256
-
 /**********************************************************************
  **********************************************************************/
 typedef struct _osm_mcast_work_obj
@@ -1336,135 +1333,6 @@ osm_mcast_mgr_process_tree(
 }
 
 /**********************************************************************
- **********************************************************************/
-static void
-mcast_mgr_dump_sw_routes(
-  IN const osm_mcast_mgr_t*   const p_mgr,
-  IN const osm_switch_t*      const p_sw,
-  IN FILE *file )
-{
-  osm_mcast_tbl_t*      p_tbl;
-  int16_t               mlid_ho = 0;
-  int16_t               mlid_start_ho;
-  uint8_t               position = 0;
-  int16_t               block_num = 0;
-  boolean_t             first_mlid;
-  boolean_t             first_port;
-  const osm_node_t*     p_node;
-  uint16_t              i, j;
-  uint16_t              mask_entry;
-  char                  sw_hdr[256];
-  char                  mlid_hdr[32];
-
-  OSM_LOG_ENTER( p_mgr->p_log, mcast_mgr_dump_sw_routes );
-
-  if( !osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) )
-    goto Exit;
-
-  p_node = p_sw->p_node;
-
-  p_tbl = osm_switch_get_mcast_tbl_ptr( p_sw );
-
-  sprintf( sw_hdr, "\nSwitch 0x%016" PRIx64 "\n"
-           "LID    : Out Port(s)\n",
-           cl_ntoh64( osm_node_get_node_guid( p_node ) ) );
-  first_mlid = TRUE;
-  while ( block_num <= p_tbl->max_block_in_use )
-  {
-    mlid_start_ho = (uint16_t)(block_num * IB_MCAST_BLOCK_SIZE);
-    for (i = 0 ; i < IB_MCAST_BLOCK_SIZE ; i++)
-    {
-      mlid_ho = mlid_start_ho + i;
-      position = 0;
-      first_port = TRUE;
-      sprintf( mlid_hdr, "0x%04X :", mlid_ho + IB_LID_MCAST_START_HO );
-      while ( position <= p_tbl->max_position )
-      {
-        mask_entry = cl_ntoh16((*p_tbl->p_mask_tbl)[mlid_ho][position]);
-        if (mask_entry == 0)
-        {
-          position++;
-          continue;
-        }
-        for (j = 0 ; j < 16 ; j++)
-        {
-          if ( (1 << j) & mask_entry )
-          {
-            if (first_mlid)
-            {
-              fprintf( file,"%s", sw_hdr );
-              first_mlid = FALSE;
-            }
-            if (first_port)
-            {
-              fprintf( file,"%s", mlid_hdr );
-              first_port = FALSE;
-            }
-            fprintf( file, " 0x%03X ", j+(position*16) );
-          }
-        }
-        position++;
-      }
-      if (first_port == FALSE)
-      {
-        fprintf( file, "\n" );
-      }
-    }
-    block_num++;
-  }
-
- Exit:
-  OSM_LOG_EXIT( p_mgr->p_log );
-}
-
-/**********************************************************************
- **********************************************************************/
-struct mcast_mgr_dump_context {
-	osm_mcast_mgr_t *p_mgr;
-	FILE *file;
-};
-
-static void
-mcast_mgr_dump_table(cl_map_item_t *p_map_item, void *context)
-{
-	osm_switch_t *p_sw = (osm_switch_t *)p_map_item;
-	struct mcast_mgr_dump_context *cxt = context;
-
-	mcast_mgr_dump_sw_routes(cxt->p_mgr, p_sw, cxt->file);
-}
-
-static void
-mcast_mgr_dump_mcast_routes(osm_mcast_mgr_t *p_mgr)
-{
-	char file_name[1024];
-	struct mcast_mgr_dump_context dump_context;
-	FILE  *file;
-
-	if (!osm_log_is_active(p_mgr->p_log, OSM_LOG_ROUTING))
-		return;
-
-	snprintf(file_name, sizeof(file_name), "%s/%s",
-		 p_mgr->p_subn->opt.dump_files_dir, "opensm.mcfdbs");
-
- 	file = fopen(file_name, "w");
-	if (!file) {
-		osm_log(p_mgr->p_log, OSM_LOG_ERROR,
-			"mcast_dump_mcast_routes: ERR 0A18: "
-			"cannot create mcfdb file \'%s\': %s\n",
-			file_name, strerror(errno));
-		return;
-	}
-
-	dump_context.p_mgr = p_mgr;
-	dump_context.file = file;
-
-	cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl,
-			   mcast_mgr_dump_table, &dump_context);
-
-	fclose(file);
-}
-
-/**********************************************************************
  Process the entire group.
 
  NOTE : The lock should be held externally!
@@ -1510,7 +1378,7 @@ osm_mcast_mgr_process_mgrp(
     p_sw = (osm_switch_t*)cl_qmap_next( &p_sw->map_item );
   }
 
-  mcast_mgr_dump_mcast_routes( p_mgr );
+  osm_dump_mcast_routes( p_mgr->p_subn->p_osm );
 
  Exit:
   OSM_LOG_EXIT( p_mgr->p_log );
@@ -1580,8 +1448,6 @@ osm_mcast_mgr_process(
     p_sw = (osm_switch_t*)cl_qmap_next( &p_sw->map_item );
   }
 
-  mcast_mgr_dump_mcast_routes( p_mgr );
-
   CL_PLOCK_RELEASE( p_mgr->p_lock );
 
   OSM_LOG_EXIT( p_mgr->p_log );
diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
index a15f3b4..a6d0e24 100644
--- a/opensm/opensm/osm_state_mgr.c
+++ b/opensm/opensm/osm_state_mgr.c
@@ -2749,6 +2749,7 @@ Idle:
                p_mgr->p_subn->need_update = 0;
 
                __osm_topology_file_create( p_mgr );
+               osm_dump_all(p_mgr->p_subn->p_osm);
                __osm_state_mgr_report( p_mgr );
                __osm_state_mgr_up_msg( p_mgr );
 
diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
index f049e74..cfe1a58 100644
--- a/opensm/opensm/osm_ucast_mgr.c
+++ b/opensm/opensm/osm_ucast_mgr.c
@@ -48,7 +48,7 @@
 #  include <config.h>
 #endif /* HAVE_CONFIG_H */
 
-#include <unistd.h>
+#include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <iba/ib_types.h>
@@ -62,8 +62,6 @@
 #include <opensm/osm_msgdef.h>
 #include <opensm/osm_opensm.h>
 
-#define LINE_LENGTH 256
-
 /**********************************************************************
  **********************************************************************/
 void
@@ -123,329 +121,6 @@ osm_ucast_mgr_init(
 }
 
 /**********************************************************************
- **********************************************************************/
-struct ucast_mgr_dump_context {
-	osm_ucast_mgr_t *p_mgr;
-	FILE *file;
-};
-
-static void
-ucast_mgr_dump(osm_ucast_mgr_t *p_mgr, FILE *file,
-	       void (*func)(cl_map_item_t *, void *))
-{
-	struct ucast_mgr_dump_context dump_context;
-
-	dump_context.p_mgr = p_mgr;
-	dump_context.file = file;
-
-	cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl, func, &dump_context);
-}
-
-void
-ucast_mgr_dump_to_file(osm_ucast_mgr_t *p_mgr, const char *file_name,
-		       void (*func)(cl_map_item_t *, void *))
-{
-	char path[1024];
-	FILE *file;
-
-	snprintf(path, sizeof(path), "%s/%s",
-		 p_mgr->p_subn->opt.dump_files_dir, file_name);
-
-	file = fopen(path, "w");
-	if (!file) {
-		osm_log( p_mgr->p_log, OSM_LOG_ERROR,
-			 "ucast_mgr_dump_to_file: ERR 3A12: "
-			 "Failed to open fdb file (%s)\n", path );
-		return;
-	}
-
-	ucast_mgr_dump(p_mgr, file, func);
-
-	fclose(file);
-}
-
-/**********************************************************************
- **********************************************************************/
-static void
-__osm_ucast_mgr_dump_path_distribution(
-  IN cl_map_item_t *p_map_item,
-  IN void *cxt)
-{
-  osm_node_t *p_node;
-  osm_node_t *p_remote_node;
-  uint8_t i;
-  uint8_t num_ports;
-  uint32_t num_paths;
-  ib_net64_t remote_guid_ho;
-  osm_switch_t* p_sw = (osm_switch_t *)p_map_item;
-  osm_ucast_mgr_t* p_mgr = ((struct ucast_mgr_dump_context *)cxt)->p_mgr;
-
-  OSM_LOG_ENTER( p_mgr->p_log, __osm_ucast_mgr_dump_path_distribution );
-
-  p_node = p_sw->p_node;
-  num_ports = p_sw->num_ports;
-
-  osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG,
-                  "__osm_ucast_mgr_dump_path_distribution: "
-                  "Switch 0x%" PRIx64 "\n"
-                  "Port : Path Count Through Port",
-                  cl_ntoh64( osm_node_get_node_guid( p_node ) ) );
-
-  for( i = 0; i < num_ports; i++ )
-  {
-    num_paths = osm_switch_path_count_get( p_sw , i );
-    osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG,"\n %03u : %u", i, num_paths );
-    if( i == 0 )
-    {
-      osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, " (switch management port)" );
-      continue;
-    }
-
-    p_remote_node = osm_node_get_remote_node( p_node, i, NULL );
-    if( p_remote_node == NULL )
-      continue;
-
-    remote_guid_ho = cl_ntoh64( osm_node_get_node_guid( p_remote_node ) );
-
-    switch(  osm_node_get_remote_type( p_node, i ) )
-    {
-    case IB_NODE_TYPE_SWITCH:
-      osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, " (link to switch" );
-      break;
-    case IB_NODE_TYPE_ROUTER:
-      osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, " (link to router" );
-      break;
-    case IB_NODE_TYPE_CA:
-      osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, " (link to CA" );
-      break;
-    default:
-      osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, " (link to unknown node type" );
-      break;
-    }
-
-    osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, " 0x%" PRIx64 ")",
-                    remote_guid_ho );
-  }
-
-  osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, "\n" );
-
-  OSM_LOG_EXIT( p_mgr->p_log );
-}
-
-/**********************************************************************
- **********************************************************************/
-static void
-__osm_ucast_mgr_dump_ucast_routes(
-  IN cl_map_item_t *p_map_item,
-  IN void *cxt )
-{
-  const osm_node_t*        p_node;
-  osm_port_t *             p_port;
-  uint8_t                  port_num;
-  uint8_t                  num_hops;
-  uint8_t                  best_hops;
-  uint8_t                  best_port;
-  uint16_t                 max_lid_ho;
-  uint16_t                 lid_ho, base_lid;
-  boolean_t                direct_route_exists = FALSE;
-  osm_switch_t* p_sw = (osm_switch_t *)p_map_item;
-  osm_ucast_mgr_t* p_mgr = ((struct ucast_mgr_dump_context *)cxt)->p_mgr;
-  FILE *file = ((struct ucast_mgr_dump_context *)cxt)->file;
-
-  OSM_LOG_ENTER( p_mgr->p_log, __osm_ucast_mgr_dump_ucast_routes );
-
-  p_node = p_sw->p_node;
-
-  max_lid_ho = p_sw->max_lid_ho;
-
-  fprintf( file, "__osm_ucast_mgr_dump_ucast_routes: "
-           "Switch 0x%016" PRIx64 "\n"
-           "LID    : Port : Hops : Optimal\n",
-           cl_ntoh64( osm_node_get_node_guid( p_node ) ) );
-  for( lid_ho = 1; lid_ho <= max_lid_ho; lid_ho++ )
-  {
-    fprintf(file, "0x%04X : ", lid_ho);
-
-    p_port = cl_ptr_vector_get(&p_mgr->p_subn->port_lid_tbl, lid_ho);
-    if (!p_port)
-    {
-      fprintf( file, "UNREACHABLE\n" );
-      continue;
-    }
-
-    port_num = osm_switch_get_port_by_lid( p_sw, lid_ho );
-    if( port_num == OSM_NO_PATH )
-    {
-      /*
-        This may occur if there are 'holes' in the existing
-        LID assignments.  Running SM with --reassign_lids
-        will reassign and compress the LID range.  The
-        subnet should work fine either way.
-      */
-      fprintf( file, "UNREACHABLE\n" );
-      continue;
-    }
-    /*
-      Switches can lie about which port routes a given
-      lid due to a recent reconfiguration of the subnet.
-      Therefore, ensure that the hop count is better than
-      OSM_NO_PATH.
-    */
-    if( p_port->p_node->sw )
-    {
-      /* Target LID is switch.
-         Get its base lid and check hop count for this base LID only. */
-      base_lid = osm_node_get_base_lid(p_port->p_node, 0);
-      base_lid = cl_ntoh16(base_lid);
-      num_hops = osm_switch_get_hop_count( p_sw, base_lid, port_num );
-    }
-    else
-    {
-      /* Target LID is not switch (CA or router).
-         Check if we have route to this target from current switch. */
-      num_hops = osm_switch_get_hop_count( p_sw, lid_ho, port_num );
-      if (num_hops != OSM_NO_PATH)
-      {
-         direct_route_exists = TRUE;
-         base_lid = lid_ho;
-      }
-      else
-      {
-        osm_physp_t *p_physp = p_port->p_physp;
-
-        if( !p_physp || !p_physp->p_remote_physp ||
-            !p_physp->p_remote_physp->p_node->sw )
-          num_hops = OSM_NO_PATH;
-        else
-        {
-          base_lid = osm_node_get_base_lid(p_physp->p_remote_physp->p_node, 0);
-          base_lid = cl_ntoh16(base_lid);
-          num_hops = p_physp->p_remote_physp->p_node->sw == p_sw ?
-                     0 : osm_switch_get_hop_count( p_sw, base_lid, port_num );
-        }
-      }
-    }
-
-    if( num_hops == OSM_NO_PATH )
-    {
-      fprintf( file, "UNREACHABLE\n" );
-      continue;
-    }
-
-    best_hops = osm_switch_get_least_hops( p_sw, base_lid );
-    if (!p_port->p_node->sw && !direct_route_exists)
-    {
-      best_hops++;
-      num_hops++;
-    }
-
-    fprintf( file, "%03u  : %02u   : ", port_num, num_hops );
-
-    if( best_hops == num_hops )
-      fprintf( file, "yes" );
-    else
-    {
-      best_port = osm_switch_recommend_path(
-        p_sw, p_port, lid_ho, TRUE,
-        NULL, NULL, NULL, NULL ); /* No LMC Optimization */
-      fprintf( file, "No %u hop path possible via port %u!",
-               best_hops, best_port );
-    }
-
-    fprintf( file, "\n" );
-  }
-
-  OSM_LOG_EXIT( p_mgr->p_log );
-}
-
-/**********************************************************************
- **********************************************************************/
-static void
-ucast_mgr_dump_lid_matrix(cl_map_item_t *p_map_item, void *cxt)
-{
-	osm_switch_t* p_sw = (osm_switch_t *)p_map_item;
-	osm_ucast_mgr_t* p_mgr = ((struct ucast_mgr_dump_context *)cxt)->p_mgr;
-	FILE *file = ((struct ucast_mgr_dump_context *)cxt)->file;
-	osm_node_t *p_node = p_sw->p_node;
-	unsigned max_lid = p_sw->max_lid_ho;
-	unsigned max_port = p_sw->num_ports;
-	uint16_t lid;
-	uint8_t port;
-
-	fprintf(file, "Switch: guid 0x%016" PRIx64 "\n",
-		cl_ntoh64(osm_node_get_node_guid(p_node)));
-	for (lid = 1; lid <= max_lid; lid++) {
-		osm_port_t *p_port;
-		if (osm_switch_get_least_hops(p_sw, lid) == OSM_NO_PATH)
-			continue;
-		fprintf(file, "0x%04x:", lid);
-		for (port = 0 ; port < max_port ; port++)
-			fprintf(file, " %02x",
-				osm_switch_get_hop_count(p_sw, lid, port));
-		p_port = cl_ptr_vector_get(&p_mgr->p_subn->port_lid_tbl, lid);
-		if (p_port)
-			fprintf(file, " # portguid 0x%" PRIx64,
-				cl_ntoh64(osm_port_get_guid(p_port)));
-		fprintf(file, "\n");
-	}
-}
-
-/**********************************************************************
- **********************************************************************/
-void
-ucast_mgr_dump_lfts(cl_map_item_t *p_map_item, void *cxt)
-{
-	osm_switch_t* p_sw = (osm_switch_t *)p_map_item;
-	osm_ucast_mgr_t* p_mgr = ((struct ucast_mgr_dump_context *)cxt)->p_mgr;
-	FILE *file = ((struct ucast_mgr_dump_context *)cxt)->file;
-	osm_node_t *p_node = p_sw->p_node;
-	unsigned max_lid = p_sw->max_lid_ho;
-	unsigned max_port = p_sw->num_ports;
-	uint16_t lid;
-	uint8_t port;
-
-	fprintf(file, "Unicast lids [0x0-0x%x] of switch Lid %u guid 0x%016"
-		PRIx64 " (\'%s\'):\n",
-		max_lid, osm_node_get_base_lid(p_node, 0),
-		cl_ntoh64(osm_node_get_node_guid(p_node)),
-		p_node->print_desc);
-	for (lid = 0; lid <= max_lid; lid++) {
-		osm_port_t *p_port;
-		port = osm_switch_get_port_by_lid(p_sw, lid);
-
-		if (port >= max_port)
-			continue;
-
-		fprintf(file, "0x%04x %03u # ", lid, port);
-
-		p_port = cl_ptr_vector_get(&p_mgr->p_subn->port_lid_tbl, lid);
-		if (p_port) {
-			p_node = p_port->p_node;
-			fprintf(file, "%s portguid 0x016%" PRIx64 ": \'%s\'",
-				ib_get_node_type_str(osm_node_get_type(p_node)),
-				cl_ntoh64(osm_port_get_guid(p_port)),
-				p_node->print_desc);
-		}
-		else
-			fprintf(file, "unknown node and type");
-		fprintf(file, "\n");
-	}
-	fprintf(file, "%u lids dumped\n", max_lid);
-}
-
-/**********************************************************************
- **********************************************************************/
-static void __osm_ucast_mgr_dump_tables(osm_ucast_mgr_t *p_mgr)
-{
-  ucast_mgr_dump_to_file(p_mgr, "opensm-lid-matrix.dump",
-                         ucast_mgr_dump_lid_matrix);
-  ucast_mgr_dump_to_file(p_mgr, "opensm-lfts.dump", ucast_mgr_dump_lfts);
-  if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
-    ucast_mgr_dump(p_mgr, NULL, __osm_ucast_mgr_dump_path_distribution);
-  ucast_mgr_dump_to_file(p_mgr, "opensm.fdbs", __osm_ucast_mgr_dump_ucast_routes);
-}
-
-/**********************************************************************
  Add each switch's own and neighbor LIDs to its LID matrix
 **********************************************************************/
 static void
@@ -1172,15 +847,6 @@ osm_ucast_mgr_process(
   else
      cl_qmap_apply_func( p_sw_guid_tbl, __osm_ucast_mgr_process_tbl, p_mgr );
 
-  /* dump fdb into file: */
-  if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) )
-  {
-     if ( !default_routing && p_routing_eng->ucast_dump_tables != 0 )
-        p_routing_eng->ucast_dump_tables(p_routing_eng->context);
-     else
-        __osm_ucast_mgr_dump_tables( p_mgr );
-  }
-
   if (p_mgr->any_change)
   {
     signal = OSM_SIGNAL_DONE_PENDING;
-- 
1.5.3.rc2.38.g11308


From qokwn at ida.net  Thu Jul 26 17:54:19 2007
From: qokwn at ida.net (Amelia)
Date: Thu, 26 Jul 2007 19:54:19 -0500
Subject: [ofa-general] Request
Message-ID: <46A9423B.2010203@ida.net>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: Request.pdf
Type: application/pdf
Size: 10624 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070726/e52f9243/attachment.pdf>

From sashak at voltaire.com  Thu Jul 26 18:07:07 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 27 Jul 2007 04:07:07 +0300
Subject: [ofa-general] Re: OpenSM detection of duplicated GUIDs on loopback
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F75673@mtlexch01.mtl.com>
References: <f0e08f230707241138sdd03eapece02d1626a25bb5@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F7510A@mtlexch01.mtl.com>
	<f0e08f230707241155v225aeed7s172f5cfb9024fc0e@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F7512E@mtlexch01.mtl.com>
	<f0e08f230707241325p1b7b04cap5bf37459bc7cb33f@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F75131@mtlexch01.mtl.com>
	<20070725001847.GG25264@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901F7558A@mtlexch01.mtl.com>
	<20070725194856.GB31582@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901F75673@mtlexch01.mtl.com>
Message-ID: <20070727010707.GR2472@sashak.voltaire.com>

On 09:25 Thu 26 Jul     , Eitan Zahavi wrote:
> > Hi Eitan, Hal,
> > 
> > On 20:44 Wed 25 Jul     , Eitan Zahavi wrote:
> > > 
> > > I am not following you.
> > > Why do a user need to run -y if a simple legal cable connector is 
> > > plugged?
> > 
> > Because duplicated GUIDs detector can aborts OpenSM when 
> > regular port is reconnected to another location during hard 
> > sweep. This issue is not related to loopback plug at all.
> I  think we should handle the case of "migrated port" in a more global
> sense:
> If a port "moved" during the sweep we have to do a new sweep anyway.

Another option is just to use recently discovered port location. In
case of CA it could work, switch migration can be more complicated.

> Maybe we could delay the 'abort' to the second sweep.
>
> So practically I propose:
> 1. Add state flag "was duplicated" on the port saying it was reported as
> duplicate GUID.
> 2. Set the variable controlling a forced secodn sweep (similar to the
> one used if we got Set error)

We even can catch this yet before drop_manager and just rediscover.

> 3. Repeat the sweep - if we find a port where it is a duplicate and the
> "was duplicated" flag is set - abort.
>
> A refinement for the user who is doing many changes continuously might
> be to keep a counter.
> And have the abort happen after the Nth iteration.

It is better approach than what we have today.

> > 
> > > The issue is only if a "loop back" plug connecting a port 
> > to itself is 
> > > plugged.
> > 
> > No, not only. Now there are two completely separate known 
> > issues with duplicated GUIDs detector:
> > 
> > 1. Port moving
> > 2. Loopback plug
> > 
> > And I think that _both_ should be solved. And if just using 
> > '-y' could be suitable for (2) because it is esoteric 
> > (although perfectly legal) use, it is not acceptable solution for (1).
> > 
> > I think we need to improve GUIDs duplication detector 
> > instead. For example we could add NodeInfo comparison there, 
> > and only in case if it is different drop GUIDs duplication 
> > error. Also I think this should not be fatal error and should 
> > not abort OpenSM, just logging (probably via syslog too) 
> > should be sufficient - non-working port is good reason to 
> > look at logs. Another ideas?
> The problem is that the SM will sort of figure out the network but will
> create a completely bogus routing etc.

Right. But it is not so with back-to-back (when loopback plug could be
interpreted as back-to-back duplicated GUID). So no need to abort in
this (back-to-back/loopback) case. Agreed?

Sasha

> 
> > 
> > Sasha
> > 
> > > Do users use these plugs? For what sake?
> > > 
> > > 
> > > Eitan Zahavi
> > > Senior Engineering Director, Software Architect Mellanox 
> > Technologies 
> > > LTD
> > > Tel:+972-4-9097208
> > > Fax:+972-4-9593245
> > > P.O. Box 586 Yokneam 20692 ISRAEL
> > > 
> > >  
> > > 
> > > > -----Original Message-----
> > > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com]
> > > > Sent: Wednesday, July 25, 2007 3:19 AM
> > > > To: Eitan Zahavi
> > > > Cc: Hal Rosenstock; OpenFabrics General; Yevgeny Kliteynik
> > > > Subject: Re: OpenSM detection of duplicated GUIDs on loopback
> > > > 
> > > > On 23:25 Tue 24 Jul     , Eitan Zahavi wrote:
> > > > > 
> > > > > 	On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote: 
> > > > > 
> > > > > 		Maybe  avoid the log if -y is provided?
> > > > > 
> > > > > 	 
> > > > > 	That avoids the spew but the duplicated GUID is
> > > > important to know so
> > > > > IMO something in the "middle" is needed where 
> > duplicated GUIDs are 
> > > > > logged but not continually the same ones.
> > > > > 	[EZ]  
> > > > > 	OK so in -y mode only we track which ones were reported
> > > > and do not
> > > > > repeat the log?
> > > > 
> > > > And how port moving problem should be solved?
> > > > 
> > > > We cannot ask an user to run OpenSM with '-y' if in 
> > her/his plans to 
> > > > reconnect some ports in a future and just decrease logging.
> > > > 
> > > > Sasha
> > > > 
> > 


From mshefty at ichips.intel.com  Thu Jul 26 18:11:51 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 26 Jul 2007 18:11:51 -0700
Subject: [ofa-general] QoS RFC
In-Reply-To: <46A283B6.1070105@dev.mellanox.co.il>
References: <46A283B6.1070105@dev.mellanox.co.il>
Message-ID: <46A94657.1020101@ichips.intel.com>

> 2. Architecture ----------------

This is a higher level approach to the problem, but I came up with the
following QoS relationship hierarchy, where '->' means 'maps to'.

Application Service -> Service ID (or range)
Service ID -> desired QoS
QoS, SGID, DGID, PKey -> SGID, DGID, TClass, FlowLabel, PKey
SGID, DGID, TC, FL, PKey -> SLID, DLID, SL (set if crossing subnets)
SLID, DLID, SL -> MTU, Rate, VL, PacketLifeTime

I use these relationships below:

> 4. IPoIB ---------
> 
> IPoIB already query the SA for its broadcast group information. The 
> additional functionality required is for IPoIB to provide the
> broadcast group SL, MTU, and RATE in every following PathRecord query
> performed when a new UDAV is needed by IPoIB. We could assign a
> special Service-ID for IPoIB use but since all communication on the
> same IPoIB interface shares the same QoS-Level without the ability to
>  differentiate it by target service we can ignore it for simplicity.

Rather than IPoIB specifying SL, MTU, and rate with PR queries, it 
should specify TClass and FlowLabel.  This is necessary for IPoIB to 
span IB subnets.

> 5. CMA features ----------------
> 
> The CMA interface supports Service-ID through the notion of port
> space as a prefixes to the port_num which is part of the sockaddr
> provided to rdma_resolve_add(). What is missing is the explicit
> request for a QoS-Class that should allow the ULP (like SDP) to
> propagate a specific request for a class of service. A mechanism for
> providing the QoS-Class is available in the IPv6 address, so we could
> use that address field. Another option is to implement a special 
> connection options API for CMA.
> 
> Missing functionality by CMA is the usage of the provided QoS-Class
> and Service-ID in the sent PR/MPR. When a response is obtained it is
> an existing requirement for the CMA to use the PR/MPR from the
> response in setting up the QP address vector.

I think the RDMA CM needs two solutions, depending on which address 
family is used.  For IPv6, the existing interface is sufficient, and 
works for both IB and iWarp.  The RDMA CM only needs to include the TC 
and FL as part of its PR query.  For IPv4, to remain transport neutral, 
I think we should add an rdma_set_option() routine to specify the QoS 
field.  The RDMA CM would include the QoS field for PR query under this 
condition.

For IB, this requires changes to the ib_sa to support the new PR 
extensions.  I don't think we gain anything having the RDMA CM include 
service IDs as part of the query.

> 6. SDP -------
> 
> SDP uses CMA for building its connections. The Service-ID for SDP is
> 0x000000000001PPPP, where PPPP are 4 hex digits holding the remote
> TCP/IP Port Number to connect to. SDP might be provided with
> SO_PRIORITY socket option. In that case the value provided should be
> sent to the CMA as the TClass option of that connection.

SDP would use specify the QoS through the IPv6 address or 
rdma_set_option() routine.

> 7. SRP -------
> 
> Current SRP implementation uses its own CM callbacks (not CMA). So
> SRP should fill in the Service-ID in the PR/MPR by itself and use
> that information in setting up the QP. The T10 SRP standard defines
> the SRP Service-ID to be defined by the SRP target I/O Controller
> (but they should also comply with IBTA Service- ID rules). Anyway,
> the Service-ID is reported by the I/O Controller in the 
> ServiceEntries DMA attribute and should be used in the PR/MPR if the
> SA reports its ability to handle QoS PR/MPRs.

I agree.

> 8. iSER -------- iSER uses CMA and thus should be very close to SDP.
> The Service-ID for iSER should be TBD.

See RDMA CM and SDP.

> 3.2. PR/MPR query handling: OpenSM should be able to enforce the
> provided policy on client request. The overall flow for such requests
> is: first the request is matched against the defined match rules such
> that the target QoS-Level definition is found. Given the QoS-Level a
> path(s) search is performed with the given restrictions imposed by
> that level. The following two sections describe these steps.

If we use the QoS hierarchy outlined above, I think we can construct 
some fairly simple tables to guide our PR selection.  The SA may need to 
construct the tables starting at the bottom and working up, but I 
*think* it could be done.  And by distributing the tables, we can 
support a more distributed (a la local SA) operation.

 From an administration point, I would be happier seeing something where 
the administrator defines a QoS level in terms of latency or bandwidth 
requirements and relative priority.  Then, if desired, the administrator 
could provide more details, such as indicating which nodes would use 
which services, minimum required MTUs, etc.  It would then be up to the 
SA to map these requirements to specific TC, FL, SL, VL values.

In general, though, I'm personally far less concerned with the QoS 
specification interface to the SA, versus the operation that takes place 
on the hosts.

Comments on using this approach on the host side?

- Sean


From sashak at voltaire.com  Thu Jul 26 19:59:52 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 27 Jul 2007 05:59:52 +0300
Subject: [ofa-general] Lost in-service traps during Open SM migration
In-Reply-To: <ac71172a0707261237wb833b1bq66c64ca39fb3c321@mail.gmail.com>
References: <ac71172a0707250957u6148b638s826a560ec013d3e0@mail.gmail.com>
	<20070725220204.GI31582@sashak.voltaire.com>
	<ac71172a0707261237wb833b1bq66c64ca39fb3c321@mail.gmail.com>
Message-ID: <20070727025952.GE6691@sashak.voltaire.com>

On 12:37 Thu 26 Jul     , lbt wrote:
>  Thanks for the suggestion Sasha!
> 
>  Our host stack does receive a rereregistration notice and does resubscribe
>  all handlers at
>  that point in time. At the time of the SM migration, our stack prints out
>  some informational messages to
>  confirm this:
>  Jul 18 14:31:09 localhost kernel: Event IB_EVENT_CLIENT_REREGISTER occurred
>  on port 1
>  Jul 18 14:31:09 localhost kernel: OpemSM migrated, old SM LID=1 new SM LID=8
> 
>  And also confirmed in the SM logs that after the migration, the higher
>  priority SM is getting a subscription request for in-service trap:
>  Jul 18 14:32:13 103550 [41E02960] -> osm_infr_rcv_process_set_method:
>  Subscribe Request with QPN: 0x000001
>  Jul 18 14:32:13 103554 [41E02960] -> osm_infr_get_by_rec: [
>  Jul 18 14:32:13 103558 [41E02960] -> __dump_all_informs: [
>  Jul 18 14:32:13 103562 [41E02960] -> InformInfo dump:
>                                 gid.....................0x0000000000000000 :
>  0x0000000000000000
>                                 lid_range_begin.........0xFFFF
>                                 lid_range_end...........0x0
>                                 is_generic..............0x1
>                                 subscribe...............0x0
>                                 trap_type...............0x3
>                                 trap_num................64
>                                 qpn.....................0x000001
>                                 resp_time_val...........0x0
>                                 node_type...............0x000004
>  Jul 18 14:32:13 103569 [41E02960] -> __dump_all_informs: ]
> 
>  It maybe a problem if the resubscription of the in-service handler occurs
>  after the in-service notice was forwarded, but I think the problem is that
>  there is never a notice that is forwared for the higher priority SM port
>  that is restored.

And after OpenSM migration, did you receive in-service notices for
another ports? Does the problem happen only in migration time?

>  Perhaps, neither SM (the lower priority and higher
>  priority one), generates an in-service trap because of the timing  gap
>  between when the restored port is detected and "marked" (i.e. added to
>  new_ports_list) and when in-service traps are generated for new ports.
>  During SM migration, the lower priority SM detects the new port, but the
>  higher priority SM does the trap generation (but it doesn't realize that
>  it's own port is a new port and thus doesn't generate a trap for it).
> 
>  Our host stack executes some functions when a port is restored  (in our
>  in-service subscription handler).
>  Am I not supposed to receive an in-service trap for a restored port that
>  happens to be the Master SM,

Yes, I guess you are.

>  and instead  execute these actions with a
>  client reregistration event?

Client reregistration request is not suitable here - SM can ask for
client reregistration at any time (in practice OpenSM now does it only
when enters MASTER state, but it is also optional).

Sasha

> 
>  Thanks again for your help!
>  Lan
> 
> 
> 
>  On 7/25/07, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> >
> > Hi Lan,
> >
> > On 09:57 Wed 25 Jul     , lbt wrote:
> > >  Hello,
> > >
> > >  I have been seeing a problem where a subscriber for in-service traps is
> > not
> > >  getting informed when the port of master openSM is restored (i.e.
> > causing an
> > >  SM migration).
> > >
> > >  I have an IB subnet with 2 nodes running OpenSM , different priorities
> > of
> > >  course (OpenSM Rev:openib-2.0.5). I also have another node on the
> > subnet
> > >  that has subscribed for the forwarding of any
> > IB_SA_GENERIC_TRAP_NUM_IN_SVC
> > >  trap events. I've been doing cable pull tests on the IB ports, to check
> > if
> > >  the in-service handler I have subscribed gets invoked when I restore
> > the
> > >  cable. I've noticed that everything works as expected ( i.e. my
> > in-service
> > >  handler is invoked) whenever I restore the cable on the lower priority
> > SM IB
> > >  port without ever touching the master SM port. But if I cause an SM
> > >  migration, by restoring the port of the higher priority SM, the
> > in-service
> > >  trap does not get generated as expected on a cable restore.
> > >
> > >  Steps to Reproduce:
> > >  1) Start with port to higher priority SM disconnected.
> > >  2) restore port cable on the higher priority SM
> > >  --> This causes an SM Migration as expected, SM's migration happens
> > okay
> > >  --> I expected the restoration of the higher priority SM to tit to also
> > >  trigger an in-service trap as well and notify subscribers, but it
> > doesn't
> > >  occur
> > >
> > >  I have collected debug messages log for both open SM's, and it appears
> > that
> > >  the reason is because:
> > >  1) in-service traps are generated based on what ports are added on the
> > >  Master SM's new_ports_list, but these traps are generated only after
> > LID
> > >  assignment
> > >  2) when the higher priority SM port is restored, the restored port gets
> > >  added to the lower priority SM's new_ports_list (since it's still the
> > Master
> > >  SM at that point in time)
> > >  3) the handover of Master  SM  from lower priority to higher priority
> > SM
> > >  occurs (before LID assignment and thus a chance for traps get generated
> > for
> > >  those ports on new_ports_list)
> > >  4) the higher priority SM is now Master SM, but it has an empty
> > >  new_ports_list, so no trap generated either
> > >
> > >  Does this look like a legitimate Open SM bug? Any feedback would be
> > much
> > >  appreciated, and if I can help further in any way please let me know .
> >
> > As far as I know when OpenSM (even old like 2.0.5) becomes master it
> > requests client to reregister SA related stuff (by setting this bit in
> > PortInfo).
> >
> > Probably your port doesn't not support this (you could verify by seeing
> > PortInfo:CapabilityMask - use 'smpquery portinfo <client-port-lid>') or
> > maybe your host stack doesn't do reregistration?
> >
> > Anyway you could track this in the OpenSM code in osm_lid_mgr.c
> > __osm_lid_mgr_set_physp_pi() whenever client reregistration bit is set
> > (with ib_port_info_set_client_rereg()) or not. Then we will know more
> > about this problem.
> >
> > Sasha
> >
> > >
> > >
> > >  Subset of logs from lower priority SM during the cable restore of
> > higher
> > >  priority SM port:
> > >  ### Jul 18 14:31:56 614522 [41401960] ->
> > __osm_trap_rcv_process_request:
> > >  Received Generic Notice type:0x03 num:128 Producer:2 from LID:0x000A
> > >  TID:0x00000016000012e1
> > >  ### Jul 18 14:31:56 614823 [41401960] -> osm_state_mgr_process:
> > Received
> > >  signal OSM_SIGNAL_SWEEP in state OSM_SM_STATE_IDLE
> > >  ### 14:31:56 ******************** INITIATING HEAVY SWEEP
> > >  **********************
> > >  ### Jul 18 14:31:56 616887 [42803960] -> osm_state_mgr_process:
> > Received
> > >  signal OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state
> > >  OSM_SM_STATE_SWEEP_HEAVY_SELF
> > >  Jul 18 14:31:56 626078 [42803960] -> __osm_ni_rcv_process_new: Adding
> > port
> > >  GUID:0x00504501483e0000 to new_ports_list
> > >  Jul 18 14:31:56 626524 [42803960] -> osm_state_mgr_process: Received
> > signal
> > >  OSM_SIGNAL_CHANGE_DETECTED in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET
> > >  Jul 18 14:31:56 632630 [41E02960] -> osm_state_mgr_process: Received
> > signal
> > >  OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state
> > OSM_SM_STATE_SWEEP_HEAVY_SUBNET
> > >  14:31:56 ********************* HEAVY SWEEP COMPLETE
> > ***********************
> > >  Jul 18 14:31:56 632773 [41E02960] -> osm_sm_state_mgr_process: Received
> > >  signal OSM_SM_SIGNAL_HANDOVER_SENT in state IB_SMINFO_STATE_MASTER###
> > >  14:31:56 ******************** ENTERING SM STANDBY STATE
> > *******************
> > >
> > >  Subset of logs from higher priority SM during the cable restore of
> > higher
> > >  priority SM port:
> > >
> > >  Jul 18 14:32:02 995600 [41401960] -> osm_sm_state_mgr_process: [
> > >  Jul 18 14:32:02 995605 [41401960] -> osm_sm_state_mgr_process: Received
> > >  signal OSM_SM_SIGNAL_DISCOVERY_COMPLETED in state
> > >  IB_SMINFO_STATE_DISCOVERING
> > >  Jul 18 14:32:02 995609 [41401960] -> Entering MASTER state
> > >  Jul 18 14:32:02 995888 [41401960] -> __osm_sm_state_mgr_master_msg:
> > >  ******************** ENTERING SM MASTER STATE ********************
> > >  Jul 18 14:32:03 009014 [41401960] ->
> > __osm_state_mgr_set_sm_lid_done_msg:
> > >  **** SM LID ASSIGNMENT COMPLETE - STARTING SUBNET LID CONFIG *****
> > >  Jul 18 14:32:03 024047 [41E02960] -> __osm_state_mgr_lid_assign_msg
> > >  ***** LID ASSIGNMENT COMPLETE - STARTING SWITCH TABLE CONFIG *****
> > >  Jul 18 14:32:03 024052 [41E02960] -> __osm_state_mgr_report_new_ports:
> > [
> > >  ----> no in-service traps are generated and notices forwarded because
> > there
> > >  are no ports on this list
> > >  Jul 18 14:32:03 024057 [41E02960] -> __osm_state_mgr_report_new_ports:
> > ]
> > >
> > >
> > >  Thanks!
> > >  Lan
> >
> > > _______________________________________________
> > > general mailing list
> > > general at lists.openfabrics.org
> > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > >
> > > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> >


From kliteyn at mellanox.co.il  Thu Jul 26 21:42:38 2007
From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il)
Date: 27 Jul 2007 07:42:38 +0300
Subject: [ofa-general] nightly osm_sim report 2007-07-27:normal completion
Message-ID: <MTLEXCH01dDAO2Rpy9t000009b2@mtlexch01.mtl.com>

OSM Simulation Regression Summary
 
[Generated mail - please do NOT reply]
 
 
OpenSM rev = Tue_Jul_24_09:41:51_2007 [feadbcd8281f007f092bbd95a2e078cac5a8a0aa]
ibutils rev = Tue_Mar_13_14:36:32_2007 [80aaff94f0eb65117db39b9db7d609ffdcc055de]
 
 
Total=560  Pass=560  Fail=0
 
 
Pass:
42 Stability IS1-16.topo
42 Pkey IS1-16.topo
42 OsmTest IS1-16.topo
42 OsmStress IS1-16.topo
42 Multicast IS1-16.topo
42 LidMgr IS1-16.topo
14 Stability IS3-loop.topo
14 Stability IS3-128.topo
14 Pkey IS3-128.topo
14 OsmTest IS3-loop.topo
14 OsmTest IS3-128.topo
14 OsmStress IS3-128.topo
14 Multicast IS3-loop.topo
14 Multicast IS3-128.topo
14 LidMgr IS3-128.topo
14 FatTree merge-roots-4-ary-2-tree.topo
14 FatTree merge-root-4-ary-3-tree.topo
14 FatTree gnu-stallion-64.topo
14 FatTree blend-4-ary-2-tree.topo
14 FatTree RhinoDDR.topo
14 FatTree FullGnu.topo
14 FatTree 4-ary-2-tree.topo
14 FatTree 2-ary-4-tree.topo
14 FatTree 12-node-spaced.topo
14 FTreeFail 4-ary-2-tree-missing-sw-link.topo
14 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo
14 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo
14 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo

Failures:


From tamura at osrg.net  Thu Jul 26 21:45:04 2007
From: tamura at osrg.net (Yoshiaki Tamura)
Date: Fri, 27 Jul 2007 13:45:04 +0900
Subject: [ofa-general] OFED-1.2 on x86 debian
Message-ID: <46A97850.2030607@osrg.net>

Hi.

I'm trying to install OFED-1.2 on x86 (32bit) debian machine.
Although build_env.sh seems to work on debian,
it fails compiling both kernel modules and user land tools by rpmbuild.

Is OFED-1.2 tested on debian or totally unsupported?

Thanks,

Yoshi Tamura


From eitan at mellanox.co.il  Thu Jul 26 21:56:54 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Fri, 27 Jul 2007 07:56:54 +0300
Subject: [ofa-general] RE: pkey.sim.tcl
In-Reply-To: <20070726224133.GC2472@sashak.voltaire.com>
References: <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com>
	<20070722174048.GO27878@sashak.voltaire.com>
	<f0e08f230707230830x5d0f73ccib09421f3f660e563@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com>
	<20070724005153.GD11674@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com>
	<20070724170432.GZ27878@sashak.voltaire.com>
	<20070724215441.GA25264@sashak.voltaire.com>
	<20070725202418.GD31582@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901F75674@mtlexch01.mtl.com>
	<20070726224133.GC2472@sashak.voltaire.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901F75A8F@mtlexch01.mtl.com>

> 
> On 09:26 Thu 26 Jul     , Eitan Zahavi wrote:
> > 
> > I am happy you actually use the simulator.
> > Please provide more info regarding the failure. You should tar 
> > compress the /tmp/ibmgtsim.XXXX of your run.
> 
> I can send this for you if you want, but the failure is trivial.
No need if you already know where the bug is...
> 
> Yes, and it is due (6), where default Pkey is removed 
> "externally". I'm not sure that OpenSM should handle the case 
> when pkey table is modified externally by something which is not SM.
> 

For a few years it just worked fine. So I wonder why this fucntionality
was removed ?
It is a real BAD case where Pkeys are altered but I think would be wise
to "refresh" these tables on heavy seep.

In general it seems OpenSM has lost its "heavy sweep" concept. Now it
does not refresh the fabric setup even on heavy sweep.
This is assuming a "perfect" HW and software and I would really this we
should have preserved that capability.
Note that a "heavy sweep" does not happen unless somethng changed or
trapped.

Eitan

Eitan


From philippe.gregoire at cea.fr  Fri Jul 27 00:32:44 2007
From: philippe.gregoire at cea.fr (Philippe Gregoire)
Date: Fri, 27 Jul 2007 09:32:44 +0200
Subject: [ofa-general] QoS RFC
In-Reply-To: <46A283B6.1070105@dev.mellanox.co.il>
References: <46A283B6.1070105@dev.mellanox.co.il>
Message-ID: <46A99F9C.5040303@cea.fr>

HI Yevgeny

Yevgeny Kliteynik a écrit :
> Hi All
>
> Please find the attached RFC describing how QoS policy support could 
> be implemented in the OpenFabrics stack.
> Your comments are welcome.
>
> -- Yevgeny
>
>               RFC: OpenFabrics Enhancements for QoS Support
>              ===============================================
>
> Authors: . Eitan Zahavi <eitan at mellanox.co.il>
> Authors: . Yevgeny Kliteynik <kliteyn at mellanox.co.il>
> Date: .... Jul 2007.
> Revision:  0.2
>
> Table of contents:
> 1. Overview
> 2. Architecture
> 3. Supported Policy
> 4. CMA functionality
> 5. IPoIB functionality
> 6. SDP functionality
> 7. SRP functionality
> 8. iSER functionality
> 9. OpenSM functionality
>
> 1. Overview
> ------------
> Quality of Service requirements stem from the realization of I/O 
> consolidation
> over IB network: As multiple applications and ULPs share the same 
> fabric, means
> to control their use of the network resources are becoming a must. The 
> basic
> need is to differentiate the service levels provided to different 
> traffic flows,
> such that a policy could be enforced and control each flow utilization 
> of the
> fabric resources.
>
> IBTA specification defined several hardware features and management 
> interfaces
> to support QoS:
> * Up to 15 Virtual Lanes (VL) carry traffic in a non-blocking manner
> * Arbitration between traffic of different VLs is performed by a 2 
> priority
>   levels weighted round robin arbiter. The arbiter is programmable with
>   a sequence of (VL, weight) pairs and maximal number of high priority 
> credits
>   to be processed before low priority is served
> * Packets carry class of service marking in the range 0 to 15 in their
>   header SL field
> * Each switch can map the incoming packet by its SL to a particular 
> output
>   VL based on programmable table VL=SL-to-VL-MAP(in-port, out-port, SL)
> * The Subnet Administrator controls each communication flow parameters
>   by providing them as a response to Path Record (PR) or 
> MultiPathRecord (MPR)
>   queries
>
> The IB QoS features provide the means to implement a DiffServ like 
> architecture.
> DiffServ architecture (IETF RFC2474 2475) is widely used today in 
> highly dynamic
> fabrics.
>
> This proposal provides the detailed functional definition for the various
> software elements that are required to enable a DiffServ like 
> architecture over
> the OpenFabrics software stack.
>
>
>
> 2. Architecture
> ----------------
> This proposal split the QoS functionality between the SM/SA, CMA and 
> the various
> ULPS. We take the "chronology approach" to describe how the overall 
> system
> works:
>
> 2.1. The network manager (human) provides a set of rules (policy) that 
> defines
> how the network is being configured and how its resources are split to 
> different
> QoS-Levels. The policy also define how to decide which QoS-Level each
> application or ULP or service use.
>
> 2.2. The SM analyzes the provided policy to see if it is realizable 
> and performs
> the necessary fabric setup. The SM may continuously monitor the policy 
> and adapt
> to changes in it. Part of this policy defines the default QoS-Level of 
> each
> partition. The SA is being enhanced to match the requested Source, 
> Destination,
> QoS-Class, Service-ID (and optionally SL and priority) against the 
> policy. So
> clients (ULPs, programs) can obtain a policy enforced QoS. The SM is also
> enhanced to support setting up partitions with appropriate IPoIB 
> broadcast
> group. This broadcast group carries its QoS attributes: SL, MTU and
> RATE.
>
> 2.3. IPoIB is being setup. IPoIB uses the SL, MTU and RATE available 
> on the
> multicast group which forms the broadcast group of this partition.
>
> 2.4. MPI which provides non IB based connection management should be 
> configured
> to run using hard coded SLs. It uses these SLs for every QP being opened.
>
> 2.5. ULPs that use CM interface (like SRP) should have their own 
> pre-assigned
> Service-ID and use it while obtaining PR/MPR for establishing 
> connections.
> The SA receiving the PR/MPR should match it against the policy and return
> the appropriate PR/MPR including SL, MTU and RATE.
>
> 2.6. ULPs and programs using CMA to establish RC connection should 
> provide the
> CMA the target IP and Service-ID. Some of the ULPs might also provide 
> QoS-Class
> (E.g. for SDP sockets that are provided the TOS socket option). The 
> CMA should
> then use the provided Service-ID and optional QoS-Class and pass them 
> in the
> PR/MPR request. The resulting PR/MPR should be used for configuring the
> connection QP.
>
> PathRecord and MultiPathRecord enhancement for QoS:
> As mentioned above the PathRecord and MultiPathRecord attributes 
> should be
> enhanced to carry the Service-ID which is a 64bit value, which has been
> standardized by the IBTA. A new field QoS-Class is also provided.
> A new capability bit should describe the SM QoS support in the SA 
> class port
> info. This approach provides an easy migration path for existing 
> access layer
> and ULPs by not introducing new set of PR/MPR attribute.
>
>
> 3. Supported Policy
> --------------------
>
> The QoS policy supported by this proposal is divided into 4 sub sections:
>
> I) Port Group: a set of CAs, Routers or Switches that share the same 
> settings.
> A port group might be a partition defined by the partition manager 
> policy in
> terms of GUIDs. Future implementations might provide support for 
> NodeDescription
> based definition of port groups.
>
> II) Fabric Setup:
> Defines how the SL2VL and VLArb tables should be setup. This policy 
> definition
> assumes the computation of overall end to end network behavior should 
> be performed
> outside of OpenSM.
>
> III) QoS-Levels Definition:
> This section defines the possible sets of parameters for QoS that a 
> client
> might be mapped to. Each set holds: SL and optionally: Max MTU, Max Rate,
> Packet Lifetime and Path Bits (in case LMC > 0 is used for QoS).
>
> IV) Matching Rules:
> A list of rules that match an incoming PR/MPR request to a QoS-Level. The
> rules are processed in order such as the first match is applied. Each 
> rule is
> built out of a set of match expressions which should all match for the 
> rule to
> apply. The matching expressions are defined for the following fields
> ** SRC and DST to lists of port groups
> ** Service-ID to a list of Service-ID or Service-ID ranges
> ** QoS-Class to a list of QoS-Class values or ranges
>
> QoS Policy file syntax
>
> * Empty lines are ignored
> * Leading and trailing blanks, as well as empty lines, are ignored, so 
> the
>   indentation in the example is just for better readability
> * Comments are started with the pound sign (#) and terminated by EOL
> * Comments may appear only in a separate line
> * Keywords that denote section/subsection start have matching closing 
> keywords
> * Any keyword should be the first non-blank in the line
>
> QoS Policy file example
>
>     # Port Groups define sets of ports to be used later in the settings
>     port-groups
>         # using port GUIDs
>         port-group
>             name: Storage
>             # "use" is just a description that is used for logging.
>             #  Other than that, it is just a commentary
>             use: our SRP storage targets
>             port-guid: 0x1000000000000001
>             port-guid: 0x1000000000000002
>         end-port-group
>
>         port-group
>             name: Virtual Servers
>             use: node desc and IB port num
>             # The syntax of the port name is as follows: 
> "hostname/CA-num/Pnum".
>             # "hostname" and "CA-num" are compared to the first 2 
> words of
>             # NodeDescription, and "Pnum" is a port number on that node.
>             port-name: vs1/HCA-1/P1
>             port-name: vs3/HCA-1/P1
>             port-name: vs3/HCA-2/P2
>         end-port-group
>
For clusters, I like to have a syntax a la slurm or rms which understand 
node ranges :
port-name: vs[1-20,30-50]/HCA-1/P1
>         # using partitions defined in the partition policy
>         port-group
>             name: Group for Partition 1
>             use: default settings
>             partition: Part1
>         end-port-group
>
>         # using node types CA|ROUTER|SWITCH
>         port-group
>             name: Routers
>             use: all routers
>             node-type: ROUTER
>         end-port-group
>
>     end-port-groups
>
>     qos-setup
>
>         # define all types of VLArb tables. The length of the tables 
> should
>         # match the physically supported tables by their target ports
>         vlarb-tables
>             # scope defines the exact ports the VLArb tables apply to
>             vlarb-scope
>                 # defining VLArb tables on all the ports that belong to
>                 # port group 'Storage', and on all the ports connected
>                 # to ports of port group 'Storage'
>                 group: Storage
>                 # "across" means all the ports that are connected to 
> ports
>                 # that belong to the specified port group
>                 across: Storage
>                 # VLArb table holds VL and weight pairs
>                 vlarb-high: 0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1
>                 vlarb-low: 8:255,9:127,10:63,11:31,12:15,13:7,14:3
>                 vl-high-limit: 10
>             end-vlarb-scope
>             # There can be several scopes
>         end-vlarb-tables
>
>         sl2vl-tables
>             # Scope defines the exact devices and in/out ports tables 
> apply to.
>             # Note: if the same port is matching several rules the 
> *FIRST* one applies.
>             sl2vl-scope
>                 # SL2VL tables are orgnized as SL2VL(in-port,out-port)
>                 # "from: n,m" means we define the SL2VL(n,*) and 
> SL2VL(m,*)
>                 # "to: n,m" means we define the SL2VL(*,n) and SL2VL(*,m)
>                 #
>                 # The following example specifies that all the SL2VL 
> tables
>                 # entries should be defined for all the ports of group 
> Part1:
>                 group: Part1
>                 from: *
>                 to: *
>                 # SL2VL table has to have 16 values at max - one for 
> each SL.
>                 # If the user specifies less than 16 values, all the 
> missing
>                 # VL values will be implicitly set to 0
>                 sl2vl-table: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
>             end-sl2vl-scope
>
>             sl2vl-scope
>                 # "across-to" is a combination of "across" keyword 
> (definition can be found
>                 # in VLArb tables section) and "to" keyword.
>                 # "across: PortGroupName" refers to all the ports that 
> are connected
>                 # to ports that belong to PortGroupName.
>                 #
>                 # Example of "across-to" usage:
>                 #   A user has a set of 'special' nodes (e.g. storage 
> nodes), and all
>                 #   the traffic to these nodes has to get specific VL.
>                 #   The solution is to define port group (i.g. 
> "Storage") that will
>                 #   include all the ports of these nodes, and then to 
> configure SL2VL
>                 #   tables on all the switch ports that are connected 
> to the Storage
>                 #   port group by specifying "across-to: Storage".
>                 #
>                 across-to: Storage2
>                 # Similar to "across-to", "across-from" is a 
> combination of "across"
>                 # and "to" keywords
>                 across-from: Storage1
>                 sl2vl-table: 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0
>             end-sl2vl-scope
>         end-sl2vl-tables
>
>     end-qos-setup
>
>
>     qos-levels
>
>         # the first one is just setting SL
>         qos-level
>             use: for the lowest priority communication
>             sl: 15
>             packet-life: 16
>         end-qos-level
>         # the second sets SL and QoS Class
>         qos-level
>             use: low latency best bandwidth
>             sl: 0
>         end-qos-level
>         # the whole set: SL, MTU-Limit, Rate-Limit, Packet Lifetime, 
> Path Bits
>         qos-level
>             use: just an example
>             sl: 0
>             mtu-limit: 1
>             rate-limit: 1
>             packet-life: 12
>             # Path Bits can be used e.g. to provide a different routes 
> through the
>             # subnet to a particular port
>             path-bits: 2,4,8-32
>         end-qos-level
>
>     end-qos-levels
>
>
>     # Match rules are scanned in a first-fit manner (like firewall 
> rules table)
>     qos-match-rules
>
>         # matching by single criteria: class (list of values and ranges)
>         qos-match-rule
>             # just a description
>             use: low latency by class 7-9 or 11
>             qos-class: 7-9,11
>             # number of qos-level to apply to the matching PR/MPR
>             qos-level-sn: 1
>         end-qos-match-rule
>         # show matching by destination group AND service-ids
>         qos-match-rule
>             use: Storage targets connection
>             destination: Storage
>             service-id: 22,4719-5000
>             qos-level-sn: 2
>         end-qos-match-rule
>         # show matching by source group only
>         qos-match-rule
>             use: bla bla
>             source: Storage
>             qos-level-sn: 3
>         end-qos-match-rule
>
>     end-qos-match-rules
>
>
> 4. IPoIB
> ---------
>
> IPoIB already query the SA for its broadcast group information. The 
> additional
> functionality required is for IPoIB to provide the broadcast group SL, 
> MTU,
> and RATE in every following PathRecord query performed when a new UDAV is
> needed by IPoIB.
> We could assign a special Service-ID for IPoIB use but since all 
> communication
> on the same IPoIB interface shares the same QoS-Level without the 
> ability to
> differentiate it by target service we can ignore it for simplicity.
>
> 5. CMA features
> ----------------
>
> The CMA interface supports Service-ID through the notion of port space 
> as a
> prefixes to the port_num which is part of the sockaddr provided to
> rdma_resolve_add(). What is missing is the explicit request for a 
> QoS-Class that
> should allow the ULP (like SDP) to propagate a specific request for a 
> class of
> service. A mechanism for providing the QoS-Class is available in the 
> IPv6 address,
> so we could use that address field. Another option is to implement a 
> special
> connection options API for CMA.
>
> Missing functionality by CMA is the usage of the provided QoS-Class 
> and Service-ID
> in the sent PR/MPR. When a response is obtained it is an existing 
> requirement for
> the CMA to use the PR/MPR from the response in setting up the QP 
> address vector.
>
>
> 6. SDP
> -------
>
> SDP uses CMA for building its connections.
> The Service-ID for SDP is 0x000000000001PPPP, where PPPP are 4 hex digits
> holding the remote TCP/IP Port Number to connect to.
> SDP might be provided with SO_PRIORITY socket option. In that case the 
> value
> provided should be sent to the CMA as the TClass option of that 
> connection.
>
This requires modifications a applications and does not allow a global 
definition of Qos for all SDP  applications into the fabric.
This is inconsistent with Libsdp provided to migrate transparently 
TCP/IP application to SDP.
If the maching rules allows some kind of bitmask pattern matching, we 
can define something like :
qos-match-rule
            use: all SDP applications
            service-id: 0x000000000001????
            qos-level-sn: 2
        end-qos-match-rule
> 7. SRP
> -------
>
> Current SRP implementation uses its own CM callbacks (not CMA). So SRP 
> should
> fill in the Service-ID in the PR/MPR by itself and use that 
> information in
> setting up the QP. The T10 SRP standard defines the SRP Service-ID to 
> be defined
> by the SRP target I/O Controller (but they should also comply with 
> IBTA Service-
> ID rules). Anyway, the Service-ID is reported by the I/O Controller in 
> the
> ServiceEntries DMA attribute and should be used in the PR/MPR if the SA
> reports its ability to handle QoS PR/MPRs.
>
> 8. iSER
> --------
> iSER uses CMA and thus should be very close to SDP. The Service-ID for 
> iSER
> should be TBD.
>
>
> 9. OpenSM features
> -------------------
> The QoS related functionality to be provided by OpenSM can be split 
> into two
> main parts:
>
> 3.1. Fabric Setup
> During fabric initialization the SM should parse the policy and apply its
> settings to the discovered fabric elements. The following actions 
> should be
> performed:
> * Parsing of policy
> * Node Group identification. Warning should be provided for each node not
>   specified but found.
> * SL2VL settings validation should be checked:
>   + A warning will be provided if there are no matching targets for 
> the SL2VL
>     setting statement.
>   + An error message will be printed to the log file if an invalid 
> setting is
>     found. A setting is invalid if it refers to:
>     - Non existing port numbers of the target devices
>     - Unsupported VLs for the target device. In the later case the map 
> to non
>       existing VLs should be replaced to VL15 i.e. packets will be 
> dropped.
> * SL2VL setting is to be performed
> * VL Arbitration table settings should be validated according to the 
> following
>   rules:
>   + A warning will be provided if there are no matching targets for 
> the setting
>     statement
>   + An error will be provided if the port number exceeds the target ports
>   + An error will be generated if the table length exceeds device 
> capabilities
>   + A warning will be generated if the table quote a VL that is not 
> supported
>     by the target device
> * VL Arbitration tables will be set on the appropriate targets
>
> 3.2. PR/MPR query handling:
> OpenSM should be able to enforce the provided policy on client request.
> The overall flow for such requests is: first the request is matched 
> against the
> defined match rules such that the target QoS-Level definition is 
> found. Given
> the QoS-Level a path(s) search is performed with the given 
> restrictions imposed
> by that level. The following two sections describe these steps.
>
> How Service-ID is carried in the PathRecord and MultiPathRecord 
> attributes is
> now standardized by the IBTA.
>
>
> 3.2.1. Matching rule search:
> A rule is "matching" a PR/MPR request using the following criteria:
> * Matching rules provide values in a list of either single value, or 
> range of
>   values. A PR/MPR field is "matching" the rule field if it is explicitly
>   noted in the list of values or is one of the values covered by a range
>   included in the field values list.
> * Only PR/MPR fields that have their component mask bit set should be
>   compared.
> * For a rule to be "matching" a PR/MPR request all the rule fields 
> should be
>   "matching" their PR/MPR fields. Such that a PR/MPR request that does
>   not have a component mask field set for one of the rule defined 
> fields  can
>   not match that rule.
> * A PR/MPR request that have a component mask bit set for one of the 
> fields
>   that is not defined by the rule can match the rule.
>
> The algorithm to be used for searching for a rule match might be as 
> simple as a
> sequential search through all rules or enhanced for better 
> performance. The
> semantics of every rule field and its matching PR/MPR field are described
> below:
> * Source: the SGID or SLID should be part of this group
> * Destination: the DGID or DLID should be part of this group
> * Service-ID: check if the requested Service-ID (available in the 
> PR/MPR old
>   SM-Key field) is matching any of this rule Service-IDs
> * TClass: check if the PR/MPR TClass field is matching
>
> 3.2.2 PR/MPR response generation:
> The QoS-Level pointed by the first rule that matches the PR/MPR request
> should be used for obtaining the response SL, MTU-Limit, RATE-Limit, 
> Path-Bits
> and QoS-Class. A default QoS-Level should be used if no rule is 
> matching the query.
>
> The efficient algorithm for finding paths that meet the QoS-Level 
> criteria is
> beyond the scope of this RFC and left for the implementer to provide. 
> However
> the criteria by which the paths match the QoS-Level are described below:
>
> * SL: The paths found should all use the given SL. For that sake PR/MPR
>   algorithm should traverse the path from source to destination only 
> through
>   ports that carry a valid VL (not VL15) by the SL2VL map (should 
> consider input
>   and output ports and SL).
> * MTU-Limit: The resulting paths MTU should not exceed the given 
> MTU-Limit
> * Rate-Limit: The resulting paths RATE should not exceed the given 
> RATE-Limit
>   (rate limit is given in units of link BW = Width*Speed according to 
> IBTA
>   Specification Vol-1 table-205 p-901 l-24).
> * Path-Bits: define the target LID lowest bits (number of bits defined 
> by the
>   target port PortInfo.LMC field). The path should traverse the LFT 
> using the
>   target port LID with the path-bits set.
> * QoS-Class: should be returned in the result PR/MPR. When routing is 
> going to
>   be supported by OpenSM we might use this field in selecting the target
>   router too in a TBD way.
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
>
Philippe Gregoire
CEA/DAM


From mst at dev.mellanox.co.il  Fri Jul 27 01:34:38 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 27 Jul 2007 11:34:38 +0300
Subject: [ofa-general] Re: OFED-1.2 on x86 debian
In-Reply-To: <46A97850.2030607@osrg.net>
References: <46A97850.2030607@osrg.net>
Message-ID: <20070727083438.GA9912@mellanox.co.il>

> Quoting Yoshiaki Tamura <tamura at osrg.net>:
> Subject: OFED-1.2 on x86 debian
> 
> Hi.
> 
> I'm trying to install OFED-1.2 on x86 (32bit) debian machine.
> Although build_env.sh seems to work on debian,
> it fails compiling both kernel modules and user land tools by rpmbuild.
> 
> Is OFED-1.2 tested on debian or totally unsupported?

It's not on a list of supported platforms, but I think we do builds
on ubuntu so debian should work too. Vlad?

-- 
MST


From vlad at lists.openfabrics.org  Fri Jul 27 01:39:47 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Fri, 27 Jul 2007 01:39:47 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070727-0100 daily build status
Message-ID: <20070727083947.77152E60858@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.21.1
Passed on ppc64 with linux-2.6.18
Passed on x86_64 with linux-2.6.12
Passed on powerpc with linux-2.6.18
Passed on powerpc with linux-2.6.19
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.19
Passed on powerpc with linux-2.6.16
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.12
Passed on ppc64 with linux-2.6.13
Passed on x86_64 with linux-2.6.13
Passed on ppc64 with linux-2.6.15
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.14
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.16
Passed on x86_64 with linux-2.6.15
Passed on ppc64 with linux-2.6.17
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.14
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.13
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.12
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on ia64 with linux-2.6.16.21-0.8-default

Failed:


From sigvard.dadisman at bestil.net  Fri Jul 27 03:51:26 2007
From: sigvard.dadisman at bestil.net (Heriberto Villalobos)
Date: Fri, 27 Jul 2007 08:51:26 -0200
Subject: [ofa-general] To be slim
Message-ID: <01c7d02b$511d6300$cdd47859@sigvard.dadisman>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: themall.gif
Type: image/gif
Size: 9472 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070727/50869433/attachment.gif>

From vlad at lists.openfabrics.org  Fri Jul 27 02:49:14 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Fri, 27 Jul 2007 02:49:14 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_c_kernel 20070727-0200 daily build status
Message-ID: <20070727094915.02179E6085F@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2_c

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.12
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.15
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.20
Passed on ia64 with linux-2.6.13
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.15
Passed on ia64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16
Passed on ia64 with linux-2.6.16
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.13
Passed on x86_64 with linux-2.6.17
Passed on powerpc with linux-2.6.18
Passed on ppc64 with linux-2.6.13
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.14
Passed on ia64 with linux-2.6.18
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.16
Passed on x86_64 with linux-2.6.19
Passed on ppc64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on powerpc with linux-2.6.19
Passed on powerpc with linux-2.6.15
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.13
Passed on powerpc with linux-2.6.17
Passed on powerpc with linux-2.6.16
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.9-34.ELsmp

Failed:


From Sumit.Gaur at Sun.COM  Fri Jul 27 03:17:23 2007
From: Sumit.Gaur at Sun.COM (Sumit Gaur - Sun Microsystem)
Date: Fri, 27 Jul 2007 15:47:23 +0530
Subject: [ofa-general] TransactionID(IB_MAD_TRID_F) description
Message-ID: <46A9C633.7040302@Sun.COM>

Hi,
I have just observed that TransactionID that I am providing with sndbuf to 
*umad_send* is not the one that I received back from *umad_recv* function. Going 
more in detail I have seen that only low 32 bits of TID are matching in received 
mad with send mad. Is this functionality of TID is expected or there is any 
suitable way to get the all 64 bits of TID in place of low 32 bits only.

Thanks and Regards
sumit


From hnguyen at linux.vnet.ibm.com  Fri Jul 27 03:52:49 2007
From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen)
Date: Fri, 27 Jul 2007 12:52:49 +0200
Subject: [ofa-general] [PATCH 0/2] ehca: remove WARNING: externs should be
	avoided in .c files
Message-ID: <200707271252.50193.hnguyen@linux.vnet.ibm.com>

Hello Roland!
This small patch set fixes some coding-style related issues for ehca:
[1/2] remove checkpatch.pl's warnings "externs should be avoided in .c files"
[2/2] correction include order according kernel coding style
Thanks
Nam


From hnguyen at linux.vnet.ibm.com  Fri Jul 27 03:54:50 2007
From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen)
Date: Fri, 27 Jul 2007 12:54:50 +0200
Subject: [ofa-general] [PATCH 1/2] ehca: remove checkpatch.pl's warnings
	"externs should be avoided in .c files"
Message-ID: <200707271254.51055.hnguyen@linux.vnet.ibm.com>

From b5d0336089b5ebe5b18acb94b2c94c2026cb95ee Mon Sep 17 00:00:00 2001
From: Hoang-Nam Nguyen <hnguyen at de.ibm.com>
Date: Fri, 27 Jul 2007 10:24:49 +0200
Subject: [PATCH] remove checkpatch.pl's warnings "externs should be avoided in .c files"

Signed-off-by: Hoang-Nam Nguyen <hnguyen at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_classes.h |    1 +
 drivers/infiniband/hw/ehca/ehca_mrmw.c    |    2 --
 drivers/infiniband/hw/ehca/ehca_pd.c      |    1 -
 drivers/infiniband/hw/ehca/hcp_if.c       |    1 -
 drivers/infiniband/hw/ehca/ipz_pt_fn.h    |    2 ++
 5 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h
index 3725aa8..b5e9603 100644
--- a/drivers/infiniband/hw/ehca/ehca_classes.h
+++ b/drivers/infiniband/hw/ehca/ehca_classes.h
@@ -322,6 +322,7 @@ extern int ehca_static_rate;
 extern int ehca_port_act_time;
 extern int ehca_use_hp_mr;
 extern int ehca_scaling_code;
+extern int ehca_mr_largepage;
 
 struct ipzu_queue_resp {
 	u32 qe_size;      /* queue entry size */
diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c
index c1b868b..773ac3f 100644
--- a/drivers/infiniband/hw/ehca/ehca_mrmw.c
+++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c
@@ -64,8 +64,6 @@ enum ehca_mr_pgsize {
 	EHCA_MR_PGSIZE16M = 0x1000000L
 };
 
-extern int ehca_mr_largepage;
-
 static u32 ehca_encode_hwpage_size(u32 pgsize)
 {
 	u32 idx = 0;
diff --git a/drivers/infiniband/hw/ehca/ehca_pd.c b/drivers/infiniband/hw/ehca/ehca_pd.c
index 3dafd7f..43bcf08 100644
--- a/drivers/infiniband/hw/ehca/ehca_pd.c
+++ b/drivers/infiniband/hw/ehca/ehca_pd.c
@@ -88,7 +88,6 @@ int ehca_dealloc_pd(struct ib_pd *pd)
 	u32 cur_pid = current->tgid;
 	struct ehca_pd *my_pd = container_of(pd, struct ehca_pd, ib_pd);
 	int i, leftovers = 0;
-	extern struct kmem_cache *small_qp_cache;
 	struct ipz_small_queue_page *page, *tmp;
 
 	if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context &&
diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c
index fdbfebe..24f4541 100644
--- a/drivers/infiniband/hw/ehca/hcp_if.c
+++ b/drivers/infiniband/hw/ehca/hcp_if.c
@@ -758,7 +758,6 @@ u64 hipz_h_register_rpage_mr(const struct ipz_adapter_handle adapter_handle,
 			     const u64 logical_address_of_page,
 			     const u64 count)
 {
-	extern int ehca_debug_level;
 	u64 ret;
 
 	if (unlikely(ehca_debug_level >= 2)) {
diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn.h b/drivers/infiniband/hw/ehca/ipz_pt_fn.h
index c6937a0..a801274 100644
--- a/drivers/infiniband/hw/ehca/ipz_pt_fn.h
+++ b/drivers/infiniband/hw/ehca/ipz_pt_fn.h
@@ -54,6 +54,8 @@
 struct ehca_pd;
 struct ipz_small_queue_page;
 
+extern struct kmem_cache *small_qp_cache;
+
 /* struct generic ehca page */
 struct ipz_page {
 	u8 entries[EHCA_PAGESIZE];
-- 
1.5.2


From hnguyen at linux.vnet.ibm.com  Fri Jul 27 03:55:19 2007
From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen)
Date: Fri, 27 Jul 2007 12:55:19 +0200
Subject: [ofa-general] [PATCH 2/2] ehca: correction include order according
	kernel coding style
Message-ID: <200707271255.19456.hnguyen@linux.vnet.ibm.com>

From a2794450cbee597cefd7b6e159257583c459d358 Mon Sep 17 00:00:00 2001
From: Hoang-Nam Nguyen <hnguyen at de.ibm.com>
Date: Fri, 27 Jul 2007 10:26:40 +0200
Subject: [PATCH] correction include order according kernel coding style

Signed-off-by: Hoang-Nam Nguyen <hnguyen at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_mrmw.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c
index 773ac3f..1180b65 100644
--- a/drivers/infiniband/hw/ehca/ehca_mrmw.c
+++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c
@@ -40,9 +40,8 @@
  * POSSIBILITY OF SUCH DAMAGE.
  */
 
-#include <rdma/ib_umem.h>
-
 #include <asm/current.h>
+#include <rdma/ib_umem.h>
 
 #include "ehca_iverbs.h"
 #include "ehca_mrmw.h"
-- 
1.5.2


From elsgz at bbsc.com.cn  Fri Jul 27 02:58:08 2007
From: elsgz at bbsc.com.cn (Canadian Charity'2007 refnum_317)
Date: Fri, 27 Jul 2007 19:58:08 +1000
Subject: [ofa-general] We are expanding, new  offer for you.        Id: 265
Message-ID: <002301c7d03d$04766830$97018279@jmzp39dexle2z9o>

 This proposal is of most weight to all EU candidates

   We are glad to introduce you our new mission.
   This is a vacancy for European residents only.

   Requirements and benefits:

   Monthly gross salary: 1500-3000 EUR per month
   Age limit: from 18 y.o.
   Possible profession growth and promotion opportunity
   Internet access, mobile or home phone number and the email
   Part-time (2-3hr per day) and full-time employment (8hr per day)

   Our organization Canadian Charity is looking for new candidates and
   collaborators in EU.

   Become a part of our donating corporation that includes worldwide
   donations to HIV positives, war refugees from Middle East and starving
   children from poorest European countries.
   Our mission does not charge or ask you to invest anything. We do not
   try to take your money. Our regional sponsors and investors from
   different European Union and USA regions have already accepted our
   offer and are now the investing affiliates in our international
   donating program.

   Cooperate with our investors during the donation process and receive
   from 1500 EUR (1800 USD) up to 3000 EUR (3600 USD) income per month.
   Together we can make this system work with greatest efficiency and
   thus have an occasion to ease the sufferings and reduce the needs of
   thousands of people.

   This vacancy you can apply for is the "Donating Assistant" (future
   promotion to "donating manager" is possible after 3 months of
   successful support).

   Please reply if you are interested in becoming a part of our system
   and EMAIL US. We will then send you more details concerning the
   vacancy of a "donating assistant".

   Thank you very much for your interest and for your wish to help the
   ones who really need our help and joint support.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070727/0cedc5e8/attachment.html>

From sam at ravnborg.org  Fri Jul 27 04:01:18 2007
From: sam at ravnborg.org (Sam Ravnborg)
Date: Fri, 27 Jul 2007 13:01:18 +0200
Subject: [ofa-general] Re: [PATCH 1/2] ehca: remove checkpatch.pl's warnings
	"externs should be avoided in .c files"
In-Reply-To: <200707271254.51055.hnguyen@linux.vnet.ibm.com>
References: <200707271254.51055.hnguyen@linux.vnet.ibm.com>
Message-ID: <20070727110118.GB12647@uranus.ravnborg.org>

On Fri, Jul 27, 2007 at 12:54:50PM +0200, Hoang-Nam Nguyen wrote:
> >From b5d0336089b5ebe5b18acb94b2c94c2026cb95ee Mon Sep 17 00:00:00 2001
> From: Hoang-Nam Nguyen <hnguyen at de.ibm.com>
> Date: Fri, 27 Jul 2007 10:24:49 +0200
> Subject: [PATCH] remove checkpatch.pl's warnings "externs should be avoided in .c files"
> 
> Signed-off-by: Hoang-Nam Nguyen <hnguyen at de.ibm.com>

And you checked that said .h file was indeed included by the .c file that has the original definition?
Otherwise the definition and the declaration can get out of sync without notice.

	Sam


From eitan at mellanox.co.il  Fri Jul 27 04:27:53 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Fri, 27 Jul 2007 14:27:53 +0300
Subject: [ofa-general] RE: OpenSM detection of duplicated GUIDs on loopback
In-Reply-To: <20070727010707.GR2472@sashak.voltaire.com>
References: <f0e08f230707241138sdd03eapece02d1626a25bb5@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F7510A@mtlexch01.mtl.com>
	<f0e08f230707241155v225aeed7s172f5cfb9024fc0e@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F7512E@mtlexch01.mtl.com>
	<f0e08f230707241325p1b7b04cap5bf37459bc7cb33f@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F75131@mtlexch01.mtl.com>
	<20070725001847.GG25264@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901F7558A@mtlexch01.mtl.com>
	<20070725194856.GB31582@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901F75673@mtlexch01.mtl.com>
	<20070727010707.GR2472@sashak.voltaire.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901F75ABE@mtlexch01.mtl.com>

The problem I have with back-to-back plug is that it is a fatal case if
found in a case where there was no use of this plug.
So we will  need some sort of user input if it is OK or not.

The case of moving a port in the middle of a sweep can be easily
detected if instead of reporting an error a second 
check of the original DR where the same GUID was found is performed...

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL

 
> -----Original Message-----
> From: Sasha Khapyorsky [mailto:sashak at voltaire.com] 
> Sent: Friday, July 27, 2007 4:07 AM
> To: Eitan Zahavi
> Cc: Hal Rosenstock; OpenFabrics General; Yevgeny Kliteynik
> Subject: Re: OpenSM detection of duplicated GUIDs on loopback
> 
> On 09:25 Thu 26 Jul     , Eitan Zahavi wrote:
> > > Hi Eitan, Hal,
> > > 
> > > On 20:44 Wed 25 Jul     , Eitan Zahavi wrote:
> > > > 
> > > > I am not following you.
> > > > Why do a user need to run -y if a simple legal cable 
> connector is 
> > > > plugged?
> > > 
> > > Because duplicated GUIDs detector can aborts OpenSM when regular 
> > > port is reconnected to another location during hard sweep. This 
> > > issue is not related to loopback plug at all.
> > I  think we should handle the case of "migrated port" in a 
> more global
> > sense:
> > If a port "moved" during the sweep we have to do a new sweep anyway.
> 
> Another option is just to use recently discovered port 
> location. In case of CA it could work, switch migration can 
> be more complicated.
> 
> > Maybe we could delay the 'abort' to the second sweep.
> >
> > So practically I propose:
> > 1. Add state flag "was duplicated" on the port saying it 
> was reported 
> > as duplicate GUID.
> > 2. Set the variable controlling a forced secodn sweep 
> (similar to the 
> > one used if we got Set error)
> 
> We even can catch this yet before drop_manager and just rediscover.
> 
> > 3. Repeat the sweep - if we find a port where it is a duplicate and 
> > the "was duplicated" flag is set - abort.
> >
> > A refinement for the user who is doing many changes 
> continuously might 
> > be to keep a counter.
> > And have the abort happen after the Nth iteration.
> 
> It is better approach than what we have today.
> 
> > > 
> > > > The issue is only if a "loop back" plug connecting a port
> > > to itself is
> > > > plugged.
> > > 
> > > No, not only. Now there are two completely separate known issues 
> > > with duplicated GUIDs detector:
> > > 
> > > 1. Port moving
> > > 2. Loopback plug
> > > 
> > > And I think that _both_ should be solved. And if just using '-y' 
> > > could be suitable for (2) because it is esoteric 
> (although perfectly 
> > > legal) use, it is not acceptable solution for (1).
> > > 
> > > I think we need to improve GUIDs duplication detector 
> instead. For 
> > > example we could add NodeInfo comparison there, and only 
> in case if 
> > > it is different drop GUIDs duplication error. Also I think this 
> > > should not be fatal error and should not abort OpenSM, 
> just logging 
> > > (probably via syslog too) should be sufficient - 
> non-working port is 
> > > good reason to look at logs. Another ideas?
> > The problem is that the SM will sort of figure out the network but 
> > will create a completely bogus routing etc.
> 
> Right. But it is not so with back-to-back (when loopback plug 
> could be interpreted as back-to-back duplicated GUID). So no 
> need to abort in this (back-to-back/loopback) case. Agreed?
> 
> Sasha
> 
> > 
> > > 
> > > Sasha
> > > 
> > > > Do users use these plugs? For what sake?
> > > > 
> > > > 
> > > > Eitan Zahavi
> > > > Senior Engineering Director, Software Architect Mellanox
> > > Technologies
> > > > LTD
> > > > Tel:+972-4-9097208
> > > > Fax:+972-4-9593245
> > > > P.O. Box 586 Yokneam 20692 ISRAEL
> > > > 
> > > >  
> > > > 
> > > > > -----Original Message-----
> > > > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com]
> > > > > Sent: Wednesday, July 25, 2007 3:19 AM
> > > > > To: Eitan Zahavi
> > > > > Cc: Hal Rosenstock; OpenFabrics General; Yevgeny Kliteynik
> > > > > Subject: Re: OpenSM detection of duplicated GUIDs on loopback
> > > > > 
> > > > > On 23:25 Tue 24 Jul     , Eitan Zahavi wrote:
> > > > > > 
> > > > > > 	On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote: 
> > > > > > 
> > > > > > 		Maybe  avoid the log if -y is provided?
> > > > > > 
> > > > > > 	 
> > > > > > 	That avoids the spew but the duplicated GUID is
> > > > > important to know so
> > > > > > IMO something in the "middle" is needed where
> > > duplicated GUIDs are
> > > > > > logged but not continually the same ones.
> > > > > > 	[EZ]  
> > > > > > 	OK so in -y mode only we track which ones were reported
> > > > > and do not
> > > > > > repeat the log?
> > > > > 
> > > > > And how port moving problem should be solved?
> > > > > 
> > > > > We cannot ask an user to run OpenSM with '-y' if in
> > > her/his plans to
> > > > > reconnect some ports in a future and just decrease logging.
> > > > > 
> > > > > Sasha
> > > > > 
> > > 
> 


From suri at baymicrosystems.com  Fri Jul 27 05:26:04 2007
From: suri at baymicrosystems.com (Suresh Shelvapille)
Date: Fri, 27 Jul 2007 08:26:04 -0400
Subject: [ofa-general] opensm off by default
In-Reply-To: <46A99F9C.5040303@cea.fr>
References: <46A283B6.1070105@dev.mellanox.co.il> <46A99F9C.5040303@cea.fr>
Message-ID: <01e501c7d049$50483d10$1914a8c0@surioffice>

Since opensm is off by default in ofed1.2 (which I found out the hard way), can we please
add a note either to the documentation or the ./install.sh menu on how to enable/install
Opensm please.

Thanks,
Suri


From hal.rosenstock at gmail.com  Fri Jul 27 05:47:59 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 27 Jul 2007 08:47:59 -0400
Subject: [ofa-general] opensm off by default
In-Reply-To: <01e501c7d049$50483d10$1914a8c0@surioffice>
References: <46A283B6.1070105@dev.mellanox.co.il> <46A99F9C.5040303@cea.fr>
	<01e501c7d049$50483d10$1914a8c0@surioffice>
Message-ID: <f0e08f230707270547x1d13346bv3b45e3a1ea1574d8@mail.gmail.com>

On 7/27/07, Suresh Shelvapille <suri at baymicrosystems.com> wrote:
> Since opensm is off by default in ofed1.2 (which I found out the hard way), can we please
> add a note either to the documentation or the ./install.sh menu on how to enable/install
> Opensm please.

I think this may be an EWG request rather than OpenIB/Fabrics.

-- Hal

>
> Thanks,
> Suri
>
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From landman at scalableinformatics.com  Fri Jul 27 06:42:02 2007
From: landman at scalableinformatics.com (Joe Landman)
Date: Fri, 27 Jul 2007 09:42:02 -0400
Subject: [ofa-general] Re: OFED-1.2 on x86 debian
In-Reply-To: <20070727083438.GA9912@mellanox.co.il>
References: <46A97850.2030607@osrg.net> <20070727083438.GA9912@mellanox.co.il>
Message-ID: <46A9F62A.7080500@scalableinformatics.com>


Michael S. Tsirkin wrote:
>> Quoting Yoshiaki Tamura <tamura at osrg.net>:
>> Subject: OFED-1.2 on x86 debian
>>
>> Hi.
>>
>> I'm trying to install OFED-1.2 on x86 (32bit) debian machine.
>> Although build_env.sh seems to work on debian,
>> it fails compiling both kernel modules and user land tools by rpmbuild.
>>
>> Is OFED-1.2 tested on debian or totally unsupported?
> 
> It's not on a list of supported platforms, but I think we do builds
> on ubuntu so debian should work too. Vlad?

I have been trying to make it work here on Ubuntu (Debian rebuild) 7.04.

Had to hack build_env.sh a little to get it to ignore some of the 
dependency checking (done by package name, which is not portable across 
distros).


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615


From transter at gmail.com  Fri Jul 27 07:47:25 2007
From: transter at gmail.com (lbt)
Date: Fri, 27 Jul 2007 07:47:25 -0700
Subject: [ofa-general] Lost in-service traps during Open SM migration
In-Reply-To: <20070727025952.GE6691@sashak.voltaire.com>
References: <ac71172a0707250957u6148b638s826a560ec013d3e0@mail.gmail.com>
	<20070725220204.GI31582@sashak.voltaire.com>
	<ac71172a0707261237wb833b1bq66c64ca39fb3c321@mail.gmail.com>
	<20070727025952.GE6691@sashak.voltaire.com>
Message-ID: <ac71172a0707270747y77ae14eflf7268b2581d113bd@mail.gmail.com>

Hi Sasha,

Yes, the problem seems to appear only when there is an SM migration. I
receive in-service notices for other ports, as long as there is no SM
migration occurring.

Thanks,
Lan

On 7/26/07, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>
> On 12:37 Thu 26 Jul     , lbt wrote:
> >  Thanks for the suggestion Sasha!
> >
> >  Our host stack does receive a rereregistration notice and does
> resubscribe
> >  all handlers at
> >  that point in time. At the time of the SM migration, our stack prints
> out
> >  some informational messages to
> >  confirm this:
> >  Jul 18 14:31:09 localhost kernel: Event IB_EVENT_CLIENT_REREGISTER
> occurred
> >  on port 1
> >  Jul 18 14:31:09 localhost kernel: OpemSM migrated, old SM LID=1 new SM
> LID=8
> >
> >  And also confirmed in the SM logs that after the migration, the higher
> >  priority SM is getting a subscription request for in-service trap:
> >  Jul 18 14:32:13 103550 [41E02960] -> osm_infr_rcv_process_set_method:
> >  Subscribe Request with QPN: 0x000001
> >  Jul 18 14:32:13 103554 [41E02960] -> osm_infr_get_by_rec: [
> >  Jul 18 14:32:13 103558 [41E02960] -> __dump_all_informs: [
> >  Jul 18 14:32:13 103562 [41E02960] -> InformInfo dump:
> >
> gid.....................0x0000000000000000 :
> >  0x0000000000000000
> >                                 lid_range_begin.........0xFFFF
> >                                 lid_range_end...........0x0
> >                                 is_generic..............0x1
> >                                 subscribe...............0x0
> >                                 trap_type...............0x3
> >                                 trap_num................64
> >                                 qpn.....................0x000001
> >                                 resp_time_val...........0x0
> >                                 node_type...............0x000004
> >  Jul 18 14:32:13 103569 [41E02960] -> __dump_all_informs: ]
> >
> >  It maybe a problem if the resubscription of the in-service handler
> occurs
> >  after the in-service notice was forwarded, but I think the problem is
> that
> >  there is never a notice that is forwared for the higher priority SM
> port
> >  that is restored.
>
> And after OpenSM migration, did you receive in-service notices for
> another ports? Does the problem happen only in migration time?
>
> >  Perhaps, neither SM (the lower priority and higher
> >  priority one), generates an in-service trap because of the timing  gap
> >  between when the restored port is detected and "marked" (i.e. added to
> >  new_ports_list) and when in-service traps are generated for new ports.
> >  During SM migration, the lower priority SM detects the new port, but
> the
> >  higher priority SM does the trap generation (but it doesn't realize
> that
> >  it's own port is a new port and thus doesn't generate a trap for it).
> >
> >  Our host stack executes some functions when a port is restored  (in our
> >  in-service subscription handler).
> >  Am I not supposed to receive an in-service trap for a restored port
> that
> >  happens to be the Master SM,
>
> Yes, I guess you are.
>
> >  and instead  execute these actions with a
> >  client reregistration event?
>
> Client reregistration request is not suitable here - SM can ask for
> client reregistration at any time (in practice OpenSM now does it only
> when enters MASTER state, but it is also optional).
>
> Sasha
>
> >
> >  Thanks again for your help!
> >  Lan
> >
> >
> >
> >  On 7/25/07, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> > >
> > > Hi Lan,
> > >
> > > On 09:57 Wed 25 Jul     , lbt wrote:
> > > >  Hello,
> > > >
> > > >  I have been seeing a problem where a subscriber for in-service
> traps is
> > > not
> > > >  getting informed when the port of master openSM is restored (i.e.
> > > causing an
> > > >  SM migration).
> > > >
> > > >  I have an IB subnet with 2 nodes running OpenSM , different
> priorities
> > > of
> > > >  course (OpenSM Rev:openib-2.0.5). I also have another node on the
> > > subnet
> > > >  that has subscribed for the forwarding of any
> > > IB_SA_GENERIC_TRAP_NUM_IN_SVC
> > > >  trap events. I've been doing cable pull tests on the IB ports, to
> check
> > > if
> > > >  the in-service handler I have subscribed gets invoked when I
> restore
> > > the
> > > >  cable. I've noticed that everything works as expected ( i.e. my
> > > in-service
> > > >  handler is invoked) whenever I restore the cable on the lower
> priority
> > > SM IB
> > > >  port without ever touching the master SM port. But if I cause an SM
> > > >  migration, by restoring the port of the higher priority SM, the
> > > in-service
> > > >  trap does not get generated as expected on a cable restore.
> > > >
> > > >  Steps to Reproduce:
> > > >  1) Start with port to higher priority SM disconnected.
> > > >  2) restore port cable on the higher priority SM
> > > >  --> This causes an SM Migration as expected, SM's migration happens
> > > okay
> > > >  --> I expected the restoration of the higher priority SM to tit to
> also
> > > >  trigger an in-service trap as well and notify subscribers, but it
> > > doesn't
> > > >  occur
> > > >
> > > >  I have collected debug messages log for both open SM's, and it
> appears
> > > that
> > > >  the reason is because:
> > > >  1) in-service traps are generated based on what ports are added on
> the
> > > >  Master SM's new_ports_list, but these traps are generated only
> after
> > > LID
> > > >  assignment
> > > >  2) when the higher priority SM port is restored, the restored port
> gets
> > > >  added to the lower priority SM's new_ports_list (since it's still
> the
> > > Master
> > > >  SM at that point in time)
> > > >  3) the handover of Master  SM  from lower priority to higher
> priority
> > > SM
> > > >  occurs (before LID assignment and thus a chance for traps get
> generated
> > > for
> > > >  those ports on new_ports_list)
> > > >  4) the higher priority SM is now Master SM, but it has an empty
> > > >  new_ports_list, so no trap generated either
> > > >
> > > >  Does this look like a legitimate Open SM bug? Any feedback would be
> > > much
> > > >  appreciated, and if I can help further in any way please let me
> know .
> > >
> > > As far as I know when OpenSM (even old like 2.0.5) becomes master it
> > > requests client to reregister SA related stuff (by setting this bit in
> > > PortInfo).
> > >
> > > Probably your port doesn't not support this (you could verify by
> seeing
> > > PortInfo:CapabilityMask - use 'smpquery portinfo <client-port-lid>')
> or
> > > maybe your host stack doesn't do reregistration?
> > >
> > > Anyway you could track this in the OpenSM code in osm_lid_mgr.c
> > > __osm_lid_mgr_set_physp_pi() whenever client reregistration bit is set
> > > (with ib_port_info_set_client_rereg()) or not. Then we will know more
> > > about this problem.
> > >
> > > Sasha
> > >
> > > >
> > > >
> > > >  Subset of logs from lower priority SM during the cable restore of
> > > higher
> > > >  priority SM port:
> > > >  ### Jul 18 14:31:56 614522 [41401960] ->
> > > __osm_trap_rcv_process_request:
> > > >  Received Generic Notice type:0x03 num:128 Producer:2 from
> LID:0x000A
> > > >  TID:0x00000016000012e1
> > > >  ### Jul 18 14:31:56 614823 [41401960] -> osm_state_mgr_process:
> > > Received
> > > >  signal OSM_SIGNAL_SWEEP in state OSM_SM_STATE_IDLE
> > > >  ### 14:31:56 ******************** INITIATING HEAVY SWEEP
> > > >  **********************
> > > >  ### Jul 18 14:31:56 616887 [42803960] -> osm_state_mgr_process:
> > > Received
> > > >  signal OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state
> > > >  OSM_SM_STATE_SWEEP_HEAVY_SELF
> > > >  Jul 18 14:31:56 626078 [42803960] -> __osm_ni_rcv_process_new:
> Adding
> > > port
> > > >  GUID:0x00504501483e0000 to new_ports_list
> > > >  Jul 18 14:31:56 626524 [42803960] -> osm_state_mgr_process:
> Received
> > > signal
> > > >  OSM_SIGNAL_CHANGE_DETECTED in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET
> > > >  Jul 18 14:31:56 632630 [41E02960] -> osm_state_mgr_process:
> Received
> > > signal
> > > >  OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state
> > > OSM_SM_STATE_SWEEP_HEAVY_SUBNET
> > > >  14:31:56 ********************* HEAVY SWEEP COMPLETE
> > > ***********************
> > > >  Jul 18 14:31:56 632773 [41E02960] -> osm_sm_state_mgr_process:
> Received
> > > >  signal OSM_SM_SIGNAL_HANDOVER_SENT in state
> IB_SMINFO_STATE_MASTER###
> > > >  14:31:56 ******************** ENTERING SM STANDBY STATE
> > > *******************
> > > >
> > > >  Subset of logs from higher priority SM during the cable restore of
> > > higher
> > > >  priority SM port:
> > > >
> > > >  Jul 18 14:32:02 995600 [41401960] -> osm_sm_state_mgr_process: [
> > > >  Jul 18 14:32:02 995605 [41401960] -> osm_sm_state_mgr_process:
> Received
> > > >  signal OSM_SM_SIGNAL_DISCOVERY_COMPLETED in state
> > > >  IB_SMINFO_STATE_DISCOVERING
> > > >  Jul 18 14:32:02 995609 [41401960] -> Entering MASTER state
> > > >  Jul 18 14:32:02 995888 [41401960] -> __osm_sm_state_mgr_master_msg:
> > > >  ******************** ENTERING SM MASTER STATE ********************
> > > >  Jul 18 14:32:03 009014 [41401960] ->
> > > __osm_state_mgr_set_sm_lid_done_msg:
> > > >  **** SM LID ASSIGNMENT COMPLETE - STARTING SUBNET LID CONFIG *****
> > > >  Jul 18 14:32:03 024047 [41E02960] -> __osm_state_mgr_lid_assign_msg
> > > >  ***** LID ASSIGNMENT COMPLETE - STARTING SWITCH TABLE CONFIG *****
> > > >  Jul 18 14:32:03 024052 [41E02960] ->
> __osm_state_mgr_report_new_ports:
> > > [
> > > >  ----> no in-service traps are generated and notices forwarded
> because
> > > there
> > > >  are no ports on this list
> > > >  Jul 18 14:32:03 024057 [41E02960] ->
> __osm_state_mgr_report_new_ports:
> > > ]
> > > >
> > > >
> > > >  Thanks!
> > > >  Lan
> > >
> > > > _______________________________________________
> > > > general mailing list
> > > > general at lists.openfabrics.org
> > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > >
> > > > To unsubscribe, please visit
> > > http://openib.org/mailman/listinfo/openib-general
> > >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070727/c4c7bfc7/attachment.html>

From mshefty at ichips.intel.com  Fri Jul 27 09:15:20 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 27 Jul 2007 09:15:20 -0700
Subject: [ofa-general] TransactionID(IB_MAD_TRID_F) description
In-Reply-To: <46A9C633.7040302@Sun.COM>
References: <46A9C633.7040302@Sun.COM>
Message-ID: <46AA1A18.8020501@ichips.intel.com>

> I have just observed that TransactionID that I am providing with sndbuf 
> to *umad_send* is not the one that I received back from *umad_recv* 
> function. Going more in detail I have seen that only low 32 bits of TID 
> are matching in received mad with send mad. Is this functionality of TID 
> is expected or there is any suitable way to get the all 64 bits of TID 
> in place of low 32 bits only.

This is the correct behavior.  The user only has control over the lower 
32-bits of the TID.  The upper 32-bits are reserved by the MAD layer for 
multiplexing purposes.

- Sean


From mshefty at ichips.intel.com  Fri Jul 27 09:24:35 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 27 Jul 2007 09:24:35 -0700
Subject: [ofa-general] QoS RFC
In-Reply-To: <46A94657.1020101@ichips.intel.com>
References: <46A283B6.1070105@dev.mellanox.co.il>
	<46A94657.1020101@ichips.intel.com>
Message-ID: <46AA1C43.20808@ichips.intel.com>

> I think the RDMA CM needs two solutions, depending on which address 
> family is used.  For IPv6, the existing interface is sufficient, and 
> works for both IB and iWarp.  The RDMA CM only needs to include the TC 
> and FL as part of its PR query.  For IPv4, to remain transport neutral, 
> I think we should add an rdma_set_option() routine to specify the QoS 
> field.  The RDMA CM would include the QoS field for PR query under this 
> condition.
> 
> For IB, this requires changes to the ib_sa to support the new PR 
> extensions.  I don't think we gain anything having the RDMA CM include 
> service IDs as part of the query.

I overlooked multicast in my reply.  Unfortunately, the QoS field was 
not added to MCMemberRecord.  For multicast, IPv6 addresses would still 
use the TC and FL provided by the user.  For IPv4, the RDMA CM will 
either need to match the TC and FL of the IPoIB broadcast group or leave 
these fields unspecified.

- Sean


From hal.rosenstock at gmail.com  Fri Jul 27 09:33:25 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 27 Jul 2007 12:33:25 -0400
Subject: [ofa-general] QoS RFC
In-Reply-To: <46AA1C43.20808@ichips.intel.com>
References: <46A283B6.1070105@dev.mellanox.co.il>
	<46A94657.1020101@ichips.intel.com> <46AA1C43.20808@ichips.intel.com>
Message-ID: <f0e08f230707270933g12ec7e17s60db2889ab7b3369@mail.gmail.com>

On 7/27/07, Sean Hefty <mshefty at ichips.intel.com> wrote:
> > I think the RDMA CM needs two solutions, depending on which address
> > family is used.  For IPv6, the existing interface is sufficient, and
> > works for both IB and iWarp.  The RDMA CM only needs to include the TC
> > and FL as part of its PR query.  For IPv4, to remain transport neutral,
> > I think we should add an rdma_set_option() routine to specify the QoS
> > field.  The RDMA CM would include the QoS field for PR query under this
> > condition.
> >
> > For IB, this requires changes to the ib_sa to support the new PR
> > extensions.  I don't think we gain anything having the RDMA CM include
> > service IDs as part of the query.
>
> I overlooked multicast in my reply.  Unfortunately, the QoS field was
> not added to MCMemberRecord.

Good point.

You can make PR requests with MGID as DGID though.

-- Hal

>  For multicast, IPv6 addresses would still
> use the TC and FL provided by the user.  For IPv4, the RDMA CM will
> either need to match the TC and FL of the IPoIB broadcast group or leave
> these fields unspecified.
>
> - Sean
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From mshefty at ichips.intel.com  Fri Jul 27 09:44:35 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 27 Jul 2007 09:44:35 -0700
Subject: [ofa-general] QoS RFC
In-Reply-To: <46A99F9C.5040303@cea.fr>
References: <46A283B6.1070105@dev.mellanox.co.il> <46A99F9C.5040303@cea.fr>
Message-ID: <46AA20F3.8010901@ichips.intel.com>

>> SDP uses CMA for building its connections.
>> The Service-ID for SDP is 0x000000000001PPPP, where PPPP are 4 hex digits
>> holding the remote TCP/IP Port Number to connect to.
>> SDP might be provided with SO_PRIORITY socket option. In that case the 
>> value
>> provided should be sent to the CMA as the TClass option of that 
>> connection.
>>
> This requires modifications a applications and does not allow a global 
> definition of Qos for all SDP  applications into the fabric.
> This is inconsistent with Libsdp provided to migrate transparently 
> TCP/IP application to SDP.
> If the maching rules allows some kind of bitmask pattern matching, we 
> can define something like :
> qos-match-rule
>            use: all SDP applications
>            service-id: 0x000000000001????
>            qos-level-sn: 2
>        end-qos-match-rule

Please see my response from yesterday.  I believe we can eliminate the 
use of the service ID for SDP, and instead rely on the IPv6 address or 
socket options.

My suggestions for the host stack restrict the use of the service ID to 
SRP.  If SRP were to provide a QoS parameter instead, we could avoid any 
use of service ID in our implementation.  However, I don't know the 
scope required to support that change.

- Sean


From mshefty at ichips.intel.com  Fri Jul 27 09:45:56 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 27 Jul 2007 09:45:56 -0700
Subject: [ofa-general] QoS RFC
In-Reply-To: <f0e08f230707270933g12ec7e17s60db2889ab7b3369@mail.gmail.com>
References: <46A283B6.1070105@dev.mellanox.co.il>	
	<46A94657.1020101@ichips.intel.com>
	<46AA1C43.20808@ichips.intel.com>
	<f0e08f230707270933g12ec7e17s60db2889ab7b3369@mail.gmail.com>
Message-ID: <46AA2144.70102@ichips.intel.com>

> You can make PR requests with MGID as DGID though.

I thought about that, but the QoS field isn't defined for PR responses, 
only requests (Get, GetTable).

- Sean


From mshefty at ichips.intel.com  Fri Jul 27 09:59:54 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 27 Jul 2007 09:59:54 -0700
Subject: [ofa-general] QoS RFC
In-Reply-To: <f0e08f230707270933g12ec7e17s60db2889ab7b3369@mail.gmail.com>
References: <46A283B6.1070105@dev.mellanox.co.il>	
	<46A94657.1020101@ichips.intel.com>
	<46AA1C43.20808@ichips.intel.com>
	<f0e08f230707270933g12ec7e17s60db2889ab7b3369@mail.gmail.com>
Message-ID: <46AA248A.3010808@ichips.intel.com>

>> I overlooked multicast in my reply.  Unfortunately, the QoS field was
>> not added to MCMemberRecord.
> 
> Good point.
> 
> You can make PR requests with MGID as DGID though.

My bad here.  We need to specify the SL, FL, and TC when creating the 
multicast group, so the required QoS -> TC, FL mapping needs to be done 
by the user.  So, the RDMA CM will need to use the IPoIB broadcast group 
information.

- Sean


From rick.jones2 at hp.com  Fri Jul 27 10:07:47 2007
From: rick.jones2 at hp.com (Rick Jones)
Date: Fri, 27 Jul 2007 10:07:47 -0700
Subject: [ofa-general] OFED-1.2 on x86 debian
In-Reply-To: <46A97850.2030607@osrg.net>
References: <46A97850.2030607@osrg.net>
Message-ID: <46AA2663.4060709@hp.com>

Yoshiaki Tamura wrote:
> Hi.
> 
> I'm trying to install OFED-1.2 on x86 (32bit) debian machine.
> Although build_env.sh seems to work on debian,
> it fails compiling both kernel modules and user land tools by rpmbuild.
> 
> Is OFED-1.2 tested on debian or totally unsupported?

When I tried to do that with ia64 Debian I was directed towards some tar files
of the mods rather than the install.sh stuff.  I don't have the pointers at my
fingertips, but would assume they remain in the list archives.

rick jones


From hal.rosenstock at gmail.com  Fri Jul 27 12:03:47 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 27 Jul 2007 15:03:47 -0400
Subject: [ofa-general] ibutils building
Message-ID: <f0e08f230707271203y4725f6b6vdab1ae802523c47e@mail.gmail.com>

Hi Eitan,

When building ibutils (master), I get the following error:

gcc -DHAVE_CONFIG_H -I. -I. -I.. -I/usr/include
-I/usr/local/include/infiniband -I/usr/local/include -DDEBUG -D_DEBUG
-D_DEBUG_ -DDBG -DOSM_VENDOR_INTF_OPENIB -DOSM_BUILD_OPENIB
-D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -O2 -Wall -fno-strict-aliasing
-fPIC -DIBIS_VERSION=\"1.2\" -g -O2 -MT ibis.lo -MD -MP -MF
.deps/ibis.Tpo -c ibis.c  -fPIC -DPIC -o .libs/ibis.o
ibis.c:41:25: git_version.h: No such file or directory

Any idea ? What's git-version.h ? Thanks.

-- Hal


From hal.rosenstock at gmail.com  Fri Jul 27 12:32:50 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 27 Jul 2007 15:32:50 -0400
Subject: [ofa-general] ibutils/ibdm building
Message-ID: <f0e08f230707271232m11e651d3kd7356b972c26e228@mail.gmail.com>

Hi again Eitan,

When building ibutils/ibdm (master), I get the following error:

if g++ -DHAVE_CONFIG_H -I. -I. -I..  -I../datamodel   -g -O2 -MT
osm_check.o -MD -MP -MF ".deps/osm_check.Tpo" -c -o osm_check.o
osm_check.cpp; \
then mv -f ".deps/osm_check.Tpo" ".deps/osm_check.Po"; else rm -f
".deps/osm_check.Tpo"; exit 1; fi
osm_check.cpp: In function `int main(int, char**)':
osm_check.cpp:428: `R_OK' undeclared (first use this function)
osm_check.cpp:428: (Each undeclared identifier is reported only once for each
  function it appears in.)
osm_check.cpp:428: `access' undeclared (first use this function)

Thanks.

-- Hal


From prema.dadisman at bakkerne.dk  Fri Jul 27 05:27:26 2007
From: prema.dadisman at bakkerne.dk (Gerardo Guy)
Date: Fri, 27 Jul 2007 20:27:26 +0800
Subject: [ofa-general] Good summer, dude
Message-ID: <01c7d08c$8be97440$c2bcb2cf@prema.dadisman>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: tamtam.gif
Type: image/gif
Size: 10024 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070727/e5bd85e2/attachment.gif>

From douggibson at tikkurila.hu  Fri Jul 27 16:14:50 2007
From: douggibson at tikkurila.hu (Canadian Charity '07 refnum-45)
Date: Fri, 27 Jul 2007 19:14:50 -0400
Subject: [ofa-general] Superb parttime position         Id: 7418
Message-ID: <002701c7d0a3$eea2aba0$0202a8c0@Mallorylaptop1>

 This proposal is of most importance to all European candidates

   We are glad to introduce you our new project.
   This is a job proposal for EU candidates only.

   Requirements and benefits:

   Monthly gross earnings: 1500-3000 EUR per month
   Age limit: 18-80 y.o.
   Possible career growth and promotion opportunity
   Internet access, cellular or home phone number and the e-mail
   Part-time (2-3hr per day) and full-time employment (8hr per day)

   Our organization Canadian Charity is looking for new workers and
   collaborators in Europa.

   Become a part of our donating system that includes worldwide donations
   to HIV positives, war refugees from Middle East and starving children
   from poorest European countries.
   Our program does not charge or ask you to invest anything. We do not
   try to take your money. Our regional sponsors and investors from
   different European Union and North American regions have already
   accepted our proposition and are now the investing affiliates in our
   international donating program.

   Work in partnership with our investors during the donation process and
   earn from 1500 EUR (1800 USD) up to 3000 EUR (3600 USD) wages per
   month. Together we can make this program work with highest efficiency
   and thus have an occasion to ease the sufferings and minimize the
   needs of thousands of people.

   This vacancy you can apply for is the "Donating Assistant" (future
   promotion to "donating manager" is possible after 3 months of
   successful work).

   Please let us know if you are interested in becoming a part of our
   program and EMAIL US. We will then send you more details regarding to
   the position of a "donating assistant".

   Thank you very much for your interest and for your wish to help the
   ones who really need our assistance and joint support.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070727/ea288518/attachment.html>

From kliteyn at mellanox.co.il  Fri Jul 27 21:05:10 2007
From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il)
Date: 28 Jul 2007 07:05:10 +0300
Subject: [ofa-general] nightly osm_sim report 2007-07-28:normal completion
Message-ID: <MTLEXCH019CDBHy3Spw00000b2d@mtlexch01.mtl.com>

OSM Simulation Regression Summary
 
[Generated mail - please do NOT reply]
 
 
OpenSM rev = Wed_Jul_25_02:46:48_2007 [d06c318cb50ddddf55b20a5e896d2d22d7b90948]
ibutils rev = Tue_Mar_13_14:36:32_2007 [80aaff94f0eb65117db39b9db7d609ffdcc055de]
 
 
Total=520  Pass=467  Fail=53
 
 
Pass:
39 Stability IS1-16.topo
39 OsmTest IS1-16.topo
39 OsmStress IS1-16.topo
39 Multicast IS1-16.topo
39 LidMgr IS1-16.topo
13 Stability IS3-loop.topo
13 Stability IS3-128.topo
13 OsmTest IS3-loop.topo
13 OsmTest IS3-128.topo
13 OsmStress IS3-128.topo
13 Multicast IS3-loop.topo
13 Multicast IS3-128.topo
13 LidMgr IS3-128.topo
13 FatTree merge-root-4-ary-3-tree.topo
13 FatTree gnu-stallion-64.topo
13 FatTree blend-4-ary-2-tree.topo
13 FatTree RhinoDDR.topo
13 FatTree FullGnu.topo
13 FatTree 4-ary-2-tree.topo
13 FatTree 2-ary-4-tree.topo
13 FatTree 12-node-spaced.topo
13 FTreeFail 4-ary-2-tree-missing-sw-link.topo
13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo
13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo
13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo
12 FatTree merge-roots-4-ary-2-tree.topo

Failures:
39 Pkey IS1-16.topo
13 Pkey IS3-128.topo
1 FatTree merge-roots-4-ary-2-tree.topo


From vlad at lists.openfabrics.org  Sat Jul 28 01:39:07 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Sat, 28 Jul 2007 01:39:07 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070728-0100 daily build status
Message-ID: <20070728083907.95087E60848@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on ia64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ia64 with linux-2.6.14
Passed on x86_64 with linux-2.6.19
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on powerpc with linux-2.6.18
Passed on ia64 with linux-2.6.13
Passed on ppc64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.15
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.16
Passed on ia64 with linux-2.6.12
Passed on ppc64 with linux-2.6.14
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.20
Passed on ppc64 with linux-2.6.17
Passed on powerpc with linux-2.6.16
Passed on x86_64 with linux-2.6.12
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.15
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.14
Passed on powerpc with linux-2.6.17
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.13
Passed on powerpc with linux-2.6.14
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.21.1
Passed on powerpc with linux-2.6.19
Passed on ppc64 with linux-2.6.13
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6

Failed:


From ujz at hetnet.nl  Sat Jul 28 02:13:50 2007
From: ujz at hetnet.nl (postcards.com)
Date: Sat, 28 Jul 2007 04:13:50 -0500
Subject: [ofa-general] You've received an ecard from a School-mate!
Message-ID: <001101c7d0f7$9ca70700$aa73b523@sow.sfdtb>

Hi. School-mate has sent you an ecard.
See your card as often as you wish during the next 15 days.

SEEING YOUR CARD

If your email software creates links to Web pages, click on your 
card's direct www address below while you are connected to the Internet:

http://68.79.168.46/?a32e6b9ea6878b15d7703a3b01bda

Or copy and paste it into your browser's "Location" box (where Internet 
addresses go).

We hope you enjoy your awesome card.

Wishing you the best,
Mail Delivery System,
postcards.com


From vlad at lists.openfabrics.org  Sat Jul 28 02:51:04 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Sat, 28 Jul 2007 02:51:04 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_c_kernel 20070728-0200 daily build status
Message-ID: <20070728095104.BEE66E60848@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2_c

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.12
Passed on x86_64 with linux-2.6.20
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.12
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18
Passed on ia64 with linux-2.6.17
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.16
Passed on ppc64 with linux-2.6.12
Passed on ia64 with linux-2.6.16
Passed on ppc64 with linux-2.6.14
Passed on ia64 with linux-2.6.19
Passed on x86_64 with linux-2.6.17
Passed on ppc64 with linux-2.6.15
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.14
Passed on powerpc with linux-2.6.18
Passed on ppc64 with linux-2.6.13
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.13
Passed on powerpc with linux-2.6.15
Passed on powerpc with linux-2.6.16
Passed on ppc64 with linux-2.6.16
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.21.1
Passed on powerpc with linux-2.6.19
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.19
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.22
Passed on ia64 with linux-2.6.22
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.9-34.ELsmp

Failed:


From fathi.daker at oxygen.ie  Sat Jul 28 05:07:57 2007
From: fathi.daker at oxygen.ie (Susanne Granger)
Date: Sat, 28 Jul 2007 12:07:57 +0000
Subject: [ofa-general] Do it for pleasure
Message-ID: <01c7d10f$ef7e3990$d8b0183e@fathi.daker>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: tamtam.gif
Type: image/gif
Size: 8853 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070728/e5f5b883/attachment.gif>

From eitan at mellanox.co.il  Sat Jul 28 13:01:51 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sat, 28 Jul 2007 23:01:51 +0300
Subject: [ofa-general] RE: ibutils building
In-Reply-To: <f0e08f230707271203y4725f6b6vdab1ae802523c47e@mail.gmail.com>
References: <f0e08f230707271203y4725f6b6vdab1ae802523c47e@mail.gmail.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901F75B4B@mtlexch01.mtl.com>

Git_version.h is a new automatically generated file I added to the
Makefile.am and ibis.i Fabric.cpp and sim.i.
Is it possible you did not rerun autogen.sh ?

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL

 
> -----Original Message-----
> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] 
> Sent: Friday, July 27, 2007 10:04 PM
> To: Eitan Zahavi
> Cc: OpenFabrics General
> Subject: ibutils building
> 
> Hi Eitan,
> 
> When building ibutils (master), I get the following error:
> 
> gcc -DHAVE_CONFIG_H -I. -I. -I.. -I/usr/include 
> -I/usr/local/include/infiniband -I/usr/local/include -DDEBUG 
> -D_DEBUG -D_DEBUG_ -DDBG -DOSM_VENDOR_INTF_OPENIB 
> -DOSM_BUILD_OPENIB -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -O2 
> -Wall -fno-strict-aliasing -fPIC -DIBIS_VERSION=\"1.2\" -g 
> -O2 -MT ibis.lo -MD -MP -MF .deps/ibis.Tpo -c ibis.c  -fPIC 
> -DPIC -o .libs/ibis.o
> ibis.c:41:25: git_version.h: No such file or directory
> 
> Any idea ? What's git-version.h ? Thanks.
> 
> -- Hal
> 


From hal.rosenstock at gmail.com  Sat Jul 28 13:28:49 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sat, 28 Jul 2007 16:28:49 -0400
Subject: [ofa-general] Re: ibutils building
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F75B4B@mtlexch01.mtl.com>
References: <f0e08f230707271203y4725f6b6vdab1ae802523c47e@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F75B4B@mtlexch01.mtl.com>
Message-ID: <f0e08f230707281328kdee92f6q8fdd36abc85fdd4c@mail.gmail.com>

On 7/28/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:
> Git_version.h is a new automatically generated file I added to the
> Makefile.am and ibis.i Fabric.cpp and sim.i.
> Is it possible you did not rerun autogen.sh ?

Nope; I ran that prior to configuring.

-- Hal

>
> Eitan Zahavi
> Senior Engineering Director, Software Architect
> Mellanox Technologies LTD
> Tel:+972-4-9097208
> Fax:+972-4-9593245
> P.O. Box 586 Yokneam 20692 ISRAEL
>
>
>
> > -----Original Message-----
> > From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> > Sent: Friday, July 27, 2007 10:04 PM
> > To: Eitan Zahavi
> > Cc: OpenFabrics General
> > Subject: ibutils building
> >
> > Hi Eitan,
> >
> > When building ibutils (master), I get the following error:
> >
> > gcc -DHAVE_CONFIG_H -I. -I. -I.. -I/usr/include
> > -I/usr/local/include/infiniband -I/usr/local/include -DDEBUG
> > -D_DEBUG -D_DEBUG_ -DDBG -DOSM_VENDOR_INTF_OPENIB
> > -DOSM_BUILD_OPENIB -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -O2
> > -Wall -fno-strict-aliasing -fPIC -DIBIS_VERSION=\"1.2\" -g
> > -O2 -MT ibis.lo -MD -MP -MF .deps/ibis.Tpo -c ibis.c  -fPIC
> > -DPIC -o .libs/ibis.o
> > ibis.c:41:25: git_version.h: No such file or directory
> >
> > Any idea ? What's git-version.h ? Thanks.
> >
> > -- Hal
> >
>


From eitan at mellanox.co.il  Sat Jul 28 13:49:18 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sat, 28 Jul 2007 23:49:18 +0300
Subject: [ofa-general] RE: ibutils/ibdm building
In-Reply-To: <f0e08f230707271232m11e651d3kd7356b972c26e228@mail.gmail.com>
References: <f0e08f230707271232m11e651d3kd7356b972c26e228@mail.gmail.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901F75B5A@mtlexch01.mtl.com>

Hi Hal

This does not reproduce on my fresh clone.
So I am not sure what is going on.

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL

 
> -----Original Message-----
> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] 
> Sent: Friday, July 27, 2007 10:33 PM
> To: Eitan Zahavi
> Cc: OpenFabrics General
> Subject: ibutils/ibdm building
> 
> Hi again Eitan,
> 
> When building ibutils/ibdm (master), I get the following error:
> 
> if g++ -DHAVE_CONFIG_H -I. -I. -I..  -I../datamodel   -g -O2 -MT
> osm_check.o -MD -MP -MF ".deps/osm_check.Tpo" -c -o 
> osm_check.o osm_check.cpp; \ then mv -f ".deps/osm_check.Tpo" 
> ".deps/osm_check.Po"; else rm -f ".deps/osm_check.Tpo"; exit 1; fi
> osm_check.cpp: In function `int main(int, char**)':
> osm_check.cpp:428: `R_OK' undeclared (first use this function)
> osm_check.cpp:428: (Each undeclared identifier is reported 
> only once for each
>   function it appears in.)
> osm_check.cpp:428: `access' undeclared (first use this function)
> 
> Thanks.
> 
> -- Hal
> 


From sashak at voltaire.com  Sat Jul 28 13:53:07 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 28 Jul 2007 23:53:07 +0300
Subject: [ofa-general] Re: ibutils building
In-Reply-To: <f0e08f230707281328kdee92f6q8fdd36abc85fdd4c@mail.gmail.com>
References: <f0e08f230707271203y4725f6b6vdab1ae802523c47e@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F75B4B@mtlexch01.mtl.com>
	<f0e08f230707281328kdee92f6q8fdd36abc85fdd4c@mail.gmail.com>
Message-ID: <20070728205307.GC12351@sashak.voltaire.com>

On 16:28 Sat 28 Jul     , Hal Rosenstock wrote:
> On 7/28/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:
> > Git_version.h is a new automatically generated file I added to the
> > Makefile.am and ibis.i Fabric.cpp and sim.i.
> > Is it possible you did not rerun autogen.sh ?
> 
> Nope; I ran that prior to configuring.

The same problem is here (after ./autogen.sh).

Sasha

> 
> -- Hal
> 
> >
> > Eitan Zahavi
> > Senior Engineering Director, Software Architect
> > Mellanox Technologies LTD
> > Tel:+972-4-9097208
> > Fax:+972-4-9593245
> > P.O. Box 586 Yokneam 20692 ISRAEL
> >
> >
> >
> > > -----Original Message-----
> > > From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> > > Sent: Friday, July 27, 2007 10:04 PM
> > > To: Eitan Zahavi
> > > Cc: OpenFabrics General
> > > Subject: ibutils building
> > >
> > > Hi Eitan,
> > >
> > > When building ibutils (master), I get the following error:
> > >
> > > gcc -DHAVE_CONFIG_H -I. -I. -I.. -I/usr/include
> > > -I/usr/local/include/infiniband -I/usr/local/include -DDEBUG
> > > -D_DEBUG -D_DEBUG_ -DDBG -DOSM_VENDOR_INTF_OPENIB
> > > -DOSM_BUILD_OPENIB -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -O2
> > > -Wall -fno-strict-aliasing -fPIC -DIBIS_VERSION=\"1.2\" -g
> > > -O2 -MT ibis.lo -MD -MP -MF .deps/ibis.Tpo -c ibis.c  -fPIC
> > > -DPIC -o .libs/ibis.o
> > > ibis.c:41:25: git_version.h: No such file or directory
> > >
> > > Any idea ? What's git-version.h ? Thanks.
> > >
> > > -- Hal
> > >
> >
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From eitan at mellanox.co.il  Sat Jul 28 14:05:50 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sun, 29 Jul 2007 00:05:50 +0300
Subject: [ofa-general] RE: ibutils building
In-Reply-To: <f0e08f230707281328kdee92f6q8fdd36abc85fdd4c@mail.gmail.com>
References: <f0e08f230707271203y4725f6b6vdab1ae802523c47e@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F75B4B@mtlexch01.mtl.com>
	<f0e08f230707281328kdee92f6q8fdd36abc85fdd4c@mail.gmail.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901F75B60@mtlexch01.mtl.com>

Just reproduced it on a fresh checkout.
A fix was just pushed in

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL

 
> -----Original Message-----
> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] 
> Sent: Saturday, July 28, 2007 11:29 PM
> To: Eitan Zahavi
> Cc: OpenFabrics General
> Subject: Re: ibutils building
> 
> On 7/28/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:
> > Git_version.h is a new automatically generated file I added to the 
> > Makefile.am and ibis.i Fabric.cpp and sim.i.
> > Is it possible you did not rerun autogen.sh ?
> 
> Nope; I ran that prior to configuring.
> 
> -- Hal
> 
> >
> > Eitan Zahavi
> > Senior Engineering Director, Software Architect Mellanox 
> Technologies 
> > LTD
> > Tel:+972-4-9097208
> > Fax:+972-4-9593245
> > P.O. Box 586 Yokneam 20692 ISRAEL
> >
> >
> >
> > > -----Original Message-----
> > > From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> > > Sent: Friday, July 27, 2007 10:04 PM
> > > To: Eitan Zahavi
> > > Cc: OpenFabrics General
> > > Subject: ibutils building
> > >
> > > Hi Eitan,
> > >
> > > When building ibutils (master), I get the following error:
> > >
> > > gcc -DHAVE_CONFIG_H -I. -I. -I.. -I/usr/include 
> > > -I/usr/local/include/infiniband -I/usr/local/include -DDEBUG 
> > > -D_DEBUG -D_DEBUG_ -DDBG -DOSM_VENDOR_INTF_OPENIB 
> -DOSM_BUILD_OPENIB 
> > > -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -O2 -Wall 
> -fno-strict-aliasing 
> > > -fPIC -DIBIS_VERSION=\"1.2\" -g
> > > -O2 -MT ibis.lo -MD -MP -MF .deps/ibis.Tpo -c ibis.c  
> -fPIC -DPIC -o 
> > > .libs/ibis.o
> > > ibis.c:41:25: git_version.h: No such file or directory
> > >
> > > Any idea ? What's git-version.h ? Thanks.
> > >
> > > -- Hal
> > >
> >
> 


From sashak at voltaire.com  Sat Jul 28 14:55:27 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 29 Jul 2007 00:55:27 +0300
Subject: [ofa-general] Re: pkey.sim.tcl
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F75A8F@mtlexch01.mtl.com>
References: <f0e08f230707230830x5d0f73ccib09421f3f660e563@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com>
	<20070724005153.GD11674@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com>
	<20070724170432.GZ27878@sashak.voltaire.com>
	<20070724215441.GA25264@sashak.voltaire.com>
	<20070725202418.GD31582@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901F75674@mtlexch01.mtl.com>
	<20070726224133.GC2472@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901F75A8F@mtlexch01.mtl.com>
Message-ID: <20070728215527.GH12351@sashak.voltaire.com>

Hi Eitan,

On 07:56 Fri 27 Jul     , Eitan Zahavi wrote:
> > 
> > On 09:26 Thu 26 Jul     , Eitan Zahavi wrote:
> > > 
> > > I am happy you actually use the simulator.
> > > Please provide more info regarding the failure. You should tar 
> > > compress the /tmp/ibmgtsim.XXXX of your run.
> > 
> > I can send this for you if you want, but the failure is trivial.
> No need if you already know where the bug is...
> > 
> > Yes, and it is due (6), where default Pkey is removed 
> > "externally". I'm not sure that OpenSM should handle the case 
> > when pkey table is modified externally by something which is not SM.
> > 
> 
> For a few years it just worked fine. So I wonder why this fucntionality
> was removed ?
> It is a real BAD case where Pkeys are altered but I think would be wise
> to "refresh" these tables on heavy seep.

We discussed how and when port tables refresh should be done just few
days ago in this thread. My impression was that we are "in sync" about
this.

> In general it seems OpenSM has lost its "heavy sweep" concept. Now it
> does not refresh the fabric setup even on heavy sweep.

Not on each heavy sweep, but it does when it needed or when data could
change. I don't think the concept was changed, just optimized. Let just
look at the numbers:

$ time ./opensm/opensm -e -f ./osm.log -o
...
SUBNET UP
Exiting SM

real    0m7.995s
user    0m4.488s
sys     0m6.072s

$ time ./opensm/opensm -e -f ./osm.log -o --qos
...
SUBNET UP
Exiting SM

real    0m22.521s
user    0m10.921s
sys     0m17.173s


This is simulated runs (with ibsim), the fabric is ~1300 nodes.

The difference there is '--qos' flag, so OpenSM skips SL2VL and VLArb
update in first run and does it in the second - sweep times are 8
against 22 seconds.

> This is assuming a "perfect" HW and software and I would really this we
> should have preserved that capability.

What about an option? Now with subn->need_update flag (which always
enforces updates) it is trivial to implement.

> Note that a "heavy sweep" does not happen unless somethng changed or
> trapped.

Yes, for example some port was connected/disconnected, some node
rebooted, etc.. OpenSM starts huge heavy sweep, it takes a while, SA
is not responsive most the time, TCP connection over IPoIB timeouted,
applications failed. This is production experiences... :(

Sasha


From sashak at voltaire.com  Sat Jul 28 15:15:40 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 29 Jul 2007 01:15:40 +0300
Subject: [ofa-general] Re: OpenSM detection of duplicated GUIDs on loopback
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F75ABE@mtlexch01.mtl.com>
References: <f0e08f230707241155v225aeed7s172f5cfb9024fc0e@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F7512E@mtlexch01.mtl.com>
	<f0e08f230707241325p1b7b04cap5bf37459bc7cb33f@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F75131@mtlexch01.mtl.com>
	<20070725001847.GG25264@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901F7558A@mtlexch01.mtl.com>
	<20070725194856.GB31582@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901F75673@mtlexch01.mtl.com>
	<20070727010707.GR2472@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901F75ABE@mtlexch01.mtl.com>
Message-ID: <20070728221540.GI12351@sashak.voltaire.com>

On 14:27 Fri 27 Jul     , Eitan Zahavi wrote:
> The problem I have with back-to-back plug is that it is a fatal case if
> found in a case where there was no use of this plug.
> So we will  need some sort of user input if it is OK or not.

Ok, and let's add cl_qmap_count() check there.

> The case of moving a port in the middle of a sweep can be easily
> detected if instead of reporting an error a second 
> check of the original DR where the same GUID was found is performed...

Do you mean to resend NodeInfo request to the original location?
Assuming so, I guess it should be instead of second heavy sweep, and
it is a good idea. The only small downside of this I can see is potential
timeouts (and discovery slowdown). But anyway it is much better then
fatal error. Thanks!

Sasha


From envio10006 at gmail.com  Sat Jul 28 17:09:18 2007
From: envio10006 at gmail.com (Imedeen)
Date: Sat, 28 Jul 2007 20:09:18 -0400
Subject: [ofa-general] Rejuvenezca la piel de todo su cuerpo ...
Message-ID: <9484122-220077029091820@Mauricio>


Desde Dinamarca para su piel .. 
      
 
Imedeen Time Perfection 

Mejora la estructura y calidad de la piel. 
Reduce las l&iacute;neas finas y arrugas. 
Incrementa la humectaci&oacute;n de la piel. 
Aten&uacute;a los capilares finos y pigmentos de la piel. 
Otorga mayor firmeza a la piel. 
Protege a las fibras de col&aacute;geno y elastina. 
Ayuda a neutralizar los elementos nocivos para la piel. 
Otorga a la piel un aspecto m&aacute;s brillante, terso y juvenil. 
Se recomienda en mujeres desde los 30 a&ntilde;os y en hombres desde los 40 a&ntilde;os. 
Se comienzan a visualizar los beneficios a los 90 d&iacute;as. 
Tomar diariamente 2 tabletas, en forma continua. 
Avalado con documentaci&oacute;n cient&iacute;fica 
  
Descripci&oacute;n detallada del producto 

Imedeen Time Perfection

Tratamiento para 30 d&iacute;as (60 C&aacute;psulas) $41.900.- 

Tratamiento para 90 d&iacute;as (180 C&aacute;psulas) $100.000.- 
                              
 
El despacho de este producto es sin costo en Santiago, el pedido se cancela a la persona que lo entrega y puede ser pagado con cheque o efectivo. 

El despacho a regiones es v&iacute;a Tur Bus, el flete es por pagar, el pedido se cancela con un dep&oacute;sito en la Cta.Cte de la empresa. 

 
Imedeen Prime Renewal 

Mejora la estructura y calidad de la piel. 
Reduce las l&iacute;neas finas y arrugas. 
Incrementa la humectaci&oacute;n de la piel. 
Aten&uacute;a los capilares finos y pigmentos de la piel. 
Otorga mayor firmeza a la piel. 
Protege a las fibras de col&aacute;geno y elastina. 
Ayuda a neutralizar los elementos nocivos para la piel. 
Otorga a la piel un aspecto m&aacute;s brillante, terso y juvenil.
Se recomienda solo en mujeres desde los 45 a los 65 a&ntilde;os. 
Se comienzan a visualizar los beneficios a los 90 d&iacute;as. 
Tomar diariamente 2 tabletas en la ma&ntilde;ana y 2 en la noche despu&eacute;s de la cena. 
Toma continua, sin descanso. 
Avalado con documentaci&oacute;n cient&iacute;fica
  
Descripci&oacute;n detallada del producto

Imedeen Prime Renewal

Tratamiento para 30 d&iacute;as (120 C&aacute;psulas) $54.900.- 

Tratamiento para 90 d&iacute;as (360 C&aacute;psulas) $132.000.- 

 
El despacho de este producto es sin costo en Santiago, el pedido se cancela a la persona que lo entrega y puede ser pagado con cheque o efectivo. 

El despacho a regiones es v&iacute;a Tur Bus, el flete es por pagar, el pedido se cancela con un dep&oacute;sito en la Cta.Cte de la empresa

 
Imedeen Tan Optimizar 

Prepara la piel para el sol. 
Disminuye la sensibilidad de la piel al sol. 
Acelera el bronceado. 
Optimiza el bronceado. 
Homogeniza el bronceado. 
Prolonga el bronceado. 
Producto unisex, recomendado desde los 15 a&ntilde;os de edad. 
Se comienzan a visualizar los beneficios a los 35 d&iacute;as. 
Tomar una c&aacute;psula diariamente 1 mes antes de la exposici&oacute;n y 1 c&aacute;psula 1 mes despu&eacute;s de la exposici&oacute;n. 
Toma continua sin descanso, si se desea. 
Avalado con documentaci&oacute;n cient&iacute;fica. 
Idealmente en los meses de verano se debe tomar Imedeen Tan Optimizar en forma conjunta con Imedeen Time Perfection o Imedeen Prime Renewal. 
La toma de Imedeen Tan Optimizar no reemplaza el uso de protectores solares (SPF) y no se debe exponer en las horas de m&aacute;xima intensidad solar.
  
Descripci&oacute;n detallada del producto 

Imedeen Tan Optimizar

Tratamiento para 30 d&iacute;as (30 C&aacute;psulas) $24.900.- 

Tratamiento para 60 d&iacute;as (60 C&aacute;psulas) $39.900.- 

Tratamiento para 90 d&iacute;as (90 C&aacute;psulas) $54.900.- 

 
El despacho de este producto es sin costo en Santiago, el pedido se cancela a la persona que lo entrega y puede ser pagado con cheque o efectivo. 

El despacho a regiones es v&iacute;a Tur Bus, el flete es por pagar, el pedido se cancela con un dep&oacute;sito en la Cta.Cte de la empresa

  
 + 
Oferta

Imedeen Time Perfection para 90 d&iacute;as (180 C&aacute;psulas) 

Imedeen Tan Optimizar para 90 d&iacute;as (90 C&aacute;psulas) 

Oferta $140.000.-
                
 
El despacho de este producto es sin costo en Santiago, el pedido se cancela a la persona que lo entrega y puede ser pagado con cheque o efectivo. 

El despacho a regiones es v&iacute;a Tur Bus, el flete es por pagar, el pedido se cancela con un dep&oacute;sito en la Cta.Cte de la empresa

  
Este mensaje se enva en base al art. 28b de la ley 19.955 que reforma la la ley de derechos del consumidor, y los artculos 2 y 4 de la ley 19.628 sobre proteccin de la vida privada o datos de carcter personal, todo esto en conformidad a los numerales 4 y 12 de la constitucin poltica. Su direccin ha sido extrada manualmente por personal de nuestra compaa desde su sitio Web en Internet, o ha sido introducida por usted al aceptar el envo de mensajes publicitarios al inscribirse en alguno de los sitios o foros de nuestra Red de trabajo. Para ser removido presione Borrarme de su Base de Datos

 
Todos los productos se despachan en Santiago sin costo de Lunes a Domingo , tambien se despacha a todo Chile via Tur Bus o Pullman Bus (La encomienda se cancela en destino final) 

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070728/56ccb6c6/attachment.html>

From rdreier at cisco.com  Sat Jul 28 20:09:14 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Sat, 28 Jul 2007 20:09:14 -0700
Subject: [ofa-general] Re: [PATCH] amso1100: QP init bug in amso driver
References: <1185305512.20489.6.camel@trinity.ogc.int>
Message-ID: <adatzrnvrc5.fsf@cisco.com>

thanks, applied.


From rdreier at cisco.com  Sat Jul 28 20:34:14 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Sat, 28 Jul 2007 20:34:14 -0700
Subject: [ofa-general] Re: [PATCH] mlx4: fix double-kfree in mlx4_mr_alloc
	error flow
References: <200707261116.58679.jackm@dev.mellanox.co.il>
Message-ID: <adamyxfvq6h.fsf@cisco.com>

thanks, applied.


From rdreier at cisco.com  Sat Jul 28 20:39:14 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Sat, 28 Jul 2007 20:39:14 -0700
Subject: [ofa-general] [PATCH 1/2] ehca: remove checkpatch.pl's warnings
	"externs should be avoided in .c files"
References: <200707271254.51055.hnguyen@linux.vnet.ibm.com>
Message-ID: <adahcnnvpy5.fsf@cisco.com>

the patch looks fine except your mailer seems to have mangled
it... can you resend so I can apply it?

thanks...


From rdreier at cisco.com  Sat Jul 28 20:39:14 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Sat, 28 Jul 2007 20:39:14 -0700
Subject: [ofa-general] Re: [PATCH 2/2] ehca: correction include order
	according kernel coding style
References: <200707271255.19456.hnguyen@linux.vnet.ibm.com>
Message-ID: <adabqdvvpy5.fsf@cisco.com>

thanks, I applied this by hand since it was so trivial.


From kliteyn at mellanox.co.il  Sat Jul 28 21:38:00 2007
From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il)
Date: 29 Jul 2007 07:38:00 +0300
Subject: [ofa-general] nightly osm_sim report 2007-07-29:normal completion
Message-ID: <MTLEXCH01rAuWiqJmdT00000c68@mtlexch01.mtl.com>

OSM Simulation Regression Summary
 
[Generated mail - please do NOT reply]
 
 
OpenSM rev = Wed_Jul_25_02:46:48_2007 [d06c318cb50ddddf55b20a5e896d2d22d7b90948]
ibutils rev = Tue_Mar_13_14:36:32_2007 [80aaff94f0eb65117db39b9db7d609ffdcc055de]
 
 
Total=560  Pass=504  Fail=56
 
 
Pass:
42 Stability IS1-16.topo
42 OsmTest IS1-16.topo
42 OsmStress IS1-16.topo
42 Multicast IS1-16.topo
42 LidMgr IS1-16.topo
14 Stability IS3-loop.topo
14 Stability IS3-128.topo
14 OsmTest IS3-loop.topo
14 OsmTest IS3-128.topo
14 OsmStress IS3-128.topo
14 Multicast IS3-loop.topo
14 Multicast IS3-128.topo
14 LidMgr IS3-128.topo
14 FatTree merge-roots-4-ary-2-tree.topo
14 FatTree merge-root-4-ary-3-tree.topo
14 FatTree gnu-stallion-64.topo
14 FatTree blend-4-ary-2-tree.topo
14 FatTree RhinoDDR.topo
14 FatTree FullGnu.topo
14 FatTree 4-ary-2-tree.topo
14 FatTree 2-ary-4-tree.topo
14 FatTree 12-node-spaced.topo
14 FTreeFail 4-ary-2-tree-missing-sw-link.topo
14 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo
14 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo
14 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo

Failures:
42 Pkey IS1-16.topo
14 Pkey IS3-128.topo


From rdreier at cisco.com  Sat Jul 28 21:39:14 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Sat, 28 Jul 2007 21:39:14 -0700
Subject: [ofa-general] Bug in inline sends with sge_num > 0 in libmlx4
References: <20070724121440.GA2775@minantech.com>
Message-ID: <adatzrnu8lp.fsf@cisco.com>

thanks, good catch.  applied.


From dotanb at dev.mellanox.co.il  Sat Jul 28 23:53:36 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Sun, 29 Jul 2007 09:53:36 +0300
Subject: [ofa-general] [PATCH] perftest: Fix deleting the utils files in
	"make clean"
Message-ID: <200707290953.36390.dotanb@dev.mellanox.co.il>

Fix deleting the utils files in "make clean".

Signed-off-by: Dotan Barak <dotanb at dev.mellanox.co.il>

---

Index: connectx_user/src/userspace/perftest/Makefile
===================================================================
--- connectx_user.orig/src/userspace/perftest/Makefile	2007-07-26 08:02:02.000000000 +0300
+++ connectx_user/src/userspace/perftest/Makefile	2007-07-29 09:38:40.000000000 +0300
@@ -16,6 +16,6 @@ ${TESTS} ${UTILS}: %: %.c ${EXTRA_FILES}
 	$(CC) $(CPPFLAGS) $(CFLAGS) $(LDFLAGS) $< ${EXTRA_FILES} $(LOADLIBES) $(LDLIBS) -o ib_$@
 clean:
 	$(foreach fname,${TESTS}, rm -f ib_${fname})
-	rm -f ${UTILS}	
+	$(foreach fname,${UTILS}, rm -f ib_${fname})
 .DELETE_ON_ERROR:
 .PHONY: all clean


From ogerlitz at voltaire.com  Sun Jul 29 01:25:09 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 29 Jul 2007 11:25:09 +0300
Subject: [ofa-general] Re: IPoIB path caching
In-Reply-To: <20070726172619.GA5208@mellanox.co.il>
References: <46A46637.3080104@voltaire.com>	<20070723083020.GD20614@mellanox.co.il>	<46A46A1D.6040000@voltaire.com>
	<46A4EF00.9070305@ichips.intel.com>	<46A5C8E6.5020906@voltaire.com>
	<46A628D8.4050109@ichips.intel.com>	<46A6F50C.5000906@voltaire.com>
	<46A78146.1090304@ichips.intel.com>	<46A846FC.5040704@voltaire.com>
	<46A8D80C.1090305@ichips.intel.com>
	<20070726172619.GA5208@mellanox.co.il>
Message-ID: <46AC4EE5.3080806@voltaire.com>

Michael S. Tsirkin wrote:
>> I believe a better solution is for everyone to use cached records, if 
>> they exist, with a feedback mechanism from the CM that removes paths on 
>> a connection failure or path migration event.

> Ack timeout on an RC QP is also a good indication we should redo the lookup.

I am not following you.

The only way I know for IB SW to be aware to ACK timeout is if the RC QP 
retries are set to zero (actually also at this scheme the QP can move to 
error as of other reason).

Indeed IPoIB-CM uses RC and does this zero-retries settings, but we have 
already agreed that moving forward, IPoIB-CM default IB transport would 
change to UC, didn't we?

Or.


From ogerlitz at voltaire.com  Sun Jul 29 01:27:08 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 29 Jul 2007 11:27:08 +0300
Subject: [ofa-general] Re: IPoIB path caching
In-Reply-To: <20070726174700.GB5208@mellanox.co.il>
References: <46A46A1D.6040000@voltaire.com>
	<46A4EF00.9070305@ichips.intel.com>	<46A5C8E6.5020906@voltaire.com>
	<46A628D8.4050109@ichips.intel.com>	<46A6F50C.5000906@voltaire.com>
	<46A78146.1090304@ichips.intel.com>	<46A846FC.5040704@voltaire.com>
	<46A8D80C.1090305@ichips.intel.com>	<20070726172619.GA5208@mellanox.co.il>	<46A8DBED.40808@ichips.intel.com>
	<20070726174700.GB5208@mellanox.co.il>
Message-ID: <46AC4F5C.6030605@voltaire.com>

Michael S. Tsirkin wrote:
>> Do you know if we get a specific event for this?  (I don't remember.) 

> CQE with error IIRC.

OK, got you, CQE with "retries exceeded", sorry.

Or.


From ogerlitz at voltaire.com  Sun Jul 29 01:32:27 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 29 Jul 2007 11:32:27 +0300
Subject: [ofa-general] Re: IPoIB path caching
In-Reply-To: <20070726181132.GO19768@obsidianresearch.com>
References: <46A46637.3080104@voltaire.com>
	<20070723083020.GD20614@mellanox.co.il>
	<46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com>
	<46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com>
	<46A6F50C.5000906@voltaire.com> <46A78146.1090304@ichips.intel.com>
	<46A846FC.5040704@voltaire.com> <46A8D80C.1090305@ichips.intel.com>
	<20070726181132.GO19768@obsidianresearch.com>
Message-ID: <46AC509B.6020206@voltaire.com>

Jason Gunthorpe wrote:
> The existing trap monitoring in Sean's module covers about 90% of the
> cases in IB when you need to invalidate a PR, the last 10% will need
> something new :(

Let it be. Do you think the last 10% should not prevent the local sa 
merge to the upstream code?

Or.


From ogerlitz at voltaire.com  Sun Jul 29 01:27:32 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 29 Jul 2007 11:27:32 +0300
Subject: [ofa-general] Re: IPoIB path caching
In-Reply-To: <46A8D80C.1090305@ichips.intel.com>
References: <adalkdl43w0.fsf@cisco.com>	<46A2F696.4060007@voltaire.com>	<adafy3f22z5.fsf@cisco.com>	<46A46637.3080104@voltaire.com>	<20070723083020.GD20614@mellanox.co.il>
	<46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com>
	<46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com>
	<46A6F50C.5000906@voltaire.com> <46A78146.1090304@ichips.intel.com>
	<46A846FC.5040704@voltaire.com> <46A8D80C.1090305@ichips.intel.com>
Message-ID: <46AC4F74.2000904@voltaire.com>

Sean Hefty wrote:
> Administrators can enable or disable the cache.  I don't believe that 
> individual applications should be able to override the administrator, 
> nor do I think we gain anything by having per application settings. This 
> is similar to exposing to applications whether they want to use cached 
> ARP information every time they connect.

Applications --can-- delete the network stack neighbour before doing 
this or that action.

>> For example, I think it would be correct for IB block and file I/O 
>> ULPs (iSER, SRP, Lustre, rNFS, etc) to request non cached PR, as their 
>> connecting model is not all-to-all but rather n-to-m (n clients to m 
>> servers with m << n), the connections are long-lived (hours, days, 
>> weeks, more) and a connection failure as of PR caching does not seem 
>> acceptable.

> I believe a better solution is for everyone to use cached records, if 
> they exist, with a feedback mechanism from the CM that removes paths on 
> a connection failure or path migration event.

That's an interesting point. What's the conceptual difference between CM 
connection failure caused as of "wrong" PR to failure of --unicast-- ARP 
probe initiated by the network stack? CM feedback to the local sa seems 
a correct approach for me, however, I don't see the equivalent for UD 
communication.

> With all to all connections over the rdma cm, the first thing that needs 
> to be done is resolve the remote addresses to GIDs.  This causes an ARP 
> storm, followed by an SA storm caused by IPoIB, followed by a second SA 
> storm caused by the rdma cm.  For scalability, we need to remove both of 
> these SA storms, not just the second.  We don't see the first SA storm 
> today because IPoIB caches PRs.  Let's not add it.  Restricting caching 
> to the rdma cm, but removing it from IPoIB leaves us with the same 
> issues that we have today.

Again, typical I/O client-server scheme is n-to-m where m is small (1,2 
  say up to few tens). The PRs are needed by IPoIB only at the 
--passive-- side which sends the unicast ARP reply. So when n=1024 and 
m=4 the SA would need to serve 4096 PRs which is about one fourth of the 
queries/second rate you have reported on earlier threads on the matter.

Or.


From vlad at lists.openfabrics.org  Sun Jul 29 01:43:24 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Sun, 29 Jul 2007 01:43:24 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070729-0100 daily build status
Message-ID: <20070729084325.2EDD2E6085B@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.12
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.12
Passed on ppc64 with linux-2.6.16
Passed on ia64 with linux-2.6.16
Passed on powerpc with linux-2.6.15
Passed on x86_64 with linux-2.6.12
Passed on powerpc with linux-2.6.18
Passed on ia64 with linux-2.6.18
Passed on powerpc with linux-2.6.13
Passed on x86_64 with linux-2.6.20
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.17
Passed on powerpc with linux-2.6.16
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.12
Passed on ia64 with linux-2.6.17
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.13
Passed on powerpc with linux-2.6.19
Passed on ppc64 with linux-2.6.15
Passed on x86_64 with linux-2.6.16
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.17
Passed on ia64 with linux-2.6.19
Passed on ppc64 with linux-2.6.17
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ia64 with linux-2.6.21.1
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5

Failed:


From eitan at mellanox.co.il  Sun Jul 29 02:00:33 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sun, 29 Jul 2007 12:00:33 +0300
Subject: [ofa-general] RE: OpenSM detection of duplicated GUIDs on loopback
In-Reply-To: <20070728221540.GI12351@sashak.voltaire.com>
References: <f0e08f230707241155v225aeed7s172f5cfb9024fc0e@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F7512E@mtlexch01.mtl.com>
	<f0e08f230707241325p1b7b04cap5bf37459bc7cb33f@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F75131@mtlexch01.mtl.com>
	<20070725001847.GG25264@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901F7558A@mtlexch01.mtl.com>
	<20070725194856.GB31582@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901F75673@mtlexch01.mtl.com>
	<20070727010707.GR2472@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901F75ABE@mtlexch01.mtl.com>
	<20070728221540.GI12351@sashak.voltaire.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901F75CF6@mtlexch01.mtl.com>

> On 14:27 Fri 27 Jul     , Eitan Zahavi wrote:
> > The problem I have with back-to-back plug is that it is a 
> fatal case 
> > if found in a case where there was no use of this plug.
> > So we will  need some sort of user input if it is OK or not.
> 
> Ok, and let's add cl_qmap_count() check there.
Not following you.
> 
> > The case of moving a port in the middle of a sweep can be easily 
> > detected if instead of reporting an error a second check of the 
> > original DR where the same GUID was found is performed...
> 
> Do you mean to resend NodeInfo request to the original location?
> Assuming so, I guess it should be instead of second heavy 
> sweep, and it is a good idea. The only small downside of this 
> I can see is potential timeouts (and discovery slowdown). But 
> anyway it is much better then fatal error. Thanks!

So we are inline with this one .
Instead of changing the order of things we could generate list of DR's
that are to be re-scanned
 during drop-mgr and then abort if really dulicates.

> 
> Sasha
> 


From eitan at mellanox.co.il  Sun Jul 29 02:11:05 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sun, 29 Jul 2007 12:11:05 +0300
Subject: [ofa-general] RE: pkey.sim.tcl
In-Reply-To: <20070728215527.GH12351@sashak.voltaire.com>
References: <f0e08f230707230830x5d0f73ccib09421f3f660e563@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com>
	<20070724005153.GD11674@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com>
	<20070724170432.GZ27878@sashak.voltaire.com>
	<20070724215441.GA25264@sashak.voltaire.com>
	<20070725202418.GD31582@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901F75674@mtlexch01.mtl.com>
	<20070726224133.GC2472@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901F75A8F@mtlexch01.mtl.com>
	<20070728215527.GH12351@sashak.voltaire.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901F75D17@mtlexch01.mtl.com>

Regarding the test :
Once I will know the exact condition causing a full re-sweep I would use
it in the test.
In OFED 1.2 it was enough to set one switch ChangeBit to force a full
reconfiguration.

Regarding incremental flow in general:
1. Yes - it is good.
2. But we must make sure it is robust enough that we do not loose some
nodes or functionality 
    under extreme cases of reboot or HW errors.
3. We should have a way to force a full sweep without killing the SM:
As the size of the clusters grow there is a growing chance that "soft
errors" will hit the devices.
Most of the device memory is guarded and would be auto detected if
affected. 
However I think it is wise to allow for the user to force full
reconfiguration without making the SM "go away".

Regarding OpenSM does not respond to SA queries during sweep:
It is due to the fact there is no "double buffer" for the internal DB.
So whenever the SM starts a sweep the SA will see an "empty" DB. 
The solution for that problem may be having a "previous" DB during
sweeps. 
I suspect using that approach will also enable a fine grain incremental
capability too.

Eitan


Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL

 
> -----Original Message-----
> From: Sasha Khapyorsky [mailto:sashak at voltaire.com] 
> Sent: Sunday, July 29, 2007 12:55 AM
> To: Eitan Zahavi
> Cc: Yevgeny Kliteynik; Hal Rosenstock; general at lists.openfabrics.org
> Subject: Re: pkey.sim.tcl
> 
> Hi Eitan,
> 
> On 07:56 Fri 27 Jul     , Eitan Zahavi wrote:
> > > 
> > > On 09:26 Thu 26 Jul     , Eitan Zahavi wrote:
> > > > 
> > > > I am happy you actually use the simulator.
> > > > Please provide more info regarding the failure. You should tar 
> > > > compress the /tmp/ibmgtsim.XXXX of your run.
> > > 
> > > I can send this for you if you want, but the failure is trivial.
> > No need if you already know where the bug is...
> > > 
> > > Yes, and it is due (6), where default Pkey is removed 
> "externally". 
> > > I'm not sure that OpenSM should handle the case when pkey 
> table is 
> > > modified externally by something which is not SM.
> > > 
> > 
> > For a few years it just worked fine. So I wonder why this 
> > fucntionality was removed ?
> > It is a real BAD case where Pkeys are altered but I think would be 
> > wise to "refresh" these tables on heavy seep.
> 
> We discussed how and when port tables refresh should be done 
> just few days ago in this thread. My impression was that we 
> are "in sync" about this.
> 
> > In general it seems OpenSM has lost its "heavy sweep" 
> concept. Now it 
> > does not refresh the fabric setup even on heavy sweep.
> 
> Not on each heavy sweep, but it does when it needed or when 
> data could change. I don't think the concept was changed, 
> just optimized. Let just look at the numbers:
> 
> $ time ./opensm/opensm -e -f ./osm.log -o ...
> SUBNET UP
> Exiting SM
> 
> real    0m7.995s
> user    0m4.488s
> sys     0m6.072s
> 
> $ time ./opensm/opensm -e -f ./osm.log -o --qos ...
> SUBNET UP
> Exiting SM
> 
> real    0m22.521s
> user    0m10.921s
> sys     0m17.173s
> 
> 
> This is simulated runs (with ibsim), the fabric is ~1300 nodes.
> 
> The difference there is '--qos' flag, so OpenSM skips SL2VL 
> and VLArb update in first run and does it in the second - 
> sweep times are 8 against 22 seconds.
> 
> > This is assuming a "perfect" HW and software and I would 
> really this 
> > we should have preserved that capability.
> 
> What about an option? Now with subn->need_update flag (which 
> always enforces updates) it is trivial to implement.
> 
> > Note that a "heavy sweep" does not happen unless somethng 
> changed or 
> > trapped.
> 
> Yes, for example some port was connected/disconnected, some 
> node rebooted, etc.. OpenSM starts huge heavy sweep, it takes 
> a while, SA is not responsive most the time, TCP connection 
> over IPoIB timeouted, applications failed. This is production 
> experiences... :(
> 
> Sasha
> 


From vlad at lists.openfabrics.org  Sun Jul 29 02:49:55 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Sun, 29 Jul 2007 02:49:55 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_c_kernel 20070729-0200 daily build status
Message-ID: <20070729094955.64799E608FE@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2_c

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.12
Passed on ia64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.16
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.15
Passed on x86_64 with linux-2.6.17
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.16
Passed on x86_64 with linux-2.6.20
Passed on ppc64 with linux-2.6.17
Passed on ia64 with linux-2.6.17
Passed on ppc64 with linux-2.6.12
Passed on x86_64 with linux-2.6.18
Passed on ia64 with linux-2.6.12
Passed on x86_64 with linux-2.6.14
Passed on powerpc with linux-2.6.17
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.12
Passed on ppc64 with linux-2.6.15
Passed on powerpc with linux-2.6.15
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.18
Passed on x86_64 with linux-2.6.21.1
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.19
Passed on ppc64 with linux-2.6.16
Passed on x86_64 with linux-2.6.13
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.13
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default

Failed:


From dotanb at dev.mellanox.co.il  Sun Jul 29 03:26:36 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Sun, 29 Jul 2007 13:26:36 +0300
Subject: [ofa-general] Re: I think that there is a resource leak in the core
	file mad_rmpp.c
In-Reply-To: <46A4F5A2.2020508@ichips.intel.com>
References: <46A45A8C.2090800@dev.mellanox.co.il>
	<46A4F5A2.2020508@ichips.intel.com>
Message-ID: <46AC6B5C.6020702@dev.mellanox.co.il>

Sean Hefty wrote:
>> I reviewed the file mad_rmpp.c and it seems that there is a leak of 
>> the Address Handle.
>> The AH that is being created in the function "alloc_response_msg" is 
>> never being destroyed.
>
> The AH is destroyed in ib_rmpp_send_handler().
I checked this issue again and I added the following prints:
  the AH handler which is being created in alloc_response_msg()
  the AH handler which is being destroyed in ib_rmpp_send_handler()

It seems that the AHs which are being created in alloc_response_msg() 
(which is being called from
ack_ds_ack()) are not being destroyed because the rmpp_type of this 
packet is
IB_MGMT_RMPP_TYPE_ACK, so the destroy AH is not being executed.

We saw this issue in our daily regression during the osmtest, here are 
the reproduction instructions:

Start the openSM (during the following commands, the openSM needs to be 
online)
execute: # osmtest -f c
execute: # osmtest -f a

during the test, the AHs which were mentioned above will be created.


thanks
Dotan


From dotanb at dev.mellanox.co.il  Sun Jul 29 03:32:54 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Sun, 29 Jul 2007 13:32:54 +0300
Subject: [ofa-general] [PATCH-v2] perftest: Fix deleting the utils files in
	"make clean"
Message-ID: <200707291332.54356.dotanb@dev.mellanox.co.il>

Fix deleting the utils files in "make clean".

Signed-off-by: Dotan Barak <dotanb at dev.mellanox.co.il>

---

diff --git a/Makefile b/Makefile
index 812de14..8042531 100644
--- a/Makefile
+++ b/Makefile
@@ -15,7 +15,6 @@ ${TESTS}: LOADLIBES += -libverbs -lrdmacm
 ${TESTS} ${UTILS}: %: %.c ${EXTRA_FILES} ${EXTRA_HEADERS}
 	$(CC) $(CPPFLAGS) $(CFLAGS) $(LDFLAGS) $< ${EXTRA_FILES} $(LOADLIBES) $(LDLIBS) -o ib_$@
 clean:
-	$(foreach fname,${TESTS}, rm -f ib_${fname})
-	rm -f ${UTILS}	
+	$(foreach fname,${TESTS} ${UTILS}, rm -f ib_${fname})
 .DELETE_ON_ERROR:
 .PHONY: all clean


From hal.rosenstock at gmail.com  Sun Jul 29 04:09:43 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sun, 29 Jul 2007 07:09:43 -0400
Subject: [ofa-general] Re: ibutils/ibdm building
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F75B5A@mtlexch01.mtl.com>
References: <f0e08f230707271232m11e651d3kd7356b972c26e228@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F75B5A@mtlexch01.mtl.com>
Message-ID: <f0e08f230707290409k3af3f6f0j36b863e70d055345@mail.gmail.com>

On 7/28/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:
> Hi Hal
>
> This does not reproduce on my fresh clone.
> So I am not sure what is going on.

Your latest change fixed this:

commit a39dcb1db6f0559ca95df2948675fda0222f1532
Author: Eitan Zahavi <eitan at sw097.lab.mtl.com>
Date:   Sat Jul 28 23:44:28 2007 +0300

   The target i sno ibis.o but ibis.lo

diff --git a/ibis/src/Makefile.am b/ibis/src/Makefile.am
index 27f0652..2018c66 100644
--- a/ibis/src/Makefile.am
+++ b/ibis/src/Makefile.am
@@ -89,7 +89,7 @@ SWIG_IFC_FILES= $(srcdir)/ibbbm.i \
       $(srcdir)/ibsm.i \
       $(srcdir)/ibvs.i

-ibis.o: $(srcdir)/git_version.h
+ibis.lo: $(srcdir)/git_version.h

 # track latest GIT version for this tree:
 $(srcdir)/git_version.h:  @MAINTAINER_MODE_TRUE@ FORCE


> Eitan Zahavi
> Senior Engineering Director, Software Architect
> Mellanox Technologies LTD
> Tel:+972-4-9097208
> Fax:+972-4-9593245
> P.O. Box 586 Yokneam 20692 ISRAEL
>
>
>
> > -----Original Message-----
> > From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> > Sent: Friday, July 27, 2007 10:33 PM
> > To: Eitan Zahavi
> > Cc: OpenFabrics General
> > Subject: ibutils/ibdm building
> >
> > Hi again Eitan,
> >
> > When building ibutils/ibdm (master), I get the following error:
> >
> > if g++ -DHAVE_CONFIG_H -I. -I. -I..  -I../datamodel   -g -O2 -MT
> > osm_check.o -MD -MP -MF ".deps/osm_check.Tpo" -c -o
> > osm_check.o osm_check.cpp; \ then mv -f ".deps/osm_check.Tpo"
> > ".deps/osm_check.Po"; else rm -f ".deps/osm_check.Tpo"; exit 1; fi
> > osm_check.cpp: In function `int main(int, char**)':
> > osm_check.cpp:428: `R_OK' undeclared (first use this function)
> > osm_check.cpp:428: (Each undeclared identifier is reported
> > only once for each
> >   function it appears in.)
> > osm_check.cpp:428: `access' undeclared (first use this function)
> >
> > Thanks.
> >
> > -- Hal
> >
>


From mst at dev.mellanox.co.il  Sun Jul 29 04:17:34 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 29 Jul 2007 14:17:34 +0300
Subject: [ofa-general] Re: [PATCH-v2] perftest: Fix deleting the utils files
	in "make clean"
In-Reply-To: <200707291332.54356.dotanb@dev.mellanox.co.il>
References: <200707291332.54356.dotanb@dev.mellanox.co.il>
Message-ID: <20070729111734.GA16915@mellanox.co.il>

thanks, applied

Quoting Dotan Barak <dotanb at dev.mellanox.co.il>:
Subject: [PATCH-v2] perftest: Fix deleting the utils files in "make clean"

Fix deleting the utils files in "make clean".

Signed-off-by: Dotan Barak <dotanb at dev.mellanox.co.il>

---

diff --git a/Makefile b/Makefile
index 812de14..8042531 100644
--- a/Makefile
+++ b/Makefile
@@ -15,7 +15,6 @@ ${TESTS}: LOADLIBES += -libverbs -lrdmacm
 ${TESTS} ${UTILS}: %: %.c ${EXTRA_FILES} ${EXTRA_HEADERS}
 	$(CC) $(CPPFLAGS) $(CFLAGS) $(LDFLAGS) $< ${EXTRA_FILES} $(LOADLIBES) $(LDLIBS) -o ib_$@
 clean:
-	$(foreach fname,${TESTS}, rm -f ib_${fname})
-	rm -f ${UTILS}	
+	$(foreach fname,${TESTS} ${UTILS}, rm -f ib_${fname})
 .DELETE_ON_ERROR:
 .PHONY: all clean

-- 
MST


From hal.rosenstock at gmail.com  Sun Jul 29 04:27:31 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sun, 29 Jul 2007 07:27:31 -0400
Subject: [ofa-general] [PATCH] mad.c: Fix memory leak in switch handling and
	improve error handling
Message-ID: <f0e08f230707290427x4ab37716t76c7f9692eed5b1c@mail.gmail.com>

mad.c: Fix memory leak in switch handling and improve error handling

Signed-off-by: Suresh Shelvapille <suri at baymicrosystems.com>
Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index bc547f1..6310dc3 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -1847,11 +1847,6 @@ static void ib_mad_recv_done_handler(struct
ib_mad_port_private *port_priv,
       struct ib_mad_agent_private *mad_agent;
       int port_num;

-       response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL);
-       if (!response)
-               printk(KERN_ERR PFX "ib_mad_recv_done_handler no memory "
-                      "for response buffer\n");
-
       mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id;
       qp_info = mad_list->mad_queue->qp_info;
       dequeue_mad(mad_list);
@@ -1879,6 +1874,13 @@ static void ib_mad_recv_done_handler(struct
ib_mad_port_private *port_priv,
       if (!validate_mad(&recv->mad.mad, qp_info->qp->qp_num))
               goto out;

+       response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL);
+       if (!response) {
+               printk(KERN_ERR PFX "ib_mad_recv_done_handler no memory "
+                      "for response buffer\n");
+               goto out;
+       }
+
       if (port_priv->device->node_type == RDMA_NODE_IB_SWITCH)
               port_num = wc->port_num;
       else
@@ -1914,12 +1916,11 @@ static void ib_mad_recv_done_handler(struct
ib_mad_port_private *port_priv,
                       response->header.recv_wc.recv_buf.mad =
&response->mad.mad;
                       response->header.recv_wc.recv_buf.grh = &response->grh;

-                       if (!agent_send_response(&response->mad.mad,
-                                                &response->grh, wc,
-                                                port_priv->device,
-
smi_get_fwd_port(&recv->mad.smp),
-                                                qp_info->qp->qp_num))
-                               response = NULL;
+                       agent_send_response(&response->mad.mad,
+                                           &response->grh, wc,
+                                           port_priv->device,
+                                           smi_get_fwd_port(&recv->mad.smp),
+                                           qp_info->qp->qp_num);

                       goto out;
               }
@@ -1930,15 +1931,6 @@ local:
       if (port_priv->device->process_mad) {
               int ret;

-               if (!response) {
-                       printk(KERN_ERR PFX "No memory for response MAD\n");
-                       /*
-                        * Is it better to assume that
-                        * it wouldn't be processed ?
-                        */
-                       goto out;
-               }
-
               ret = port_priv->device->process_mad(port_priv->device, 0,
                                                    port_priv->port_num,
                                                    wc, &recv->grh,


From naim.hammond at gmail.com  Sun Jul 29 04:29:56 2007
From: naim.hammond at gmail.com (Naim Hammond)
Date: Sun, 29 Jul 2007 14:29:56 +0300
Subject: [ofa-general] Re: OFED-1.2 on x86 debian
In-Reply-To: <20070727083438.GA9912@mellanox.co.il>
References: <46A97850.2030607@osrg.net> <20070727083438.GA9912@mellanox.co.il>
Message-ID: <bf621e460707290429u47f7df33p80e2038548be28b2@mail.gmail.com>

Where is the list of supported distributions?
Where can I see it?

Thanks

On 7/27/07, Michael S. Tsirkin <mst at dev.mellanox.co.il> wrote:
>
> > Quoting Yoshiaki Tamura <tamura at osrg.net>:
> > Subject: OFED-1.2 on x86 debian
> >
> > Hi.
> >
> > I'm trying to install OFED-1.2 on x86 (32bit) debian machine.
> > Although build_env.sh seems to work on debian,
> > it fails compiling both kernel modules and user land tools by rpmbuild.
> >
> > Is OFED-1.2 tested on debian or totally unsupported?
>
> It's not on a list of supported platforms, but I think we do builds
> on ubuntu so debian should work too. Vlad?
>
> --
> MST
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070729/26a98f35/attachment.html>

From vlad at mellanox.co.il  Sun Jul 29 05:48:09 2007
From: vlad at mellanox.co.il (Vladimir Sokolovsky)
Date: Sun, 29 Jul 2007 15:48:09 +0300
Subject: [ofa-general] Re: OFED-1.2 on x86 debian
In-Reply-To: <bf621e460707290429u47f7df33p80e2038548be28b2@mail.gmail.com>
References: <46A97850.2030607@osrg.net> <20070727083438.GA9912@mellanox.co.il>
	<bf621e460707290429u47f7df33p80e2038548be28b2@mail.gmail.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901F75EA5@mtlexch01.mtl.com>

Hi,

See OFED-1.2/docs/OFED_release_notes.txt:

 
1.2 Supported Platforms and Operating Systems

---------------------------------------------

  o   CPU architectures:

        - x86_64

        - x86

        - ppc64

        - ia64

 
  o   Linux Operating Systems:

        - RedHat EL4 up3: 2.6.9-34.ELsmp

        - RedHat EL4 up4: 2.6.9-42.ELsmp

        - RedHat EL4 up5: 2.6.9-55.ELsmp

        - RedHat EL5: 2.6.18-8.el5

        - SLES10: 2.6.16.21-0.8-smp

        - kernel.org: 2.6.20.x

        - SLES10 SP1: 2.6.16.46-0.12-smp (partially tested)

 
OFED-1.2 use RPM environment for installation. You can't use OFED
installation script as is on Debian.

 
Regards,

Vladimir

 
From: Naim Hammond [mailto:naim.hammond at gmail.com] 
Sent: Sunday, July 29, 2007 2:30 PM
To: Michael S. Tsirkin
Cc: Yoshiaki Tamura; Vladimir Sokolovsky; openib-general at openib.org
Subject: Re: [ofa-general] Re: OFED-1.2 on x86 debian

 
Where is the list of supported distributions?
Where can I see it?

Thanks

On 7/27/07, Michael S. Tsirkin < mst at dev.mellanox.co.il
<mailto:mst at dev.mellanox.co.il> > wrote:

> Quoting Yoshiaki Tamura < tamura at osrg.net <mailto:tamura at osrg.net> >:
> Subject: OFED-1.2 on x86 debian
>
> Hi.
>
> I'm trying to install OFED-1.2 on x86 (32bit) debian machine.
> Although build_env.sh seems to work on debian, 
> it fails compiling both kernel modules and user land tools by
rpmbuild.
>
> Is OFED-1.2 tested on debian or totally unsupported?

It's not on a list of supported platforms, but I think we do builds 
on ubuntu so debian should work too. Vlad?

--
MST
_______________________________________________
general mailing list
general at lists.openfabrics.org 
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070729/28c74b88/attachment.html>

From ygdlz at amphi.com  Sun Jul 29 06:21:10 2007
From: ygdlz at amphi.com (Hernandez G. Kit)
Date: Sun, 29 Jul 2007 13:21:10 -0000
Subject: [ofa-general] Doc
Message-ID: <46334ABF.4020001@amphi.com>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: Doc.pdf
Type: application/pdf
Size: 24371 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070729/ca74fb88/attachment.pdf>

From mst at dev.mellanox.co.il  Sun Jul 29 07:04:31 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 29 Jul 2007 17:04:31 +0300
Subject: [ofa-general] RFC: SRC API
Message-ID: <20070729140431.GG16915@mellanox.co.il>

Hello!
Here is an API proposal for support of the SRC
(scalable reliable connected) protocol extension in libibverbs.

This adds APIs to:
- manage SRC domains

- share SRC domains between processes,
  by means of creating a 1:1 association
  between an SRC domain and a file.

Notes:
- The file is specified by means of a file descriptor,
  this makes it possible for the user to manage file
  creation/deletion in the most flexible manner
  (e.g. tmpfile can be used).

- I envision implementing this sharing mechanism in kernel by means
  of a per-device tree, with inode as a key and domain object
  as a value.
 
Please comment.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index acc1b82..503f201 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -370,6 +370,11 @@ struct ibv_ah_attr {
 	uint8_t			port_num;
 };
 
+struct ibv_src_domain {
+	struct ibv_context     *context;
+	uint32_t		handle;
+};
+
 enum ibv_srq_attr_mask {
 	IBV_SRQ_MAX_WR	= 1 << 0,
 	IBV_SRQ_LIMIT	= 1 << 1
@@ -389,7 +394,8 @@ struct ibv_srq_init_attr {
 enum ibv_qp_type {
 	IBV_QPT_RC = 2,
 	IBV_QPT_UC,
-	IBV_QPT_UD
+	IBV_QPT_UD,
+	IBV_QPT_SRC
 };
 
 struct ibv_qp_cap {
@@ -408,6 +414,7 @@ struct ibv_qp_init_attr {
 	struct ibv_qp_cap	cap;
 	enum ibv_qp_type	qp_type;
 	int			sq_sig_all;
+	struct ibv_src_domain  *src_domain;
 };
 
 enum ibv_qp_attr_mask {
@@ -526,6 +533,7 @@ struct ibv_send_wr {
 			uint32_t	remote_qkey;
 		} ud;
 	} wr;
+	uint32_t		src_remote_srq_num;
 };
 
 struct ibv_recv_wr {
@@ -553,6 +561,10 @@ struct ibv_srq {
 	pthread_mutex_t		mutex;
 	pthread_cond_t		cond;
 	uint32_t		events_completed;
+
+	uint32_t		src_srq_num;
+	struct ibv_src_domain  *src_domain;
+	struct ibv_cq	       *src_cq;
 };
 
 struct ibv_qp {
@@ -570,6 +582,8 @@ struct ibv_qp {
 	pthread_mutex_t		mutex;
 	pthread_cond_t		cond;
 	uint32_t		events_completed;
+
+	struct ibv_src_domain  *src_domain;
 };
 
 struct ibv_comp_channel {
@@ -912,6 +926,25 @@ struct ibv_srq *ibv_create_srq(struct ibv_pd *pd,
 			       struct ibv_srq_init_attr *srq_init_attr);
 
 /**
+ * ibv_create_src_srq - Creates a SRQ associated with the specified protection
+ *   domain and src domain.
+ * @pd: The protection domain associated with the SRQ.
+ * @src_domain: The SRC domain associated with the SRQ.
+ * @src_cq: CQ to report completions for SRC packets on.
+ *
+ * @srq_init_attr: A list of initial attributes required to create the SRQ.
+ *
+ * srq_attr->max_wr and srq_attr->max_sge are read the determine the
+ * requested size of the SRQ, and set to the actual values allocated
+ * on return.  If ibv_create_srq() succeeds, then max_wr and max_sge
+ * will always be at least as large as the requested values.
+ */
+struct ibv_srq *ibv_create_src_srq(struct ibv_pd *pd,
+				   struct ibv_src_domain *src_domain,
+				   struct ibv_cq *src_cq,
+			           struct ibv_srq_init_attr *srq_init_attr);
+
+/**
  * ibv_modify_srq - Modifies the attributes for the specified SRQ.
  * @srq: The SRQ to modify.
  * @srq_attr: On input, specifies the SRQ attributes to modify.  On output,
@@ -1074,6 +1107,44 @@ int ibv_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid);
  */
 int ibv_fork_init(void);
 
+/**
+ * ibv_alloc_src_domain - Allocate an SRC domain
+ * Returns a reference to an SRC domain.
+ * Use ibv_put_src_domain to free the reference.
+ * @context: Device context
+ */
+struct ibv_src_domain *ibv_get_new_src_domain(struct ibv_context *context);
+
+/**
+ * ibv_share_src_domain - associate the src domain with a file.
+ * Establishes a connection between an SRC domain object and a file descriptor.
+ *
+ * @d: SRC domain to share
+ * @fd: descriptor for a file to associate with the domain
+ */
+int ibv_share_src_domain(struct ibv_src_domain *d, int fd);
+
+/**
+ * ibv_unshare_src_domain - disassociate the src domain from a file.
+ * Subsequent calls to ibv_get_shared_src_domain will fail.
+ * @d: SRC domain to unshare
+ */
+int ibv_unshare_src_domain(struct ibv_src_domain *d);
+
+/**
+ * ibv_get_src_domain - get a reference to shared SRC domain
+ * @context: Device context
+ * @fd: descriptor for a file associated with the domain
+ */
+struct ibv_src_domain *ibv_get_shared_src_domain(struct ibv_context *context, int fd);
+
+/**
+ * ibv_put_src_domain - destroy a reference to an SRC domain
+ * If this is the last reference, destroys the domain.
+ * @d: reference to SRC domain to put
+ */
+int ibv_put_src_domain(struct ibv_src_domain *d);
+
 END_C_DECLS
 
 #  undef __attribute_const


-- 
MST


From naim.hammond at gmail.com  Sun Jul 29 08:01:00 2007
From: naim.hammond at gmail.com (Naim Hammond)
Date: Sun, 29 Jul 2007 18:01:00 +0300
Subject: [ofa-general] Re: OFED-1.2 on x86 debian
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F75EA5@mtlexch01.mtl.com>
References: <46A97850.2030607@osrg.net> <20070727083438.GA9912@mellanox.co.il>
	<bf621e460707290429u47f7df33p80e2038548be28b2@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F75EA5@mtlexch01.mtl.com>
Message-ID: <bf621e460707290801w5fbbe296i50314761e700ca7@mail.gmail.com>

So OFED is not supported on any free distribution.

You did mention it is tested on Ubuntu, but you weren't sure. is it?

N

On 7/29/07, Vladimir Sokolovsky <vlad at mellanox.co.il> wrote:
>
>  Hi,
>
> See OFED-1.2/docs/OFED_release_notes.txt:
>
>
>
> 1.2 Supported Platforms and Operating Systems
>
> ---------------------------------------------
>
>   o   CPU architectures:
>
>         - x86_64
>
>         - x86
>
>         - ppc64
>
>         - ia64
>
>
>
>   o   Linux Operating Systems:
>
>         - RedHat EL4 up3: 2.6.9-34.ELsmp
>
>         - RedHat EL4 up4: 2.6.9-42.ELsmp
>
>         - RedHat EL4 up5: 2.6.9-55.ELsmp
>
>         - RedHat EL5: 2.6.18-8.el5
>
>         - SLES10: 2.6.16.21-0.8-smp
>
>         - kernel.org: 2.6.20.x
>
>         - SLES10 SP1: 2.6.16.46-0.12-smp (partially tested)
>
> * *
>
> OFED-1.2 use RPM environment for installation. You can't use OFED
> installation script as is on Debian.
>
> * *
>
> *Regards,*
>
> *Vladimir*
>
>
>
> *From:* Naim Hammond [mailto:naim.hammond at gmail.com]
> *Sent:* Sunday, July 29, 2007 2:30 PM
> *To:* Michael S. Tsirkin
> *Cc:* Yoshiaki Tamura; Vladimir Sokolovsky; openib-general at openib.org
> *Subject:* Re: [ofa-general] Re: OFED-1.2 on x86 debian
>
>
>
> Where is the list of supported distributions?
> Where can I see it?
>
> Thanks
>
> On 7/27/07, *Michael S. Tsirkin* < mst at dev.mellanox.co.il> wrote:
>
> > Quoting Yoshiaki Tamura < tamura at osrg.net>:
> > Subject: OFED-1.2 on x86 debian
> >
> > Hi.
> >
> > I'm trying to install OFED-1.2 on x86 (32bit) debian machine.
> > Although build_env.sh seems to work on debian,
> > it fails compiling both kernel modules and user land tools by rpmbuild.
> >
> > Is OFED-1.2 tested on debian or totally unsupported?
>
> It's not on a list of supported platforms, but I think we do builds
> on ubuntu so debian should work too. Vlad?
>
> --
> MST
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070729/ae07b9d5/attachment.html>

From sashak at voltaire.com  Sun Jul 29 09:47:47 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 29 Jul 2007 19:47:47 +0300
Subject: [ofa-general] Re: pkey.sim.tcl
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F75D17@mtlexch01.mtl.com>
References: <20070724005153.GD11674@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com>
	<20070724170432.GZ27878@sashak.voltaire.com>
	<20070724215441.GA25264@sashak.voltaire.com>
	<20070725202418.GD31582@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901F75674@mtlexch01.mtl.com>
	<20070726224133.GC2472@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901F75A8F@mtlexch01.mtl.com>
	<20070728215527.GH12351@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901F75D17@mtlexch01.mtl.com>
Message-ID: <20070729164747.GB29844@sashak.voltaire.com>

Hi Eitan,

On 12:11 Sun 29 Jul     , Eitan Zahavi wrote:
> Regarding the test :
> Once I will know the exact condition causing a full re-sweep I would use
> it in the test.
> In OFED 1.2 it was enough to set one switch ChangeBit to force a full
> reconfiguration.

You can set PortState where pkey table was modified to INIT and this will
trigger update.

> Regarding incremental flow in general:
> 1. Yes - it is good.

Ok.

> 2. But we must make sure it is robust enough that we do not loose some
> nodes or functionality 
>     under extreme cases of reboot or HW errors.

Testing reports are welcomed (as usual). I'm testing too.

> 3. We should have a way to force a full sweep without killing the SM:
> As the size of the clusters grow there is a growing chance that "soft
> errors" will hit the devices.
> Most of the device memory is guarded and would be auto detected if
> affected. 
> However I think it is wise to allow for the user to force full
> reconfiguration without making the SM "go away".

We can add config option to force update unconditionally. Would it be
sufficient?

> Regarding OpenSM does not respond to SA queries during sweep:
> It is due to the fact there is no "double buffer" for the internal DB.
> So whenever the SM starts a sweep the SA will see an "empty" DB. 

Specific problem was due to fact that OpenSM DB is in "locked" state
most of the time during sweep and SA is waiting to get access.

> The solution for that problem may be having a "previous" DB during
> sweeps. 
> I suspect using that approach will also enable a fine grain incremental
> capability too.

I agree, this could be good direction too. As well as some others like
more granular locking etc..

Sasha


From mst at dev.mellanox.co.il  Sun Jul 29 09:46:32 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 29 Jul 2007 19:46:32 +0300
Subject: [ofa-general] Re: OFED-1.2 on x86 debian
In-Reply-To: <bf621e460707290801w5fbbe296i50314761e700ca7@mail.gmail.com>
References: <46A97850.2030607@osrg.net> <20070727083438.GA9912@mellanox.co.il>
	<bf621e460707290429u47f7df33p80e2038548be28b2@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F75EA5@mtlexch01.mtl.com>
	<bf621e460707290801w5fbbe296i50314761e700ca7@mail.gmail.com>
Message-ID: <20070729164632.GA28212@mellanox.co.il>

> Quoting Naim Hammond <naim.hammond at gmail.com>:
> Subject: Re: [ofa-general] Re: OFED-1.2 on x86 debian
> 
> So OFED is not supported on any free distribution.
> 
> You did mention it is tested on Ubuntu, but you weren't sure. is it?

Note that support in this context means whether OFED was tested on the distro,
not whether it builds/works.


Quoting Naim Hammond <naim.hammond at gmail.com>:
Subject: Re: [ofa-general] Re: OFED-1.2 on x86 debian

So OFED is not supported on any free distribution.

You did mention it is tested on Ubuntu, but you weren't sure. is it?

N

On 7/29/07, Vladimir Sokolovsky <vlad at mellanox.co.il> wrote:


    Hi,

    See OFED-1.2/docs/OFED_release_notes.txt:

     
    1.2 Supported Platforms and Operating Systems

    ---------------------------------------------

      o   CPU architectures:

            - x86_64

            - x86

            - ppc64

            - ia64

     
      o   Linux Operating Systems:

            - RedHat EL4 up3: 2.6.9-34.ELsmp

            - RedHat EL4 up4: 2.6.9-42.ELsmp

            - RedHat EL4 up5: 2.6.9-55.ELsmp

            - RedHat EL5: 2.6.18-8.el5

            - SLES10: 2.6.16.21-0.8-smp

            - kernel.org: 2.6.20.x

            - SLES10 SP1: 2.6.16.46-0.12-smp (partially tested)

     
    OFED-1.2 use RPM environment for installation. You can't use OFED
    installation script as is on Debian.

     
    Regards,

    Vladimir

     
    From: Naim Hammond [mailto:naim.hammond at gmail.com]
    Sent: Sunday, July 29, 2007 2:30 PM
    To: Michael S. Tsirkin
    Cc: Yoshiaki Tamura; Vladimir Sokolovsky; openib-general at openib.org
    Subject: Re: [ofa-general] Re: OFED-1.2 on x86 debian

     
    Where is the list of supported distributions?
    Where can I see it?

    Thanks

    On 7/27/07, Michael S. Tsirkin < mst at dev.mellanox.co.il> wrote:

    > Quoting Yoshiaki Tamura < tamura at osrg.net>:
    > Subject: OFED-1.2 on x86 debian
    >
    > Hi.
    >
    > I'm trying to install OFED-1.2 on x86 (32bit) debian machine.
    > Although build_env.sh seems to work on debian,
    > it fails compiling both kernel modules and user land tools by rpmbuild.
    >
    > Is OFED-1.2 tested on debian or totally unsupported?

    It's not on a list of supported platforms, but I think we do builds
    on ubuntu so debian should work too. Vlad?

    --
    MST
    _______________________________________________
    general mailing list
    general at lists.openfabrics.org
    http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

    To unsubscribe, please visit http://openib.org/mailman/listinfo/
    openib-general

     
-- 
MST


From sashak at voltaire.com  Sun Jul 29 09:52:20 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 29 Jul 2007 19:52:20 +0300
Subject: [ofa-general] Re: OpenSM detection of duplicated GUIDs on loopback
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F75CF6@mtlexch01.mtl.com>
References: <f0e08f230707241325p1b7b04cap5bf37459bc7cb33f@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F75131@mtlexch01.mtl.com>
	<20070725001847.GG25264@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901F7558A@mtlexch01.mtl.com>
	<20070725194856.GB31582@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901F75673@mtlexch01.mtl.com>
	<20070727010707.GR2472@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901F75ABE@mtlexch01.mtl.com>
	<20070728221540.GI12351@sashak.voltaire.com>
	<6C2C79E72C305246B504CBA17B5500C901F75CF6@mtlexch01.mtl.com>
Message-ID: <20070729165220.GC29844@sashak.voltaire.com>

On 12:00 Sun 29 Jul     , Eitan Zahavi wrote:
> > On 14:27 Fri 27 Jul     , Eitan Zahavi wrote:
> > > The problem I have with back-to-back plug is that it is a 
> > fatal case 
> > > if found in a case where there was no use of this plug.
> > > So we will  need some sort of user input if it is OK or not.
> > 
> > Ok, and let's add cl_qmap_count() check there.
> Not following you.

With back-to-back network cl_qmap_count(&sw_guid_tbl) should be 0.

> > > The case of moving a port in the middle of a sweep can be easily 
> > > detected if instead of reporting an error a second check of the 
> > > original DR where the same GUID was found is performed...
> > 
> > Do you mean to resend NodeInfo request to the original location?
> > Assuming so, I guess it should be instead of second heavy 
> > sweep, and it is a good idea. The only small downside of this 
> > I can see is potential timeouts (and discovery slowdown). But 
> > anyway it is much better then fatal error. Thanks!
> 
> So we are inline with this one .
> Instead of changing the order of things we could generate list of DR's
> that are to be re-scanned
>  during drop-mgr and then abort if really dulicates.

I will need to look at code...

Sasha


From jgunthorpe at obsidianresearch.com  Sun Jul 29 10:32:32 2007
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Sun, 29 Jul 2007 11:32:32 -0600
Subject: [ofa-general] Re: IPoIB path caching
In-Reply-To: <46AC509B.6020206@voltaire.com>
References: <46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com>
	<46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com>
	<46A6F50C.5000906@voltaire.com> <46A78146.1090304@ichips.intel.com>
	<46A846FC.5040704@voltaire.com> <46A8D80C.1090305@ichips.intel.com>
	<20070726181132.GO19768@obsidianresearch.com>
	<46AC509B.6020206@voltaire.com>
Message-ID: <20070729173232.GA14867@obsidianresearch.com>

On Sun, Jul 29, 2007 at 11:32:27AM +0300, Or Gerlitz wrote:
> Jason Gunthorpe wrote:
> >The existing trap monitoring in Sean's module covers about 90% of the
> >cases in IB when you need to invalidate a PR, the last 10% will need
> >something new :(
> 
> Let it be. Do you think the last 10% should not prevent the local sa 
> merge to the upstream code?

Only that the design philosophy should accommodate an eventual solution
to this remaining problem. Mainly, as I've said, I'd like to see more
stuff in userspace and a simple well defined kernel component.

What about you? Your arguments about linking arp lifetime to PR cache
lifetime are trying to address this very same 10%.

Jason


From swise at opengridcomputing.com  Sun Jul 29 13:12:26 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Sun, 29 Jul 2007 15:12:26 -0500
Subject: [ofa-general] [PATCH 2.6.23 1/2] Make the iw_cxgb3 module parameters
	writable.
Message-ID: <20070729201226.31659.85900.stgit@dell3.ogc.int>


Make the iw_cxgb3 module parameters writable.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_cm.c |   16 ++++++++--------
 1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c
index 9574088..fa95dce 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_cm.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c
@@ -63,37 +63,37 @@ static char *states[] = {
 };
 
 static int ep_timeout_secs = 10;
-module_param(ep_timeout_secs, int, 0444);
+module_param(ep_timeout_secs, int, 0644);
 MODULE_PARM_DESC(ep_timeout_secs, "CM Endpoint operation timeout "
 				   "in seconds (default=10)");
 
 static int mpa_rev = 1;
-module_param(mpa_rev, int, 0444);
+module_param(mpa_rev, int, 0644);
 MODULE_PARM_DESC(mpa_rev, "MPA Revision, 0 supports amso1100, "
 		 "1 is spec compliant. (default=1)");
 
 static int markers_enabled = 0;
-module_param(markers_enabled, int, 0444);
+module_param(markers_enabled, int, 0644);
 MODULE_PARM_DESC(markers_enabled, "Enable MPA MARKERS (default(0)=disabled)");
 
 static int crc_enabled = 1;
-module_param(crc_enabled, int, 0444);
+module_param(crc_enabled, int, 0644);
 MODULE_PARM_DESC(crc_enabled, "Enable MPA CRC (default(1)=enabled)");
 
 static int rcv_win = 256 * 1024;
-module_param(rcv_win, int, 0444);
+module_param(rcv_win, int, 0644);
 MODULE_PARM_DESC(rcv_win, "TCP receive window in bytes (default=256)");
 
 static int snd_win = 32 * 1024;
-module_param(snd_win, int, 0444);
+module_param(snd_win, int, 0644);
 MODULE_PARM_DESC(snd_win, "TCP send window in bytes (default=32KB)");
 
 static unsigned int nocong = 0;
-module_param(nocong, uint, 0444);
+module_param(nocong, uint, 0644);
 MODULE_PARM_DESC(nocong, "Turn off congestion control (default=0)");
 
 static unsigned int cong_flavor = 1;
-module_param(cong_flavor, uint, 0444);
+module_param(cong_flavor, uint, 0644);
 MODULE_PARM_DESC(cong_flavor, "TCP Congestion control flavor (default=1)");
 
 static void process_work(struct work_struct *work);


From swise at opengridcomputing.com  Sun Jul 29 13:12:29 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Sun, 29 Jul 2007 15:12:29 -0500
Subject: [ofa-general] [PATCH 2.6.23 2/2] iw_cxgb3: Always call low level
	send function via cxgb3_ofld_send().
In-Reply-To: <20070729201226.31659.85900.stgit@dell3.ogc.int>
References: <20070729201226.31659.85900.stgit@dell3.ogc.int>
Message-ID: <20070729201228.31659.26300.stgit@dell3.ogc.int>


iw_cxgb3: Always call low level send function via cxgb3_ofld_send().

Avoids deadlocks.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_cm.c |   16 ++++++++--------
 1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c
index fa95dce..20ba372 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_cm.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c
@@ -139,7 +139,7 @@ static void release_tid(struct t3cdev *t
 	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
 	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_TID_RELEASE, hwtid));
 	skb->priority = CPL_PRIORITY_SETUP;
-	tdev->send(tdev, skb);
+	cxgb3_ofld_send(tdev, skb);
 	return;
 }
 
@@ -161,7 +161,7 @@ int iwch_quiesce_tid(struct iwch_ep *ep)
 	req->val = cpu_to_be64(1 << S_TCB_RX_QUIESCE);
 
 	skb->priority = CPL_PRIORITY_DATA;
-	ep->com.tdev->send(ep->com.tdev, skb);
+	cxgb3_ofld_send(ep->com.tdev, skb);
 	return 0;
 }
 
@@ -183,7 +183,7 @@ int iwch_resume_tid(struct iwch_ep *ep)
 	req->val = 0;
 
 	skb->priority = CPL_PRIORITY_DATA;
-	ep->com.tdev->send(ep->com.tdev, skb);
+	cxgb3_ofld_send(ep->com.tdev, skb);
 	return 0;
 }
 
@@ -784,7 +784,7 @@ static int update_rx_credits(struct iwch
 	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_RX_DATA_ACK, ep->hwtid));
 	req->credit_dack = htonl(V_RX_CREDITS(credits) | V_RX_FORCE_ACK(1));
 	skb->priority = CPL_PRIORITY_ACK;
-	ep->com.tdev->send(ep->com.tdev, skb);
+	cxgb3_ofld_send(ep->com.tdev, skb);
 	return credits;
 }
 
@@ -1152,7 +1152,7 @@ static int listen_start(struct iwch_list
 	req->opt1 = htonl(V_CONN_POLICY(CPL_CONN_POLICY_ASK));
 
 	skb->priority = 1;
-	ep->com.tdev->send(ep->com.tdev, skb);
+	cxgb3_ofld_send(ep->com.tdev, skb);
 	return 0;
 }
 
@@ -1186,7 +1186,7 @@ static int listen_stop(struct iwch_liste
 	req->cpu_idx = 0;
 	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, ep->stid));
 	skb->priority = 1;
-	ep->com.tdev->send(ep->com.tdev, skb);
+	cxgb3_ofld_send(ep->com.tdev, skb);
 	return 0;
 }
 
@@ -1264,7 +1264,7 @@ static void reject_cr(struct t3cdev *tde
 		rpl->opt0l_status = htonl(CPL_PASS_OPEN_REJECT);
 		rpl->opt2 = 0;
 		rpl->rsvd = rpl->opt2;
-		tdev->send(tdev, skb);
+		cxgb3_ofld_send(tdev, skb);
 	}
 }
 
@@ -1557,7 +1557,7 @@ static int peer_abort(struct t3cdev *tde
 	rpl->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
 	OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_ABORT_RPL, ep->hwtid));
 	rpl->cmd = CPL_ABORT_NO_RST;
-	ep->com.tdev->send(ep->com.tdev, rpl_skb);
+	cxgb3_ofld_send(ep->com.tdev, rpl_skb);
 	if (state != ABORTING) {
 		state_set(&ep->com, DEAD);
 		release_ep_resources(ep);


From swise at opengridcomputing.com  Sun Jul 29 13:34:37 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Sun, 29 Jul 2007 15:34:37 -0500
Subject: [ofa-general] [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the
 host TCP port space.
Message-ID: <46ACF9DD.1010509@opengridcomputing.com>

RDMA experts,

I'd like input on the patch below.  iWARP devices that support both 
native stack TCP and iwarp connections on the same interface need the 
fix below or some similar enhancement to the rdma cm.  This is a bug in 
the ofed-1.2 RDMA-CM code as it stands.  I propose we fix this for 
ofed-1.2.1 or ofed-1.3.

Here is the issue:

Consider an mpi cluster running mvapich2.  And the cluster runs 
MPI/Sockets jobs concurrently with MPI/RDMA jobs.  It is possible, 
without the patch below, for MPI/Sockets processes to mistakenly get 
incoming RDMA connections and vice versa.  The way mvapich2 works is 
that the ranks all bind and listen to a random port (retrying new random 
ports if the bind fails with "in use").  Once they get a free port and 
bind/listen, they advertise that port to the peers to do connection 
setup.  Currently, without the patch below, the mpi/rdma processes can 
end up binding/listening to the _same_ port number as the mpi/sockets 
processes running over the native tcp stack.  This is due to duplicate 
port spaces for native stack TCP and the rdma cm's RDMA_PS_TCP port 
space.  If this happens, then the connections can get screwed up.

The correct solution in my mind is to use the host stack's TCP port 
space for _all_ RDMA_PS_TCP port allocations.   The patch below is a 
minimal delta to unify the port spaces bay using the kernel stack to 
bind ports.  This is done by allocating a kernel socket and binding to 
the appropriate local addr/port.  It also allows the kernel stack to 
pick ephemeral ports by virtue of just passing in port 0 on the kernel 
bind operation.

I'd like to discuss this with the RDMA folks first and iron out an 
agreement on how this should be implemented, then widen the audience to 
lklm/netdev.  With a goal of inclusion in 2.6.23 and ofed-1.2.1 or 1.3.

Thanks,

Steve.


-------- Original Message --------
Subject: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP 
port space.
Date: Sun, 29 Jul 2007 15:17:04 -0500
From: Steve Wise <swise at opengridcomputing.com>
To: swise at opengridcomputing.com


RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

This is needed for iwarp providers that support native and rdma
connections over the same interface.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

  drivers/infiniband/core/cma.c |   27 ++++++++++++++++++++++++++-
  1 files changed, 26 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 9e0ab04..e4d2d7f 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -111,6 +111,7 @@ struct rdma_id_private {
  	struct rdma_cm_id	id;

  	struct rdma_bind_list	*bind_list;
+	struct socket		*sock;
  	struct hlist_node	node;
  	struct list_head	list;
  	struct list_head	listen_list;
@@ -695,6 +696,8 @@ static void cma_release_port(struct rdma
  		kfree(bind_list);
  	}
  	mutex_unlock(&lock);
+	if (id_priv->sock)
+		sock_release(id_priv->sock);
  }

  void rdma_destroy_id(struct rdma_cm_id *id)
@@ -1790,6 +1793,25 @@ static int cma_use_port(struct idr *ps,
  	return 0;
  }

+static int cma_get_tcp_port(struct rdma_id_private *id_priv)
+{
+	int ret;
+	struct socket *sock;
+
+	ret = sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &sock);
+	if (ret)
+		return ret;
+	ret = sock->ops->bind(sock,
+			  (struct socketaddr *)&id_priv->id.route.addr.src_addr,
+			  ip_addr_size(&id_priv->id.route.addr.src_addr));
+	if (ret) {
+		sock_release(sock);
+		return ret;
+	}
+	id_priv->sock = sock;
+	return 0;	
+}
+
  static int cma_get_port(struct rdma_id_private *id_priv)
  {
  	struct idr *ps;
@@ -1801,6 +1823,9 @@ static int cma_get_port(struct rdma_id_p
  		break;
  	case RDMA_PS_TCP:
  		ps = &tcp_ps;
+		ret = cma_get_tcp_port(id_priv); /* Synch with native stack */
+		if (ret)
+			goto out;
  		break;
  	case RDMA_PS_UDP:
  		ps = &udp_ps;
@@ -1815,7 +1840,7 @@ static int cma_get_port(struct rdma_id_p
  	else
  		ret = cma_use_port(ps, id_priv);
  	mutex_unlock(&lock);
-
+out:
  	return ret;
  }


From lypfw at pavarini.com  Sun Jul 29 14:16:32 2007
From: lypfw at pavarini.com (Cook Robin)
Date: Sun, 29 Jul 2007 21:16:32 +0000
Subject: [ofa-general] e-mail
Message-ID: <46AD03B0.7030001@pavarini.com>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: e-mail.pdf
Type: application/pdf
Size: 25842 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070729/c196e6c8/attachment.pdf>

From tamura at osrg.net  Sun Jul 29 17:30:32 2007
From: tamura at osrg.net (Yoshiaki Tamura)
Date: Mon, 30 Jul 2007 09:30:32 +0900
Subject: [ofa-general] OFED-1.2 on x86 debian
In-Reply-To: <46AA2663.4060709@hp.com>
References: <46A97850.2030607@osrg.net> <46AA2663.4060709@hp.com>
Message-ID: <46AD3128.1020009@osrg.net>

 > Michael S. Tsirkin wrote:
 >>> Quoting Yoshiaki Tamura <tamura at osrg.net>:
 >>> Subject: OFED-1.2 on x86 debian
 >>>
 >>> Hi.
 >>>
 >>> I'm trying to install OFED-1.2 on x86 (32bit) debian machine.
 >>> Although build_env.sh seems to work on debian,
 >>> it fails compiling both kernel modules and user land tools by rpmbuild.
 >>>
 >>> Is OFED-1.2 tested on debian or totally unsupported?
 >>
 >> It's not on a list of supported platforms, but I think we do builds
 >> on ubuntu so debian should work too. Vlad?

For some components it seems to work, but not all of them.

 > I have been trying to make it work here on Ubuntu (Debian rebuild) 7.04.
 >
 > Had to hack build_env.sh a little to get it to ignore some of the
 > dependency checking (done by package name, which is not portable across
 > distros).

I removed gcc and zlib dependency checking to build on debian etch.
I could compile user land basic packages, but it failed building dapl.
rpmbuild couldn't find dat.conf.

 > When I tried to do that with ia64 Debian I was directed towards some tar files
 > of the mods rather than the install.sh stuff.  I don't have the pointers at my
 > fingertips, but would assume they remain in the list archives.
 >
 > rick jones

Maybe this page?
http://www.openfabrics.org/builds/

Thanks for your comments.

Yoshi


From kliteyn at mellanox.co.il  Sun Jul 29 21:05:11 2007
From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il)
Date: 30 Jul 2007 07:05:11 +0300
Subject: [ofa-general] nightly osm_sim report 2007-07-30:normal completion
Message-ID: <MTLEXCH01LSAQV6uPDV00000212@mtlexch01.mtl.com>

OSM Simulation Regression Summary
 
[Generated mail - please do NOT reply]
 
 
OpenSM rev = Fri_Jul_27_07:38:19_2007 [7284d020ea232b253331faf52c950626cf330aab]
ibutils rev = Tue_Mar_13_14:36:32_2007 [80aaff94f0eb65117db39b9db7d609ffdcc055de]
 
 
Total=520  Pass=467  Fail=53
 
 
Pass:
39 Stability IS1-16.topo
39 OsmTest IS1-16.topo
39 OsmStress IS1-16.topo
39 Multicast IS1-16.topo
39 LidMgr IS1-16.topo
13 Stability IS3-loop.topo
13 Stability IS3-128.topo
13 OsmTest IS3-128.topo
13 OsmStress IS3-128.topo
13 Multicast IS3-loop.topo
13 Multicast IS3-128.topo
13 LidMgr IS3-128.topo
13 FatTree merge-roots-4-ary-2-tree.topo
13 FatTree merge-root-4-ary-3-tree.topo
13 FatTree gnu-stallion-64.topo
13 FatTree blend-4-ary-2-tree.topo
13 FatTree RhinoDDR.topo
13 FatTree FullGnu.topo
13 FatTree 4-ary-2-tree.topo
13 FatTree 2-ary-4-tree.topo
13 FatTree 12-node-spaced.topo
13 FTreeFail 4-ary-2-tree-missing-sw-link.topo
13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo
13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo
13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo
12 OsmTest IS3-loop.topo

Failures:
39 Pkey IS1-16.topo
13 Pkey IS3-128.topo
1 OsmTest IS3-loop.topo


From naim.hammond at gmail.com  Sun Jul 29 23:13:51 2007
From: naim.hammond at gmail.com (Naim Hammond)
Date: Mon, 30 Jul 2007 09:13:51 +0300
Subject: [ofa-general] Re: OFED-1.2 on x86 debian
In-Reply-To: <20070729164632.GA28212@mellanox.co.il>
References: <46A97850.2030607@osrg.net> <20070727083438.GA9912@mellanox.co.il>
	<bf621e460707290429u47f7df33p80e2038548be28b2@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F75EA5@mtlexch01.mtl.com>
	<bf621e460707290801w5fbbe296i50314761e700ca7@mail.gmail.com>
	<20070729164632.GA28212@mellanox.co.il>
Message-ID: <bf621e460707292313m4d96eb2cq3353dafe56adcfe7@mail.gmail.com>

On 7/29/07, Michael S. Tsirkin <mst at dev.mellanox.co.il> wrote:
>
> > Quoting Naim Hammond <naim.hammond at gmail.com>:
> > You did mention it is tested on Ubuntu, but you weren't sure. is it?
>
> Note that support in this context means whether OFED was tested on the
> distro,
> not whether it builds/works.


I'm sorry that I don't understand your answer.
What exactly do you mean "OFED was tested" if it does not build, nor works,
on this or that distribution?

Please explain.

N
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070730/fcf2403f/attachment.html>

From vlad at lists.openfabrics.org  Mon Jul 30 01:40:35 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Mon, 30 Jul 2007 01:40:35 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070730-0100 daily build status
Message-ID: <20070730084035.6DFB6E60863@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.18
Passed on powerpc with linux-2.6.18
Passed on ppc64 with linux-2.6.18
Passed on powerpc with linux-2.6.15
Passed on x86_64 with linux-2.6.21.1
Passed on powerpc with linux-2.6.19
Passed on ia64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.12
Passed on ia64 with linux-2.6.13
Passed on ppc64 with linux-2.6.19
Passed on x86_64 with linux-2.6.19
Passed on ia64 with linux-2.6.12
Passed on x86_64 with linux-2.6.13
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.17
Passed on ia64 with linux-2.6.18
Passed on x86_64 with linux-2.6.14
Passed on powerpc with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.15
Passed on x86_64 with linux-2.6.15
Passed on ppc64 with linux-2.6.13
Passed on powerpc with linux-2.6.12
Passed on x86_64 with linux-2.6.17
Passed on ia64 with linux-2.6.14
Passed on ppc64 with linux-2.6.14
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.18-8.el5
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.18-1.2798.fc6

Failed:


From glebn at voltaire.com  Mon Jul 30 01:50:16 2007
From: glebn at voltaire.com (Gleb Natapov)
Date: Mon, 30 Jul 2007 11:50:16 +0300
Subject: [ofa-general] RFC: SRC API
In-Reply-To: <20070729140431.GG16915@mellanox.co.il>
References: <20070729140431.GG16915@mellanox.co.il>
Message-ID: <20070730085016.GG4434@minantech.com>

On Sun, Jul 29, 2007 at 05:04:31PM +0300, Michael S. Tsirkin wrote:
> Hello!
> Here is an API proposal for support of the SRC
> (scalable reliable connected) protocol extension in libibverbs.
> 
> This adds APIs to:
> - manage SRC domains
> 
> - share SRC domains between processes,
>   by means of creating a 1:1 association
>   between an SRC domain and a file.
> 
> Notes:
> - The file is specified by means of a file descriptor,
>   this makes it possible for the user to manage file
>   creation/deletion in the most flexible manner
>   (e.g. tmpfile can be used).
> 
> - I envision implementing this sharing mechanism in kernel by means
>   of a per-device tree, with inode as a key and domain object
>   as a value.
>  
> Please comment.
Can you provide a pseudo code of an application using this API?
Especially QP sharing part.

> 
> Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>
> 
> ---
> 
> diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
> index acc1b82..503f201 100644
> --- a/include/infiniband/verbs.h
> +++ b/include/infiniband/verbs.h
> @@ -370,6 +370,11 @@ struct ibv_ah_attr {
>  	uint8_t			port_num;
>  };
>  
> +struct ibv_src_domain {
> +	struct ibv_context     *context;
> +	uint32_t		handle;
> +};
> +
>  enum ibv_srq_attr_mask {
>  	IBV_SRQ_MAX_WR	= 1 << 0,
>  	IBV_SRQ_LIMIT	= 1 << 1
> @@ -389,7 +394,8 @@ struct ibv_srq_init_attr {
>  enum ibv_qp_type {
>  	IBV_QPT_RC = 2,
>  	IBV_QPT_UC,
> -	IBV_QPT_UD
> +	IBV_QPT_UD,
> +	IBV_QPT_SRC
>  };
>  
>  struct ibv_qp_cap {
> @@ -408,6 +414,7 @@ struct ibv_qp_init_attr {
>  	struct ibv_qp_cap	cap;
>  	enum ibv_qp_type	qp_type;
>  	int			sq_sig_all;
> +	struct ibv_src_domain  *src_domain;
>  };
>  
>  enum ibv_qp_attr_mask {
> @@ -526,6 +533,7 @@ struct ibv_send_wr {
>  			uint32_t	remote_qkey;
>  		} ud;
>  	} wr;
> +	uint32_t		src_remote_srq_num;
>  };
>  
>  struct ibv_recv_wr {
> @@ -553,6 +561,10 @@ struct ibv_srq {
>  	pthread_mutex_t		mutex;
>  	pthread_cond_t		cond;
>  	uint32_t		events_completed;
> +
> +	uint32_t		src_srq_num;
> +	struct ibv_src_domain  *src_domain;
> +	struct ibv_cq	       *src_cq;
>  };
>  
>  struct ibv_qp {
> @@ -570,6 +582,8 @@ struct ibv_qp {
>  	pthread_mutex_t		mutex;
>  	pthread_cond_t		cond;
>  	uint32_t		events_completed;
> +
> +	struct ibv_src_domain  *src_domain;
>  };
>  
>  struct ibv_comp_channel {
> @@ -912,6 +926,25 @@ struct ibv_srq *ibv_create_srq(struct ibv_pd *pd,
>  			       struct ibv_srq_init_attr *srq_init_attr);
>  
>  /**
> + * ibv_create_src_srq - Creates a SRQ associated with the specified protection
> + *   domain and src domain.
> + * @pd: The protection domain associated with the SRQ.
> + * @src_domain: The SRC domain associated with the SRQ.
> + * @src_cq: CQ to report completions for SRC packets on.
> + *
> + * @srq_init_attr: A list of initial attributes required to create the SRQ.
> + *
> + * srq_attr->max_wr and srq_attr->max_sge are read the determine the
> + * requested size of the SRQ, and set to the actual values allocated
> + * on return.  If ibv_create_srq() succeeds, then max_wr and max_sge
> + * will always be at least as large as the requested values.
> + */
> +struct ibv_srq *ibv_create_src_srq(struct ibv_pd *pd,
> +				   struct ibv_src_domain *src_domain,
> +				   struct ibv_cq *src_cq,
> +			           struct ibv_srq_init_attr *srq_init_attr);
> +
> +/**
>   * ibv_modify_srq - Modifies the attributes for the specified SRQ.
>   * @srq: The SRQ to modify.
>   * @srq_attr: On input, specifies the SRQ attributes to modify.  On output,
> @@ -1074,6 +1107,44 @@ int ibv_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid);
>   */
>  int ibv_fork_init(void);
>  
> +/**
> + * ibv_alloc_src_domain - Allocate an SRC domain
> + * Returns a reference to an SRC domain.
> + * Use ibv_put_src_domain to free the reference.
> + * @context: Device context
> + */
> +struct ibv_src_domain *ibv_get_new_src_domain(struct ibv_context *context);
> +
> +/**
> + * ibv_share_src_domain - associate the src domain with a file.
> + * Establishes a connection between an SRC domain object and a file descriptor.
> + *
> + * @d: SRC domain to share
> + * @fd: descriptor for a file to associate with the domain
> + */
> +int ibv_share_src_domain(struct ibv_src_domain *d, int fd);
> +
> +/**
> + * ibv_unshare_src_domain - disassociate the src domain from a file.
> + * Subsequent calls to ibv_get_shared_src_domain will fail.
> + * @d: SRC domain to unshare
> + */
> +int ibv_unshare_src_domain(struct ibv_src_domain *d);
> +
> +/**
> + * ibv_get_src_domain - get a reference to shared SRC domain
> + * @context: Device context
> + * @fd: descriptor for a file associated with the domain
> + */
> +struct ibv_src_domain *ibv_get_shared_src_domain(struct ibv_context *context, int fd);
> +
> +/**
> + * ibv_put_src_domain - destroy a reference to an SRC domain
> + * If this is the last reference, destroys the domain.
> + * @d: reference to SRC domain to put
> + */
> +int ibv_put_src_domain(struct ibv_src_domain *d);
> +
>  END_C_DECLS
>  
>  #  undef __attribute_const
> 
> 
> 
> -- 
> MST
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

--
			Gleb.


From mst at dev.mellanox.co.il  Mon Jul 30 01:52:21 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 30 Jul 2007 11:52:21 +0300
Subject: [ofa-general] RFC: SRC API
In-Reply-To: <20070730085016.GG4434@minantech.com>
References: <20070729140431.GG16915@mellanox.co.il>
	<20070730085016.GG4434@minantech.com>
Message-ID: <20070730085221.GF9963@mellanox.co.il>

> On Sun, Jul 29, 2007 at 05:04:31PM +0300, Michael S. Tsirkin wrote:
> > Hello!
> > Here is an API proposal for support of the SRC
> > (scalable reliable connected) protocol extension in libibverbs.
> > 
> > This adds APIs to:
> > - manage SRC domains
> > 
> > - share SRC domains between processes,
> >   by means of creating a 1:1 association
> >   between an SRC domain and a file.
> > 
> > Notes:
> > - The file is specified by means of a file descriptor,
> >   this makes it possible for the user to manage file
> >   creation/deletion in the most flexible manner
> >   (e.g. tmpfile can be used).
> > 
> > - I envision implementing this sharing mechanism in kernel by means
> >   of a per-device tree, with inode as a key and domain object
> >   as a value.
> >  
> > Please comment.
> Can you provide a pseudo code of an application using this API?
> Especially QP sharing part.

There's no QP sharing here.
You mean SRC domain sharing?

-- 
MST


From glebn at voltaire.com  Mon Jul 30 01:54:09 2007
From: glebn at voltaire.com (Gleb Natapov)
Date: Mon, 30 Jul 2007 11:54:09 +0300
Subject: [ofa-general] RFC: SRC API
In-Reply-To: <20070730085221.GF9963@mellanox.co.il>
References: <20070729140431.GG16915@mellanox.co.il>
	<20070730085016.GG4434@minantech.com>
	<20070730085221.GF9963@mellanox.co.il>
Message-ID: <20070730085409.GH4434@minantech.com>

On Mon, Jul 30, 2007 at 11:52:21AM +0300, Michael S. Tsirkin wrote:
> > On Sun, Jul 29, 2007 at 05:04:31PM +0300, Michael S. Tsirkin wrote:
> > > Hello!
> > > Here is an API proposal for support of the SRC
> > > (scalable reliable connected) protocol extension in libibverbs.
> > > 
> > > This adds APIs to:
> > > - manage SRC domains
> > > 
> > > - share SRC domains between processes,
> > >   by means of creating a 1:1 association
> > >   between an SRC domain and a file.
> > > 
> > > Notes:
> > > - The file is specified by means of a file descriptor,
> > >   this makes it possible for the user to manage file
> > >   creation/deletion in the most flexible manner
> > >   (e.g. tmpfile can be used).
> > > 
> > > - I envision implementing this sharing mechanism in kernel by means
> > >   of a per-device tree, with inode as a key and domain object
> > >   as a value.
> > >  
> > > Please comment.
> > Can you provide a pseudo code of an application using this API?
> > Especially QP sharing part.
> 
> There's no QP sharing here.
> You mean SRC domain sharing?
> 
Yes. Sorry.

--
			Gleb.


From mst at dev.mellanox.co.il  Mon Jul 30 02:01:40 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 30 Jul 2007 12:01:40 +0300
Subject: [ofa-general] RFC: SRC API
In-Reply-To: <20070730085016.GG4434@minantech.com>
References: <20070729140431.GG16915@mellanox.co.il>
	<20070730085016.GG4434@minantech.com>
Message-ID: <20070730090140.GG9963@mellanox.co.il>

Some code examples:
	/* create a domain and share it: */

	struct ibv_src_domain * d = ibv_get_new_src_domain(ctx);
	int fd = open(path, O_CREAT | O_RDWR, mode);
	ibv_share_src_domain(d, fd);

	/* get a reference to a shared domain: */

	int fd = open(path, O_CREAT | O_RDWR, mode);
	struct ibv_src_domain * d = ibv_get_shared_src_domain(ctx, fd);

	/* once done: */
	ibv_put_src_domain(d);

Note: when all users do put, domain is destroyed.

-- 
MST


From vlad at mellanox.co.il  Mon Jul 30 02:04:00 2007
From: vlad at mellanox.co.il (Vladimir Sokolovsky)
Date: Mon, 30 Jul 2007 12:04:00 +0300
Subject: [ofa-general] Re: OFED-1.2 on x86 debian
In-Reply-To: <bf621e460707290801w5fbbe296i50314761e700ca7@mail.gmail.com>
References: <46A97850.2030607@osrg.net> <20070727083438.GA9912@mellanox.co.il>
	<bf621e460707290429u47f7df33p80e2038548be28b2@mail.gmail.com>
	<6C2C79E72C305246B504CBA17B5500C901F75EA5@mtlexch01.mtl.com>
	<bf621e460707290801w5fbbe296i50314761e700ca7@mail.gmail.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901FF0CB0@mtlexch01.mtl.com>

OFED-1.2 should work on Fedora C6, CentOS 4.4, 4.5, 5.0.

 
Regards,

Vladimir

 
From: Naim Hammond [mailto:naim.hammond at gmail.com] 
Sent: Sunday, July 29, 2007 6:01 PM
To: Vladimir Sokolovsky
Cc: Michael S. Tsirkin; Yoshiaki Tamura; openib-general at openib.org
Subject: Re: [ofa-general] Re: OFED-1.2 on x86 debian

 
So OFED is not supported on any free distribution.

You did mention it is tested on Ubuntu, but you weren't sure. is it?

N

On 7/29/07, Vladimir Sokolovsky <vlad at mellanox.co.il> wrote:

Hi,

See OFED-1.2/docs/OFED_release_notes.txt:

 
1.2 Supported Platforms and Operating Systems

---------------------------------------------

  o   CPU architectures:

        - x86_64

        - x86

        - ppc64

        - ia64

 
  o   Linux Operating Systems:

        - RedHat EL4 up3: 2.6.9-34.ELsmp

        - RedHat EL4 up4: 2.6.9-42.ELsmp

        - RedHat EL4 up5: 2.6.9-55.ELsmp

        - RedHat EL5: 2.6.18-8.el5

        - SLES10: 2.6.16.21-0.8-smp

        - kernel.org: 2.6.20.x

        - SLES10 SP1: 2.6.16.46-0.12-smp (partially tested)

 
OFED-1.2 use RPM environment for installation. You can't use OFED
installation script as is on Debian.

 
Regards,

Vladimir

 
From: Naim Hammond [mailto:naim.hammond at gmail.com] 
Sent: Sunday, July 29, 2007 2:30 PM
To: Michael S. Tsirkin
Cc: Yoshiaki Tamura; Vladimir Sokolovsky; openib-general at openib.org
Subject: Re: [ofa-general] Re: OFED-1.2 on x86 debian

 
Where is the list of supported distributions?
Where can I see it?

Thanks

On 7/27/07, Michael S. Tsirkin < mst at dev.mellanox.co.il
<mailto:mst at dev.mellanox.co.il> > wrote:

> Quoting Yoshiaki Tamura < tamura at osrg.net <mailto:tamura at osrg.net> >:
> Subject: OFED-1.2 on x86 debian
>
> Hi.
>
> I'm trying to install OFED-1.2 on x86 (32bit) debian machine.
> Although build_env.sh seems to work on debian, 
> it fails compiling both kernel modules and user land tools by
rpmbuild.
>
> Is OFED-1.2 tested on debian or totally unsupported?

It's not on a list of supported platforms, but I think we do builds 
on ubuntu so debian should work too. Vlad?

--
MST
_______________________________________________
general mailing list
general at lists.openfabrics.org 
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070730/28bfaf72/attachment.html>

From glebn at voltaire.com  Mon Jul 30 02:06:00 2007
From: glebn at voltaire.com (Gleb Natapov)
Date: Mon, 30 Jul 2007 12:06:00 +0300
Subject: [ofa-general] RFC: SRC API
In-Reply-To: <20070730090140.GG9963@mellanox.co.il>
References: <20070729140431.GG16915@mellanox.co.il>
	<20070730085016.GG4434@minantech.com>
	<20070730090140.GG9963@mellanox.co.il>
Message-ID: <20070730090600.GI4434@minantech.com>

On Mon, Jul 30, 2007 at 12:01:40PM +0300, Michael S. Tsirkin wrote:
> Some code examples:
> 	/* create a domain and share it: */
> 
> 	struct ibv_src_domain * d = ibv_get_new_src_domain(ctx);
> 	int fd = open(path, O_CREAT | O_RDWR, mode);
> 	ibv_share_src_domain(d, fd);
> 
> 	/* get a reference to a shared domain: */
> 
> 	int fd = open(path, O_CREAT | O_RDWR, mode);
> 	struct ibv_src_domain * d = ibv_get_shared_src_domain(ctx, fd);
> 
> 	/* once done: */
> 	ibv_put_src_domain(d);
> 
> Note: when all users do put, domain is destroyed.
> 
OK. I am more interested in how SRC is connected to a QP in different processes.
How a process will be able to do sends to different processes through one QP, etc...

--
			Gleb.


From ogerlitz at voltaire.com  Mon Jul 30 02:08:21 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 30 Jul 2007 12:08:21 +0300
Subject: [ofa-general] Re: IPoIB path caching
In-Reply-To: <20070729173232.GA14867@obsidianresearch.com>
References: <46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com>
	<46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com>
	<46A6F50C.5000906@voltaire.com> <46A78146.1090304@ichips.intel.com>
	<46A846FC.5040704@voltaire.com> <46A8D80C.1090305@ichips.intel.com>
	<20070726181132.GO19768@obsidianresearch.com>
	<46AC509B.6020206@voltaire.com>
	<20070729173232.GA14867@obsidianresearch.com>
Message-ID: <46ADAA85.8070106@voltaire.com>

Jason Gunthorpe wrote:
> On Sun, Jul 29, 2007 at 11:32:27AM +0300, Or Gerlitz wrote:
>> Jason Gunthorpe wrote:

>>> The existing trap monitoring in Sean's module covers about 90% of the
>>> cases in IB when you need to invalidate a PR, the last 10% will need
>>> something new :(

>> Let it be. Do you think the last 10% should not prevent the local sa 
>> merge to the upstream code?

> Only that the design philosophy should accommodate an eventual solution
> to this remaining problem. Mainly, as I've said, I'd like to see more
> stuff in userspace and a simple well defined kernel component.

> What about you? Your arguments about linking arp lifetime to PR cache
> lifetime are trying to address this very same 10%.

Indeed. The argument I was trying to make is that arp cache invalidation 
  requires IPoIB PR cache invalidation, this handles 100% of the cases, 
including the 10% not covered by doing cache invalidation based only on 
  IB events such as port up / sm lid change / sm reregister / etc

So far, my approach has not accepted as is by Sean and you (Roland, 
Michael - would be nice to get your say here), I have to see what other 
design is possible here.

Or.


From mst at dev.mellanox.co.il  Mon Jul 30 02:10:20 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 30 Jul 2007 12:10:20 +0300
Subject: [ofa-general] Re: OFED-1.2 on x86 debian
Message-ID: <20070730091020.GH9963@mellanox.co.il>


Quoting Michael S. Tsirkin <mst at dev.mellanox.co.il>:
Subject: Re: [ofa-general] Re: OFED-1.2 on x86 debian

> Quoting Naim Hammond <naim.hammond at gmail.com>:
> Subject: Re: [ofa-general] Re: OFED-1.2 on x86 debian
> 
> On 7/29/07, Michael S. Tsirkin <mst at dev.mellanox.co.il> wrote:
> 
>     > Quoting Naim Hammond <naim.hammond at gmail.com>:
>     > You did mention it is tested on Ubuntu, but you weren't sure. is it?
> 
>     Note that support in this context means whether OFED was tested on the
>     distro, not whether it builds/works.
> 
> 
> I'm sorry that I don't understand your answer.
> What exactly do you mean "OFED was tested" if it does not build, nor works, on
> this or that distribution?
> 
> Please explain.
> 
> N

The reason some things in OFED might not work on a given distro is because,
no one volunteered to test this distro.

However:

- Maintainers for some packages use distros outside the list of
  supported platforms. These will build and work,
  but no one compiled a specific list - one'll have to ask
  around, and OFED packaging might not work so one might need
  to install packages individially.

- We do care about portability.
  If someone is interested enough to test things out on a given distro,
  and report issues, and work with maintainers on fixing things,
  you will find that people will be happy to help.

Hope this helps,
     
-- 
MST


From mst at dev.mellanox.co.il  Mon Jul 30 02:16:39 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 30 Jul 2007 12:16:39 +0300
Subject: [ofa-general] RFC: SRC API
In-Reply-To: <20070730085016.GG4434@minantech.com>
References: <20070729140431.GG16915@mellanox.co.il>
	<20070730085016.GG4434@minantech.com>
Message-ID: <20070730091639.GI9963@mellanox.co.il>

More code examples:

Create an SRC QP, part of SRC domain:

	attr.qp_type = IBV_QPT_SRC;
	attr.src_domain = d;
	qp = ibv_create_qp(pd, &attr);

Given remote SRQ number, send data to this SRQ over an SRC QP:

	wr.src_remote_srq_num = src_remote_srq_num;
	ib_post_send(qp, &wr);

Note: SRQ number needs to be exchanged as part of CM private data
      or some other protocol.

-- 
MST


From vlad at lists.openfabrics.org  Mon Jul 30 02:49:35 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Mon, 30 Jul 2007 02:49:35 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_c_kernel 20070730-0200 daily build status
Message-ID: <20070730094936.111EAE60921@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2_c

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.12
Passed on powerpc with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on powerpc with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.13
Passed on ppc64 with linux-2.6.15
Passed on ia64 with linux-2.6.12
Passed on powerpc with linux-2.6.16
Passed on ppc64 with linux-2.6.12
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.14
Passed on ppc64 with linux-2.6.14
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.19
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.12
Passed on ppc64 with linux-2.6.17
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.16
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.17
Passed on x86_64 with linux-2.6.13
Passed on powerpc with linux-2.6.15
Passed on ppc64 with linux-2.6.13
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ia64 with linux-2.6.22
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default

Failed:


From glebn at voltaire.com  Mon Jul 30 03:41:17 2007
From: glebn at voltaire.com (Gleb Natapov)
Date: Mon, 30 Jul 2007 13:41:17 +0300
Subject: [ofa-general] RFC: SRC API
In-Reply-To: <20070730091639.GI9963@mellanox.co.il>
References: <20070729140431.GG16915@mellanox.co.il>
	<20070730085016.GG4434@minantech.com>
	<20070730091639.GI9963@mellanox.co.il>
Message-ID: <20070730104117.GJ4434@minantech.com>

On Mon, Jul 30, 2007 at 12:16:39PM +0300, Michael S. Tsirkin wrote:
> More code examples:
> 
> Create an SRC QP, part of SRC domain:
> 
> 	attr.qp_type = IBV_QPT_SRC;
> 	attr.src_domain = d;
> 	qp = ibv_create_qp(pd, &attr);
> 
> Given remote SRQ number, send data to this SRQ over an SRC QP:
> 
> 	wr.src_remote_srq_num = src_remote_srq_num;
> 	ib_post_send(qp, &wr);
> 
> Note: SRQ number needs to be exchanged as part of CM private data
>       or some other protocol.
> 
You are too brief. I can come up with one linears based on the API by
myself. I am trying to understand how sharing of SRC between processes
will work and your example doesn't show this. Can I connected the same
SRC to different QPs? If yes, can I send packet to any SRQ connected to
the SRC through any QP connected to the same SRC?  If yes how is this
different from having regular QPs?

--
			Gleb.


From ishai at mellanox.co.il  Mon Jul 30 04:03:40 2007
From: ishai at mellanox.co.il (Ishai Rabinovitz)
Date: Mon, 30 Jul 2007 14:03:40 +0300
Subject: [ofa-general] RFC: SRC API
In-Reply-To: <20070730104117.GJ4434@minantech.com>
References: <20070729140431.GG16915@mellanox.co.il>
	<20070730085016.GG4434@minantech.com>
	<20070730091639.GI9963@mellanox.co.il>
	<20070730104117.GJ4434@minantech.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901FF0D95@mtlexch01.mtl.com>

 Gleb, 

I'm attaching a presentation that explains how we can use SRC in MPI.
(You need power point to watch it).

Comments are welcomed.

Enjoy
Ishai

-----Original Message-----
From: Gleb Natapov [mailto:glebn at voltaire.com] 
Sent: Monday, July 30, 2007 13:41 PM
To: Michael S. Tsirkin
Cc: general at lists.openfabrics.org; Roland Dreier; Pavel Shamis; Ishai
Rabinovitz; ewg at lists.openfabrics.org
Subject: Re: [ofa-general] RFC: SRC API

On Mon, Jul 30, 2007 at 12:16:39PM +0300, Michael S. Tsirkin wrote:
> More code examples:
> 
> Create an SRC QP, part of SRC domain:
> 
> 	attr.qp_type = IBV_QPT_SRC;
> 	attr.src_domain = d;
> 	qp = ibv_create_qp(pd, &attr);
> 
> Given remote SRQ number, send data to this SRQ over an SRC QP:
> 
> 	wr.src_remote_srq_num = src_remote_srq_num;
> 	ib_post_send(qp, &wr);
> 
> Note: SRQ number needs to be exchanged as part of CM private data
>       or some other protocol.
> 
You are too brief. I can come up with one linears based on the API by
myself. I am trying to understand how sharing of SRC between processes
will work and your example doesn't show this. Can I connected the same
SRC to different QPs? If yes, can I send packet to any SRQ connected to
the SRC through any QP connected to the same SRC?  If yes how is this
different from having regular QPs?

--
			Gleb.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SRC-2.ppt
Type: application/vnd.ms-powerpoint
Size: 66560 bytes
Desc: SRC-2.ppt
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070730/d1ea9cdb/attachment.ppt>

From mst at dev.mellanox.co.il  Mon Jul 30 04:21:30 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 30 Jul 2007 14:21:30 +0300
Subject: [ofa-general] RFC: SRC API
In-Reply-To: <20070730104117.GJ4434@minantech.com>
References: <20070729140431.GG16915@mellanox.co.il>
	<20070730085016.GG4434@minantech.com>
	<20070730091639.GI9963@mellanox.co.il>
	<20070730104117.GJ4434@minantech.com>
Message-ID: <20070730112130.GJ9963@mellanox.co.il>

> Quoting Gleb Natapov <glebn at voltaire.com>:
> Subject: Re: [ofa-general] RFC: SRC API
> 
> On Mon, Jul 30, 2007 at 12:16:39PM +0300, Michael S. Tsirkin wrote:
> > More code examples:
> > 
> > Create an SRC QP, part of SRC domain:
> > 
> > 	attr.qp_type = IBV_QPT_SRC;
> > 	attr.src_domain = d;
> > 	qp = ibv_create_qp(pd, &attr);
> > 
> > Given remote SRQ number, send data to this SRQ over an SRC QP:
> > 
> > 	wr.src_remote_srq_num = src_remote_srq_num;
> > 	ib_post_send(qp, &wr);
> > 
> > Note: SRQ number needs to be exchanged as part of CM private data
> >       or some other protocol.
> > 
> You are too brief. I can come up with one linears based on the API by
> myself. I am trying to understand how sharing of SRC between processes
> will work and your example doesn't show this.

It seems what you are missing is what SRC is, not how to use the API.
I'll have a working example when I get closer to implementation.
For now you'll have to look up Dror's preso if you want to
understand what SRC is.

> Can I connected the same
> SRC to different QPs? If yes, can I send packet to any SRQ connected to
> the SRC through any QP connected to the same SRC?

Yes to both.

> If yes how is this
> different from having regular QPs?

With regular QP you can only send to a single SRQ.
But again, look at Dror's preso.

-- 
MST


From glebn at voltaire.com  Mon Jul 30 04:27:22 2007
From: glebn at voltaire.com (Gleb Natapov)
Date: Mon, 30 Jul 2007 14:27:22 +0300
Subject: [ofa-general] RFC: SRC API
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901FF0D95@mtlexch01.mtl.com>
References: <20070729140431.GG16915@mellanox.co.il>
	<20070730085016.GG4434@minantech.com>
	<20070730091639.GI9963@mellanox.co.il>
	<20070730104117.GJ4434@minantech.com>
	<6C2C79E72C305246B504CBA17B5500C901FF0D95@mtlexch01.mtl.com>
Message-ID: <20070730112722.GK4434@minantech.com>

On Mon, Jul 30, 2007 at 02:03:40PM +0300, Ishai Rabinovitz wrote:
>  Gleb, 
> 
> I'm attaching a presentation that explains how we can use SRC in MPI.
> (You need power point to watch it).
> 
> Comments are welcomed.
So you propose to have separate QP for sending and receiving? And
receiving QP should be shared between ranks (this part is not addressed
by proposed API BTW). Correct?


--
			Gleb.


From ishai at mellanox.co.il  Mon Jul 30 04:29:46 2007
From: ishai at mellanox.co.il (Ishai Rabinovitz)
Date: Mon, 30 Jul 2007 14:29:46 +0300
Subject: [ofa-general] RFC: SRC API
In-Reply-To: <20070730112722.GK4434@minantech.com>
References: <20070729140431.GG16915@mellanox.co.il>
	<20070730085016.GG4434@minantech.com>
	<20070730091639.GI9963@mellanox.co.il>
	<20070730104117.GJ4434@minantech.com>
	<6C2C79E72C305246B504CBA17B5500C901FF0D95@mtlexch01.mtl.com>
	<20070730112722.GK4434@minantech.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901FF0DBF@mtlexch01.mtl.com>

Yes, there will be a different QP for send and a different QP for
receive. 
There is no need for a special API to support this. You just open
several QPs and treat them the way you want.

Ishai 

-----Original Message-----
From: Gleb Natapov [mailto:glebn at voltaire.com] 
Sent: Monday, July 30, 2007 14:27 PM
To: Ishai Rabinovitz
Cc: Michael S. Tsirkin; general at lists.openfabrics.org; Roland Dreier;
Pavel Shamis; Jeff Squyres; Galen Shipman; Gil Bloch;
panda at cse.ohio-state.edu
Subject: Re: [ofa-general] RFC: SRC API

On Mon, Jul 30, 2007 at 02:03:40PM +0300, Ishai Rabinovitz wrote:
>  Gleb,
> 
> I'm attaching a presentation that explains how we can use SRC in MPI.
> (You need power point to watch it).
> 
> Comments are welcomed.
So you propose to have separate QP for sending and receiving? And
receiving QP should be shared between ranks (this part is not addressed
by proposed API BTW). Correct?


--
			Gleb.


From glebn at voltaire.com  Mon Jul 30 04:54:25 2007
From: glebn at voltaire.com (Gleb Natapov)
Date: Mon, 30 Jul 2007 14:54:25 +0300
Subject: [ofa-general] RFC: SRC API
In-Reply-To: <20070730112130.GJ9963@mellanox.co.il>
References: <20070729140431.GG16915@mellanox.co.il>
	<20070730085016.GG4434@minantech.com>
	<20070730091639.GI9963@mellanox.co.il>
	<20070730104117.GJ4434@minantech.com>
	<20070730112130.GJ9963@mellanox.co.il>
Message-ID: <20070730115425.GL4434@minantech.com>

On Mon, Jul 30, 2007 at 02:21:30PM +0300, Michael S. Tsirkin wrote:
> > Quoting Gleb Natapov <glebn at voltaire.com>:
> > Subject: Re: [ofa-general] RFC: SRC API
> > 
> > On Mon, Jul 30, 2007 at 12:16:39PM +0300, Michael S. Tsirkin wrote:
> > > More code examples:
> > > 
> > > Create an SRC QP, part of SRC domain:
> > > 
> > > 	attr.qp_type = IBV_QPT_SRC;
> > > 	attr.src_domain = d;
> > > 	qp = ibv_create_qp(pd, &attr);
> > > 
> > > Given remote SRQ number, send data to this SRQ over an SRC QP:
> > > 
> > > 	wr.src_remote_srq_num = src_remote_srq_num;
> > > 	ib_post_send(qp, &wr);
> > > 
> > > Note: SRQ number needs to be exchanged as part of CM private data
> > >       or some other protocol.
> > > 
> > You are too brief. I can come up with one linears based on the API by
> > myself. I am trying to understand how sharing of SRC between processes
> > will work and your example doesn't show this.
> 
> It seems what you are missing is what SRC is, not how to use the API.
So tell us. Because it seems I am not the only one judging by
presentation I've got from Ishai. In this presentation he propose to
create separate receive QPs and send QPs. Is this how it meant to be
working if SRC domain is shared between processes? Because frankly, I don't
see how it can be used in any other way.

> I'll have a working example when I get closer to implementation.
> For now you'll have to look up Dror's preso if you want to
> understand what SRC is.
I looked at Dror's presentation not once. If we are talking about the
same presentation there is no much details there except additional field
in the header with destination SRQ number so HW will be able to demux a packet in
the right SRQ.

> 
> > Can I connected the same
> > SRC to different QPs? If yes, can I send packet to any SRQ connected to
> > the SRC through any QP connected to the same SRC?
> 
> Yes to both.
And can I attach SRQ to SRC domain without creating QP? I suppose yes.

> 
> > If yes how is this
> > different from having regular QPs?
> 
> With regular QP you can only send to a single SRQ.
> But again, look at Dror's preso.
> 
Yes but I can use the same QP for sending and receiving (this is a Queue
Pair after all). Now I'll have to create QP for send and QP for receive.
Overall number of QPs may still be smaller though...

--
			Gleb.


From glebn at voltaire.com  Mon Jul 30 05:00:18 2007
From: glebn at voltaire.com (Gleb Natapov)
Date: Mon, 30 Jul 2007 15:00:18 +0300
Subject: [ofa-general] RFC: SRC API
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901FF0DBF@mtlexch01.mtl.com>
References: <20070729140431.GG16915@mellanox.co.il>
	<20070730085016.GG4434@minantech.com>
	<20070730091639.GI9963@mellanox.co.il>
	<20070730104117.GJ4434@minantech.com>
	<6C2C79E72C305246B504CBA17B5500C901FF0D95@mtlexch01.mtl.com>
	<20070730112722.GK4434@minantech.com>
	<6C2C79E72C305246B504CBA17B5500C901FF0DBF@mtlexch01.mtl.com>
Message-ID: <20070730120018.GM4434@minantech.com>

On Mon, Jul 30, 2007 at 02:29:46PM +0300, Ishai Rabinovitz wrote:
> Yes, there will be a different QP for send and a different QP for
> receive. 
> There is no need for a special API to support this. You just open
> several QPs and treat them the way you want.
The way it is present in your slides the receive QPs are in shared
memory, but if it is possible to attach SRQ to SRC without access to QP
the QP may reside in a memory of one of the ranks. By the way
ibv_create_src_srq() gets PD as a parameter and each process will have
its own PD, so one QP will be able to put messages to different PD
domains is that right?

--
			Gleb.


From mst at dev.mellanox.co.il  Mon Jul 30 05:07:06 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 30 Jul 2007 15:07:06 +0300
Subject: [ofa-general] RFC: SRC API
In-Reply-To: <20070730120018.GM4434@minantech.com>
References: <20070729140431.GG16915@mellanox.co.il>
	<20070730085016.GG4434@minantech.com>
	<20070730091639.GI9963@mellanox.co.il>
	<20070730104117.GJ4434@minantech.com>
	<6C2C79E72C305246B504CBA17B5500C901FF0D95@mtlexch01.mtl.com>
	<20070730112722.GK4434@minantech.com>
	<6C2C79E72C305246B504CBA17B5500C901FF0DBF@mtlexch01.mtl.com>
	<20070730120018.GM4434@minantech.com>
Message-ID: <20070730120706.GM9963@mellanox.co.il>

> By the way ibv_create_src_srq() gets PD as a parameter and each process will
> have its own PD, so one QP will be able to put messages to different PD
> domains is that right?

Correct. That's part of the SRC extension.

-- 
MST


From ishai at mellanox.co.il  Mon Jul 30 05:06:01 2007
From: ishai at mellanox.co.il (Ishai Rabinovitz)
Date: Mon, 30 Jul 2007 15:06:01 +0300
Subject: [ofa-general] RFC: SRC API
In-Reply-To: <20070730120018.GM4434@minantech.com>
References: <20070729140431.GG16915@mellanox.co.il>
	<20070730085016.GG4434@minantech.com>
	<20070730091639.GI9963@mellanox.co.il>
	<20070730104117.GJ4434@minantech.com>
	<6C2C79E72C305246B504CBA17B5500C901FF0D95@mtlexch01.mtl.com>
	<20070730112722.GK4434@minantech.com>
	<6C2C79E72C305246B504CBA17B5500C901FF0DBF@mtlexch01.mtl.com>
	<20070730120018.GM4434@minantech.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901FF0E0F@mtlexch01.mtl.com>

 
-----Original Message-----
From: Gleb Natapov [mailto:glebn at voltaire.com] 
Sent: Monday, July 30, 2007 15:00 PM
To: Ishai Rabinovitz
Cc: Michael S. Tsirkin; general at lists.openfabrics.org; Roland Dreier;
Pavel Shamis; Jeff Squyres; Galen Shipman; Gil Bloch;
panda at cse.ohio-state.edu
Subject: Re: [ofa-general] RFC: SRC API

On Mon, Jul 30, 2007 at 02:29:46PM +0300, Ishai Rabinovitz wrote:
>The way it is present in your slides the receive QPs are in shared
memory, but if it is possible to attach SRQ to SRC without access to QP
the QP may reside in a 
> memory of one of the ranks. 

Actually, no one access the receive QP and it occupies little space. I
draw it in SHM, but you can think of it as existing only in the kernel
and in the HCA.

> By the way
> ibv_create_src_srq() gets PD as a parameter and each process will have
its own PD, so one QP will be able to put messages to different PD
domains is that right?

Yes.


Ishai


From mst at dev.mellanox.co.il  Mon Jul 30 05:10:57 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 30 Jul 2007 15:10:57 +0300
Subject: [ofa-general] RFC: SRC API
In-Reply-To: <20070730115425.GL4434@minantech.com>
References: <20070729140431.GG16915@mellanox.co.il>
	<20070730085016.GG4434@minantech.com>
	<20070730091639.GI9963@mellanox.co.il>
	<20070730104117.GJ4434@minantech.com>
	<20070730112130.GJ9963@mellanox.co.il>
	<20070730115425.GL4434@minantech.com>
Message-ID: <20070730121057.GN9963@mellanox.co.il>

> > It seems what you are missing is what SRC is, not how to use the API.
> 
> So tell us.

This calls for a separate document. From feedback from Sonoma I really assumed
people have it figured out.

Let's open a separate thread, and there I will try writing up
what SRC is from the protocol point of view.


-- 
MST


From glebn at voltaire.com  Mon Jul 30 05:11:02 2007
From: glebn at voltaire.com (Gleb Natapov)
Date: Mon, 30 Jul 2007 15:11:02 +0300
Subject: [ofa-general] RFC: SRC API
In-Reply-To: <20070730120706.GM9963@mellanox.co.il>
References: <20070729140431.GG16915@mellanox.co.il>
	<20070730085016.GG4434@minantech.com>
	<20070730091639.GI9963@mellanox.co.il>
	<20070730104117.GJ4434@minantech.com>
	<6C2C79E72C305246B504CBA17B5500C901FF0D95@mtlexch01.mtl.com>
	<20070730112722.GK4434@minantech.com>
	<6C2C79E72C305246B504CBA17B5500C901FF0DBF@mtlexch01.mtl.com>
	<20070730120018.GM4434@minantech.com>
	<20070730120706.GM9963@mellanox.co.il>
Message-ID: <20070730121102.GN4434@minantech.com>

On Mon, Jul 30, 2007 at 03:07:06PM +0300, Michael S. Tsirkin wrote:
> > By the way ibv_create_src_srq() gets PD as a parameter and each process will
> > have its own PD, so one QP will be able to put messages to different PD
> > domains is that right?
> 
> Correct. That's part of the SRC extension.
> 
Is rkey/lkey are unique across different PDs? If yes is this required by
Spec or is this just a consequences of the implementation?

--
			Gleb.


From glebn at voltaire.com  Mon Jul 30 05:12:13 2007
From: glebn at voltaire.com (Gleb Natapov)
Date: Mon, 30 Jul 2007 15:12:13 +0300
Subject: [ofa-general] RFC: SRC API
In-Reply-To: <20070730121057.GN9963@mellanox.co.il>
References: <20070729140431.GG16915@mellanox.co.il>
	<20070730085016.GG4434@minantech.com>
	<20070730091639.GI9963@mellanox.co.il>
	<20070730104117.GJ4434@minantech.com>
	<20070730112130.GJ9963@mellanox.co.il>
	<20070730115425.GL4434@minantech.com>
	<20070730121057.GN9963@mellanox.co.il>
Message-ID: <20070730121213.GO4434@minantech.com>

On Mon, Jul 30, 2007 at 03:10:57PM +0300, Michael S. Tsirkin wrote:
> > > It seems what you are missing is what SRC is, not how to use the API.
> > 
> > So tell us.
> 
> This calls for a separate document. From feedback from Sonoma I really assumed
> people have it figured out.
> 
> Let's open a separate thread, and there I will try writing up
> what SRC is from the protocol point of view.
> 
No problem. Start it :)

--
			Gleb.


From tziporet at dev.mellanox.co.il  Mon Jul 30 05:24:06 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Mon, 30 Jul 2007 15:24:06 +0300
Subject: [ewg] Re: [ofa-general] RE: OFA website edits
In-Reply-To: <46A798F0.5070902@ichips.intel.com>
References: <B0095134066CC94FBC80973103FFA1FE046A5CB3@orsmsx416.amr.corp.intel.com>	<adamyy27cxk.fsf@cisco.com>	<46956FF9.50102@ichips.intel.com>	<46968448.2000401@ichips.intel.com>
	<46A798F0.5070902@ichips.intel.com>
Message-ID: <46ADD866.7080301@mellanox.co.il>

Arlin Davis wrote:
>
>> I would like to propose adding project directories under 
>> http://www.openfabrics.org/downloads/  where appropriate and give 
>> maintainers access. For example:
>>
> Jeff,  please add the following directories with maintainer access as 
> follow (or grant access at a maintainer group level):
>
> http://www.openfabrics.org/downloads/sdp (eitan)
SDP should be on the name of Jim Mott (jimmott) since he is the 
maintainer of SDP and not Eitan.

Tziporet


From monis at voltaire.com  Mon Jul 30 05:37:29 2007
From: monis at voltaire.com (Moni Shoua)
Date: Mon, 30 Jul 2007 15:37:29 +0300
Subject: [ofa-general] [PATCH V3 0/7] net/bonding: ADD IPoIB support for the
	bonding driver
Message-ID: <46ADDB89.5030601@voltaire.com>

This patch series is the third version (see below link to V2) of the 
suggested changes to the bonding driver so it would be able to support 
non ARPHRD_ETHER netdevices for its High-Availability (active-backup) mode. 

The motivation is to enable the bonding driver on its HA mode to work with 
the IP over Infiniband (IPoIB) driver. With these patches I was able to enslave 
IPoIB netdevices and run TCP, UDP, IP (UDP) Multicast and ICMP traffic with 
fail-over and fail-back working fine. The working environment was the net-2.6 git. 

More over, as IPoIB is also the IB ARP provider for the RDMA CM driver which 
is used by native IB ULPs whose addressing scheme is based on IP (e.g. iSER, 
SDP, Lustre, NFSoRDMA, RDS), bonding support for IPoIB devices **enables** HA 
for these ULPs. This holds as when the ULP is informed by the IB HW on the 
failure of the current IB connection, it just need to reconnect, where the 
bonding device will now issue the IB ARP over the active IPoIB slave. 

This series also includes patches to the IPoIB driver that fix some fix 
some neighboring related issues. 

There are still 2 open issues here: 
1. When bonding enslaves an IPoIB device the bonding neighbor holds a 
reference to a cleanup function in the IPoIB drives. This makes it unsafe to 
unload the IPoIB module if there are bonding neighbors in the air. So, to 
avoid this race one must unload bonding before unloading IPoIB. 

2. Patch No. 7 is a workaround to a problem where gratuitous were not sent quite often.
I didn't find something better that fixes this and I would 
appreciate advices and comments regarding it. However, this doesn't seem to me as 
an issue related exclusively to IPoIB. 

Links to earlier discussion: 

1. A discussion in netdev about bonding support for IPoIB.
http://lists.openwall.net/netdev/2006/11/30/46

2. A discussion in openfabrics regarding changes in the IPoIB that 
enable using it as a slave for bonding.
http://lists.openfabrics.org/pipermail/general/2007-March/034033.html


From monis at voltaire.com  Mon Jul 30 05:48:20 2007
From: monis at voltaire.com (Moni Shoua)
Date: Mon, 30 Jul 2007 15:48:20 +0300
Subject: [ofa-general] [PATCH V3 1/7] IB/ipoib: Bound the net device to the
 ipoib_neigh structue
In-Reply-To: <46ADDB89.5030601@voltaire.com>
References: <46ADDB89.5030601@voltaire.com>
Message-ID: <46ADDE14.7000708@voltaire.com>

IPoIB uses a two layer neighboring scheme, such that for each struct neighbour
whose device is an ipoib one, there is a struct ipoib_neigh buddy which is
created on demand at the tx flow by an ipoib_neigh_alloc(skb->dst->neighbour)
call.

When using the bonding driver, neighbours are created by the net stack on behalf
of the bonding (master) device. On the tx flow the bonding code gets an skb such
that skb->dev points to the master device, it changes this skb to point on the
slave device and calls the slave hard_start_xmit function.

Under this scheme, ipoib_neigh_destructor assumption that for each struct
neighbour it gets, n->dev is an ipoib device and hence netdev_priv(n->dev)
can be casted to struct ipoib_dev_priv is buggy.

To fix it, this patch adds a dev field to struct ipoib_neigh which is used
instead of the struct neighbour dev one, when n->dev->flags has the
IFF_MASTER bit set.

Signed-off-by: Moni Shoua <monis at voltaire.com>
Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h           |    4 +++-
 drivers/infiniband/ulp/ipoib/ipoib_main.c      |   17 +++++++++++++++--
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |    2 +-
 3 files changed, 19 insertions(+), 4 deletions(-)

Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib.h
===================================================================
--- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib.h	2007-07-25 14:56:13.000000000 +0300
+++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib.h	2007-07-25 14:57:48.095724495 +0300
@@ -328,6 +328,7 @@ struct ipoib_neigh {
 	struct sk_buff_head queue;
 
 	struct neighbour   *neighbour;
+	struct net_device *dev;
 
 	struct list_head    list;
 };
@@ -344,7 +345,8 @@ static inline struct ipoib_neigh **to_ip
 				     INFINIBAND_ALEN, sizeof(void *));
 }
 
-struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh);
+struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh,
+				      struct net_device *dev);
 void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh);
 
 extern struct workqueue_struct *ipoib_workqueue;
Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-07-25 14:56:13.000000000 +0300
+++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-07-25 15:03:11.291291271 +0300
@@ -510,7 +510,7 @@ static void neigh_add_path(struct sk_buf
 	struct ipoib_path *path;
 	struct ipoib_neigh *neigh;
 
-	neigh = ipoib_neigh_alloc(skb->dst->neighbour);
+	neigh = ipoib_neigh_alloc(skb->dst->neighbour, skb->dev);
 	if (!neigh) {
 		++priv->stats.tx_dropped;
 		dev_kfree_skb_any(skb);
@@ -817,6 +817,16 @@ static void ipoib_neigh_cleanup(struct n
 	unsigned long flags;
 	struct ipoib_ah *ah = NULL;
 
+	if (n->dev->flags & IFF_MASTER) {
+		/* n->dev is not an IPoIB device and we have to take priv from elsewhere */
+		neigh = *to_ipoib_neigh(n);
+		if (neigh){
+			priv = netdev_priv(neigh->dev);
+			ipoib_dbg(priv, "neigh_destructor for bonding device: %s\n",
+				  n->dev->name);
+		} else
+			return;
+	}
 	ipoib_dbg(priv,
 		  "neigh_cleanup for %06x " IPOIB_GID_FMT "\n",
 		  IPOIB_QPN(n->ha),
@@ -838,7 +848,9 @@ static void ipoib_neigh_cleanup(struct n
 		ipoib_put_ah(ah);
 }
 
-struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour)
+struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour,
+				      struct net_device *dev)
+
 {
 	struct ipoib_neigh *neigh;
 
@@ -847,6 +859,7 @@ struct ipoib_neigh *ipoib_neigh_alloc(st
 		return NULL;
 
 	neigh->neighbour = neighbour;
+	neigh->dev = dev;
 	*to_ipoib_neigh(neighbour) = neigh;
 	skb_queue_head_init(&neigh->queue);
 	ipoib_cm_set(neigh, NULL);
Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===================================================================
--- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-07-25 14:56:13.000000000 +0300
+++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-07-25 14:57:48.097724142 +0300
@@ -727,7 +727,7 @@ out:
 		if (skb->dst            &&
 		    skb->dst->neighbour &&
 		    !*to_ipoib_neigh(skb->dst->neighbour)) {
-			struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour);
+			struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour, skb->dev);
 
 			if (neigh) {
 				kref_get(&mcast->ah->ref);


From monis at voltaire.com  Mon Jul 30 05:49:43 2007
From: monis at voltaire.com (Moni Shoua)
Date: Mon, 30 Jul 2007 15:49:43 +0300
Subject: [ofa-general] [PATCH V3 2/7] IB/ipoib: Verify address handle
	validity on send
In-Reply-To: <46ADDB89.5030601@voltaire.com>
References: <46ADDB89.5030601@voltaire.com>
Message-ID: <46ADDE67.7030502@voltaire.com>

When the bonding device senses a carrier loss of its active slave it replaces
that slave with a new one. In between the times when the carrier of an IPoIB
device goes down and ipoib_neigh is destroyed, it is possible that the 
bonding driver will send a packet on a new slave that uses an old ipoib_neigh.
This patch detects and prevents this from happenning.

Signed-off-by: Moni Shoua <monis at voltaire.com>
Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-07-25 14:57:48.000000000 +0300
+++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-07-25 15:02:55.525131034 +0300
@@ -685,9 +685,10 @@ static int ipoib_start_xmit(struct sk_bu
 				goto out;
 			}
 		} else if (neigh->ah) {
-			if (unlikely(memcmp(&neigh->dgid.raw,
+			if (unlikely((memcmp(&neigh->dgid.raw,
 					    skb->dst->neighbour->ha + 4,
-					    sizeof(union ib_gid)))) {
+					    sizeof(union ib_gid))) ||
+						 (neigh->dev != dev))) {
 				spin_lock(&priv->lock);
 				/*
 				 * It's safe to call ipoib_put_ah() inside


From mst at dev.mellanox.co.il  Mon Jul 30 05:50:54 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 30 Jul 2007 15:50:54 +0300
Subject: [ofa-general] Scalable reliable connection
Message-ID: <20070730125054.GO9963@mellanox.co.il>


Here's some background on what SRC is.  This is basically slide 6 in Dror's
talk, for those that missed the talk.

 * * *

SRC is an extension supported by recent Mellanox hardware
which is geared toward reducing the number of QPs
required for all-to-all communication on systems
with a high number of jobs per node.

===================================================================
Motivation:
===================================================================
Given N nodes with J jobs per node, number of QPs required
for all-to-all communication is:

With RC:
		O((N * J) ^ 2)

	Since each job out of O(N * J) jobs must create a single QP
	to communicate with each one of O(N * J) other jobs.

With SRC:
		O(N ^ 2 * J)

	This is achived by using a single send queue (per job, out of O(N * J) jobs)
	to send data to all J jobs running on a specific node (out of O(N) nodes).
	Hardware uses new "SRQ number" field in packet header to
	multiplex receive WRs and WCs to private memory of each job.

This is similiar idea to IB RD.
Q: Why not use RD then?
A: Because no hardware supports it.

Details:

===================================================================
Verbs extension:
===================================================================

- There is a new transport/QP type "SRC".
- There is a new object type "SRC domain"
- Each SRQ gets new (optional) attributes:
        SRC domain
	SRC SRQ number
        SRC CQ
  SRQ must have either all 3 of these or none of these attributes

- QPs of type SRC have all the same attributes as regular RC QPs
  connected to SRQ, except that:
  A. Each SRC QP has a new required attribute "SRC domain"
  B. SRC QPs do *not* have "SRQ" attribute
  	(do not have a specific SRQ associated with them)

===================================================================
Protocol extension:
===================================================================
SRC QP behaviour: Requestor
- Post send WR for this QP type is extended with SRQ number field
  This number is sent as part of packet header
- SRC Packets follow rules for RC packets on the wire, exactly
  What is different is their handling at the responder side

SRC QP behaviour: Responder
Each incoming packet passes transport checks with respect
to the SRC QP, following RC rules, exactly.

After this, SRQ number in packet header is used to look up
a specific SRQ. SRC domain of the resulting SRQ must be equal
to SRC domain of the QP, otherwise a NAK is sent,
and QP moves to error state.

If the SRC domains match, receive WR and receive WC processing
are as follows:

- RC Send
  - Rather than using SRQ to which the QP is attached,
    SRQ is looked up by SRQ number in the packet.
    Receive WR is taken from this SRQ.
  - Completions are generated on the CQ specified in the SRQ

- RDMA/Atomic
  - Rather than using PD to which the QP is attached,
    SRQ is looked up by SRQ number in the packet.
    PD of this SRQ is used for protection checks.
===================================================================
 
-- 
MST


From monis at voltaire.com  Mon Jul 30 05:51:22 2007
From: monis at voltaire.com (Moni Shoua)
Date: Mon, 30 Jul 2007 15:51:22 +0300
Subject: [ofa-general] [PATCH V3 3/7] net/bonding: Enable bonding to enslave
	non ARPHRD_ETHER
In-Reply-To: <46ADDB89.5030601@voltaire.com>
References: <46ADDB89.5030601@voltaire.com>
Message-ID: <46ADDECA.8010605@voltaire.com>

This patch changes some of the bond netdevice attributes and functions
to be that of the active slave for the case of the enslaved device not being
of ARPHRD_ETHER type. Basically it overrides those setting done by ether_setup(),
which are netdevice **type** dependent and hence might be not appropriate for
devices of other types. It also enforces mutual exclusion on bonding slaves
from dissimilar ether types, as was concluded over the v1 discussion.

IPoIB (see Documentation/infiniband/ipoib.txt) MAC address is made of a 3 bytes
IB QP (Queue Pair) number and 16 bytes IB port GID (Global ID) of the port this
IPoIB device is bounded to. The QP is a resource created by the IB HW and the
GID is an identifier burned into the HCA (i have omitted here some details which
are not important for the bonding RFC).

Signed-off-by: Moni Shoua <monis at voltaire.com>
Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>
---
 drivers/net/bonding/bond_main.c |   38 ++++++++++++++++++++++++++++++++++++++
 1 files changed, 38 insertions(+)

Index: net-2.6/drivers/net/bonding/bond_main.c
===================================================================
--- net-2.6.orig/drivers/net/bonding/bond_main.c	2007-07-25 15:02:10.000000000 +0300
+++ net-2.6/drivers/net/bonding/bond_main.c	2007-07-29 16:24:30.913343981 +0300
@@ -1277,6 +1277,26 @@ static int bond_compute_features(struct 
 	return 0;
 }
 
+
+static void bond_setup_by_slave(struct net_device *bond_dev,
+				struct net_device *slave_dev)
+{
+	bond_dev->hard_header	        = slave_dev->hard_header;
+	bond_dev->rebuild_header        = slave_dev->rebuild_header;
+	bond_dev->hard_header_cache	= slave_dev->hard_header_cache;
+	bond_dev->header_cache_update   = slave_dev->header_cache_update;
+	bond_dev->hard_header_parse	= slave_dev->hard_header_parse;
+
+	bond_dev->neigh_setup           = slave_dev->neigh_setup;
+
+	bond_dev->type		    = slave_dev->type;
+	bond_dev->hard_header_len   = slave_dev->hard_header_len;
+	bond_dev->addr_len	    = slave_dev->addr_len;
+
+	memcpy(bond_dev->broadcast, slave_dev->broadcast,
+		slave_dev->addr_len);
+}
+
 /* enslave device <slave> to bond device <master> */
 int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 {
@@ -1351,6 +1371,24 @@ int bond_enslave(struct net_device *bond
 		goto err_undo_flags;
 	}
 
+	/* set bonding device ether type by slave - bonding netdevices are
+	 * created with ether_setup, so when the slave type is not ARPHRD_ETHER
+	 * there is a need to override some of the type dependent attribs/funcs.
+	 *
+	 * bond ether type mutual exclusion - don't allow slaves of dissimilar
+	 * ether type (eg ARPHRD_ETHER and ARPHRD_INFINIBAND) share the same bond
+	 */
+	if (bond->slave_cnt == 0) {
+		if (slave_dev->type != ARPHRD_ETHER)
+			bond_setup_by_slave(bond_dev, slave_dev);
+	} else if (bond_dev->type != slave_dev->type) {
+		printk(KERN_ERR DRV_NAME ": %s ether type (%d) is different from "
+			"other slaves (%d), can not enslave it.\n", slave_dev->name,
+			slave_dev->type, bond_dev->type);
+			res = -EINVAL;
+			goto err_undo_flags;
+	}
+
 	if (slave_dev->set_mac_address == NULL) {
 		printk(KERN_ERR DRV_NAME
 			": %s: Error: The slave device you specified does "


From mst at dev.mellanox.co.il  Mon Jul 30 05:52:37 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 30 Jul 2007 15:52:37 +0300
Subject: [ofa-general] RFC: SRC API
In-Reply-To: <20070730121102.GN4434@minantech.com>
References: <20070729140431.GG16915@mellanox.co.il>
	<20070730085016.GG4434@minantech.com>
	<20070730091639.GI9963@mellanox.co.il>
	<20070730104117.GJ4434@minantech.com>
	<6C2C79E72C305246B504CBA17B5500C901FF0D95@mtlexch01.mtl.com>
	<20070730112722.GK4434@minantech.com>
	<6C2C79E72C305246B504CBA17B5500C901FF0DBF@mtlexch01.mtl.com>
	<20070730120018.GM4434@minantech.com>
	<20070730120706.GM9963@mellanox.co.il>
	<20070730121102.GN4434@minantech.com>
Message-ID: <20070730125237.GP9963@mellanox.co.il>

> Quoting Gleb Natapov <glebn at voltaire.com>:
> Subject: Re: [ofa-general] RFC: SRC API
> 
> On Mon, Jul 30, 2007 at 03:07:06PM +0300, Michael S. Tsirkin wrote:
> > > By the way ibv_create_src_srq() gets PD as a parameter and each process will
> > > have its own PD, so one QP will be able to put messages to different PD
> > > domains is that right?
> > 
> > Correct. That's part of the SRC extension.
> > 
> Is rkey/lkey are unique across different PDs? If yes is this required by
> Spec or is this just a consequences of the implementation?

Yes, but I think that this is not required by the spec.

-- 
MST


From monis at voltaire.com  Mon Jul 30 05:52:50 2007
From: monis at voltaire.com (Moni Shoua)
Date: Mon, 30 Jul 2007 15:52:50 +0300
Subject: [ofa-general] [PATCH V3 4/7] net/bonding: Enable bonding to enslave
 netdevices not supporting set_mac_address()
In-Reply-To: <46ADDB89.5030601@voltaire.com>
References: <46ADDB89.5030601@voltaire.com>
Message-ID: <46ADDF22.3090604@voltaire.com>

This patch allows for enslaving netdevices which do not support
the set_mac_address() function. In that case the bond mac address is the one
of the active slave, where remote peers are notified on the mac address
(neighbour) change by Gratuitous ARP sent by bonding when fail-over occurs
(this is already done by the bonding code).

Signed-off-by: Moni Shoua <monis at voltaire.com>
Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>
---
 drivers/net/bonding/bond_main.c |   88 +++++++++++++++++++++++++++-------------
 drivers/net/bonding/bonding.h   |    1 

 2 files changed, 61 insertions(+), 28 deletions(-)
Index: net-2.6/drivers/net/bonding/bond_main.c
===================================================================
--- net-2.6.orig/drivers/net/bonding/bond_main.c	2007-07-29 16:24:30.913343981 +0300
+++ net-2.6/drivers/net/bonding/bond_main.c	2007-07-29 16:36:53.234602471 +0300
@@ -1127,6 +1127,14 @@ void bond_change_active_slave(struct bon
 		if (new_active) {
 			bond_set_slave_active_flags(new_active);
 		}
+
+		/* when bonding does not set the slave MAC address, the bond MAC
+		 * address is the one of the active slave.
+		 */
+		if (new_active && !bond->do_set_mac_addr)
+			memcpy(bond->dev->dev_addr,  new_active->dev->dev_addr,
+				new_active->dev->addr_len);
+
 		bond_send_gratuitous_arp(bond);
 	}
 }
@@ -1390,13 +1398,22 @@ int bond_enslave(struct net_device *bond
 	}
 
 	if (slave_dev->set_mac_address == NULL) {
-		printk(KERN_ERR DRV_NAME
-			": %s: Error: The slave device you specified does "
-			"not support setting the MAC address. "
-			"Your kernel likely does not support slave "
-			"devices.\n", bond_dev->name);
-  		res = -EOPNOTSUPP;
-		goto err_undo_flags;
+		if (bond->slave_cnt == 0) {
+			printk(KERN_WARNING DRV_NAME
+				": %s: Warning: The first slave device you "
+				"specified does not support setting the MAC "
+				"address. This bond MAC address would be that "
+				"of the active slave.\n", bond_dev->name);
+			bond->do_set_mac_addr = 0;
+		} else if (bond->do_set_mac_addr) {
+			printk(KERN_ERR DRV_NAME
+				": %s: Error: The slave device you specified "
+				"does not support setting the MAC addres,."
+				"but this bond uses this practice. \n"
+				, bond_dev->name);
+			res = -EOPNOTSUPP;
+			goto err_undo_flags;
+		}
 	}
 
 	new_slave = kzalloc(sizeof(struct slave), GFP_KERNEL);
@@ -1417,16 +1434,18 @@ int bond_enslave(struct net_device *bond
 	 */
 	memcpy(new_slave->perm_hwaddr, slave_dev->dev_addr, ETH_ALEN);
 
-	/*
-	 * Set slave to master's mac address.  The application already
-	 * set the master's mac address to that of the first slave
-	 */
-	memcpy(addr.sa_data, bond_dev->dev_addr, bond_dev->addr_len);
-	addr.sa_family = slave_dev->type;
-	res = dev_set_mac_address(slave_dev, &addr);
-	if (res) {
-		dprintk("Error %d calling set_mac_address\n", res);
-		goto err_free;
+	if (bond->do_set_mac_addr) {
+		/*
+		 * Set slave to master's mac address.  The application already
+		 * set the master's mac address to that of the first slave
+		 */
+		memcpy(addr.sa_data, bond_dev->dev_addr, bond_dev->addr_len);
+		addr.sa_family = slave_dev->type;
+		res = dev_set_mac_address(slave_dev, &addr);
+		if (res) {
+			dprintk("Error %d calling set_mac_address\n", res);
+			goto err_free;
+		}
 	}
 
 	res = netdev_set_master(slave_dev, bond_dev);
@@ -1651,9 +1670,11 @@ err_close:
 	dev_close(slave_dev);
 
 err_restore_mac:
-	memcpy(addr.sa_data, new_slave->perm_hwaddr, ETH_ALEN);
-	addr.sa_family = slave_dev->type;
-	dev_set_mac_address(slave_dev, &addr);
+	if (bond->do_set_mac_addr) {
+		memcpy(addr.sa_data, new_slave->perm_hwaddr, ETH_ALEN);
+		addr.sa_family = slave_dev->type;
+		dev_set_mac_address(slave_dev, &addr);
+	}
 
 err_free:
 	kfree(new_slave);
@@ -1831,10 +1852,12 @@ int bond_release(struct net_device *bond
 	/* close slave before restoring its mac address */
 	dev_close(slave_dev);
 
-	/* restore original ("permanent") mac address */
-	memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN);
-	addr.sa_family = slave_dev->type;
-	dev_set_mac_address(slave_dev, &addr);
+	if (bond->do_set_mac_addr) {
+		/* restore original ("permanent") mac address */
+		memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN);
+		addr.sa_family = slave_dev->type;
+		dev_set_mac_address(slave_dev, &addr);
+	}
 
 	slave_dev->priv_flags &= ~(IFF_MASTER_8023AD | IFF_MASTER_ALB |
 				   IFF_SLAVE_INACTIVE | IFF_BONDING |
@@ -1921,10 +1944,12 @@ static int bond_release_all(struct net_d
 		/* close slave before restoring its mac address */
 		dev_close(slave_dev);
 
-		/* restore original ("permanent") mac address*/
-		memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN);
-		addr.sa_family = slave_dev->type;
-		dev_set_mac_address(slave_dev, &addr);
+		if (bond->do_set_mac_addr) {
+			/* restore original ("permanent") mac address*/
+			memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN);
+			addr.sa_family = slave_dev->type;
+			dev_set_mac_address(slave_dev, &addr);
+		}
 
 		slave_dev->priv_flags &= ~(IFF_MASTER_8023AD | IFF_MASTER_ALB |
 					   IFF_SLAVE_INACTIVE);
@@ -3961,6 +3986,10 @@ static int bond_set_mac_address(struct n
 
 	dprintk("bond=%p, name=%s\n", bond, (bond_dev ? bond_dev->name : "None"));
 
+	if (!bond->do_set_mac_addr) {
+		return -EOPNOTSUPP;
+	}
+
 	if (!is_valid_ether_addr(sa->sa_data)) {
 		return -EADDRNOTAVAIL;
 	}
@@ -4351,6 +4380,9 @@ static int bond_init(struct net_device *
 	bond_create_proc_entry(bond);
 #endif
 
+	/* set do_set_mac_addr to true on startup */
+	bond->do_set_mac_addr = 1;
+
 	list_add_tail(&bond->bond_list, &bond_dev_list);
 
 	return 0;
Index: net-2.6/drivers/net/bonding/bonding.h
===================================================================
--- net-2.6.orig/drivers/net/bonding/bonding.h	2007-07-29 16:25:22.000000000 +0300
+++ net-2.6/drivers/net/bonding/bonding.h	2007-07-29 16:37:13.163056181 +0300
@@ -201,6 +201,7 @@ struct bonding {
 	struct   list_head vlan_list;
 	struct   vlan_group *vlgrp;
 	struct   packet_type arp_mon_pt;
+	s8       do_set_mac_addr;
 };
 
 /**


From monis at voltaire.com  Mon Jul 30 05:54:03 2007
From: monis at voltaire.com (Moni Shoua)
Date: Mon, 30 Jul 2007 15:54:03 +0300
Subject: [ofa-general] [PATCH V3 5/7] net/bonding: Enable IP multicast for
 bonding IPoIB devices
In-Reply-To: <46ADDB89.5030601@voltaire.com>
References: <46ADDB89.5030601@voltaire.com>
Message-ID: <46ADDF6B.1080907@voltaire.com>


Allow to enslave devices when the bonding device is not up. Over the discussion
held at the previous post this seemed to be the most clean way to go, where it
is not expected to cause instabilities.

Normally, the bonding driver is UP before any enslavement takes place.
Once a netdevice is UP, the network stack acts to have it join some multicast groups
(eg the all-hosts 224.0.0.1). Now, since ether_setup() have set the bonding device
type to be ARPHRD_ETHER and address len to be ETHER_ALEN, the net core code
computes a wrong multicast link address. This is b/c ip_eth_mc_map() is called
where for multicast joins taking place after the enslavement another ip_xxx_mc_map()
is called (eg ip_ib_mc_map() when the bond type is ARPHRD_INFINIBAND)

Signed-off-by: Moni Shoua <monis at voltaire.com>
Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>
---
 drivers/net/bonding/bond_main.c  |    4 ++--
 drivers/net/bonding/bond_sysfs.c |    6 ++----
 2 files changed, 4 insertions(+), 6 deletions(-)

Index: net-2.6/drivers/net/bonding/bond_main.c
===================================================================
--- net-2.6.orig/drivers/net/bonding/bond_main.c	2007-07-25 15:04:50.000000000 +0300
+++ net-2.6/drivers/net/bonding/bond_main.c	2007-07-25 15:06:17.175820632 +0300
@@ -1325,8 +1325,8 @@ int bond_enslave(struct net_device *bond
 
 	/* bond must be initialized by bond_open() before enslaving */
 	if (!(bond_dev->flags & IFF_UP)) {
-		dprintk("Error, master_dev is not up\n");
-		return -EPERM;
+		printk(KERN_WARNING DRV_NAME
+			" %s: master_dev is not up in bond_enslave\n", bond_dev->name);
 	}
 
 	/* already enslaved */
Index: net-2.6/drivers/net/bonding/bond_sysfs.c
===================================================================
--- net-2.6.orig/drivers/net/bonding/bond_sysfs.c	2007-07-25 14:18:12.000000000 +0300
+++ net-2.6/drivers/net/bonding/bond_sysfs.c	2007-07-25 15:06:17.176820452 +0300
@@ -266,11 +266,9 @@ static ssize_t bonding_store_slaves(stru
 
 	/* Quick sanity check -- is the bond interface up? */
 	if (!(bond->dev->flags & IFF_UP)) {
-		printk(KERN_ERR DRV_NAME
-		       ": %s: Unable to update slaves because interface is down.\n",
+		printk(KERN_WARNING DRV_NAME
+		       ": %s: doing slave updates when interface is down.\n",
 		       bond->dev->name);
-		ret = -EPERM;
-		goto out;
 	}
 
 	/* Note:  We can't hold bond->lock here, as bond_create grabs it. */


From monis at voltaire.com  Mon Jul 30 05:54:59 2007
From: monis at voltaire.com (Moni Shoua)
Date: Mon, 30 Jul 2007 15:54:59 +0300
Subject: [ofa-general] [PATCH V3 6/7] net/bonding: Handlle wrong assumptions
 that slave is always an Ethernet device
In-Reply-To: <46ADDB89.5030601@voltaire.com>
References: <46ADDB89.5030601@voltaire.com>
Message-ID: <46ADDFA3.1000100@voltaire.com>


bonding sometimes uses Ethernet constants (such as MTU and address length) which
are not good when it enslaves non Ethernet devices (such as InfiniBand).

Signed-off-by: Moni Shoua <monis at voltaire.com>
---
 drivers/net/bonding/bond_main.c  |    2 +-
 drivers/net/bonding/bond_sysfs.c |   10 ++++++++--
 drivers/net/bonding/bonding.h    |    1 +
 3 files changed, 10 insertions(+), 3 deletions(-)

Index: net-2.6/drivers/net/bonding/bond_main.c
===================================================================
--- net-2.6.orig/drivers/net/bonding/bond_main.c	2007-07-25 15:06:17.000000000 +0300
+++ net-2.6/drivers/net/bonding/bond_main.c	2007-07-25 15:33:25.012883360 +0300
@@ -1255,7 +1255,7 @@ static int bond_compute_features(struct 
 	unsigned long features = BOND_INTERSECT_FEATURES;
 	struct slave *slave;
 	struct net_device *bond_dev = bond->dev;
-	unsigned short max_hard_header_len = ETH_HLEN;
+	u16 max_hard_header_len = max((u16)ETH_HLEN, bond_dev->hard_header_len);
 	int i;
 
 	bond_for_each_slave(bond, slave, i) {
Index: net-2.6/drivers/net/bonding/bond_sysfs.c
===================================================================
--- net-2.6.orig/drivers/net/bonding/bond_sysfs.c	2007-07-25 15:06:17.000000000 +0300
+++ net-2.6/drivers/net/bonding/bond_sysfs.c	2007-07-25 15:20:10.224527636 +0300
@@ -260,6 +260,7 @@ static ssize_t bonding_store_slaves(stru
 	char command[IFNAMSIZ + 1] = { 0, };
 	char *ifname;
 	int i, res, found, ret = count;
+	u32 original_mtu;
 	struct slave *slave;
 	struct net_device *dev = NULL;
 	struct bonding *bond = to_bond(d);
@@ -325,6 +326,7 @@ static ssize_t bonding_store_slaves(stru
 		}
 
 		/* Set the slave's MTU to match the bond */
+		original_mtu = dev->mtu;
 		if (dev->mtu != bond->dev->mtu) {
 			if (dev->change_mtu) {
 				res = dev->change_mtu(dev,
@@ -339,6 +341,9 @@ static ssize_t bonding_store_slaves(stru
 		}
 		rtnl_lock();
 		res = bond_enslave(bond->dev, dev);
+		bond_for_each_slave(bond, slave, i)
+			if (strnicmp(slave->dev->name, ifname, IFNAMSIZ) == 0)
+				slave->original_mtu=original_mtu;
 		rtnl_unlock();
 		if (res) {
 			ret = res;
@@ -351,6 +356,7 @@ static ssize_t bonding_store_slaves(stru
 		bond_for_each_slave(bond, slave, i)
 			if (strnicmp(slave->dev->name, ifname, IFNAMSIZ) == 0) {
 				dev = slave->dev;
+				original_mtu = slave->original_mtu;
 				break;
 			}
 		if (dev) {
@@ -365,9 +371,9 @@ static ssize_t bonding_store_slaves(stru
 			}
 			/* set the slave MTU to the default */
 			if (dev->change_mtu) {
-				dev->change_mtu(dev, 1500);
+				dev->change_mtu(dev, original_mtu);
 			} else {
-				dev->mtu = 1500;
+				dev->mtu = original_mtu;
 			}
 		}
 		else {
Index: net-2.6/drivers/net/bonding/bonding.h
===================================================================
--- net-2.6.orig/drivers/net/bonding/bonding.h	2007-07-25 15:03:32.000000000 +0300
+++ net-2.6/drivers/net/bonding/bonding.h	2007-07-25 15:20:10.223527810 +0300
@@ -156,6 +156,7 @@ struct slave {
 	s8     link;    /* one of BOND_LINK_XXXX */
 	s8     state;   /* one of BOND_STATE_XXXX */
 	u32    original_flags;
+	u32    original_mtu;
 	u32    link_failure_count;
 	u16    speed;
 	u8     duplex;


From hnguyen at linux.vnet.ibm.com  Mon Jul 30 06:07:47 2007
From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen)
Date: Mon, 30 Jul 2007 15:07:47 +0200
Subject: [ofa-general] [PATCH 1/2] ehca: remove checkpatch.pl's warnings
	"externs should be avoided in .c files"
In-Reply-To: <adahcnnvpy5.fsf@cisco.com>
References: <200707271254.51055.hnguyen@linux.vnet.ibm.com>
	<adahcnnvpy5.fsf@cisco.com>
Message-ID: <200707301507.47575.hnguyen@linux.vnet.ibm.com>

Hi Roland!
> the patch looks fine except your mailer seems to have mangled
> it... can you resend so I can apply it?
Was going to recreate this patch, but then I saw that you
probably have incorporated it (manually) in your latest git.
Just want to make sure I'm seeing it right.
Anyway, appreciate your help!
Nam


From monis at voltaire.com  Mon Jul 30 05:56:06 2007
From: monis at voltaire.com (Moni Shoua)
Date: Mon, 30 Jul 2007 15:56:06 +0300
Subject: [ofa-general] [PATCH V3 7/7] net/bonding: Delay sending of
 gratuitous ARP to avoid failure
In-Reply-To: <46ADDB89.5030601@voltaire.com>
References: <46ADDB89.5030601@voltaire.com>
Message-ID: <46ADDFE6.9000609@voltaire.com>


Delay sending a gratuitous_arp when LINK_STATE_LINKWATCH_PENDING bit
in dev->state field is on. This improves the chances for the arp packet to
be transmitted.

Signed-off-by: Moni Shoua <monis at voltaire.com>
---
 drivers/net/bonding/bond_main.c |   25 +++++++++++++++++++++----
 drivers/net/bonding/bonding.h   |    1 +
 2 files changed, 22 insertions(+), 4 deletions(-)

Index: net-2.6/drivers/net/bonding/bond_main.c
===================================================================
--- net-2.6.orig/drivers/net/bonding/bond_main.c	2007-07-25 15:33:25.000000000 +0300
+++ net-2.6/drivers/net/bonding/bond_main.c	2007-07-26 18:42:59.296296622 +0300
@@ -1134,8 +1134,13 @@ void bond_change_active_slave(struct bon
 		if (new_active && !bond->do_set_mac_addr)
 			memcpy(bond->dev->dev_addr,  new_active->dev->dev_addr,
 				new_active->dev->addr_len);
-
-		bond_send_gratuitous_arp(bond);
+		if (bond->curr_active_slave &&
+			test_bit(__LINK_STATE_LINKWATCH_PENDING, &bond->curr_active_slave->dev->state)){
+			dprintk("delaying gratuitous arp on %s\n",bond->curr_active_slave->dev->name);
+			bond->send_grat_arp=1;
+		}else{
+			bond_send_gratuitous_arp(bond);
+		}
 	}
 }
 
@@ -2120,6 +2125,15 @@ void bond_mii_monitor(struct net_device 
 	 * program could monitor the link itself if needed.
 	 */
 
+	if (bond->send_grat_arp) {
+		if (bond->curr_active_slave && test_bit(__LINK_STATE_LINKWATCH_PENDING, &bond->curr_active_slave->dev->state))
+			dprintk("Needs to send gratuitous arp but not yet\n",__FUNCTION__);
+		else {
+			dprintk("sending delayed gratuitous arp on ond->curr_active_slave->dev->name\n");
+			bond_send_gratuitous_arp(bond);
+			bond->send_grat_arp=0;
+		}
+	}
 	read_lock(&bond->curr_slave_lock);
 	oldcurrent = bond->curr_active_slave;
 	read_unlock(&bond->curr_slave_lock);
@@ -2513,6 +2527,7 @@ static void bond_send_gratuitous_arp(str
 	struct slave *slave = bond->curr_active_slave;
 	struct vlan_entry *vlan;
 	struct net_device *vlan_dev;
+	int i;
 
 	dprintk("bond_send_grat_arp: bond %s slave %s\n", bond->dev->name,
 				slave ? slave->dev->name : "NULL");
@@ -2520,8 +2535,9 @@ static void bond_send_gratuitous_arp(str
 		return;
 
 	if (bond->master_ip) {
-		bond_arp_send(slave->dev, ARPOP_REPLY, bond->master_ip,
-				  bond->master_ip, 0);
+		for (i=0;i<3;i++)
+			bond_arp_send(slave->dev, ARPOP_REPLY, bond->master_ip,
+					  bond->master_ip, 0);
 	}
 
 	list_for_each_entry(vlan, &bond->vlan_list, vlan_list) {
@@ -4331,6 +4347,7 @@ static int bond_init(struct net_device *
 	bond->current_arp_slave = NULL;
 	bond->primary_slave = NULL;
 	bond->dev = bond_dev;
+	bond->send_grat_arp=0;
 	INIT_LIST_HEAD(&bond->vlan_list);
 
 	/* Initialize the device entry points */
Index: net-2.6/drivers/net/bonding/bonding.h
===================================================================
--- net-2.6.orig/drivers/net/bonding/bonding.h	2007-07-25 15:20:10.000000000 +0300
+++ net-2.6/drivers/net/bonding/bonding.h	2007-07-26 18:42:43.652087660 +0300
@@ -203,6 +203,7 @@ struct bonding {
 	struct   vlan_group *vlgrp;
 	struct   packet_type arp_mon_pt;
 	s8       do_set_mac_addr;
+	int	 send_grat_arp;
 };
 
 /**


From rdreier at cisco.com  Mon Jul 30 06:54:30 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 30 Jul 2007 06:54:30 -0700
Subject: [ofa-general] [PATCH 1/2] ehca: remove checkpatch.pl's warnings
	"externs should be avoided in .c files"
In-Reply-To: <200707301507.47575.hnguyen@linux.vnet.ibm.com> (Hoang-Nam
	Nguyen's message of "Mon, 30 Jul 2007 15:07:47 +0200")
References: <200707271254.51055.hnguyen@linux.vnet.ibm.com>
	<adahcnnvpy5.fsf@cisco.com>
	<200707301507.47575.hnguyen@linux.vnet.ibm.com>
Message-ID: <adabqdut2sp.fsf@cisco.com>

 > Was going to recreate this patch, but then I saw that you
 > probably have incorporated it (manually) in your latest git.
 > Just want to make sure I'm seeing it right.

Yes, I ended up doing it by hand.  Thanks.


From tziporet at dev.mellanox.co.il  Mon Jul 30 06:58:23 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Mon, 30 Jul 2007 16:58:23 +0300
Subject: [ofa-general] reminder: OFED meeting today at 9am PST
Message-ID: <46ADEE7F.2000005@mellanox.co.il>

Hi All,

We will have our bi-weekly OFED meeting today at 9am PST

Agenda:
- Status update
- Bugzilla cleanup

If you have more agenda items please send them

Tziporet


From jim at mellanox.com  Mon Jul 30 07:08:26 2007
From: jim at mellanox.com (Jim Mott)
Date: Mon, 30 Jul 2007 07:08:26 -0700
Subject: [ofa-general] [PATCH V1 1/1] sdplib: fix error return information
In-Reply-To: <46ADDFE6.9000609@voltaire.com>
References: <46ADDB89.5030601@voltaire.com> <46ADDFE6.9000609@voltaire.com>
Message-ID: <F57121538EA0C94F86018DDD40ADA1D16A6801@mtiexch01.mti.com>

Hi,
  I am the new maintainer of SDP and have almost figured out what that
means.  It is time to start submitting code changes for public review.
Please send comments on both the content and the style of these notices.


Fix various improper error indications returned by libsdp.so.  Most of
the problems were found by unit tests and the rest by inspection looking
for similar coding issues.

Diff from OFED 1.2:

Index: ofa_user/src/userspace/libsdp/src/port.c
===================================================================
--- ofa_user.orig/src/userspace/libsdp/src/port.c	2007-07-16
23:51:00.000000000 +0300
+++ ofa_user/src/userspace/libsdp/src/port.c	2007-07-18
23:43:08.000000000 +0300
@@ -418,11 +418,13 @@ __sdp_sockaddr_to_sdp(
 	if ( !addr_in ) {
 		__sdp_log( 9, "Error __sdp_sockaddr_to_sdp: "
 					  "provided NULL input
pointer\n" );
+		errno = EINVAL;
 		return -1;
 	}
 	if ( !addr_out ) {
 		__sdp_log( 9, "Error __sdp_sockaddr_to_sdp: "
 					  "provided NULL output
pointer\n" );
+		errno = EINVAL;
 		return -1;
 	}
 
@@ -432,6 +434,7 @@ __sdp_sockaddr_to_sdp(
 			__sdp_log( 9, "Error __sdp_sockaddr_to_sdp: "
 						  "provided address
length:%d < IPv4 length %d\n",
 						  addrlen, sizeof(
struct sockaddr_in ) );
+			errno = EINVAL;
 			return -1;
 		}
 
@@ -443,6 +446,7 @@ __sdp_sockaddr_to_sdp(
 			__sdp_log( 9, "Error __sdp_sockaddr_to_sdp: "
 						  "provided address
length:%d < IPv6 length %d\n",
 						  addrlen,
IPV6_ADDR_IN_MIN_LEN );
+			errno = EINVAL;
 			return -1;
 		}
 
@@ -450,6 +454,7 @@ __sdp_sockaddr_to_sdp(
 		if ( !is_ipv4_embedded_in_ipv6( sin6 ) ) {
 			__sdp_log( 9, "Error __sdp_sockaddr_to_sdp: "
 						  "Given IPv6 address
not an embedded IPv4\n" );
+			errno = EINVAL;
 			return -1;
 		}
 		memset( addr_out, 0, sizeof( struct sockaddr_in ) );
@@ -490,7 +495,8 @@ __sdp_sockaddr_to_sdp(
 	} else {
 		__sdp_log( 9, "Error __sdp_sockaddr_to_sdp: "
 					  "address family <%d> is
unknown\n", sin->sin_family );
-		return 1;
+		errno = EAFNOSUPPORT;
+		return -1;
 	}
 
 	return 0;
@@ -1270,7 +1276,7 @@ bind(
 		if ( __sdp_sockaddr_to_sdp( my_addr, addrlen, &sdp_addr,
&was_ipv6 ) ) {
 			__sdp_log( 9, "Error bind: failed to convert
address:%s for SDP\n",
 						  buf );
-			ret = EADDRNOTAVAIL;
+			ret = -1;
 			goto done;
 		}
 #ifndef SDP_SUPPORTS_IPv6
@@ -1305,6 +1311,7 @@ bind(
 				__sdp_log( 9, "BIND: Failed to find
common free port\n" );
 				/* We cannot bind both tcp and sdp on
the same port, we will close
 				 * the sdp and continue with tcp only */
+				goto done;
 			} else {
 				/* copy the port to the tmp address */
 				set_addr_port_num( ( struct sockaddr *
)&tmp_my_addr, port );
@@ -1454,7 +1461,7 @@ connect(
 				__sdp_log( 9,
 							  "Error
connect: "
 							  "failed to
convert address:%s to SDP\n", buf );
-				ret = EADDRNOTAVAIL;
+				ret = -1;
 				goto done;
 			}
 #ifndef SDP_SUPPORTS_IPv6
@@ -1485,7 +1492,7 @@ connect(
 		if ( __sdp_sockaddr_to_sdp( serv_addr, addrlen, sdp_sin,
&was_ipv6 ) ) {
 			__sdp_log( 9, "Error connect: "
 						  "failed to convert to
shadow address:%s to SDP\n", buf );
-			ret = EADDRNOTAVAIL;
+			ret = -1;
 		} else {
 #ifndef SDP_SUPPORTS_IPv6
 			if ( was_ipv6 )
@@ -1590,7 +1597,8 @@ listen(
 		  getsockname( fd, ( struct sockaddr * )&tmp_sin,
&tmp_sinlen ) < 0 ) {
 		__sdp_log( 9, "Error listen: getsockname return <%d> for
TCP socket\n",
 					  errno );
-		sret = EADDRNOTAVAIL;
+		errno = EADDRNOTAVAIL;
+		sret = -1;
 		goto done;
 	}
 
@@ -1623,7 +1631,7 @@ listen(
 
tmp_sinlen, sdp_sin, &was_ipv6 ) ) {
 			__sdp_log( 9, "Error listen: "
 						  "failed to convert to
address:%s to SDP\n", buf );
-			ret = EOPNOTSUPP;
+			ret = -1;
 		} else {
 #ifndef SDP_SUPPORTS_IPv6
 			if ( was_ipv6 )


From hnguyen at linux.vnet.ibm.com  Mon Jul 30 08:02:59 2007
From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen)
Date: Mon, 30 Jul 2007 17:02:59 +0200
Subject: [ofa-general] Re: [PATCH 2/5] ehca: Generate event when SRQ limit
	reached
In-Reply-To: <ada7iou75tn.fsf@cisco.com>
References: <200707201602.19142.hnguyen@linux.vnet.ibm.com>
	<ada7iou75tn.fsf@cisco.com>
Message-ID: <200707301703.00111.hnguyen@linux.vnet.ibm.com>

Hi,
> BTW, does your SRQ-capable hardware support generating the "last WQE
> reached" event?  There's not any reliable way to avoid problems when
> destroying QPs attached to an SRQ without it, and the IB spec requires
> CAs that support SRQs to generate it (o11-5.2.5 in chapter 11 of vol 1).
> 
> I don't see any code in ehca to generate the event, and IPoIB CM at
> least will be very unhappy when using SRQs if the event is not
> generated.
Thanks for this good catch. We're investigating how to implement this.
Will keep you updated.
Regards
Nam


From arthur.jones at qlogic.com  Mon Jul 30 08:06:00 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Mon, 30 Jul 2007 08:06:00 -0700
Subject: [ofa-general] [PATCH 1/4] IB/ipath - Remove unsafe fastrcvint code
	from interrupt handler
In-Reply-To: <20070730150555.19920.69378.stgit@eng-46.internal.keyresearch.com>
References: <20070730150555.19920.69378.stgit@eng-46.internal.keyresearch.com>
Message-ID: <20070730150600.19920.61255.stgit@eng-46.internal.keyresearch.com>

From: Dave Olson <dave.olson at qlogic.com>

The fastrcvint code's purpose was to avoid reading the interrupt status
if kernel packets were in the receive queue (to reduce overhead).  Because
intstatus was not read, we could miss the error interrupt bit indicating
freeze mode, since it only delivers a single interrupt, even if still
pending after intclear is written.

This patch removes that optimization.

Signed-off-by: Dave Olson <dave.olson at qlogic.com>
---

 drivers/infiniband/hw/ipath/ipath_common.h |    3 +--
 drivers/infiniband/hw/ipath/ipath_intr.c   |   31 ----------------------------
 2 files changed, 1 insertions(+), 33 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_common.h b/drivers/infiniband/hw/ipath/ipath_common.h
index b4b786d..6ad822c 100644
--- a/drivers/infiniband/hw/ipath/ipath_common.h
+++ b/drivers/infiniband/hw/ipath/ipath_common.h
@@ -100,8 +100,7 @@ struct infinipath_stats {
 	__u64 sps_hwerrs;
 	/* number of times IB link changed state unexpectedly */
 	__u64 sps_iblink;
-	/* kernel receive interrupts that didn't read intstat */
-	__u64 sps_fastrcvint;
+	__u64 sps_unused; /* was fastrcvint, no longer implemented */
 	/* number of kernel (port0) packets received */
 	__u64 sps_port0pkts;
 	/* number of "ethernet" packets sent by driver */
diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c
index 1fd91c5..9b03154 100644
--- a/drivers/infiniband/hw/ipath/ipath_intr.c
+++ b/drivers/infiniband/hw/ipath/ipath_intr.c
@@ -1035,36 +1035,6 @@ irqreturn_t ipath_intr(int irq, void *data)
 		goto bail;
 	}
 
-	/*
-	 * We try to avoid reading the interrupt status register, since
-	 * that's a PIO read, and stalls the processor for up to about
-	 * ~0.25 usec. The idea is that if we processed a port0 packet,
-	 * we blindly clear the  port 0 receive interrupt bits, and nothing
-	 * else, then return.  If other interrupts are pending, the chip
-	 * will re-interrupt us as soon as we write the intclear register.
-	 * We then won't process any more kernel packets (if not the 2nd
-	 * time, then the 3rd or 4th) and we'll then handle the other
-	 * interrupts.   We clear the interrupts first so that we don't
-	 * lose intr for later packets that arrive while we are processing.
-	 */
-	oldhead = dd->ipath_port0head;
-	curtail = (u32)le64_to_cpu(*dd->ipath_hdrqtailptr);
-	if (oldhead != curtail) {
-		if (dd->ipath_flags & IPATH_GPIO_INTR) {
-			ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_clear,
-					 (u64) (1 << IPATH_GPIO_PORT0_BIT));
-			istat = port0rbits | INFINIPATH_I_GPIO;
-		}
-		else
-			istat = port0rbits;
-		ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, istat);
-		ipath_kreceive(dd);
-		if (oldhead != dd->ipath_port0head) {
-			ipath_stats.sps_fastrcvint++;
-			goto done;
-		}
-	}
-
 	istat = ipath_read_kreg32(dd, dd->ipath_kregs->kr_intstatus);
 
 	if (unlikely(!istat)) {
@@ -1225,7 +1195,6 @@ irqreturn_t ipath_intr(int irq, void *data)
 		handle_layer_pioavail(dd);
 	}
 
-done:
 	ret = IRQ_HANDLED;
 
 bail:


From arthur.jones at qlogic.com  Mon Jul 30 08:05:55 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Mon, 30 Jul 2007 08:05:55 -0700
Subject: [ofa-general] [PATCH] IB/ipath -- bug fixes in for-roland
Message-ID: <20070730150555.19920.69378.stgit@eng-46.internal.keyresearch.com>

hi roland,  welcome back -- to help you feel wanted,
here are the latest set of fixes for 2.6.23.  these
changes are avail via git pull from:

git://git.qlogic.com/ipath-linux-2.6 for-roland

arthur


From arthur.jones at qlogic.com  Mon Jul 30 08:06:05 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Mon, 30 Jul 2007 08:06:05 -0700
Subject: [ofa-general] [PATCH 2/4] IB/ipath - use faster put_tid_2 routine
	after initialization
In-Reply-To: <20070730150555.19920.69378.stgit@eng-46.internal.keyresearch.com>
References: <20070730150555.19920.69378.stgit@eng-46.internal.keyresearch.com>
Message-ID: <20070730150605.19920.66997.stgit@eng-46.internal.keyresearch.com>

From: Dave Olson <dave.olson at qlogic.com>

At some point the ipath_minrev field was initialized prior to
the ipath_init_iba6120_funcs call, but that is no longer the
case, so the slower put_tid routine was always being used.

Signed-off-by: Dave Olson <dave.olson at qlogic.com>
---

 drivers/infiniband/hw/ipath/ipath_iba6120.c |   20 +++++++++++++-------
 1 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_iba6120.c b/drivers/infiniband/hw/ipath/ipath_iba6120.c
index 9868ccd..5b6ac9a 100644
--- a/drivers/infiniband/hw/ipath/ipath_iba6120.c
+++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c
@@ -321,6 +321,8 @@ static const struct ipath_hwerror_msgs ipath_6120_hwerror_msgs[] = {
 		        << INFINIPATH_HWE_TXEMEMPARITYERR_SHIFT)
 
 static int ipath_pe_txe_recover(struct ipath_devdata *);
+static void ipath_pe_put_tid_2(struct ipath_devdata *, u64 __iomem *,
+			       u32, unsigned long);
 
 /**
  * ipath_pe_handle_hwerrors - display hardware errors.
@@ -555,8 +557,11 @@ static int ipath_pe_boardname(struct ipath_devdata *dd, char *name,
 		ipath_dev_err(dd, "Unsupported InfiniPath hardware revision %u.%u!\n",
 			      dd->ipath_majrev, dd->ipath_minrev);
 		ret = 1;
-	} else
+	} else {
 		ret = 0;
+		if (dd->ipath_minrev >= 2)
+			dd->ipath_f_put_tid = ipath_pe_put_tid_2;
+	}
 
 	return ret;
 }
@@ -1220,7 +1225,7 @@ static void ipath_pe_clear_tids(struct ipath_devdata *dd, unsigned port)
 		 port * dd->ipath_rcvtidcnt * sizeof(*tidbase));
 
 	for (i = 0; i < dd->ipath_rcvtidcnt; i++)
-		ipath_pe_put_tid(dd, &tidbase[i], RCVHQ_RCV_TYPE_EXPECTED,
+		dd->ipath_f_put_tid(dd, &tidbase[i], RCVHQ_RCV_TYPE_EXPECTED,
 				 tidinv);
 
 	tidbase = (u64 __iomem *)
@@ -1229,7 +1234,7 @@ static void ipath_pe_clear_tids(struct ipath_devdata *dd, unsigned port)
 		 port * dd->ipath_rcvegrcnt * sizeof(*tidbase));
 
 	for (i = 0; i < dd->ipath_rcvegrcnt; i++)
-		ipath_pe_put_tid(dd, &tidbase[i], RCVHQ_RCV_TYPE_EAGER,
+		dd->ipath_f_put_tid(dd, &tidbase[i], RCVHQ_RCV_TYPE_EAGER,
 				 tidinv);
 }
 
@@ -1395,10 +1400,11 @@ void ipath_init_iba6120_funcs(struct ipath_devdata *dd)
 	dd->ipath_f_quiet_serdes = ipath_pe_quiet_serdes;
 	dd->ipath_f_bringup_serdes = ipath_pe_bringup_serdes;
 	dd->ipath_f_clear_tids = ipath_pe_clear_tids;
-	if (dd->ipath_minrev >= 2)
-		dd->ipath_f_put_tid = ipath_pe_put_tid_2;
-	else
-		dd->ipath_f_put_tid = ipath_pe_put_tid;
+	/*
+	 * this may get changed after we read the chip revision,
+	 * but we start with the safe version for all revs
+	 */
+	dd->ipath_f_put_tid = ipath_pe_put_tid;
 	dd->ipath_f_cleanup = ipath_setup_pe_cleanup;
 	dd->ipath_f_setextled = ipath_setup_pe_setextled;
 	dd->ipath_f_get_base_info = ipath_pe_get_base_info;


From arthur.jones at qlogic.com  Mon Jul 30 08:06:10 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Mon, 30 Jul 2007 08:06:10 -0700
Subject: [ofa-general] [PATCH 3/4] IB/ipath - Fix some issues with buffer
	cancel and sendctrl register update
In-Reply-To: <20070730150555.19920.69378.stgit@eng-46.internal.keyresearch.com>
References: <20070730150555.19920.69378.stgit@eng-46.internal.keyresearch.com>
Message-ID: <20070730150610.19920.74815.stgit@eng-46.internal.keyresearch.com>

From: Dave Olson <dave.olson at qlogic.com>

There was confused use between INFINIPATH_S_PIOBUFAVAILUPD (value)
and IPATH_S_PIOBUFAVAILUPD (bit position).   Also, some callers of
ipath_cancel_sends() need kr_sendctrl restored, and some want to
do it later.

Signed-off-by: Dave Olson <dave.olson at qlogic.com>
---

 drivers/infiniband/hw/ipath/ipath_driver.c    |   11 +++++++----
 drivers/infiniband/hw/ipath/ipath_init_chip.c |    2 +-
 drivers/infiniband/hw/ipath/ipath_intr.c      |    6 +++---
 drivers/infiniband/hw/ipath/ipath_kernel.h    |    2 +-
 4 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c
index 09c5fd8..6ccba36 100644
--- a/drivers/infiniband/hw/ipath/ipath_driver.c
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c
@@ -740,7 +740,7 @@ void ipath_disarm_piobufs(struct ipath_devdata *dd, unsigned first,
 	 * pioavail updates to memory to stop.
 	 */
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl,
-			 sendorig & ~IPATH_S_PIOBUFAVAILUPD);
+			 sendorig & ~INFINIPATH_S_PIOBUFAVAILUPD);
 	sendorig = ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch);
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl,
 			 dd->ipath_sendctrl);
@@ -1614,7 +1614,7 @@ int ipath_waitfor_mdio_cmdready(struct ipath_devdata *dd)
  * it's safer to always do it.
  * PIOAvail bits are updated by the chip as if normal send had happened.
  */
-void ipath_cancel_sends(struct ipath_devdata *dd)
+void ipath_cancel_sends(struct ipath_devdata *dd, int restore_sendctrl)
 {
 	ipath_dbg("Cancelling all in-progress send buffers\n");
 	dd->ipath_lastcancel = jiffies+HZ/2; /* skip armlaunch errs a bit */
@@ -1627,6 +1627,9 @@ void ipath_cancel_sends(struct ipath_devdata *dd)
 	ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch);
 	ipath_disarm_piobufs(dd, 0,
 		(unsigned)(dd->ipath_piobcnt2k + dd->ipath_piobcnt4k));
+	if (restore_sendctrl) /* else done by caller later */
+		ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl,
+				 dd->ipath_sendctrl);
 
 	/* and again, be sure all have hit the chip */
 	ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch);
@@ -1655,7 +1658,7 @@ static void ipath_set_ib_lstate(struct ipath_devdata *dd, int which)
 	/* flush all queued sends when going to DOWN or INIT, to be sure that
 	 * they don't block MAD packets */
 	if (!linkcmd || linkcmd == INFINIPATH_IBCC_LINKCMD_INIT)
-		ipath_cancel_sends(dd);
+		ipath_cancel_sends(dd, 1);
 
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_ibcctrl,
 			 dd->ipath_ibcctrl | which);
@@ -2000,7 +2003,7 @@ void ipath_shutdown_device(struct ipath_devdata *dd)
 
 	ipath_set_ib_lstate(dd, INFINIPATH_IBCC_LINKINITCMD_DISABLE <<
 			    INFINIPATH_IBCC_LINKINITCMD_SHIFT);
-	ipath_cancel_sends(dd);
+	ipath_cancel_sends(dd, 0);
 
 	/* disable IBC */
 	dd->ipath_control &= ~INFINIPATH_C_LINKENABLE;
diff --git a/drivers/infiniband/hw/ipath/ipath_init_chip.c b/drivers/infiniband/hw/ipath/ipath_init_chip.c
index 49951d5..71e6c9d 100644
--- a/drivers/infiniband/hw/ipath/ipath_init_chip.c
+++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c
@@ -782,7 +782,7 @@ int ipath_init_chip(struct ipath_devdata *dd, int reinit)
 	 * Follows early_init because some chips have to initialize
 	 * PIO buffers in early_init to avoid false parity errors.
 	 */
-	ipath_cancel_sends(dd);
+	ipath_cancel_sends(dd, 0);
 
 	/* early_init sets rcvhdrentsize and rcvhdrsize, so this must be
 	 * done after early_init */
diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c
index 9b03154..a5b3e7e 100644
--- a/drivers/infiniband/hw/ipath/ipath_intr.c
+++ b/drivers/infiniband/hw/ipath/ipath_intr.c
@@ -303,7 +303,7 @@ static void handle_e_ibstatuschanged(struct ipath_devdata *dd,
 		 * Flush all queued sends when link went to DOWN or INIT,
 		 * to be sure that they don't block SMA and other MAD packets
 		 */
-		ipath_cancel_sends(dd);
+		ipath_cancel_sends(dd, 1);
 	}
 	else if (lstate == IPATH_IBSTATE_INIT || lstate == IPATH_IBSTATE_ARM ||
 	    lstate == IPATH_IBSTATE_ACTIVE) {
@@ -799,13 +799,13 @@ void ipath_clear_freeze(struct ipath_devdata *dd)
 	 * therefore would not be sent, and eventually
 	 * might cause the process to run out of bufs
 	 */
-	ipath_cancel_sends(dd);
+	ipath_cancel_sends(dd, 0);
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_control,
 			 dd->ipath_control);
 
 	/* ensure pio avail updates continue */
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl,
-		 dd->ipath_sendctrl & ~IPATH_S_PIOBUFAVAILUPD);
+		 dd->ipath_sendctrl & ~INFINIPATH_S_PIOBUFAVAILUPD);
 	ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch);
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl,
 		 dd->ipath_sendctrl);
diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h
index ace63ef..ef77329 100644
--- a/drivers/infiniband/hw/ipath/ipath_kernel.h
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h
@@ -683,7 +683,7 @@ int ipath_unordered_wc(void);
 
 void ipath_disarm_piobufs(struct ipath_devdata *, unsigned first,
 			  unsigned cnt);
-void ipath_cancel_sends(struct ipath_devdata *);
+void ipath_cancel_sends(struct ipath_devdata *, int);
 
 int ipath_create_rcvhdrq(struct ipath_devdata *, struct ipath_portdata *);
 void ipath_free_pddata(struct ipath_devdata *, struct ipath_portdata *);


From arthur.jones at qlogic.com  Mon Jul 30 08:06:15 2007
From: arthur.jones at qlogic.com (Arthur Jones)
Date: Mon, 30 Jul 2007 08:06:15 -0700
Subject: [ofa-general] [PATCH 4/4] IB/ipath - Workaround problem of errormask
	register being overwritten
In-Reply-To: <20070730150555.19920.69378.stgit@eng-46.internal.keyresearch.com>
References: <20070730150555.19920.69378.stgit@eng-46.internal.keyresearch.com>
Message-ID: <20070730150615.19920.44705.stgit@eng-46.internal.keyresearch.com>

From: Dave Olson <dave.olson at qlogic.com>

On some system hardware, we are seeing moderately common cases of the
chip errormask register being overwritten due to a chip bug in iba6120
that is triggered by a vendor specific PCIe broadcast message.  This
patch merely checks periodically, and corrects it if needed (the overwrite
can cause us to not get error and hardware error interrupts).  Also, make
dd->ipath_errormask the one, true canonical source for kr_errormask, and
remove references to ipath_ignorederrs as it is currently unused.

Signed-off-by: Dave Olson <dave.olson at qlogic.com>
Signed-off-by: John Gregor <john.gregor at qlogic.com>
---

 drivers/infiniband/hw/ipath/ipath_init_chip.c |    5 +-
 drivers/infiniband/hw/ipath/ipath_intr.c      |   25 ++++++-----
 drivers/infiniband/hw/ipath/ipath_kernel.h    |   11 +----
 drivers/infiniband/hw/ipath/ipath_stats.c     |   55 ++++++++++++++++++++++---
 4 files changed, 67 insertions(+), 29 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_init_chip.c b/drivers/infiniband/hw/ipath/ipath_init_chip.c
index 71e6c9d..9dd0bac 100644
--- a/drivers/infiniband/hw/ipath/ipath_init_chip.c
+++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c
@@ -851,13 +851,14 @@ int ipath_init_chip(struct ipath_devdata *dd, int reinit)
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_hwerrmask,
 			 dd->ipath_hwerrmask);
 
-	dd->ipath_maskederrs = dd->ipath_ignorederrs;
 	/* clear all */
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_errorclear, -1LL);
 	/* enable errors that are masked, at least this first time. */
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask,
 			 ~dd->ipath_maskederrs);
-	/* clear any interrups up to this point (ints still not enabled) */
+	dd->ipath_errormask = ipath_read_kreg64(dd,
+		dd->ipath_kregs->kr_errormask);
+	/* clear any interrupts up to this point (ints still not enabled) */
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, -1LL);
 
 	/*
diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c
index a5b3e7e..6480465 100644
--- a/drivers/infiniband/hw/ipath/ipath_intr.c
+++ b/drivers/infiniband/hw/ipath/ipath_intr.c
@@ -517,10 +517,7 @@ static int handle_errors(struct ipath_devdata *dd, ipath_err_t errs)
 
 	supp_msgs = handle_frequent_errors(dd, errs, msg, &noprint);
 
-	/*
-	 * don't report errors that are masked (includes those always
-	 * ignored)
-	 */
+	/* don't report errors that are masked */
 	errs &= ~dd->ipath_maskederrs;
 
 	/* do these first, they are most important */
@@ -566,19 +563,19 @@ static int handle_errors(struct ipath_devdata *dd, ipath_err_t errs)
 		 * ones on this particular interrupt, which also isn't great
 		 */
 		dd->ipath_maskederrs |= dd->ipath_lasterror | errs;
+		dd->ipath_errormask &= ~dd->ipath_maskederrs;
 		ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask,
-				 ~dd->ipath_maskederrs);
+			dd->ipath_errormask);
 		s_iserr = ipath_decode_err(msg, sizeof msg,
-				 (dd->ipath_maskederrs & ~dd->
-				  ipath_ignorederrs));
+			dd->ipath_maskederrs);
 
-		if ((dd->ipath_maskederrs & ~dd->ipath_ignorederrs) &
+		if (dd->ipath_maskederrs &
 			~(INFINIPATH_E_RRCVEGRFULL |
 			INFINIPATH_E_RRCVHDRFULL | INFINIPATH_E_PKTERRS))
 			ipath_dev_err(dd, "Temporarily disabling "
 			    "error(s) %llx reporting; too frequent (%s)\n",
-				(unsigned long long) (dd->ipath_maskederrs &
-				~dd->ipath_ignorederrs), msg);
+				(unsigned long long)dd->ipath_maskederrs,
+				msg);
 		else {
 			/*
 			 * rcvegrfull and rcvhdrqfull are "normal",
@@ -793,6 +790,9 @@ void ipath_clear_freeze(struct ipath_devdata *dd)
 	/* disable error interrupts, to avoid confusion */
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask, 0ULL);
 
+	/* also disable interrupts; errormask is sometimes overwriten */
+	ipath_write_kreg(dd, dd->ipath_kregs->kr_intmask, 0ULL);
+
 	/*
 	 * clear all sends, because they have may been
 	 * completed by usercode while in freeze mode, and
@@ -817,7 +817,7 @@ void ipath_clear_freeze(struct ipath_devdata *dd)
 	for (i = 0; i < dd->ipath_pioavregs; i++) {
 		/* deal with 6110 chip bug */
 		im = i > 3 ? ((i&1) ? i-1 : i+1) : i;
-		val = ipath_read_kreg64(dd, 0x1000+(im*sizeof(u64)));
+		val = ipath_read_kreg64(dd, (0x1000/sizeof(u64))+im);
 		dd->ipath_pioavailregs_dma[i] = dd->ipath_pioavailshadow[i]
 			= le64_to_cpu(val);
 	}
@@ -832,7 +832,8 @@ void ipath_clear_freeze(struct ipath_devdata *dd)
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_errorclear,
 		E_SPKT_ERRS_IGNORE);
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask,
-		~dd->ipath_maskederrs);
+		dd->ipath_errormask);
+	ipath_write_kreg(dd, dd->ipath_kregs->kr_intmask, -1LL);
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, 0ULL);
 }
 
diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h
index ef77329..7a7966f 100644
--- a/drivers/infiniband/hw/ipath/ipath_kernel.h
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h
@@ -261,18 +261,10 @@ struct ipath_devdata {
 	 * limiting of hwerror reporting
 	 */
 	ipath_err_t ipath_lasthwerror;
-	/*
-	 * errors masked because they occur too fast, also includes errors
-	 * that are always ignored (ipath_ignorederrs)
-	 */
+	/* errors masked because they occur too fast */
 	ipath_err_t ipath_maskederrs;
 	/* time in jiffies at which to re-enable maskederrs */
 	unsigned long ipath_unmasktime;
-	/*
-	 * errors always ignored (masked), at least for a given
-	 * chip/device, because they are wrong or not useful
-	 */
-	ipath_err_t ipath_ignorederrs;
 	/* count of egrfull errors, combined for all ports */
 	u64 ipath_last_tidfull;
 	/* for ipath_qcheck() */
@@ -436,6 +428,7 @@ struct ipath_devdata {
 	u64 ipath_lastibcstat;
 	/* hwerrmask shadow */
 	ipath_err_t ipath_hwerrmask;
+	ipath_err_t ipath_errormask; /* errormask shadow */
 	/* interrupt config reg shadow */
 	u64 ipath_intconfig;
 	/* kr_sendpiobufbase value */
diff --git a/drivers/infiniband/hw/ipath/ipath_stats.c b/drivers/infiniband/hw/ipath/ipath_stats.c
index 73ed17d..7338312 100644
--- a/drivers/infiniband/hw/ipath/ipath_stats.c
+++ b/drivers/infiniband/hw/ipath/ipath_stats.c
@@ -196,6 +196,46 @@ static void ipath_qcheck(struct ipath_devdata *dd)
 	}
 }
 
+
+static void ipath_chk_errormask(struct ipath_devdata *dd)
+{
+	static u32 fixed;
+	u32 ctrl;
+	unsigned long errormask;
+	unsigned long hwerrs;
+
+	if (!dd->ipath_errormask || !(dd->ipath_flags & IPATH_INITTED))
+		return;
+
+	errormask = ipath_read_kreg64(dd, dd->ipath_kregs->kr_errormask);
+
+	if (errormask == dd->ipath_errormask)
+		return;
+	fixed++;
+
+	hwerrs = ipath_read_kreg64(dd, dd->ipath_kregs->kr_hwerrstatus);
+	ctrl = ipath_read_kreg32(dd, dd->ipath_kregs->kr_control);
+
+	ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask,
+		dd->ipath_errormask);
+
+	if ((hwerrs & dd->ipath_hwerrmask) ||
+		(ctrl & INFINIPATH_C_FREEZEMODE)) {
+		/* force re-interrupt of pending events, just in case */
+		ipath_write_kreg(dd, dd->ipath_kregs->kr_hwerrclear, 0ULL);
+		ipath_write_kreg(dd, dd->ipath_kregs->kr_errorclear, 0ULL);
+		ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, 0ULL);
+		dev_info(&dd->pcidev->dev,
+			"errormask fixed(%u) %lx -> %lx, ctrl %x hwerr %lx\n",
+			fixed, errormask, (unsigned long)dd->ipath_errormask,
+			ctrl, hwerrs);
+	} else
+		ipath_dbg("errormask fixed(%u) %lx -> %lx, no freeze\n",
+			fixed, errormask,
+			(unsigned long)dd->ipath_errormask);
+}
+
+
 /**
  * ipath_get_faststats - get word counters from chip before they overflow
  * @opaque - contains a pointer to the infinipath device ipath_devdata
@@ -251,14 +291,13 @@ void ipath_get_faststats(unsigned long opaque)
 		dd->ipath_lasterror = 0;
 	if (dd->ipath_lasthwerror)
 		dd->ipath_lasthwerror = 0;
-	if ((dd->ipath_maskederrs & ~dd->ipath_ignorederrs)
+	if (dd->ipath_maskederrs
 	    && time_after(jiffies, dd->ipath_unmasktime)) {
 		char ebuf[256];
 		int iserr;
 		iserr = ipath_decode_err(ebuf, sizeof ebuf,
-				 (dd->ipath_maskederrs & ~dd->
-				  ipath_ignorederrs));
-		if ((dd->ipath_maskederrs & ~dd->ipath_ignorederrs) &
+			dd->ipath_maskederrs);
+		if (dd->ipath_maskederrs &
 				~(INFINIPATH_E_RRCVEGRFULL | INFINIPATH_E_RRCVHDRFULL |
 				INFINIPATH_E_PKTERRS ))
 			ipath_dev_err(dd, "Re-enabling masked errors "
@@ -278,9 +317,12 @@ void ipath_get_faststats(unsigned long opaque)
 				ipath_cdbg(ERRPKT, "Re-enabling packet"
 						" problem interrupt (%s)\n", ebuf);
 		}
-		dd->ipath_maskederrs = dd->ipath_ignorederrs;
+
+		/* re-enable masked errors */
+		dd->ipath_errormask |= dd->ipath_maskederrs;
 		ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask,
-				 ~dd->ipath_maskederrs);
+			dd->ipath_errormask);
+		dd->ipath_maskederrs = 0;
 	}
 
 	/* limit qfull messages to ~one per minute per port */
@@ -294,6 +336,7 @@ void ipath_get_faststats(unsigned long opaque)
 		}
 	}
 
+	ipath_chk_errormask(dd);
 done:
 	mod_timer(&dd->ipath_stats_timer, jiffies + HZ * 5);
 }


From kenjeffries at storagegear.com  Mon Jul 30 08:30:18 2007
From: kenjeffries at storagegear.com (Ken Jeffries)
Date: Mon, 30 Jul 2007 10:30:18 -0500
Subject: [ofa-general] OFED SRP Client / StorageGear Target / Performance
	with Modified Write Protocol
Message-ID: <02ca01c7d2be$89a17eb0$0a97a8c0@blacktip>


We have been doing a fair amount of performance testing on our SRP target.
One thing we found early on was that client writes were considerably slower
than client reads. We addressed this by patching the SRP client code so
that it could include the client write data in the SRP CMD IU if it would
fit. This notion is in iSER but is not in standard SRP. Architecturally,
the capability is signaled using an additional data buffer format bit.
We find that client write performance is considerably improved by using
this capability. We are calling SRP spec compliant writes "standard
writes" and our modified writes "iu data writes".

We also implemented a similar capability for client reads but on our system
we did not see a performance improvement.

We would like to know if other SRP'rs would be interesting in us making
the patch available for either inclusion or for discussion. Since we did
this without input from anyone else we are not going to claim that the
way we did it is necessarily the best way to do it.

Below are some of our performance numbers, preceeded by a description of
our test setup.

The StorageGear SRP Solid State Disk System is an asymmetrical embedded system
based on proprietary firmware and a Supermicro X7DBi+ motherboard with two
2.00GHz Woodcrest processors (four cpus altogether). The system used in this
test includes two Mellanox sdr pci-e hcas in 8x slots. Four independent SSDs
(SRP0, SRP1, ...) are configured. SRP0 is made visible on the first hca port,
SRP1 is made visible on the second hca port and so on. Each hca is statically
associated with a cpu at boot time. The native block size of each ssd is 4KB.
The native block size can be configured to be from 512B to 64KB. We suspect
that 4KB is best for Linux applications.

"testy" is a small client program that uses Linux asynchronous i/o and O_DIRECT
to drive read and write requests as quickly as possible. It tries to keep a
specified number of reads or writes of specified size outstanding for a
specified time. testy was written because available tools were not able to
load the StorageGear target sufficiently.  All testy io is random. For an
SSD, random io performance should be the same as sequential so we don't look
at sequential performance at all.

The SRP clients, Tesla and Newton, used in the tests have Asus A8N32-SLI Deluxe
motherboards, each with a AMD 1.8GHz Dual Core Opteron 165 processor, 1GB ram,
2 Mellanox sdr pci-e 8x hcas in 16x slots running OFED-1.2 with SRP on SUSE
Linux Enterprise Server 10 (x86_64).  Tesla runs kernel 2.6.16.27-0.9-smp and
Newton runs kernel 2.6.16.21-0.8-smp.

Two Mellanox MTEK 43132 8-port 4x switches are used to implement two subnets.
SMs for each subnet are provided by separate systems.

For these tests, four testys are run, two per client, one per srp target. The
paths are arranged thru visibility and allow/deny configuration to use all
four client ports and all four srp target ports. We monitor our target cpu
utilization and we know that the maximum number of "small" iops for a
particular hca is reached when the cpu associated with the hca reaches 100%
utilization. All numbers are 90 second testy run averages.

4KB Random Standard Reads
testy       target  target iops     target hca  hca iops
-----       ------  -----------     ----------  --------
newton.0    srp0    30636
newton.1    srp1    30682
                                    hca0        61318
tesla.0     srp2    30680
tesla.1     srp3    30710
                                    hca1        61390

4KB Random Standard Writes
testy       target  target iops     target hca  hca iops
-----       ------  -----------     ----------  --------
newton.0    srp0    25201
newton.1    srp1    25291
                                    hca0        50492
tesla.0     srp2    25412
tesla.1     srp3    25441
                                    hca1        50853

4KB Random IU Data Writes
testy       target  target iops     target hca  hca iops
-----       ------  -----------     ----------  --------
newton.0    srp0    31993
newton.1    srp1    32526
                                    hca0        64519
tesla.0     srp2    32172
tesla.1     srp3    32594
                                    hca1        64766
-

64KB Random Standard Reads
testy       target  target mbps     target hca  hca mbps
-----       ------  -----------     ----------  --------
newton.0    srp0    681.2
newton.1    srp1    681.2
                                    hca0        1362.4
tesla.0     srp2    680.1
tesla.1     srp3    680.2
                                    hca1        1360.3

128KB Random Standard Writes
testy       target  target mbps     target hca  hca mbps
-----       ------  -----------     ----------  --------
newton.0    srp0    747.8
newton.1    srp1    739.5
                                    hca0        1487.3
tesla.0     srp2    747.2
tesla.1     srp3    738.7
                                    hca1        1485.9
-

The following tests are one testy to one srp target.

4KB Random Reads
testy       target  target iops     target hca  hca iops
-----       ------  -----------     ----------  --------
tesla       srp3    59289
                                    hca1        59289

4KB Random Standard Writes
testy       target  target iops     target hca  hca iops
-----       ------  -----------     ----------  --------
tesla       srp3    43054
                                    hca1        43054

4KB Random IU Data Writes
testy       target  target iops     target hca  hca iops
-----       ------  -----------     ----------  --------
tesla       srp3    53839
                                    hca1        53839
128 Random Standard Reads
testy       target  target mbps     target hca  hca mbps
-----       ------  -----------     ----------  --------
tesla       srp3    971.9
                                    hca1        971.9

128 Random Standard Writes
testy       target  target mbps     target hca  hca mbps
-----       ------  -----------     ----------  --------
tesla       srp3    881.5
                                    hca1        881.5


We have done some testing with directly connected DDR
hcas. The DDR hcas provide an iops boost in the range
of 10%.

Ken Jeffries
StorageGear


From hal.rosenstock at gmail.com  Mon Jul 30 09:04:40 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 30 Jul 2007 12:04:40 -0400
Subject: [ofa-general] QoS RFC
In-Reply-To: <46A283B6.1070105@dev.mellanox.co.il>
References: <46A283B6.1070105@dev.mellanox.co.il>
Message-ID: <f0e08f230707300904y28b39f43k2a9c506c73c6f342@mail.gmail.com>

Hi Yevgeny,

On 7/21/07, Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il> wrote:
> Hi All
>
> Please find the attached RFC describing how QoS policy support could be implemented in the OpenFabrics stack.
> Your comments are welcome.

A couple of quick questions:

How does this differ from the original RFC posted 5/30/06 ?

What I can see is the following:
1. Updated for not yet released IBTA QoS Annex
2. Use of plain text rather than XML based policy file for OpenSM
Anything else ?

Below, IPoIB is discussed in terms of UD. What about IPoIB-CM ? It
uses CM and has a service ID.

Also, have my specific comments to the patches originally submitted
been addressed ? (Do I need to dig them out again ?) Just wondering...

Thanks.

-- Hal


>
> -- Yevgeny
>
>               RFC: OpenFabrics Enhancements for QoS Support
>              ===============================================
>
> Authors: . Eitan Zahavi <eitan at mellanox.co.il>
> Authors: . Yevgeny Kliteynik <kliteyn at mellanox.co.il>
> Date: .... Jul 2007.
> Revision:  0.2
>
> Table of contents:
> 1. Overview
> 2. Architecture
> 3. Supported Policy
> 4. CMA functionality
> 5. IPoIB functionality
> 6. SDP functionality
> 7. SRP functionality
> 8. iSER functionality
> 9. OpenSM functionality
>
> 1. Overview
> ------------
> Quality of Service requirements stem from the realization of I/O consolidation
> over IB network: As multiple applications and ULPs share the same fabric, means
> to control their use of the network resources are becoming a must. The basic
> need is to differentiate the service levels provided to different traffic flows,
> such that a policy could be enforced and control each flow utilization of the
> fabric resources.
>
> IBTA specification defined several hardware features and management interfaces
> to support QoS:
> * Up to 15 Virtual Lanes (VL) carry traffic in a non-blocking manner
> * Arbitration between traffic of different VLs is performed by a 2 priority
>   levels weighted round robin arbiter. The arbiter is programmable with
>   a sequence of (VL, weight) pairs and maximal number of high priority credits
>   to be processed before low priority is served
> * Packets carry class of service marking in the range 0 to 15 in their
>   header SL field
> * Each switch can map the incoming packet by its SL to a particular output
>   VL based on programmable table VL=SL-to-VL-MAP(in-port, out-port, SL)
> * The Subnet Administrator controls each communication flow parameters
>   by providing them as a response to Path Record (PR) or MultiPathRecord (MPR)
>   queries
>
> The IB QoS features provide the means to implement a DiffServ like architecture.
> DiffServ architecture (IETF RFC2474 2475) is widely used today in highly dynamic
> fabrics.
>
> This proposal provides the detailed functional definition for the various
> software elements that are required to enable a DiffServ like architecture over
> the OpenFabrics software stack.
>
>
>
> 2. Architecture
> ----------------
> This proposal split the QoS functionality between the SM/SA, CMA and the various
> ULPS. We take the "chronology approach" to describe how the overall system
> works:
>
> 2.1. The network manager (human) provides a set of rules (policy) that defines
> how the network is being configured and how its resources are split to different
> QoS-Levels. The policy also define how to decide which QoS-Level each
> application or ULP or service use.
>
> 2.2. The SM analyzes the provided policy to see if it is realizable and performs
> the necessary fabric setup. The SM may continuously monitor the policy and adapt
> to changes in it. Part of this policy defines the default QoS-Level of each
> partition. The SA is being enhanced to match the requested Source, Destination,
> QoS-Class, Service-ID (and optionally SL and priority) against the policy. So
> clients (ULPs, programs) can obtain a policy enforced QoS. The SM is also
> enhanced to support setting up partitions with appropriate IPoIB broadcast
> group. This broadcast group carries its QoS attributes: SL, MTU and
> RATE.
>
> 2.3. IPoIB is being setup. IPoIB uses the SL, MTU and RATE available on the
> multicast group which forms the broadcast group of this partition.
>
> 2.4. MPI which provides non IB based connection management should be configured
> to run using hard coded SLs. It uses these SLs for every QP being opened.
>
> 2.5. ULPs that use CM interface (like SRP) should have their own pre-assigned
> Service-ID and use it while obtaining PR/MPR for establishing connections.
> The SA receiving the PR/MPR should match it against the policy and return
> the appropriate PR/MPR including SL, MTU and RATE.
>
> 2.6. ULPs and programs using CMA to establish RC connection should provide the
> CMA the target IP and Service-ID. Some of the ULPs might also provide QoS-Class
> (E.g. for SDP sockets that are provided the TOS socket option). The CMA should
> then use the provided Service-ID and optional QoS-Class and pass them in the
> PR/MPR request. The resulting PR/MPR should be used for configuring the
> connection QP.
>
> PathRecord and MultiPathRecord enhancement for QoS:
> As mentioned above the PathRecord and MultiPathRecord attributes should be
> enhanced to carry the Service-ID which is a 64bit value, which has been
> standardized by the IBTA. A new field QoS-Class is also provided.
> A new capability bit should describe the SM QoS support in the SA class port
> info. This approach provides an easy migration path for existing access layer
> and ULPs by not introducing new set of PR/MPR attribute.
>
>
> 3. Supported Policy
> --------------------
>
> The QoS policy supported by this proposal is divided into 4 sub sections:
>
> I) Port Group: a set of CAs, Routers or Switches that share the same settings.
> A port group might be a partition defined by the partition manager policy in
> terms of GUIDs. Future implementations might provide support for NodeDescription
> based definition of port groups.
>
> II) Fabric Setup:
> Defines how the SL2VL and VLArb tables should be setup. This policy definition
> assumes the computation of overall end to end network behavior should be performed
> outside of OpenSM.
>
> III) QoS-Levels Definition:
> This section defines the possible sets of parameters for QoS that a client
> might be mapped to. Each set holds: SL and optionally: Max MTU, Max Rate,
> Packet Lifetime and Path Bits (in case LMC > 0 is used for QoS).
>
> IV) Matching Rules:
> A list of rules that match an incoming PR/MPR request to a QoS-Level. The
> rules are processed in order such as the first match is applied. Each rule is
> built out of a set of match expressions which should all match for the rule to
> apply. The matching expressions are defined for the following fields
> ** SRC and DST to lists of port groups
> ** Service-ID to a list of Service-ID or Service-ID ranges
> ** QoS-Class to a list of QoS-Class values or ranges
>
> QoS Policy file syntax
>
> * Empty lines are ignored
> * Leading and trailing blanks, as well as empty lines, are ignored, so the
>   indentation in the example is just for better readability
> * Comments are started with the pound sign (#) and terminated by EOL
> * Comments may appear only in a separate line
> * Keywords that denote section/subsection start have matching closing keywords
> * Any keyword should be the first non-blank in the line
>
> QoS Policy file example
>
>     # Port Groups define sets of ports to be used later in the settings
>     port-groups
>         # using port GUIDs
>         port-group
>             name: Storage
>             # "use" is just a description that is used for logging.
>             #  Other than that, it is just a commentary
>             use: our SRP storage targets
>             port-guid: 0x1000000000000001
>             port-guid: 0x1000000000000002
>         end-port-group
>
>         port-group
>             name: Virtual Servers
>             use: node desc and IB port num
>             # The syntax of the port name is as follows: "hostname/CA-num/Pnum".
>             # "hostname" and "CA-num" are compared to the first 2 words of
>             # NodeDescription, and "Pnum" is a port number on that node.
>             port-name: vs1/HCA-1/P1
>             port-name: vs3/HCA-1/P1
>             port-name: vs3/HCA-2/P2
>         end-port-group
>
>         # using partitions defined in the partition policy
>         port-group
>             name: Group for Partition 1
>             use: default settings
>             partition: Part1
>         end-port-group
>
>         # using node types CA|ROUTER|SWITCH
>         port-group
>             name: Routers
>             use: all routers
>             node-type: ROUTER
>         end-port-group
>
>     end-port-groups
>
>     qos-setup
>
>         # define all types of VLArb tables. The length of the tables should
>         # match the physically supported tables by their target ports
>         vlarb-tables
>             # scope defines the exact ports the VLArb tables apply to
>             vlarb-scope
>                 # defining VLArb tables on all the ports that belong to
>                 # port group 'Storage', and on all the ports connected
>                 # to ports of port group 'Storage'
>                 group: Storage
>                 # "across" means all the ports that are connected to ports
>                 # that belong to the specified port group
>                 across: Storage
>                 # VLArb table holds VL and weight pairs
>                 vlarb-high: 0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1
>                 vlarb-low: 8:255,9:127,10:63,11:31,12:15,13:7,14:3
>                 vl-high-limit: 10
>             end-vlarb-scope
>             # There can be several scopes
>         end-vlarb-tables
>
>         sl2vl-tables
>             # Scope defines the exact devices and in/out ports tables apply to.
>             # Note: if the same port is matching several rules the *FIRST* one applies.
>             sl2vl-scope
>                 # SL2VL tables are orgnized as SL2VL(in-port,out-port)
>                 # "from: n,m" means we define the SL2VL(n,*) and SL2VL(m,*)
>                 # "to: n,m" means we define the SL2VL(*,n) and SL2VL(*,m)
>                 #
>                 # The following example specifies that all the SL2VL tables
>                 # entries should be defined for all the ports of group Part1:
>                 group: Part1
>                 from: *
>                 to: *
>                 # SL2VL table has to have 16 values at max - one for each SL.
>                 # If the user specifies less than 16 values, all the missing
>                 # VL values will be implicitly set to 0
>                 sl2vl-table: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
>             end-sl2vl-scope
>
>             sl2vl-scope
>                 # "across-to" is a combination of "across" keyword (definition can be found
>                 # in VLArb tables section) and "to" keyword.
>                 # "across: PortGroupName" refers to all the ports that are connected
>                 # to ports that belong to PortGroupName.
>                 #
>                 # Example of "across-to" usage:
>                 #   A user has a set of 'special' nodes (e.g. storage nodes), and all
>                 #   the traffic to these nodes has to get specific VL.
>                 #   The solution is to define port group (i.g. "Storage") that will
>                 #   include all the ports of these nodes, and then to configure SL2VL
>                 #   tables on all the switch ports that are connected to the Storage
>                 #   port group by specifying "across-to: Storage".
>                 #
>                 across-to: Storage2
>                 # Similar to "across-to", "across-from" is a combination of "across"
>                 # and "to" keywords
>                 across-from: Storage1
>                 sl2vl-table: 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0
>             end-sl2vl-scope
>         end-sl2vl-tables
>
>     end-qos-setup
>
>
>     qos-levels
>
>         # the first one is just setting SL
>         qos-level
>             use: for the lowest priority communication
>             sl: 15
>             packet-life: 16
>         end-qos-level
>         # the second sets SL and QoS Class
>         qos-level
>             use: low latency best bandwidth
>             sl: 0
>         end-qos-level
>         # the whole set: SL, MTU-Limit, Rate-Limit, Packet Lifetime, Path Bits
>         qos-level
>             use: just an example
>             sl: 0
>             mtu-limit: 1
>             rate-limit: 1
>             packet-life: 12
>             # Path Bits can be used e.g. to provide a different routes through the
>             # subnet to a particular port
>             path-bits: 2,4,8-32
>         end-qos-level
>
>     end-qos-levels
>
>
>     # Match rules are scanned in a first-fit manner (like firewall rules table)
>     qos-match-rules
>
>         # matching by single criteria: class (list of values and ranges)
>         qos-match-rule
>             # just a description
>             use: low latency by class 7-9 or 11
>             qos-class: 7-9,11
>             # number of qos-level to apply to the matching PR/MPR
>             qos-level-sn: 1
>         end-qos-match-rule
>         # show matching by destination group AND service-ids
>         qos-match-rule
>             use: Storage targets connection
>             destination: Storage
>             service-id: 22,4719-5000
>             qos-level-sn: 2
>         end-qos-match-rule
>         # show matching by source group only
>         qos-match-rule
>             use: bla bla
>             source: Storage
>             qos-level-sn: 3
>         end-qos-match-rule
>
>     end-qos-match-rules
>
>
> 4. IPoIB
> ---------
>
> IPoIB already query the SA for its broadcast group information. The additional
> functionality required is for IPoIB to provide the broadcast group SL, MTU,
> and RATE in every following PathRecord query performed when a new UDAV is
> needed by IPoIB.
> We could assign a special Service-ID for IPoIB use but since all communication
> on the same IPoIB interface shares the same QoS-Level without the ability to
> differentiate it by target service we can ignore it for simplicity.
>
> 5. CMA features
> ----------------
>
> The CMA interface supports Service-ID through the notion of port space as a
> prefixes to the port_num which is part of the sockaddr provided to
> rdma_resolve_add(). What is missing is the explicit request for a QoS-Class that
> should allow the ULP (like SDP) to propagate a specific request for a class of
> service. A mechanism for providing the QoS-Class is available in the IPv6 address,
> so we could use that address field. Another option is to implement a special
> connection options API for CMA.
>
> Missing functionality by CMA is the usage of the provided QoS-Class and Service-ID
> in the sent PR/MPR. When a response is obtained it is an existing requirement for
> the CMA to use the PR/MPR from the response in setting up the QP address vector.
>
>
> 6. SDP
> -------
>
> SDP uses CMA for building its connections.
> The Service-ID for SDP is 0x000000000001PPPP, where PPPP are 4 hex digits
> holding the remote TCP/IP Port Number to connect to.
> SDP might be provided with SO_PRIORITY socket option. In that case the value
> provided should be sent to the CMA as the TClass option of that connection.
>
> 7. SRP
> -------
>
> Current SRP implementation uses its own CM callbacks (not CMA). So SRP should
> fill in the Service-ID in the PR/MPR by itself and use that information in
> setting up the QP. The T10 SRP standard defines the SRP Service-ID to be defined
> by the SRP target I/O Controller (but they should also comply with IBTA Service-
> ID rules). Anyway, the Service-ID is reported by the I/O Controller in the
> ServiceEntries DMA attribute and should be used in the PR/MPR if the SA
> reports its ability to handle QoS PR/MPRs.
>
> 8. iSER
> --------
> iSER uses CMA and thus should be very close to SDP. The Service-ID for iSER
> should be TBD.
>
>
> 9. OpenSM features
> -------------------
> The QoS related functionality to be provided by OpenSM can be split into two
> main parts:
>
> 3.1. Fabric Setup
> During fabric initialization the SM should parse the policy and apply its
> settings to the discovered fabric elements. The following actions should be
> performed:
> * Parsing of policy
> * Node Group identification. Warning should be provided for each node not
>   specified but found.
> * SL2VL settings validation should be checked:
>   + A warning will be provided if there are no matching targets for the SL2VL
>     setting statement.
>   + An error message will be printed to the log file if an invalid setting is
>     found. A setting is invalid if it refers to:
>     - Non existing port numbers of the target devices
>     - Unsupported VLs for the target device. In the later case the map to non
>       existing VLs should be replaced to VL15 i.e. packets will be dropped.
> * SL2VL setting is to be performed
> * VL Arbitration table settings should be validated according to the following
>   rules:
>   + A warning will be provided if there are no matching targets for the setting
>     statement
>   + An error will be provided if the port number exceeds the target ports
>   + An error will be generated if the table length exceeds device capabilities
>   + A warning will be generated if the table quote a VL that is not supported
>     by the target device
> * VL Arbitration tables will be set on the appropriate targets
>
> 3.2. PR/MPR query handling:
> OpenSM should be able to enforce the provided policy on client request.
> The overall flow for such requests is: first the request is matched against the
> defined match rules such that the target QoS-Level definition is found. Given
> the QoS-Level a path(s) search is performed with the given restrictions imposed
> by that level. The following two sections describe these steps.
>
> How Service-ID is carried in the PathRecord and MultiPathRecord attributes is
> now standardized by the IBTA.
>
>
> 3.2.1. Matching rule search:
> A rule is "matching" a PR/MPR request using the following criteria:
> * Matching rules provide values in a list of either single value, or range of
>   values. A PR/MPR field is "matching" the rule field if it is explicitly
>   noted in the list of values or is one of the values covered by a range
>   included in the field values list.
> * Only PR/MPR fields that have their component mask bit set should be
>   compared.
> * For a rule to be "matching" a PR/MPR request all the rule fields should be
>   "matching" their PR/MPR fields. Such that a PR/MPR request that does
>   not have a component mask field set for one of the rule defined fields  can
>   not match that rule.
> * A PR/MPR request that have a component mask bit set for one of the fields
>   that is not defined by the rule can match the rule.
>
> The algorithm to be used for searching for a rule match might be as simple as a
> sequential search through all rules or enhanced for better performance. The
> semantics of every rule field and its matching PR/MPR field are described
> below:
> * Source: the SGID or SLID should be part of this group
> * Destination: the DGID or DLID should be part of this group
> * Service-ID: check if the requested Service-ID (available in the PR/MPR old
>   SM-Key field) is matching any of this rule Service-IDs
> * TClass: check if the PR/MPR TClass field is matching
>
> 3.2.2 PR/MPR response generation:
> The QoS-Level pointed by the first rule that matches the PR/MPR request
> should be used for obtaining the response SL, MTU-Limit, RATE-Limit, Path-Bits
> and QoS-Class. A default QoS-Level should be used if no rule is matching the query.
>
> The efficient algorithm for finding paths that meet the QoS-Level criteria is
> beyond the scope of this RFC and left for the implementer to provide. However
> the criteria by which the paths match the QoS-Level are described below:
>
> * SL: The paths found should all use the given SL. For that sake PR/MPR
>   algorithm should traverse the path from source to destination only through
>   ports that carry a valid VL (not VL15) by the SL2VL map (should consider input
>   and output ports and SL).
> * MTU-Limit: The resulting paths MTU should not exceed the given MTU-Limit
> * Rate-Limit: The resulting paths RATE should not exceed the given RATE-Limit
>   (rate limit is given in units of link BW = Width*Speed according to IBTA
>   Specification Vol-1 table-205 p-901 l-24).
> * Path-Bits: define the target LID lowest bits (number of bits defined by the
>   target port PortInfo.LMC field). The path should traverse the LFT using the
>   target port LID with the path-bits set.
> * QoS-Class: should be returned in the result PR/MPR. When routing is going to
>   be supported by OpenSM we might use this field in selecting the target
>   router too in a TBD way.
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From swise at opengridcomputing.com  Mon Jul 30 09:55:03 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 30 Jul 2007 11:55:03 -0500
Subject: [ofa-general] reminder: OFED meeting today at 9am PST
In-Reply-To: <46ADEE7F.2000005@mellanox.co.il>
References: <46ADEE7F.2000005@mellanox.co.il>
Message-ID: <46AE17E7.3020305@opengridcomputing.com>

Am I missing the call info?  I tried an older conf id, and it didn't 
work.  Can you please post the conf call info along with the meeting 
notification?

Thanks,

Steve.


Tziporet Koren wrote:
> Hi All,
> 
> We will have our bi-weekly OFED meeting today at 9am PST
> 
> Agenda:
> - Status update
> - Bugzilla cleanup
> 
> If you have more agenda items please send them
> 
> Tziporet
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general


From pauln at psc.edu  Mon Jul 30 09:56:31 2007
From: pauln at psc.edu (Paul Nowoczynski)
Date: Mon, 30 Jul 2007 12:56:31 -0400
Subject: [ofa-general] SDP kernel Oops.
Message-ID: <46AE183F.5090907@psc.edu>

Hi,
I am wondering if someone could shed some light on this problem?  I'm 
trying use SDP on a kernel socket with limited success.  Can someone 
with working knowledge of SDP please give me some advice?  I'm running 
OFED-1.1.  I've looked at diff's for 1.2 but didn't notice anything that 
looked pertinent to this problem.  The bug appears when a second socket 
instance is invoked.  My feeling is that the problem is related to 
teardown..

thanks,
paul

----------- [cut here ] --------- [please bite here ] ---------
Jul 30 12:39:52 Kernel BUG at sdp_cma:372
invalid operand: 0000 [1] SMP
CPU 0
Modules linked in: ksocklnd ptlrpc obdclass lvfs lnet libcfs ib_sdp 
rdma_cm ib_addr iptable_filter ip_tables e1000 ib_srp ib_cm ib_ipoib 
ib_sa ib_uverbs ib_umad ib_mthca ib_mad ib_core md forcedeth
Pid: 2578, comm: ib_cm/0 Not tainted 
2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4
RIP: 0010:[<ffffffffa00b253a>] 
<ffffffffa00b253a>{:ib_sdp:sdp_connect_handler+186}
RSP: 0018:000001015721bc08  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000010150467040 RCX: 0000000000000000
RDX: 00000000ffffffff RSI: 0000010150467028 RDI: ffffffffa00b8b40
RBP: ffffffffa00b86c0 R08: 00000101542faef8 R09: 00000101542faf08
R10: 00000000ffffffff R11: 0000000000000000 R12: 0000010150467740
R13: 000001015721bd08 R14: 0000000000000000 R15: 0000010150496c00
FS:  0000002a9589db00(0000) GS:ffffffff805a30c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000005b3430 CR3: 0000000000101000 CR4: 00000000000006e0
Process ib_cm/0 (pid: 2578, threadinfo 000001015721a000, task 
00000101567a1030)
Stack: 000001015721bd08 00000101504679a0 0000010150467740 0000010150496c00
       000001015721bd08 0000000000000000 0000000000000000 ffffffffa00b2e80
       0000000000000000 0000000000000000
Call Trace:<ffffffffa00b2e80>{:ib_sdp:sdp_cma_handler+896} 
<ffffffffa0015a3f>{:ib_core:ib_find_cached_gid+239}
       <ffffffffa00a869e>{:rdma_cm:cma_notify_user+30} 
<ffffffffa00a8fd3>{:rdma_cm:cma_req_handler+851}
       <ffffffffa00713da>{:ib_cm:cm_process_work+26} 
<ffffffffa0071dc3>{:ib_cm:cm_req_handler+2307}
       <ffffffffa0072150>{:ib_cm:cm_work_handler+0} 
<ffffffffa0072192>{:ib_cm:cm_work_handler+66}
       <ffffffff80133133>{__wake_up+67} 
<ffffffffa0072150>{:ib_cm:cm_work_handler+0}
       <ffffffff80149110>{worker_thread+496} 
<ffffffff80133070>{default_wake_function+0}
       <ffffffff801330c0>{__wake_up_common+64} 
<ffffffff80133070>{default_wake_function+0}
       <ffffffff8014d630>{keventd_create_kthread+0} 
<ffffffff80148f20>{worker_thread+0}
       <ffffffff8014d630>{keventd_create_kthread+0} 
<ffffffff8014d5e9>{kthread+217}
       <ffffffff8011144b>{child_rip+8} 
<ffffffff8014d630>{keventd_create_kthread+0}
       <ffffffff8014d510>{kthread+0} <ffffffff80111443>{child_rip+0}
       

Code: 0f 0b 15 5e 0b a0 ff ff ff ff 74 01 65 8b 04 25 34 00 00 00
RIP <ffffffffa00b253a>{:ib_sdp:sdp_connect_handler+186} RSP 
<000001015721bc08>
 <0>Kernel panic - not syncing: Oops


From jsquyres at cisco.com  Mon Jul 30 10:03:32 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Mon, 30 Jul 2007 13:03:32 -0400
Subject: [ewg] Re: [ofa-general] reminder: OFED meeting today at 9am PST
In-Reply-To: <46AE17E7.3020305@opengridcomputing.com>
References: <46ADEE7F.2000005@mellanox.co.il>
	<46AE17E7.3020305@opengridcomputing.com>
Message-ID: <0CB03EC2-3F31-4881-80FA-0F46D90825A3@cisco.com>

Yes, you missed it; the call was over about half an hour ago.  I [re-] 
posted the dial-in info about 3 hours before the call this morning on  
the ewg list.


On Jul 30, 2007, at 12:55 PM, Steve Wise wrote:

> Am I missing the call info?  I tried an older conf id, and it  
> didn't work.  Can you please post the conf call info along with the  
> meeting notification?
>
> Thanks,
>
> Steve.
>
>
> Tziporet Koren wrote:
>> Hi All,
>> We will have our bi-weekly OFED meeting today at 9am PST
>> Agenda:
>> - Status update
>> - Bugzilla cleanup
>> If you have more agenda items please send them
>> Tziporet
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/ 
>> openib-general
>
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


-- 
Jeff Squyres
Cisco Systems


From hal.rosenstock at gmail.com  Mon Jul 30 10:54:25 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 30 Jul 2007 13:54:25 -0400
Subject: [ofa-general] [PATCH][TIRIVIAL] ibdm/src/osm_check.cpp: Add missing
	include file
Message-ID: <f0e08f230707301054n50fe34adn68ddd6779d228d2c@mail.gmail.com>

ibdm/src/osm_check.cpp: Add missing include file

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

diff --git a/ibdm/src/osm_check.cpp b/ibdm/src/osm_check.cpp
index 49215c2..f24eec6 100644
--- a/ibdm/src/osm_check.cpp
+++ b/ibdm/src/osm_check.cpp
@@ -35,6 +35,7 @@
 #include "Fabric.h"
 #include "SubnMgt.h"
 #include "CredLoops.h"
+#include <unistd.h>
 #include <getopt.h>
 #include <fstream>


From pw at osc.edu  Mon Jul 30 11:23:40 2007
From: pw at osc.edu (Pete Wyckoff)
Date: Mon, 30 Jul 2007 14:23:40 -0400
Subject: [ofa-general] Announcing new open source iSER (iSCSI/RDMA) target
Message-ID: <20070730182340.GI12789@osc.edu>

We are releasing code to add support for iSCSI Extensions for RDMA
(iSER) to the existing STGT user space SCSI target.  It uses
OpenFabrics libraries and kernel drivers to act as a SCSI target
over RDMA-capable devices.  The code has been tested against
the existing Linux iSER initiator over InfiniBand cards, but
should be specification compliant and work generally.

A bit of documentation is included, and a short technical report is
available at http://www.osc.edu/~pw/papers/iser-techreport.pdf .
For performance, a single SCSI client using iSCSI over gigabit
ethernet does 100 MB/s.  iSCSI over IPoIB gets 200 MB/s, and iSER
over native IB sees 500 MB/s.

More information on STGT is available at http://stgt.berlios.de .

The seven iSER patches can be downloaded from:

	git://git.osc.edu/tgt

or browsed at:

	http://git.osc.edu/?p=tgt.git;a=summary

New and modified files are distributed under a GPLv2 license.  I'll
submit individual patches to stgt-devel for review and eventual
inclusion in STGT.

		-- Pete


From pradeeps at linux.vnet.ibm.com  Mon Jul 30 12:07:45 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Mon, 30 Jul 2007 12:07:45 -0700
Subject: NOSRQ QP implementation issues (wasRe: [ofa-general] Merge window
	for 2.6.23 closed)
In-Reply-To: <adak5sr231k.fsf@cisco.com>
References: <adak5sr231k.fsf@cisco.com>
Message-ID: <46AE3701.40603@linux.vnet.ibm.com>

Roland Dreier wrote:

> 
>  - IPoIB CM without SRQ.  Pradeep, I'm sorry this missed the window
>    but the patch quality really doesn't look up to par to me, and
>    your being in a rush to get this merged I think has actually slowed
>    things up.  I think the basic idea is OK, but I have doubts about
>    a static array as a data structure, and MST's comments about not
>    dealing with remote implementations that send packets on passive
>    connections looks quite serious as well.  I would like to close
>    this for 2.6.24 so (as above) please let's keep working this and
>    not wait for the 2.6.24 merge window.
> 

For sending (both on the active and passive side) the skbs are associated 
with the tx_qp. The remote qp for the tx_qp is the rx_qp (on the other side)
and WRs are posted to receive packets. An skb (for send) is not associated 
with SQ of the rx_qp. Therefore, no packets are expected to be sent through
the rx_qp.

In an erroneous case if packets do get sent to the wrong RQ, then they will
get dropped as no WQEs are posted. As discussed, an RNR will be returned as
expected and a new connection will get established. I still see no issues 
with this either.

If in the future, we do want to use the unused SQ and RQs, then we will have
to associate them with corresponding QP at the remote end. This will be work
for both the SRQ and non-SRQ case.

I do not see any issues. Can you please explain what is missing with this 
implementation?

Pradeep


From parks at lanl.gov  Mon Jul 30 12:18:43 2007
From: parks at lanl.gov (Parks Fields)
Date: Mon, 30 Jul 2007 13:18:43 -0600
Subject: [ewg] Re: [ofa-general] reminder: OFED meeting today at 9am PST
In-Reply-To: <0CB03EC2-3F31-4881-80FA-0F46D90825A3@cisco.com>
References: <46ADEE7F.2000005@mellanox.co.il>
	<46AE17E7.3020305@opengridcomputing.com>
	<0CB03EC2-3F31-4881-80FA-0F46D90825A3@cisco.com>
Message-ID: <7.0.1.0.2.20070730131818.02838a90@lanl.gov>

At 11:03 AM 7/30/2007, Jeff Squyres wrote:
>Yes, you missed it; the call was over about half an hour ago.  I 
>[re-] posted the dial-in info about 3 hours before the call this morning on
>the ewg list.


I am on the EWG list and didn't see it. :-(


>On Jul 30, 2007, at 12:55 PM, Steve Wise wrote:
>
>>Am I missing the call info?  I tried an older conf id, and it
>>didn't work.  Can you please post the conf call info along with the
>>meeting notification?
>>
>>Thanks,
>>
>>Steve.
>>
>>
>>Tziporet Koren wrote:
>>>Hi All,
>>>We will have our bi-weekly OFED meeting today at 9am PST
>>>Agenda:
>>>- Status update
>>>- Bugzilla cleanup
>>>If you have more agenda items please send them
>>>Tziporet
>>>_______________________________________________
>>>general mailing list
>>>general at lists.openfabrics.org
>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>To unsubscribe, please visit http://openib.org/mailman/listinfo/ 
>>>openib-general
>>
>>_______________________________________________
>>ewg mailing list
>>ewg at lists.openfabrics.org
>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>
>
>--
>Jeff Squyres
>Cisco Systems
>
>_______________________________________________
>general mailing list
>general at lists.openfabrics.org
>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

                    ***** Correspondence *****

This email contains no programmatic content that requires independent 
ADC review  


From jimmmott at austin.rr.com  Mon Jul 30 12:18:32 2007
From: jimmmott at austin.rr.com (Jim Mott)
Date: Mon, 30 Jul 2007 14:18:32 -0500
Subject: [ofa-general] SDP kernel Oops.
In-Reply-To: <46AE183F.5090907@psc.edu>
References: <46AE183F.5090907@psc.edu>
Message-ID: <004201c7d2de$6d1dca30$47595e90$@rr.com>

Hi,
  It appears that this is an illegal instruction (illegal operand) trap in a
modified Rhat4U4 kernel. I am not sure about the line number, but perhaps
sdp_cma_handler() is processing an RDMA_CM_EVENT_ROUTE_RESOLVED event. 


A few things might help:

1) Get some debug info
If the whole system does not crash, could you collect some debug information
from ib_sdp.
  - dmesg -c (to clear)
  - echo 1 > /sys/module/ib_sdp/debug_level
  - Run your app
  - dmesg > xxx
This will collect some flow information that can help.

2) OFED 1.2
I am the new guy here and have not worked with SDP from OFED 1.1.  I am
looking at the code now, but am much more familiar with 1.2.  This problem
is in the CM/CMA area and my understanding is that there were quite a few
fixes there.

Sorry to not be more helpful.
JIm

-----Original Message-----
From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Paul Nowoczynski
Sent: Monday, July 30, 2007 11:57 AM
To: general at lists.openfabrics.org
Subject: [ofa-general] SDP kernel Oops.

Hi,
I am wondering if someone could shed some light on this problem?  I'm 
trying use SDP on a kernel socket with limited success.  Can someone 
with working knowledge of SDP please give me some advice?  I'm running 
OFED-1.1.  I've looked at diff's for 1.2 but didn't notice anything that 
looked pertinent to this problem.  The bug appears when a second socket 
instance is invoked.  My feeling is that the problem is related to 
teardown..

thanks,
paul

----------- [cut here ] --------- [please bite here ] ---------
Jul 30 12:39:52 Kernel BUG at sdp_cma:372
invalid operand: 0000 [1] SMP
CPU 0
Modules linked in: ksocklnd ptlrpc obdclass lvfs lnet libcfs ib_sdp 
rdma_cm ib_addr iptable_filter ip_tables e1000 ib_srp ib_cm ib_ipoib 
ib_sa ib_uverbs ib_umad ib_mthca ib_mad ib_core md forcedeth
Pid: 2578, comm: ib_cm/0 Not tainted 
2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4
RIP: 0010:[<ffffffffa00b253a>] 
<ffffffffa00b253a>{:ib_sdp:sdp_connect_handler+186}
RSP: 0018:000001015721bc08  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000010150467040 RCX: 0000000000000000
RDX: 00000000ffffffff RSI: 0000010150467028 RDI: ffffffffa00b8b40
RBP: ffffffffa00b86c0 R08: 00000101542faef8 R09: 00000101542faf08
R10: 00000000ffffffff R11: 0000000000000000 R12: 0000010150467740
R13: 000001015721bd08 R14: 0000000000000000 R15: 0000010150496c00
FS:  0000002a9589db00(0000) GS:ffffffff805a30c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000005b3430 CR3: 0000000000101000 CR4: 00000000000006e0
Process ib_cm/0 (pid: 2578, threadinfo 000001015721a000, task 
00000101567a1030)
Stack: 000001015721bd08 00000101504679a0 0000010150467740 0000010150496c00
       000001015721bd08 0000000000000000 0000000000000000 ffffffffa00b2e80
       0000000000000000 0000000000000000
Call Trace:<ffffffffa00b2e80>{:ib_sdp:sdp_cma_handler+896} 
<ffffffffa0015a3f>{:ib_core:ib_find_cached_gid+239}
       <ffffffffa00a869e>{:rdma_cm:cma_notify_user+30} 
<ffffffffa00a8fd3>{:rdma_cm:cma_req_handler+851}
       <ffffffffa00713da>{:ib_cm:cm_process_work+26} 
<ffffffffa0071dc3>{:ib_cm:cm_req_handler+2307}
       <ffffffffa0072150>{:ib_cm:cm_work_handler+0} 
<ffffffffa0072192>{:ib_cm:cm_work_handler+66}
       <ffffffff80133133>{__wake_up+67} 
<ffffffffa0072150>{:ib_cm:cm_work_handler+0}
       <ffffffff80149110>{worker_thread+496} 
<ffffffff80133070>{default_wake_function+0}
       <ffffffff801330c0>{__wake_up_common+64} 
<ffffffff80133070>{default_wake_function+0}
       <ffffffff8014d630>{keventd_create_kthread+0} 
<ffffffff80148f20>{worker_thread+0}
       <ffffffff8014d630>{keventd_create_kthread+0} 
<ffffffff8014d5e9>{kthread+217}
       <ffffffff8011144b>{child_rip+8} 
<ffffffff8014d630>{keventd_create_kthread+0}
       <ffffffff8014d510>{kthread+0} <ffffffff80111443>{child_rip+0}
       

Code: 0f 0b 15 5e 0b a0 ff ff ff ff 74 01 65 8b 04 25 34 00 00 00
RIP <ffffffffa00b253a>{:ib_sdp:sdp_connect_handler+186} RSP 
<000001015721bc08>
 <0>Kernel panic - not syncing: Oops

_______________________________________________
general mailing list
general at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general


From jsquyres at cisco.com  Mon Jul 30 12:25:45 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Mon, 30 Jul 2007 15:25:45 -0400
Subject: [ewg] Re: [ofa-general] reminder: OFED meeting today at 9am PST
In-Reply-To: <7.0.1.0.2.20070730131818.02838a90@lanl.gov>
References: <46ADEE7F.2000005@mellanox.co.il>
	<46AE17E7.3020305@opengridcomputing.com>
	<0CB03EC2-3F31-4881-80FA-0F46D90825A3@cisco.com>
	<7.0.1.0.2.20070730131818.02838a90@lanl.gov>
Message-ID: <37A4EC2D-8A1E-481A-A149-8ED10215AFB3@cisco.com>

[shrug]  I sent it today at 8:21am US Eastern time:

     http://lists.openfabrics.org/pipermail/ewg/2007-July/004075.html

Maybe check your spam folder?


On Jul 30, 2007, at 3:18 PM, Parks Fields wrote:

> At 11:03 AM 7/30/2007, Jeff Squyres wrote:
>> Yes, you missed it; the call was over about half an hour ago.  I  
>> [re-] posted the dial-in info about 3 hours before the call this  
>> morning on
>> the ewg list.
>
>
>
> I am on the EWG list and didn't see it. :-(
>
>
>
>
>> On Jul 30, 2007, at 12:55 PM, Steve Wise wrote:
>>
>>> Am I missing the call info?  I tried an older conf id, and it
>>> didn't work.  Can you please post the conf call info along with the
>>> meeting notification?
>>>
>>> Thanks,
>>>
>>> Steve.
>>>
>>>
>>> Tziporet Koren wrote:
>>>> Hi All,
>>>> We will have our bi-weekly OFED meeting today at 9am PST
>>>> Agenda:
>>>> - Status update
>>>> - Bugzilla cleanup
>>>> If you have more agenda items please send them
>>>> Tziporet
>>>> _______________________________________________
>>>> general mailing list
>>>> general at lists.openfabrics.org
>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/  
>>>> openib-general
>>>
>>> _______________________________________________
>>> ewg mailing list
>>> ewg at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/ 
>> openib-general
>
>                    ***** Correspondence *****
>
> This email contains no programmatic content that requires  
> independent ADC review


-- 
Jeff Squyres
Cisco Systems


From hal.rosenstock at gmail.com  Mon Jul 30 12:54:11 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 30 Jul 2007 15:54:11 -0400
Subject: [ofa-general] [PATCH][TRIVIAL] OpenSM/include/iba/ib_types.h: Some
	comment fixes
Message-ID: <f0e08f230707301254r43d81d0cv577c583fb4a328e7@mail.gmail.com>

include/iba/ib_types.h: Some comment fixes

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h
index f341a37..358cd62 100644
--- a/opensm/include/iba/ib_types.h
+++ b/opensm/include/iba/ib_types.h
@@ -4931,7 +4931,7 @@ ib_port_info_get_mtu_cap(
 *              [in] Pointer to a PortInfo attribute.
 *
 * RETURN VALUES
-*      Returns the LMC value assigned to this port.
+*      Returns the encooded value for the maximum MTU supported by this port.
 *
 * NOTES
 *
@@ -4943,7 +4943,7 @@ ib_port_info_get_mtu_cap(
 *      ib_port_info_get_neighbor_mtu
 *
 * DESCRIPTION
-*      Returns the encoded value for the maximum MTU supported by this port.
+*      Returns the encoded value for the neighbor MTU supported by this port.
 *
 * SYNOPSIS
 */


From rdreier at cisco.com  Mon Jul 30 13:11:41 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 30 Jul 2007 13:11:41 -0700
Subject: [ofa-general] Re: [PATCH] IB/ipath -- bug fixes in for-roland
In-Reply-To: <20070730150555.19920.69378.stgit@eng-46.internal.keyresearch.com>
	(Arthur Jones's message of "Mon, 30 Jul 2007 08:05:55 -0700")
References: <20070730150555.19920.69378.stgit@eng-46.internal.keyresearch.com>
Message-ID: <adalkcxslc2.fsf@cisco.com>

thanks, applied all 4.


From rdreier at cisco.com  Mon Jul 30 13:18:42 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 30 Jul 2007 13:18:42 -0700
Subject: [ofa-general] [GIT PULL] please pull infiniband.git
Message-ID: <adahcnlsl0d.fsf@cisco.com>

Linus, please pull from

    master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This tree is also available from kernel.org mirrors at:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This will get some small fixes for 2.6.23:

Dave Olson (4):
      IB/ipath: Remove unsafe fastrcvint code from interrupt handler
      IB/ipath: Use faster put_tid_2 routine after initialization
      IB/ipath: Fix some issues with buffer cancel and sendctrl register update
      IB/ipath: Workaround problem of errormask register being overwritten

Hoang-Nam Nguyen (2):
      IB/ehca: Fix include order to better match kernel style
      IB/ehca: Move extern declarations from .c files to .h files

Jack Morgenstein (1):
      mlx4_core: Remove kfree() in mlx4_mr_alloc() error flow

Roland Dreier (1):
      IB/mlx4: Whitespace fix

Tom Tucker (1):
      RDMA/amso1100: Initialize the wait_queue_head_t in the c2_qp structure

 drivers/infiniband/hw/amso1100/c2_qp.c        |    1 +
 drivers/infiniband/hw/ehca/ehca_classes.h     |    1 +
 drivers/infiniband/hw/ehca/ehca_mrmw.c        |    6 +--
 drivers/infiniband/hw/ehca/ehca_pd.c          |    1 -
 drivers/infiniband/hw/ehca/hcp_if.c           |    1 -
 drivers/infiniband/hw/ehca/ipz_pt_fn.h        |    2 +
 drivers/infiniband/hw/ipath/ipath_common.h    |    3 +-
 drivers/infiniband/hw/ipath/ipath_driver.c    |   11 +++--
 drivers/infiniband/hw/ipath/ipath_iba6120.c   |   20 +++++---
 drivers/infiniband/hw/ipath/ipath_init_chip.c |    7 ++-
 drivers/infiniband/hw/ipath/ipath_intr.c      |   63 ++++++------------------
 drivers/infiniband/hw/ipath/ipath_kernel.h    |   13 +----
 drivers/infiniband/hw/ipath/ipath_stats.c     |   54 +++++++++++++++++++--
 drivers/infiniband/hw/mlx4/qp.c               |    1 -
 drivers/net/mlx4/mr.c                         |   15 +-----
 15 files changed, 101 insertions(+), 98 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2_qp.c b/drivers/infiniband/hw/amso1100/c2_qp.c
index 420c138..01d0786 100644
--- a/drivers/infiniband/hw/amso1100/c2_qp.c
+++ b/drivers/infiniband/hw/amso1100/c2_qp.c
@@ -506,6 +506,7 @@ int c2_alloc_qp(struct c2_dev *c2dev,
 	qp->send_sgl_depth = qp_attrs->cap.max_send_sge;
 	qp->rdma_write_sgl_depth = qp_attrs->cap.max_send_sge;
 	qp->recv_sgl_depth = qp_attrs->cap.max_recv_sge;
+	init_waitqueue_head(&qp->wait);
 
 	/* Initialize the SQ MQ */
 	q_size = be32_to_cpu(reply->sq_depth);
diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h
index 3725aa8..b5e9603 100644
--- a/drivers/infiniband/hw/ehca/ehca_classes.h
+++ b/drivers/infiniband/hw/ehca/ehca_classes.h
@@ -322,6 +322,7 @@ extern int ehca_static_rate;
 extern int ehca_port_act_time;
 extern int ehca_use_hp_mr;
 extern int ehca_scaling_code;
+extern int ehca_mr_largepage;
 
 struct ipzu_queue_resp {
 	u32 qe_size;      /* queue entry size */
diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c
index c1b868b..d97eda3 100644
--- a/drivers/infiniband/hw/ehca/ehca_mrmw.c
+++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c
@@ -40,10 +40,10 @@
  * POSSIBILITY OF SUCH DAMAGE.
  */
 
-#include <rdma/ib_umem.h>
-
 #include <asm/current.h>
 
+#include <rdma/ib_umem.h>
+
 #include "ehca_iverbs.h"
 #include "ehca_mrmw.h"
 #include "hcp_if.h"
@@ -64,8 +64,6 @@ enum ehca_mr_pgsize {
 	EHCA_MR_PGSIZE16M = 0x1000000L
 };
 
-extern int ehca_mr_largepage;
-
 static u32 ehca_encode_hwpage_size(u32 pgsize)
 {
 	u32 idx = 0;
diff --git a/drivers/infiniband/hw/ehca/ehca_pd.c b/drivers/infiniband/hw/ehca/ehca_pd.c
index 3dafd7f..43bcf08 100644
--- a/drivers/infiniband/hw/ehca/ehca_pd.c
+++ b/drivers/infiniband/hw/ehca/ehca_pd.c
@@ -88,7 +88,6 @@ int ehca_dealloc_pd(struct ib_pd *pd)
 	u32 cur_pid = current->tgid;
 	struct ehca_pd *my_pd = container_of(pd, struct ehca_pd, ib_pd);
 	int i, leftovers = 0;
-	extern struct kmem_cache *small_qp_cache;
 	struct ipz_small_queue_page *page, *tmp;
 
 	if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context &&
diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c
index fdbfebe..24f4541 100644
--- a/drivers/infiniband/hw/ehca/hcp_if.c
+++ b/drivers/infiniband/hw/ehca/hcp_if.c
@@ -758,7 +758,6 @@ u64 hipz_h_register_rpage_mr(const struct ipz_adapter_handle adapter_handle,
 			     const u64 logical_address_of_page,
 			     const u64 count)
 {
-	extern int ehca_debug_level;
 	u64 ret;
 
 	if (unlikely(ehca_debug_level >= 2)) {
diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn.h b/drivers/infiniband/hw/ehca/ipz_pt_fn.h
index c6937a0..a801274 100644
--- a/drivers/infiniband/hw/ehca/ipz_pt_fn.h
+++ b/drivers/infiniband/hw/ehca/ipz_pt_fn.h
@@ -54,6 +54,8 @@
 struct ehca_pd;
 struct ipz_small_queue_page;
 
+extern struct kmem_cache *small_qp_cache;
+
 /* struct generic ehca page */
 struct ipz_page {
 	u8 entries[EHCA_PAGESIZE];
diff --git a/drivers/infiniband/hw/ipath/ipath_common.h b/drivers/infiniband/hw/ipath/ipath_common.h
index b4b786d..6ad822c 100644
--- a/drivers/infiniband/hw/ipath/ipath_common.h
+++ b/drivers/infiniband/hw/ipath/ipath_common.h
@@ -100,8 +100,7 @@ struct infinipath_stats {
 	__u64 sps_hwerrs;
 	/* number of times IB link changed state unexpectedly */
 	__u64 sps_iblink;
-	/* kernel receive interrupts that didn't read intstat */
-	__u64 sps_fastrcvint;
+	__u64 sps_unused; /* was fastrcvint, no longer implemented */
 	/* number of kernel (port0) packets received */
 	__u64 sps_port0pkts;
 	/* number of "ethernet" packets sent by driver */
diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c
index 09c5fd8..6ccba36 100644
--- a/drivers/infiniband/hw/ipath/ipath_driver.c
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c
@@ -740,7 +740,7 @@ void ipath_disarm_piobufs(struct ipath_devdata *dd, unsigned first,
 	 * pioavail updates to memory to stop.
 	 */
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl,
-			 sendorig & ~IPATH_S_PIOBUFAVAILUPD);
+			 sendorig & ~INFINIPATH_S_PIOBUFAVAILUPD);
 	sendorig = ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch);
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl,
 			 dd->ipath_sendctrl);
@@ -1614,7 +1614,7 @@ int ipath_waitfor_mdio_cmdready(struct ipath_devdata *dd)
  * it's safer to always do it.
  * PIOAvail bits are updated by the chip as if normal send had happened.
  */
-void ipath_cancel_sends(struct ipath_devdata *dd)
+void ipath_cancel_sends(struct ipath_devdata *dd, int restore_sendctrl)
 {
 	ipath_dbg("Cancelling all in-progress send buffers\n");
 	dd->ipath_lastcancel = jiffies+HZ/2; /* skip armlaunch errs a bit */
@@ -1627,6 +1627,9 @@ void ipath_cancel_sends(struct ipath_devdata *dd)
 	ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch);
 	ipath_disarm_piobufs(dd, 0,
 		(unsigned)(dd->ipath_piobcnt2k + dd->ipath_piobcnt4k));
+	if (restore_sendctrl) /* else done by caller later */
+		ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl,
+				 dd->ipath_sendctrl);
 
 	/* and again, be sure all have hit the chip */
 	ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch);
@@ -1655,7 +1658,7 @@ static void ipath_set_ib_lstate(struct ipath_devdata *dd, int which)
 	/* flush all queued sends when going to DOWN or INIT, to be sure that
 	 * they don't block MAD packets */
 	if (!linkcmd || linkcmd == INFINIPATH_IBCC_LINKCMD_INIT)
-		ipath_cancel_sends(dd);
+		ipath_cancel_sends(dd, 1);
 
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_ibcctrl,
 			 dd->ipath_ibcctrl | which);
@@ -2000,7 +2003,7 @@ void ipath_shutdown_device(struct ipath_devdata *dd)
 
 	ipath_set_ib_lstate(dd, INFINIPATH_IBCC_LINKINITCMD_DISABLE <<
 			    INFINIPATH_IBCC_LINKINITCMD_SHIFT);
-	ipath_cancel_sends(dd);
+	ipath_cancel_sends(dd, 0);
 
 	/* disable IBC */
 	dd->ipath_control &= ~INFINIPATH_C_LINKENABLE;
diff --git a/drivers/infiniband/hw/ipath/ipath_iba6120.c b/drivers/infiniband/hw/ipath/ipath_iba6120.c
index 9868ccd..5b6ac9a 100644
--- a/drivers/infiniband/hw/ipath/ipath_iba6120.c
+++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c
@@ -321,6 +321,8 @@ static const struct ipath_hwerror_msgs ipath_6120_hwerror_msgs[] = {
 		        << INFINIPATH_HWE_TXEMEMPARITYERR_SHIFT)
 
 static int ipath_pe_txe_recover(struct ipath_devdata *);
+static void ipath_pe_put_tid_2(struct ipath_devdata *, u64 __iomem *,
+			       u32, unsigned long);
 
 /**
  * ipath_pe_handle_hwerrors - display hardware errors.
@@ -555,8 +557,11 @@ static int ipath_pe_boardname(struct ipath_devdata *dd, char *name,
 		ipath_dev_err(dd, "Unsupported InfiniPath hardware revision %u.%u!\n",
 			      dd->ipath_majrev, dd->ipath_minrev);
 		ret = 1;
-	} else
+	} else {
 		ret = 0;
+		if (dd->ipath_minrev >= 2)
+			dd->ipath_f_put_tid = ipath_pe_put_tid_2;
+	}
 
 	return ret;
 }
@@ -1220,7 +1225,7 @@ static void ipath_pe_clear_tids(struct ipath_devdata *dd, unsigned port)
 		 port * dd->ipath_rcvtidcnt * sizeof(*tidbase));
 
 	for (i = 0; i < dd->ipath_rcvtidcnt; i++)
-		ipath_pe_put_tid(dd, &tidbase[i], RCVHQ_RCV_TYPE_EXPECTED,
+		dd->ipath_f_put_tid(dd, &tidbase[i], RCVHQ_RCV_TYPE_EXPECTED,
 				 tidinv);
 
 	tidbase = (u64 __iomem *)
@@ -1229,7 +1234,7 @@ static void ipath_pe_clear_tids(struct ipath_devdata *dd, unsigned port)
 		 port * dd->ipath_rcvegrcnt * sizeof(*tidbase));
 
 	for (i = 0; i < dd->ipath_rcvegrcnt; i++)
-		ipath_pe_put_tid(dd, &tidbase[i], RCVHQ_RCV_TYPE_EAGER,
+		dd->ipath_f_put_tid(dd, &tidbase[i], RCVHQ_RCV_TYPE_EAGER,
 				 tidinv);
 }
 
@@ -1395,10 +1400,11 @@ void ipath_init_iba6120_funcs(struct ipath_devdata *dd)
 	dd->ipath_f_quiet_serdes = ipath_pe_quiet_serdes;
 	dd->ipath_f_bringup_serdes = ipath_pe_bringup_serdes;
 	dd->ipath_f_clear_tids = ipath_pe_clear_tids;
-	if (dd->ipath_minrev >= 2)
-		dd->ipath_f_put_tid = ipath_pe_put_tid_2;
-	else
-		dd->ipath_f_put_tid = ipath_pe_put_tid;
+	/*
+	 * this may get changed after we read the chip revision,
+	 * but we start with the safe version for all revs
+	 */
+	dd->ipath_f_put_tid = ipath_pe_put_tid;
 	dd->ipath_f_cleanup = ipath_setup_pe_cleanup;
 	dd->ipath_f_setextled = ipath_setup_pe_setextled;
 	dd->ipath_f_get_base_info = ipath_pe_get_base_info;
diff --git a/drivers/infiniband/hw/ipath/ipath_init_chip.c b/drivers/infiniband/hw/ipath/ipath_init_chip.c
index 49951d5..9dd0bac 100644
--- a/drivers/infiniband/hw/ipath/ipath_init_chip.c
+++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c
@@ -782,7 +782,7 @@ int ipath_init_chip(struct ipath_devdata *dd, int reinit)
 	 * Follows early_init because some chips have to initialize
 	 * PIO buffers in early_init to avoid false parity errors.
 	 */
-	ipath_cancel_sends(dd);
+	ipath_cancel_sends(dd, 0);
 
 	/* early_init sets rcvhdrentsize and rcvhdrsize, so this must be
 	 * done after early_init */
@@ -851,13 +851,14 @@ int ipath_init_chip(struct ipath_devdata *dd, int reinit)
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_hwerrmask,
 			 dd->ipath_hwerrmask);
 
-	dd->ipath_maskederrs = dd->ipath_ignorederrs;
 	/* clear all */
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_errorclear, -1LL);
 	/* enable errors that are masked, at least this first time. */
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask,
 			 ~dd->ipath_maskederrs);
-	/* clear any interrups up to this point (ints still not enabled) */
+	dd->ipath_errormask = ipath_read_kreg64(dd,
+		dd->ipath_kregs->kr_errormask);
+	/* clear any interrupts up to this point (ints still not enabled) */
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, -1LL);
 
 	/*
diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c
index 1fd91c5..b29fe7e 100644
--- a/drivers/infiniband/hw/ipath/ipath_intr.c
+++ b/drivers/infiniband/hw/ipath/ipath_intr.c
@@ -303,7 +303,7 @@ static void handle_e_ibstatuschanged(struct ipath_devdata *dd,
 		 * Flush all queued sends when link went to DOWN or INIT,
 		 * to be sure that they don't block SMA and other MAD packets
 		 */
-		ipath_cancel_sends(dd);
+		ipath_cancel_sends(dd, 1);
 	}
 	else if (lstate == IPATH_IBSTATE_INIT || lstate == IPATH_IBSTATE_ARM ||
 	    lstate == IPATH_IBSTATE_ACTIVE) {
@@ -517,10 +517,7 @@ static int handle_errors(struct ipath_devdata *dd, ipath_err_t errs)
 
 	supp_msgs = handle_frequent_errors(dd, errs, msg, &noprint);
 
-	/*
-	 * don't report errors that are masked (includes those always
-	 * ignored)
-	 */
+	/* don't report errors that are masked */
 	errs &= ~dd->ipath_maskederrs;
 
 	/* do these first, they are most important */
@@ -566,19 +563,19 @@ static int handle_errors(struct ipath_devdata *dd, ipath_err_t errs)
 		 * ones on this particular interrupt, which also isn't great
 		 */
 		dd->ipath_maskederrs |= dd->ipath_lasterror | errs;
+		dd->ipath_errormask &= ~dd->ipath_maskederrs;
 		ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask,
-				 ~dd->ipath_maskederrs);
+			dd->ipath_errormask);
 		s_iserr = ipath_decode_err(msg, sizeof msg,
-				 (dd->ipath_maskederrs & ~dd->
-				  ipath_ignorederrs));
+			dd->ipath_maskederrs);
 
-		if ((dd->ipath_maskederrs & ~dd->ipath_ignorederrs) &
+		if (dd->ipath_maskederrs &
 			~(INFINIPATH_E_RRCVEGRFULL |
 			INFINIPATH_E_RRCVHDRFULL | INFINIPATH_E_PKTERRS))
 			ipath_dev_err(dd, "Temporarily disabling "
 			    "error(s) %llx reporting; too frequent (%s)\n",
-				(unsigned long long) (dd->ipath_maskederrs &
-				~dd->ipath_ignorederrs), msg);
+				(unsigned long long)dd->ipath_maskederrs,
+				msg);
 		else {
 			/*
 			 * rcvegrfull and rcvhdrqfull are "normal",
@@ -793,19 +790,22 @@ void ipath_clear_freeze(struct ipath_devdata *dd)
 	/* disable error interrupts, to avoid confusion */
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask, 0ULL);
 
+	/* also disable interrupts; errormask is sometimes overwriten */
+	ipath_write_kreg(dd, dd->ipath_kregs->kr_intmask, 0ULL);
+
 	/*
 	 * clear all sends, because they have may been
 	 * completed by usercode while in freeze mode, and
 	 * therefore would not be sent, and eventually
 	 * might cause the process to run out of bufs
 	 */
-	ipath_cancel_sends(dd);
+	ipath_cancel_sends(dd, 0);
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_control,
 			 dd->ipath_control);
 
 	/* ensure pio avail updates continue */
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl,
-		 dd->ipath_sendctrl & ~IPATH_S_PIOBUFAVAILUPD);
+		 dd->ipath_sendctrl & ~INFINIPATH_S_PIOBUFAVAILUPD);
 	ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch);
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl,
 		 dd->ipath_sendctrl);
@@ -817,7 +817,7 @@ void ipath_clear_freeze(struct ipath_devdata *dd)
 	for (i = 0; i < dd->ipath_pioavregs; i++) {
 		/* deal with 6110 chip bug */
 		im = i > 3 ? ((i&1) ? i-1 : i+1) : i;
-		val = ipath_read_kreg64(dd, 0x1000+(im*sizeof(u64)));
+		val = ipath_read_kreg64(dd, (0x1000/sizeof(u64))+im);
 		dd->ipath_pioavailregs_dma[i] = dd->ipath_pioavailshadow[i]
 			= le64_to_cpu(val);
 	}
@@ -832,7 +832,8 @@ void ipath_clear_freeze(struct ipath_devdata *dd)
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_errorclear,
 		E_SPKT_ERRS_IGNORE);
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask,
-		~dd->ipath_maskederrs);
+		dd->ipath_errormask);
+	ipath_write_kreg(dd, dd->ipath_kregs->kr_intmask, -1LL);
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, 0ULL);
 }
 
@@ -1002,7 +1003,6 @@ irqreturn_t ipath_intr(int irq, void *data)
 	u32 istat, chk0rcv = 0;
 	ipath_err_t estat = 0;
 	irqreturn_t ret;
-	u32 oldhead, curtail;
 	static unsigned unexpected = 0;
 	static const u32 port0rbits = (1U<<INFINIPATH_I_RCVAVAIL_SHIFT) |
 		 (1U<<INFINIPATH_I_RCVURG_SHIFT);
@@ -1035,36 +1035,6 @@ irqreturn_t ipath_intr(int irq, void *data)
 		goto bail;
 	}
 
-	/*
-	 * We try to avoid reading the interrupt status register, since
-	 * that's a PIO read, and stalls the processor for up to about
-	 * ~0.25 usec. The idea is that if we processed a port0 packet,
-	 * we blindly clear the  port 0 receive interrupt bits, and nothing
-	 * else, then return.  If other interrupts are pending, the chip
-	 * will re-interrupt us as soon as we write the intclear register.
-	 * We then won't process any more kernel packets (if not the 2nd
-	 * time, then the 3rd or 4th) and we'll then handle the other
-	 * interrupts.   We clear the interrupts first so that we don't
-	 * lose intr for later packets that arrive while we are processing.
-	 */
-	oldhead = dd->ipath_port0head;
-	curtail = (u32)le64_to_cpu(*dd->ipath_hdrqtailptr);
-	if (oldhead != curtail) {
-		if (dd->ipath_flags & IPATH_GPIO_INTR) {
-			ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_clear,
-					 (u64) (1 << IPATH_GPIO_PORT0_BIT));
-			istat = port0rbits | INFINIPATH_I_GPIO;
-		}
-		else
-			istat = port0rbits;
-		ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, istat);
-		ipath_kreceive(dd);
-		if (oldhead != dd->ipath_port0head) {
-			ipath_stats.sps_fastrcvint++;
-			goto done;
-		}
-	}
-
 	istat = ipath_read_kreg32(dd, dd->ipath_kregs->kr_intstatus);
 
 	if (unlikely(!istat)) {
@@ -1225,7 +1195,6 @@ irqreturn_t ipath_intr(int irq, void *data)
 		handle_layer_pioavail(dd);
 	}
 
-done:
 	ret = IRQ_HANDLED;
 
 bail:
diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h
index ace63ef..7a7966f 100644
--- a/drivers/infiniband/hw/ipath/ipath_kernel.h
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h
@@ -261,18 +261,10 @@ struct ipath_devdata {
 	 * limiting of hwerror reporting
 	 */
 	ipath_err_t ipath_lasthwerror;
-	/*
-	 * errors masked because they occur too fast, also includes errors
-	 * that are always ignored (ipath_ignorederrs)
-	 */
+	/* errors masked because they occur too fast */
 	ipath_err_t ipath_maskederrs;
 	/* time in jiffies at which to re-enable maskederrs */
 	unsigned long ipath_unmasktime;
-	/*
-	 * errors always ignored (masked), at least for a given
-	 * chip/device, because they are wrong or not useful
-	 */
-	ipath_err_t ipath_ignorederrs;
 	/* count of egrfull errors, combined for all ports */
 	u64 ipath_last_tidfull;
 	/* for ipath_qcheck() */
@@ -436,6 +428,7 @@ struct ipath_devdata {
 	u64 ipath_lastibcstat;
 	/* hwerrmask shadow */
 	ipath_err_t ipath_hwerrmask;
+	ipath_err_t ipath_errormask; /* errormask shadow */
 	/* interrupt config reg shadow */
 	u64 ipath_intconfig;
 	/* kr_sendpiobufbase value */
@@ -683,7 +676,7 @@ int ipath_unordered_wc(void);
 
 void ipath_disarm_piobufs(struct ipath_devdata *, unsigned first,
 			  unsigned cnt);
-void ipath_cancel_sends(struct ipath_devdata *);
+void ipath_cancel_sends(struct ipath_devdata *, int);
 
 int ipath_create_rcvhdrq(struct ipath_devdata *, struct ipath_portdata *);
 void ipath_free_pddata(struct ipath_devdata *, struct ipath_portdata *);
diff --git a/drivers/infiniband/hw/ipath/ipath_stats.c b/drivers/infiniband/hw/ipath/ipath_stats.c
index 73ed17d..bae4f56 100644
--- a/drivers/infiniband/hw/ipath/ipath_stats.c
+++ b/drivers/infiniband/hw/ipath/ipath_stats.c
@@ -196,6 +196,45 @@ static void ipath_qcheck(struct ipath_devdata *dd)
 	}
 }
 
+static void ipath_chk_errormask(struct ipath_devdata *dd)
+{
+	static u32 fixed;
+	u32 ctrl;
+	unsigned long errormask;
+	unsigned long hwerrs;
+
+	if (!dd->ipath_errormask || !(dd->ipath_flags & IPATH_INITTED))
+		return;
+
+	errormask = ipath_read_kreg64(dd, dd->ipath_kregs->kr_errormask);
+
+	if (errormask == dd->ipath_errormask)
+		return;
+	fixed++;
+
+	hwerrs = ipath_read_kreg64(dd, dd->ipath_kregs->kr_hwerrstatus);
+	ctrl = ipath_read_kreg32(dd, dd->ipath_kregs->kr_control);
+
+	ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask,
+		dd->ipath_errormask);
+
+	if ((hwerrs & dd->ipath_hwerrmask) ||
+		(ctrl & INFINIPATH_C_FREEZEMODE)) {
+		/* force re-interrupt of pending events, just in case */
+		ipath_write_kreg(dd, dd->ipath_kregs->kr_hwerrclear, 0ULL);
+		ipath_write_kreg(dd, dd->ipath_kregs->kr_errorclear, 0ULL);
+		ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, 0ULL);
+		dev_info(&dd->pcidev->dev,
+			"errormask fixed(%u) %lx -> %lx, ctrl %x hwerr %lx\n",
+			fixed, errormask, (unsigned long)dd->ipath_errormask,
+			ctrl, hwerrs);
+	} else
+		ipath_dbg("errormask fixed(%u) %lx -> %lx, no freeze\n",
+			fixed, errormask,
+			(unsigned long)dd->ipath_errormask);
+}
+
+
 /**
  * ipath_get_faststats - get word counters from chip before they overflow
  * @opaque - contains a pointer to the infinipath device ipath_devdata
@@ -251,14 +290,13 @@ void ipath_get_faststats(unsigned long opaque)
 		dd->ipath_lasterror = 0;
 	if (dd->ipath_lasthwerror)
 		dd->ipath_lasthwerror = 0;
-	if ((dd->ipath_maskederrs & ~dd->ipath_ignorederrs)
+	if (dd->ipath_maskederrs
 	    && time_after(jiffies, dd->ipath_unmasktime)) {
 		char ebuf[256];
 		int iserr;
 		iserr = ipath_decode_err(ebuf, sizeof ebuf,
-				 (dd->ipath_maskederrs & ~dd->
-				  ipath_ignorederrs));
-		if ((dd->ipath_maskederrs & ~dd->ipath_ignorederrs) &
+			dd->ipath_maskederrs);
+		if (dd->ipath_maskederrs &
 				~(INFINIPATH_E_RRCVEGRFULL | INFINIPATH_E_RRCVHDRFULL |
 				INFINIPATH_E_PKTERRS ))
 			ipath_dev_err(dd, "Re-enabling masked errors "
@@ -278,9 +316,12 @@ void ipath_get_faststats(unsigned long opaque)
 				ipath_cdbg(ERRPKT, "Re-enabling packet"
 						" problem interrupt (%s)\n", ebuf);
 		}
-		dd->ipath_maskederrs = dd->ipath_ignorederrs;
+
+		/* re-enable masked errors */
+		dd->ipath_errormask |= dd->ipath_maskederrs;
 		ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask,
-				 ~dd->ipath_maskederrs);
+			dd->ipath_errormask);
+		dd->ipath_maskederrs = 0;
 	}
 
 	/* limit qfull messages to ~one per minute per port */
@@ -294,6 +335,7 @@ void ipath_get_faststats(unsigned long opaque)
 		}
 	}
 
+	ipath_chk_errormask(dd);
 done:
 	mod_timer(&dd->ipath_stats_timer, jiffies + HZ * 5);
 }
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index f6315df..ba0428d 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -1209,7 +1209,6 @@ static void set_datagram_seg(struct mlx4_wqe_datagram_seg *dseg,
 	memcpy(dseg->av, &to_mah(wr->wr.ud.ah)->av, sizeof (struct mlx4_av));
 	dseg->dqpn = cpu_to_be32(wr->wr.ud.remote_qpn);
 	dseg->qkey = cpu_to_be32(wr->wr.ud.remote_qkey);
-
 }
 
 static void set_data_seg(struct mlx4_wqe_data_seg *dseg,
diff --git a/drivers/net/mlx4/mr.c b/drivers/net/mlx4/mr.c
index d0808fa..5b87183 100644
--- a/drivers/net/mlx4/mr.c
+++ b/drivers/net/mlx4/mr.c
@@ -255,10 +255,8 @@ int mlx4_mr_alloc(struct mlx4_dev *dev, u32 pd, u64 iova, u64 size, u32 access,
 	int err;
 
 	index = mlx4_bitmap_alloc(&priv->mr_table.mpt_bitmap);
-	if (index == -1) {
-		err = -ENOMEM;
-		goto err;
-	}
+	if (index == -1)
+		return -ENOMEM;
 
 	mr->iova       = iova;
 	mr->size       = size;
@@ -269,15 +267,8 @@ int mlx4_mr_alloc(struct mlx4_dev *dev, u32 pd, u64 iova, u64 size, u32 access,
 
 	err = mlx4_mtt_init(dev, npages, page_shift, &mr->mtt);
 	if (err)
-		goto err_index;
-
-	return 0;
-
-err_index:
-	mlx4_bitmap_free(&priv->mr_table.mpt_bitmap, index);
+		mlx4_bitmap_free(&priv->mr_table.mpt_bitmap, index);
 
-err:
-	kfree(mr);
 	return err;
 }
 EXPORT_SYMBOL_GPL(mlx4_mr_alloc);


From rdreier at cisco.com  Mon Jul 30 13:20:21 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 30 Jul 2007 13:20:21 -0700
Subject: [ofa-general] Re: [PATCH 2.6.23 1/2] Make the iw_cxgb3 module
	parameters writable.
In-Reply-To: <20070729201226.31659.85900.stgit@dell3.ogc.int> (Steve Wise's
	message of "Sun, 29 Jul 2007 15:12:26 -0500")
References: <20070729201226.31659.85900.stgit@dell3.ogc.int>
Message-ID: <adad4y9skxm.fsf@cisco.com>

ugh, missed these before my last merge...

anyway:

why do we want to parameters writable?  a good changelog tells me
what, why and how, and this changelog just covered the "what".  Also,
I assume you've checked that it's OK for these variables to change at
any time?


From rdreier at cisco.com  Mon Jul 30 13:22:01 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 30 Jul 2007 13:22:01 -0700
Subject: [ofa-general] ipoib question
In-Reply-To: <1185172339.5513.11.camel@mtls03> (Eli Cohen's message of "Mon,
	23 Jul 2007 09:32:19 +0300")
References: <1185172339.5513.11.camel@mtls03>
Message-ID: <ada8x8xskuu.fsf@cisco.com>

 > Roland,
 > 
 > can you explain why you add 1 to the size of the CQ in
 > ipoib_transport_dev_init()?

Not really... I think it's lost in the depths of time, and probably
wrong too.


From rdreier at cisco.com  Mon Jul 30 13:23:36 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 30 Jul 2007 13:23:36 -0700
Subject: [ofa-general] PATCH] IB/core: ignore membership bit when looking
	for a P_Key in the table
In-Reply-To: <46A453BE.3030408@gmail.com> (Moni Shoua's message of "Mon,
	23 Jul 2007 10:07:42 +0300")
References: <46A36E77.5020307@gmail.com> <46A453BE.3030408@gmail.com>
Message-ID: <ada4pjlsks7.fsf@cisco.com>

Looks OK I guess.  But it seems that we should fix up the code in
sa_query.c too, right?


From rdreier at cisco.com  Mon Jul 30 13:25:12 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 30 Jul 2007 13:25:12 -0700
Subject: [ofa-general] RE: OFA website edits
In-Reply-To: <46A536EC.4060201@ichips.intel.com> (Arlin Davis's message of
	"Mon, 23 Jul 2007 16:17:00 -0700")
References: <B0095134066CC94FBC80973103FFA1FE046A5CB3@orsmsx416.amr.corp.intel.com>
	<adamyy27cxk.fsf@cisco.com> <46956FF9.50102@ichips.intel.com>
	<46968448.2000401@ichips.intel.com>
	<46A536EC.4060201@ichips.intel.com>
Message-ID: <adazm1dr653.fsf@cisco.com>

 > Maintainers: please review the following proposal regarding new public
 > download locations/website links and respond. This request originated
 > from xwg.
 > 
 > http://lists.openfabrics.org/pipermail/xwg/2007-June/000018.html

I guess it's OK, but what's the difference between a README and a
WEB_README?

Would it make sense to have just one file (maybe in a format that is
easily transformed to HTML, eg reStructuredText) for all purposes?

 - R.


From rdreier at cisco.com  Mon Jul 30 13:25:26 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 30 Jul 2007 13:25:26 -0700
Subject: [Fwd: [ofa-general] [PATCH] libibverbs: document IBV_SEND_INLINE
	buffer ownership relaxation]
In-Reply-To: <46A6F17C.8060404@voltaire.com> (Or Gerlitz's message of "Wed,
	25 Jul 2007 09:45:16 +0300")
References: <46A6F17C.8060404@voltaire.com>
Message-ID: <adavec1r64p.fsf@cisco.com>

 > It seems that you have missed this patch, can you have a look?

Sorry, I need to get to this...


From rdreier at cisco.com  Mon Jul 30 13:26:44 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 30 Jul 2007 13:26:44 -0700
Subject: [ofa-general] Question on IPoIB start xmit
In-Reply-To: <OF6BB9265C.45F6C80B-ON65257323.00395244-65257323.0039FD17@in.ibm.com>
	(Krishna Kumar2's message of "Wed, 25 Jul 2007 16:03:23 +0530")
References: <OF6BB9265C.45F6C80B-ON65257323.00395244-65257323.0039FD17@in.ibm.com>
Message-ID: <adar6mpr62j.fsf@cisco.com>

 > Pathlookup skb2, Mcast send skb5, Unicast arp send8, Good skb1, Good skb3,
 >       Good skb4, Good skb6, Good skb7, Good skb9
 > 
 > Or is there any requirement or logic that will break unless skbs are sent
 > in the same order
 > that it was received from ULP ?

Just the requirement that the low-level driver not gratuitously
reorder skbs within a flow.  Some small amount of reordering probably
acceptable if it helps performance a lot.


From pauln at psc.edu  Mon Jul 30 13:26:49 2007
From: pauln at psc.edu (Paul Nowoczynski)
Date: Mon, 30 Jul 2007 16:26:49 -0400
Subject: [ofa-general] SDP kernel Oops.. (OFED-1.2)
Message-ID: <46AE4989.9010508@psc.edu>

Jim,
I just ran with 1.2 and hit the same bug.  I've included the debug msgs 
leading up to the oops (at the bottom).  I think the problem has to do 
with handling a connection request after a socket has been destroyed.  
The failed instance of sdp_connect_handler() doesn't appear to run 
sdp_init_qp() so I assume that it fails somewhere before that.

I wonder if event->param.conn.private_data is bogus?

Thanks for your help.
Paul


int sdp_connect_handler(struct sock *sk, struct rdma_cm_id *id,
                        struct rdma_cm_event *event)
{
        struct sockaddr_in *dst_addr;
        struct sock *child;
        const struct sdp_hh *h;
        int rc;

        sdp_dbg(sk, "%s %p -> %p\n", __func__, sdp_sk(sk)->id, id);

        h = event->param.conn.private_data;

        if (!h->max_adverts)
                return -EINVAL;

        child = sk_clone(sk, GFP_KERNEL);
        if (!child)
                return -ENOMEM;

        sdp_add_sock(sdp_sk(child));
        INIT_LIST_HEAD(&sdp_sk(child)->accept_queue);
        INIT_LIST_HEAD(&sdp_sk(child)->backlog_queue);
        INIT_DELAYED_WORK(&sdp_sk(child)->time_wait_work, 
sdp_time_wait_work);
        INIT_WORK(&sdp_sk(child)->destroy_work, sdp_destroy_work);

        dst_addr = (struct sockaddr_in *)&id->route.addr.dst_addr;
        inet_sk(child)->dport = dst_addr->sin_port;
        inet_sk(child)->daddr = dst_addr->sin_addr.s_addr;

        bh_unlock_sock(child);
        __sock_put(child);

        rc = sdp_init_qp(child, id);
...

################# Console Msgs ###########################################

oss08p login:
Fedora Core release 3 (Heidelberg)
Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64

oss08p login: Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): 
sdp_cma_handler event 4 id 000001015ed2b600
Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): 
RDMA_CM_EVENT_CONNECT_REQUEST
Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_connect_handler 
000001015ed30c00 -> 000001015ed2b600
Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_init_qp
Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_init_qp done
Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_connect_handler 
bufs 64 xmit_size_goal 32768
Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 
expected 10 *err -22
Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_cma_handler event 4 
handled
Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): event 4 done. status 0
Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11
Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 
sk 000001014e790780 newsk 0000000000000000
Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 
expected 10 *err -22
Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11
Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 
sk 000001014e790780 newsk 0000000000000000
Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_cma_handler event 
9 id 000001015ed2b600
Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): RDMA_CM_EVENT_ESTABLISHED
Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_connected_handler
Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_connected_handler 
child connection established
Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_cma_handler event 
9 handled
Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): event 9 done. status 0
Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 
expected 10 *err -22
Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_accept: 
ib_req_notify_cq
Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: status -22 
sk 000001014e790780 newsk 000001014e790040
Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 
expected 10 *err -22
Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11
Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 
sk 000001014e790780 newsk 0000000000000000
Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_setsockopt

Fedora Core release 3 (Heidelberg)
Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64

oss08p login:
Fedora Core release 3 (Heidelberg)
Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64

Fedora Core release 3 (Heidelberg)
Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64

oss08p login:
Fedora Core release 3 (Heidelberg)
Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64

oss08p login: Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): 
sdp_cma_handler event 4 id 000001015ee0c800
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): 
RDMA_CM_EVENT_CONNECT_REQUEST
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_connect_handler 
000001015ed30c00 -> 000001015ee0c800
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_init_qp
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_init_qp done
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_connect_handler 
bufs 64 xmit_size_goal 32768
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 
expected 10 *err -22
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_cma_handler event 4 
handled
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): event 4 done. status 0
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 
sk 000001014e790780 newsk 0000000000000000
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 
expected 10 *err -22
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 
sk 000001014e790780 newsk 0000000000000000
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_cma_handler 
event 9 id 000001015ee0c800
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): 
RDMA_CM_EVENT_ESTABLISHED
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_connected_handler
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_connected_handler 
child connection established
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_cma_handler 
event 9 handled
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): event 9 done. status 0
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 
expected 10 *err -22
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_accept: 
ib_req_notify_cq
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: status -22 
sk 000001014e790780 newsk 0000010151c457c0
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 
expected 10 *err -22
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 
sk 000001014e790780 newsk 0000000000000000
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_setsockopt
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_fin
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close: entering 
time wait refcnt 2
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close: last 
socket put 2
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_unhash
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_handle_wc: 
destroy in time wait state
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_cma_handler 
event 10 id 000001015ee0c800
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): 
RDMA_CM_EVENT_DISCONNECTED
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_destroy_work: 
refcnt 1
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_destruct
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close_sk
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_disconnected_handler
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_cma_handler 
event 10 handled
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_reset_sk
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): event 10 done. 
status -104
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close_sk done; 
releasing sock
Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_destruct done

Fedora Core release 3 (Heidelberg)
Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64

oss08p login:
Fedora Core release 3 (Heidelberg)
Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64

oss08p login:
Kernel BUG at sdp_cma:372
invalid operand: 0000 [1] SMP
CPU 1
Modules linked in: ksocklnd ptlrpc obdclass lvfs lnet libcfs ib_ipoib 
ib_srp ib_sdp rdma_cm ib_addr iw_cm ib_local_sa ib_cm iptable_filter i
p_tables e1000 ib_sa ib_uverbs ib_umad ib_mthca ib_mad ib_core md forcedeth
Pid: 2362, comm: ib_cm/1 Not tainted 
2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4
RIP: 0010:[<ffffffffa00abbbf>] 
<ffffffffa00abbbf>{:ib_sdp:sdp_connect_handler+207}
RSP: 0018:0000010155f21bc8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000010150e45800 RCX: 0000000000000000
RDX: 00000000ffffffff RSI: 0000010150e450a8 RDI: ffffffffa00b3640
RBP: ffffffffa00b3140 R08: 0000010155ef8ef8 R09: 0000010155ef8f08
R10: 00000000ffffffff R11: 0000000000000000 R12: 000001014e790780
R13: 0000010155f21cf8 R14: 0000000000000000 R15: 000001015681bfa4
FS:  0000002a9589db00(0000) GS:ffffffff805a3140(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000002a986adad8 CR3: 000000007ea38000 CR4: 00000000000006e0
Process ib_cm/1 (pid: 2362, threadinfo 0000010155f20000, task 
00000101565897e0)
Stack: 000001015ede2400 000001014e7909e0 000001014e790780 000001015ede2400
       0000010155f21cf8 0000000000000000 0000000000000000 ffffffffa00ac581
       000001015ede2400 000001015ede2458
Call Trace:<ffffffffa00ac581>{:ib_sdp:sdp_cma_handler+945} 
<ffffffffa00a0347>{:rdma_cm:cma_acquire_dev+359}
       <ffffffffa00a1488>{:rdma_cm:cma_req_handler+1000} 
<ffffffffa006859a>{:ib_cm:cm_process_work+26}
       <ffffffffa006901f>{:ib_cm:cm_req_handler+2463} 
<ffffffffa0069490>{:ib_cm:cm_work_handler+0}
       <ffffffffa00694d2>{:ib_cm:cm_work_handler+66} 
<ffffffff80133133>{__wake_up+67}
       <ffffffffa0069490>{:ib_cm:cm_work_handler+0} 
<ffffffff80149110>{worker_thread+496}
       <ffffffff80133070>{default_wake_function+0} 
<ffffffff801330c0>{__wake_up_common+64}
       <ffffffff80133070>{default_wake_function+0} 
<ffffffff8014d630>{keventd_create_kthread+0}
       <ffffffff80148f20>{worker_thread+0} 
<ffffffff8014d630>{keventd_create_kthread+0}
       <ffffffff8014d5e9>{kthread+217} <ffffffff8011144b>{child_rip+8}
       <ffffffff8014d630>{keventd_create_kthread+0} 
<ffffffff8014d510>{kthread+0}
       <ffffffff80111443>{child_rip+0}

Code: 0f 0b 01 f0 0a a0 ff ff ff ff 74 01 66 66 90 66 90 65 8b 04
RIP <ffffffffa00abbbf>{:ib_sdp:sdp_connect_handler+207} RSP 
<0000010155f21bc8>
 <0>Kernel panic Jul 30 16:10:11 -oss08p kernel: sdp_sock(988:0): 
sdp_cma_handler event 4 id 000001015ede2400
Jul 30 16:10:11 oss08p kernel: sdp_sock(988:0): 
RDMA_CM_EVENT_CONNECT_REQUEST
Jul 30 16:10:11 oss08p kernel: sdp_sock(988:0): sdp_connect_handler 
000001015ed30c00 -> 000001015ede 2400
Jul 30 16:10:11 oss08p kernel: ----------- [cut here ] --------- [please 
bite here ] ---------
Jul 30 16:10:12 oss08p kernel: Kernel BUG at sdp_cma:372
Jul 30 16:10:12 oss08p kernel: invalid operand: 0000 [1] SMP
Jul 30 16:10:12 oss08p kernel: CPU 1
Jul 30 16:n10:12 oss08p kernel: Modules linked in: ksocklnd ptlrpc 
obdclass lvfs lnet libcfs ib_ipoib ib_srp ib_sdp rdma_cm ib_addr iw_cm ib
_local_sa ib_cm iptable_filter ip_tables e1000 ib_sa ib_uverbs ib_umad 
ib_mthca ib_mad ib_core md forcedeth
Jul 30 16:10:12 osos08p kernel: Pid: 2362, comm: ib_cm/1 Not tainted 
2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4
Jul 30 16:10:12 oss08p kernel: RIP: 0010:[<ffffffffa00abbbf>] 
<ffffffffa00abbbf>{:ib_sdp:sdp_connect_handler+207}
Jul 30 16:10:12 oss08p kernel: RSP: 0018:0000010155f21bc8  EtFLAGS: 00010246
Jul 30 16:10:12 oss08p kernel: RAX: 0000000000000000 RBX: 
0000010150e45800 RCX: 0000000000000000
Jul 30 16:10:12 oss08p kernel: RDX: 00000000ffffffff RSI: 
0000010150e450a8 RDI: ffffffffa00b3640
Jul 30 16:10:12 oss08p kernel: RBP: fffffff fa00b3140 R08: 
0000010155ef8ef8 R09: 0000010155ef8f08
Jul 30 16:10:12 oss08p kernel: R10: 00000000ffffffff R11: 
0000000000000000 R12: 000001014e790780
Jul 30 16:10:12 oss08p kernel: R13: 0000010155f21cf8 R14: 
0000000000000000 R15: 000001015681bfa4
Jul 30 16:10:12 oss08sp kernel: FS:  0000002a9589db00(0000) 
GS:ffffffff805a3140(0000) knlGS:0000000000000000
Jul 30 16:10:12 oss08p kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 
000000008005003b
Jul 30 16:10:12 oss08p kernel: CR2: 0000002a986adad8 CR3: 
000000007ea38000 CR4: 00000000000006e0
Jul y30 16:10:12 oss08p kernel: Process ib_cm/1 (pid: 2362, threadinfo 
0000010155f20000, task 00000101565897e0)
Jul 30 16:10:12 oss08p kernel: Stack: 000001015ede2400 000001014e7909e0 
000001014e790780 000001015ede2400
Jul 30 16:10:12 oss08p kernel:        00n00010155f21cf8 0000000000000000 
0000000000000000 ffffffffa00ac581
Jul 30 16:10:12 oss08p kernel:        000001015ede2400 000001015ede2458
Jul 30 16:10:12 oss08p kernel: Call 
Trace:<ffffffffa00ac581>{:ib_sdp:sdp_cma_handler+945} 
<ffffffffa00a0347>{:rdma_cm:cma_acquire_cdev+359}
Jul 30 16:10:12 oss08p kernel:        
<ffffffffa00a1488>{:rdma_cm:cma_req_handler+1000} 
<ffffffffa006859a>{:ib_cm:cm_process_work+26}
Jul 30 16:10:12 oss08p kernel:        
<ffffffffa006901f>{:ib_cm:cm_req_handler+2463} 
<ffffffffa0069490>{:ib_cim:cm_work_handler+0}
Jul 30 16:10:12 oss08p kernel:        
<ffffffffa00694d2>{:ib_cm:cm_work_handler+66} 
<ffffffff80133133>{__wake_up+67}
Jul 30 16:10:12 oss08p kernel:        
<ffffffffa0069490>{:ib_cm:cm_work_handler+0} 
<ffffffff80149110>{worker_thread+496}
Jul 30 n16:10:12 oss08p kernel:        
<ffffffff80133070>{default_wake_function+0} 
<ffffffff801330c0>{__wake_up_common+64}
Jul 30 16:10:12 oss08p kernel:        
<ffffffff80133070>{default_wake_function+0} 
<ffffffff8014d630>{keventd_create_kthread+0}
Jul 30 16:g10:12 oss08p kernel:        
<ffffffff80148f20>{worker_thread+0} 
<ffffffff8014d630>{keventd_create_kthread+0}
Jul 30 16:10:12 oss08p kernel:        <ffffffff8014d5e9>{kthread+217} 
<ffffffff8011144b>{child_rip+8}
Jul 30 16:10:12 oss08p kernel:        
<ffffffff8014d630>{:keventd_create_kthread+0} <ffffffff8014d510>{kthread+0}
Jul 30 16:10:12 oss08p kernel:        <ffffffff80111443>{child_rip+0}
Jul 30 16:10:12 oss08p kernel:
Jul 30 16:10:12 oss08p kernel: Code: 0f 0b 01 f0 0a a0 ff ff ff ff 74 01 
66 66 90 66 90 65 8b  04
Jul 30 16:10:12 oss08p kernel: RIP 
<ffffffffa00abbbf>{:ib_sdp:sdp_connect_handler+207} RSP <0000010155f21bc8>
Jul 30 16:10:12 oss08p kernel:  <0>Kernel panic - not syncing: Oops
Oops


From fubar at us.ibm.com  Mon Jul 30 13:29:44 2007
From: fubar at us.ibm.com (Jay Vosburgh)
Date: Mon, 30 Jul 2007 13:29:44 -0700
Subject: [ofa-general] Re: [PATCH V3 7/7] net/bonding: Delay sending of
	gratuitous ARP to avoid failure 
In-Reply-To: <46ADDFE6.9000609@voltaire.com> 
References: <46ADDB89.5030601@voltaire.com> <46ADDFE6.9000609@voltaire.com>
Message-ID: <19319.1185827384@death>


Moni Shoua <monis at voltaire.com> wrote:

>Delay sending a gratuitous_arp when LINK_STATE_LINKWATCH_PENDING bit
>in dev->state field is on. This improves the chances for the arp packet to
>be transmitted.

	Under what circumstances were you seeing problems that delaying
the gratuitous ARP until linkwatch is done improves things?  Is this
really an IB thing, or did you experience problems here over regular
ethernet?

>Signed-off-by: Moni Shoua <monis at voltaire.com>
>---
> drivers/net/bonding/bond_main.c |   25 +++++++++++++++++++++----
> drivers/net/bonding/bonding.h   |    1 +
> 2 files changed, 22 insertions(+), 4 deletions(-)
>
>Index: net-2.6/drivers/net/bonding/bond_main.c
>===================================================================
>--- net-2.6.orig/drivers/net/bonding/bond_main.c	2007-07-25 15:33:25.000000000 +0300
>+++ net-2.6/drivers/net/bonding/bond_main.c	2007-07-26 18:42:59.296296622 +0300
>@@ -1134,8 +1134,13 @@ void bond_change_active_slave(struct bon
> 		if (new_active && !bond->do_set_mac_addr)
> 			memcpy(bond->dev->dev_addr,  new_active->dev->dev_addr,
> 				new_active->dev->addr_len);
>-
>-		bond_send_gratuitous_arp(bond);
>+		if (bond->curr_active_slave &&
>+			test_bit(__LINK_STATE_LINKWATCH_PENDING, &bond->curr_active_slave->dev->state)){
>+			dprintk("delaying gratuitous arp on %s\n",bond->curr_active_slave->dev->name);
>+			bond->send_grat_arp=1;
>+		}else{
>+			bond_send_gratuitous_arp(bond);
>+		}

	Style issues throughout the patch series: many lines are too
long, many things are all smashed together, e.g., "}else{" instead of
"} else {", "send_grat_arp=1" instead of "send_grat_arp = 1", and so on.

> 	}
> }
>
>@@ -2120,6 +2125,15 @@ void bond_mii_monitor(struct net_device 
> 	 * program could monitor the link itself if needed.
> 	 */
>
>+	if (bond->send_grat_arp) {
>+		if (bond->curr_active_slave && test_bit(__LINK_STATE_LINKWATCH_PENDING, &bond->curr_active_slave->dev->state))
>+			dprintk("Needs to send gratuitous arp but not yet\n",__FUNCTION__);
>+		else {
>+			dprintk("sending delayed gratuitous arp on ond->curr_active_slave->dev->name\n");
>+			bond_send_gratuitous_arp(bond);
>+			bond->send_grat_arp=0;
>+		}
>+	}


> 	read_lock(&bond->curr_slave_lock);
> 	oldcurrent = bond->curr_active_slave;
> 	read_unlock(&bond->curr_slave_lock);
>@@ -2513,6 +2527,7 @@ static void bond_send_gratuitous_arp(str
> 	struct slave *slave = bond->curr_active_slave;
> 	struct vlan_entry *vlan;
> 	struct net_device *vlan_dev;
>+	int i;
>
> 	dprintk("bond_send_grat_arp: bond %s slave %s\n", bond->dev->name,
> 				slave ? slave->dev->name : "NULL");
>@@ -2520,8 +2535,9 @@ static void bond_send_gratuitous_arp(str
> 		return;
>
> 	if (bond->master_ip) {
>-		bond_arp_send(slave->dev, ARPOP_REPLY, bond->master_ip,
>-				  bond->master_ip, 0);
>+		for (i=0;i<3;i++)
>+			bond_arp_send(slave->dev, ARPOP_REPLY, bond->master_ip,
>+					  bond->master_ip, 0);
> 	}

	If you delay the grat ARP until linkwatch is done, why is it
also necessary to shotgun several ARPs instead of one?  Why are the ARPs
sent for VLANs not also shotgunned in a similar fashion?

	If shotgunning like this really is useful, would it not make
more sense to space them out a bit, e.g., over successive monitor
passes?

> 	list_for_each_entry(vlan, &bond->vlan_list, vlan_list) {
>@@ -4331,6 +4347,7 @@ static int bond_init(struct net_device *
> 	bond->current_arp_slave = NULL;
> 	bond->primary_slave = NULL;
> 	bond->dev = bond_dev;
>+	bond->send_grat_arp=0;
> 	INIT_LIST_HEAD(&bond->vlan_list);
>
> 	/* Initialize the device entry points */
>Index: net-2.6/drivers/net/bonding/bonding.h
>===================================================================
>--- net-2.6.orig/drivers/net/bonding/bonding.h	2007-07-25 15:20:10.000000000 +0300
>+++ net-2.6/drivers/net/bonding/bonding.h	2007-07-26 18:42:43.652087660 +0300
>@@ -203,6 +203,7 @@ struct bonding {
> 	struct   vlan_group *vlgrp;
> 	struct   packet_type arp_mon_pt;
> 	s8       do_set_mac_addr;
>+	int	 send_grat_arp;

	This need not be a full int, and (this applies to
do_set_mac_addr, also) could probably be squeezed into gaps already
existing within the struct bonding somewhere.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar at us.ibm.com


From rdreier at cisco.com  Mon Jul 30 13:30:40 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 30 Jul 2007 13:30:40 -0700
Subject: [ofa-general] Re: [RFC/PATCH] mthca: ensure alignment of doorbell
	writes
In-Reply-To: <20070726014931.GL10235@sgi.com> (akepner@sgi.com's message of
	"Wed, 25 Jul 2007 18:49:31 -0700")
References: <20070726014931.GL10235@sgi.com>
Message-ID: <adamyxdr5vz.fsf@cisco.com>

 > +union mthca_doorbell {
 > +	__be64 val64;
 > +	__be32 val32[2];
 > +} __attribute__ ((aligned (sizeof(__be64))));

would we get the same effect from just adding the __attribute__((aligned
to the declarations of the doorbell arrays?

I wonder how it would affect the generated code on various platforms
if we just made the doorbell values be computed as __be64 and then
passed that in to the write64 function...

 - R.


From akepner at sgi.com  Mon Jul 30 13:43:45 2007
From: akepner at sgi.com (akepner at sgi.com)
Date: Mon, 30 Jul 2007 13:43:45 -0700
Subject: [ofa-general] Re: [RFC/PATCH] mthca: ensure alignment of doorbell
	writes
In-Reply-To: <adamyxdr5vz.fsf@cisco.com>
References: <20070726014931.GL10235@sgi.com> <adamyxdr5vz.fsf@cisco.com>
Message-ID: <20070730204345.GI10032@sgi.com>

On Mon, Jul 30, 2007 at 01:30:40PM -0700, Roland Dreier wrote:
>  > +union mthca_doorbell {
>  > +	__be64 val64;
>  > +	__be32 val32[2];
>  > +} __attribute__ ((aligned (sizeof(__be64))));
> 
> would we get the same effect from just adding the __attribute__((aligned
> to the declarations of the doorbell arrays?

Yes. 

(And of course using "((aligned (sizeof(__be64))))" with a union 
containing a __be64 member is silly anyway....)

> 
> I wonder how it would affect the generated code on various platforms
> if we just made the doorbell values be computed as __be64 and then
> passed that in to the write64 function...
> 

That'd work fine for ia64 :-) 

For other platforms I can't answer...

-- 
Arthur


From rdreier at cisco.com  Mon Jul 30 13:45:00 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 30 Jul 2007 13:45:00 -0700
Subject: [ofa-general] Re: [PATCH] mad.c: Fix memory leak in switch handling
	and improve error handling
In-Reply-To: <f0e08f230707290427x4ab37716t76c7f9692eed5b1c@mail.gmail.com>
	(Hal Rosenstock's message of "Sun, 29 Jul 2007 07:27:31 -0400")
References: <f0e08f230707290427x4ab37716t76c7f9692eed5b1c@mail.gmail.com>
Message-ID: <adair81r583.fsf@cisco.com>

I'm having a hard time seeing what this does exactly.  It seems that
the current code

		} else if (port_priv->device->node_type == RDMA_NODE_IB_SWITCH) {
			/* forward case for switches */
			memcpy(response, recv, sizeof(*response));

will blindly dereference response even if the allocation failed, so
the first chunk that bails out if allocating response seems to be
fixing this.  Anyway this seems like an unrelated change to the rest
of the patch.

I guess the leak fix is:

 > -                       if (!agent_send_response(&response->mad.mad,

 > +                       agent_send_response(&response->mad.mad,

but now you're ignoring the return value of that function.
Hmm... seems that the only other caller also ignores the return
value too.  Should agent_send_response() just become a void function,
since it doesn't seem as if there's anything useful to do with the
return value anyway?

 - R.


From hal.rosenstock at gmail.com  Mon Jul 30 13:57:24 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 30 Jul 2007 16:57:24 -0400
Subject: [ofa-general] Re: [PATCH] mad.c: Fix memory leak in switch handling
	and improve error handling
In-Reply-To: <adair81r583.fsf@cisco.com>
References: <f0e08f230707290427x4ab37716t76c7f9692eed5b1c@mail.gmail.com>
	<adair81r583.fsf@cisco.com>
Message-ID: <f0e08f230707301357k128a03caib649270bad983bf7@mail.gmail.com>

On 7/30/07, Roland Dreier <rdreier at cisco.com> wrote:
> I'm having a hard time seeing what this does exactly.  It seems that
> the current code
>
>                } else if (port_priv->device->node_type == RDMA_NODE_IB_SWITCH) {
>                        /* forward case for switches */
>                        memcpy(response, recv, sizeof(*response));
>
> will blindly dereference response even if the allocation failed, so
> the first chunk that bails out if allocating response seems to be
> fixing this.

Yes.

> Anyway this seems like an unrelated change to the rest
> of the patch.

Do these need to be 2 separate patches ?

> I guess the leak fix is:
>
>  > -                       if (!agent_send_response(&response->mad.mad,
>
>  > +                       agent_send_response(&response->mad.mad,

Yes (not going to out is the leak fix).

> but now you're ignoring the return value of that function.
> Hmm... seems that the only other caller also ignores the return
> value too.  Should agent_send_response() just become a void function,
> since it doesn't seem as if there's anything useful to do with the
> return value anyway?

It could. Do you want an updated patch (or patches) ? Let me know.

-- Hal

>  - R.
>


From rdreier at cisco.com  Mon Jul 30 14:20:22 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 30 Jul 2007 14:20:22 -0700
Subject: [ofa-general] Re: [PATCH V3 0/7] net/bonding: ADD IPoIB support for
	the bonding driver
In-Reply-To: <46ADDB89.5030601@voltaire.com> (Moni Shoua's message of "Mon,
	30 Jul 2007 15:37:29 +0300")
References: <46ADDB89.5030601@voltaire.com>
Message-ID: <adaejipr3l5.fsf@cisco.com>

 > 1. When bonding enslaves an IPoIB device the bonding neighbor holds a 
 > reference to a cleanup function in the IPoIB drives. This makes it unsafe to 
 > unload the IPoIB module if there are bonding neighbors in the air. So, to 
 > avoid this race one must unload bonding before unloading IPoIB. 

I think we really want to resolve this somehow.  Getting an oops by
doing "modprobe -r ipoib" isn't that friendly.

Also, what happened to the problem of having an address handle
belonging to the wrong device on bond failover?  Did you figure out a
way to fix that one?

 - R.


From ardavis at ichips.intel.com  Mon Jul 30 14:32:49 2007
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Mon, 30 Jul 2007 14:32:49 -0700
Subject: [ofa-general] RE: OFA website edits
In-Reply-To: <adazm1dr653.fsf@cisco.com>
References: <B0095134066CC94FBC80973103FFA1FE046A5CB3@orsmsx416.amr.corp.intel.com>	<adamyy27cxk.fsf@cisco.com>
	<46956FF9.50102@ichips.intel.com>	<46968448.2000401@ichips.intel.com>	<46A536EC.4060201@ichips.intel.com>
	<adazm1dr653.fsf@cisco.com>
Message-ID: <46AE5901.7010307@ichips.intel.com>

Roland Dreier wrote:

> > Maintainers: please review the following proposal regarding new public
> > download locations/website links and respond. This request originated
> > from xwg.
> > 
> > http://lists.openfabrics.org/pipermail/xwg/2007-June/000018.html
>
>I guess it's OK, but what's the difference between a README and a
>WEB_README?
>
>Would it make sense to have just one file (maybe in a format that is
>easily transformed to HTML, eg reStructuredText) for all purposes?
>  
>

That works for me. I was waiting for to hear back from Jeff regarding a 
filename and content.

Jeff, can you comment? What format will work best for you?

-arlin


From rdreier at cisco.com  Mon Jul 30 14:33:10 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 30 Jul 2007 14:33:10 -0700
Subject: [ofa-general] Re: NOSRQ QP implementation issues
In-Reply-To: <46AE3701.40603@linux.vnet.ibm.com> (Pradeep Satyanarayana's
	message of "Mon, 30 Jul 2007 12:07:45 -0700")
References: <adak5sr231k.fsf@cisco.com> <46AE3701.40603@linux.vnet.ibm.com>
Message-ID: <adaabtdr2zt.fsf@cisco.com>

 > For sending (both on the active and passive side) the skbs are associated 
 > with the tx_qp. The remote qp for the tx_qp is the rx_qp (on the other side)
 > and WRs are posted to receive packets. An skb (for send) is not associated 
 > with SQ of the rx_qp. Therefore, no packets are expected to be sent through
 > the rx_qp.
 > 
 > In an erroneous case if packets do get sent to the wrong RQ, then they will
 > get dropped as no WQEs are posted. As discussed, an RNR will be returned as
 > expected and a new connection will get established. I still see no issues 
 > with this either.
 > 
 > If in the future, we do want to use the unused SQ and RQs, then we will have
 > to associate them with corresponding QP at the remote end. This will be work
 > for both the SRQ and non-SRQ case.
 > 
 > I do not see any issues. Can you please explain what is missing with this 
 > implementation?

I think what you are missing is that Linux is not necessarily the only
IPoIB CM implementation.  The Linux IPoIB driver needs to be able to
talk to any other implementation that follows the RFCs, in particular
RFC 4755 for connected mode.  And according to my reading of the RFC
at least, it is OK for a system to accept an IPoIB CM connection and
then use that connection to send packets back to the system that
originated the connection.  There is no requirement that a new
connection be opened for traffic in the other direction.

And killing the connection as soon as a packet is sent in the wrong
direction seems pretty wrong to me.  The current SRQ code actually
handles it fine, because all the QPs, no matter which direction they
were opened, are attached to the SRQ and hence have receives available.

One possibility would be to set the maxium receive MTU to 0 for
connections initiated in the no-SRQ case.  However I'm not sure
whether that is within the spirit of the RFC, and it might really
confuse other systems that do want to send on that QP.  Another
possibility would be to post one receive to all no-SRQ QPs, and if
that receive is consumed then post more.

 - R.


From pauln at psc.edu  Mon Jul 30 14:36:41 2007
From: pauln at psc.edu (Paul Nowoczynski)
Date: Mon, 30 Jul 2007 17:36:41 -0400
Subject: [ofa-general] Re: SDP kernel Oops.. (OFED-1.2)
In-Reply-To: <46AE4989.9010508@psc.edu>
References: <46AE4989.9010508@psc.edu>
Message-ID: <46AE59E9.7070103@psc.edu>

I was running old firmware. Upgrading to the 4.8.200 seems to have fixed 
the problem.
paul


Paul Nowoczynski wrote:
> Jim,
> I just ran with 1.2 and hit the same bug.  I've included the debug 
> msgs leading up to the oops (at the bottom).  I think the problem has 
> to do with handling a connection request after a socket has been 
> destroyed.  The failed instance of sdp_connect_handler() doesn't 
> appear to run sdp_init_qp() so I assume that it fails somewhere before 
> that.
>
> I wonder if event->param.conn.private_data is bogus?
>
> Thanks for your help.
> Paul
>
>
> int sdp_connect_handler(struct sock *sk, struct rdma_cm_id *id,
>                        struct rdma_cm_event *event)
> {
>        struct sockaddr_in *dst_addr;
>        struct sock *child;
>        const struct sdp_hh *h;
>        int rc;
>
>        sdp_dbg(sk, "%s %p -> %p\n", __func__, sdp_sk(sk)->id, id);
>
>        h = event->param.conn.private_data;
>
>        if (!h->max_adverts)
>                return -EINVAL;
>
>        child = sk_clone(sk, GFP_KERNEL);
>        if (!child)
>                return -ENOMEM;
>
>        sdp_add_sock(sdp_sk(child));
>        INIT_LIST_HEAD(&sdp_sk(child)->accept_queue);
>        INIT_LIST_HEAD(&sdp_sk(child)->backlog_queue);
>        INIT_DELAYED_WORK(&sdp_sk(child)->time_wait_work, 
> sdp_time_wait_work);
>        INIT_WORK(&sdp_sk(child)->destroy_work, sdp_destroy_work);
>
>        dst_addr = (struct sockaddr_in *)&id->route.addr.dst_addr;
>        inet_sk(child)->dport = dst_addr->sin_port;
>        inet_sk(child)->daddr = dst_addr->sin_addr.s_addr;
>
>        bh_unlock_sock(child);
>        __sock_put(child);
>
>        rc = sdp_init_qp(child, id);
> ...
>
> ################# Console Msgs 
> ###########################################
>
> oss08p login:
> Fedora Core release 3 (Heidelberg)
> Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64
>
> oss08p login: Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): 
> sdp_cma_handler event 4 id 000001015ed2b600
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): 
> RDMA_CM_EVENT_CONNECT_REQUEST
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_connect_handler 
> 000001015ed30c00 -> 000001015ed2b600
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_init_qp
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_init_qp done
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_connect_handler 
> bufs 64 xmit_size_goal 32768
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 
> expected 10 *err -22
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_cma_handler event 
> 4 handled
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): event 4 done. status 0
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 
> sk 000001014e790780 newsk 0000000000000000
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 
> expected 10 *err -22
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 
> sk 000001014e790780 newsk 0000000000000000
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_cma_handler 
> event 9 id 000001015ed2b600
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): 
> RDMA_CM_EVENT_ESTABLISHED
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_connected_handler
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_connected_handler 
> child connection established
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_cma_handler 
> event 9 handled
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): event 9 done. status 0
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 
> expected 10 *err -22
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_accept: 
> ib_req_notify_cq
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: status -22 
> sk 000001014e790780 newsk 000001014e790040
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 
> expected 10 *err -22
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 
> sk 000001014e790780 newsk 0000000000000000
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_setsockopt
>
> Fedora Core release 3 (Heidelberg)
> Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64
>
> oss08p login:
> Fedora Core release 3 (Heidelberg)
> Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64
>
> Fedora Core release 3 (Heidelberg)
> Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64
>
> oss08p login:
> Fedora Core release 3 (Heidelberg)
> Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64
>
> oss08p login: Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): 
> sdp_cma_handler event 4 id 000001015ee0c800
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): 
> RDMA_CM_EVENT_CONNECT_REQUEST
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_connect_handler 
> 000001015ed30c00 -> 000001015ee0c800
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_init_qp
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_init_qp done
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): 
> sdp_connect_handler bufs 64 xmit_size_goal 32768
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 
> expected 10 *err -22
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_cma_handler event 
> 4 handled
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): event 4 done. status 0
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 
> sk 000001014e790780 newsk 0000000000000000
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 
> expected 10 *err -22
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 
> sk 000001014e790780 newsk 0000000000000000
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_cma_handler 
> event 9 id 000001015ee0c800
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): 
> RDMA_CM_EVENT_ESTABLISHED
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_connected_handler
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_connected_handler 
> child connection established
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_cma_handler 
> event 9 handled
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): event 9 done. 
> status 0
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 
> expected 10 *err -22
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_accept: 
> ib_req_notify_cq
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: status -22 
> sk 000001014e790780 newsk 0000010151c457c0
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 
> expected 10 *err -22
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 
> sk 000001014e790780 newsk 0000000000000000
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_setsockopt
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_fin
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close: 
> entering time wait refcnt 2
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close: last 
> socket put 2
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_unhash
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_handle_wc: 
> destroy in time wait state
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_cma_handler 
> event 10 id 000001015ee0c800
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): 
> RDMA_CM_EVENT_DISCONNECTED
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_destroy_work: 
> refcnt 1
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_destruct
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close_sk
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): 
> sdp_disconnected_handler
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_cma_handler 
> event 10 handled
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_reset_sk
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): event 10 done. 
> status -104
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close_sk done; 
> releasing sock
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_destruct done
>
> Fedora Core release 3 (Heidelberg)
> Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64
>
> oss08p login:
> Fedora Core release 3 (Heidelberg)
> Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64
>
> oss08p login:
> Kernel BUG at sdp_cma:372
> invalid operand: 0000 [1] SMP
> CPU 1
> Modules linked in: ksocklnd ptlrpc obdclass lvfs lnet libcfs ib_ipoib 
> ib_srp ib_sdp rdma_cm ib_addr iw_cm ib_local_sa ib_cm iptable_filter i
> p_tables e1000 ib_sa ib_uverbs ib_umad ib_mthca ib_mad ib_core md 
> forcedeth
> Pid: 2362, comm: ib_cm/1 Not tainted 
> 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4
> RIP: 0010:[<ffffffffa00abbbf>] 
> <ffffffffa00abbbf>{:ib_sdp:sdp_connect_handler+207}
> RSP: 0018:0000010155f21bc8  EFLAGS: 00010246
> RAX: 0000000000000000 RBX: 0000010150e45800 RCX: 0000000000000000
> RDX: 00000000ffffffff RSI: 0000010150e450a8 RDI: ffffffffa00b3640
> RBP: ffffffffa00b3140 R08: 0000010155ef8ef8 R09: 0000010155ef8f08
> R10: 00000000ffffffff R11: 0000000000000000 R12: 000001014e790780
> R13: 0000010155f21cf8 R14: 0000000000000000 R15: 000001015681bfa4
> FS:  0000002a9589db00(0000) GS:ffffffff805a3140(0000) 
> knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000002a986adad8 CR3: 000000007ea38000 CR4: 00000000000006e0
> Process ib_cm/1 (pid: 2362, threadinfo 0000010155f20000, task 
> 00000101565897e0)
> Stack: 000001015ede2400 000001014e7909e0 000001014e790780 
> 000001015ede2400
>       0000010155f21cf8 0000000000000000 0000000000000000 ffffffffa00ac581
>       000001015ede2400 000001015ede2458
> Call Trace:<ffffffffa00ac581>{:ib_sdp:sdp_cma_handler+945} 
> <ffffffffa00a0347>{:rdma_cm:cma_acquire_dev+359}
>       <ffffffffa00a1488>{:rdma_cm:cma_req_handler+1000} 
> <ffffffffa006859a>{:ib_cm:cm_process_work+26}
>       <ffffffffa006901f>{:ib_cm:cm_req_handler+2463} 
> <ffffffffa0069490>{:ib_cm:cm_work_handler+0}
>       <ffffffffa00694d2>{:ib_cm:cm_work_handler+66} 
> <ffffffff80133133>{__wake_up+67}
>       <ffffffffa0069490>{:ib_cm:cm_work_handler+0} 
> <ffffffff80149110>{worker_thread+496}
>       <ffffffff80133070>{default_wake_function+0} 
> <ffffffff801330c0>{__wake_up_common+64}
>       <ffffffff80133070>{default_wake_function+0} 
> <ffffffff8014d630>{keventd_create_kthread+0}
>       <ffffffff80148f20>{worker_thread+0} 
> <ffffffff8014d630>{keventd_create_kthread+0}
>       <ffffffff8014d5e9>{kthread+217} <ffffffff8011144b>{child_rip+8}
>       <ffffffff8014d630>{keventd_create_kthread+0} 
> <ffffffff8014d510>{kthread+0}
>       <ffffffff80111443>{child_rip+0}
>
> Code: 0f 0b 01 f0 0a a0 ff ff ff ff 74 01 66 66 90 66 90 65 8b 04
> RIP <ffffffffa00abbbf>{:ib_sdp:sdp_connect_handler+207} RSP 
> <0000010155f21bc8>
> <0>Kernel panic Jul 30 16:10:11 -oss08p kernel: sdp_sock(988:0): 
> sdp_cma_handler event 4 id 000001015ede2400
> Jul 30 16:10:11 oss08p kernel: sdp_sock(988:0): 
> RDMA_CM_EVENT_CONNECT_REQUEST
> Jul 30 16:10:11 oss08p kernel: sdp_sock(988:0): sdp_connect_handler 
> 000001015ed30c00 -> 000001015ede 2400
> Jul 30 16:10:11 oss08p kernel: ----------- [cut here ] --------- 
> [please bite here ] ---------
> Jul 30 16:10:12 oss08p kernel: Kernel BUG at sdp_cma:372
> Jul 30 16:10:12 oss08p kernel: invalid operand: 0000 [1] SMP
> Jul 30 16:10:12 oss08p kernel: CPU 1
> Jul 30 16:n10:12 oss08p kernel: Modules linked in: ksocklnd ptlrpc 
> obdclass lvfs lnet libcfs ib_ipoib ib_srp ib_sdp rdma_cm ib_addr iw_cm ib
> _local_sa ib_cm iptable_filter ip_tables e1000 ib_sa ib_uverbs ib_umad 
> ib_mthca ib_mad ib_core md forcedeth
> Jul 30 16:10:12 osos08p kernel: Pid: 2362, comm: ib_cm/1 Not tainted 
> 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4
> Jul 30 16:10:12 oss08p kernel: RIP: 0010:[<ffffffffa00abbbf>] 
> <ffffffffa00abbbf>{:ib_sdp:sdp_connect_handler+207}
> Jul 30 16:10:12 oss08p kernel: RSP: 0018:0000010155f21bc8  EtFLAGS: 
> 00010246
> Jul 30 16:10:12 oss08p kernel: RAX: 0000000000000000 RBX: 
> 0000010150e45800 RCX: 0000000000000000
> Jul 30 16:10:12 oss08p kernel: RDX: 00000000ffffffff RSI: 
> 0000010150e450a8 RDI: ffffffffa00b3640
> Jul 30 16:10:12 oss08p kernel: RBP: fffffff fa00b3140 R08: 
> 0000010155ef8ef8 R09: 0000010155ef8f08
> Jul 30 16:10:12 oss08p kernel: R10: 00000000ffffffff R11: 
> 0000000000000000 R12: 000001014e790780
> Jul 30 16:10:12 oss08p kernel: R13: 0000010155f21cf8 R14: 
> 0000000000000000 R15: 000001015681bfa4
> Jul 30 16:10:12 oss08sp kernel: FS:  0000002a9589db00(0000) 
> GS:ffffffff805a3140(0000) knlGS:0000000000000000
> Jul 30 16:10:12 oss08p kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 
> 000000008005003b
> Jul 30 16:10:12 oss08p kernel: CR2: 0000002a986adad8 CR3: 
> 000000007ea38000 CR4: 00000000000006e0
> Jul y30 16:10:12 oss08p kernel: Process ib_cm/1 (pid: 2362, threadinfo 
> 0000010155f20000, task 00000101565897e0)
> Jul 30 16:10:12 oss08p kernel: Stack: 000001015ede2400 
> 000001014e7909e0 000001014e790780 000001015ede2400
> Jul 30 16:10:12 oss08p kernel:        00n00010155f21cf8 
> 0000000000000000 0000000000000000 ffffffffa00ac581
> Jul 30 16:10:12 oss08p kernel:        000001015ede2400 000001015ede2458
> Jul 30 16:10:12 oss08p kernel: Call 
> Trace:<ffffffffa00ac581>{:ib_sdp:sdp_cma_handler+945} 
> <ffffffffa00a0347>{:rdma_cm:cma_acquire_cdev+359}
> Jul 30 16:10:12 oss08p kernel:        
> <ffffffffa00a1488>{:rdma_cm:cma_req_handler+1000} 
> <ffffffffa006859a>{:ib_cm:cm_process_work+26}
> Jul 30 16:10:12 oss08p kernel:        
> <ffffffffa006901f>{:ib_cm:cm_req_handler+2463} 
> <ffffffffa0069490>{:ib_cim:cm_work_handler+0}
> Jul 30 16:10:12 oss08p kernel:        
> <ffffffffa00694d2>{:ib_cm:cm_work_handler+66} 
> <ffffffff80133133>{__wake_up+67}
> Jul 30 16:10:12 oss08p kernel:        
> <ffffffffa0069490>{:ib_cm:cm_work_handler+0} 
> <ffffffff80149110>{worker_thread+496}
> Jul 30 n16:10:12 oss08p kernel:        
> <ffffffff80133070>{default_wake_function+0} 
> <ffffffff801330c0>{__wake_up_common+64}
> Jul 30 16:10:12 oss08p kernel:        
> <ffffffff80133070>{default_wake_function+0} 
> <ffffffff8014d630>{keventd_create_kthread+0}
> Jul 30 16:g10:12 oss08p kernel:        
> <ffffffff80148f20>{worker_thread+0} 
> <ffffffff8014d630>{keventd_create_kthread+0}
> Jul 30 16:10:12 oss08p kernel:        <ffffffff8014d5e9>{kthread+217} 
> <ffffffff8011144b>{child_rip+8}
> Jul 30 16:10:12 oss08p kernel:        
> <ffffffff8014d630>{:keventd_create_kthread+0} 
> <ffffffff8014d510>{kthread+0}
> Jul 30 16:10:12 oss08p kernel:        <ffffffff80111443>{child_rip+0}
> Jul 30 16:10:12 oss08p kernel:
> Jul 30 16:10:12 oss08p kernel: Code: 0f 0b 01 f0 0a a0 ff ff ff ff 74 
> 01 66 66 90 66 90 65 8b  04
> Jul 30 16:10:12 oss08p kernel: RIP 
> <ffffffffa00abbbf>{:ib_sdp:sdp_connect_handler+207} RSP 
> <0000010155f21bc8>
> Jul 30 16:10:12 oss08p kernel:  <0>Kernel panic - not syncing: Oops
> Oops
>
>


From rdreier at cisco.com  Mon Jul 30 14:37:38 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 30 Jul 2007 14:37:38 -0700
Subject: [ofa-general] OFED SRP Client / StorageGear Target / Performance
	with Modified Write Protocol
In-Reply-To: <02ca01c7d2be$89a17eb0$0a97a8c0@blacktip> (Ken Jeffries's message
	of "Mon, 30 Jul 2007 10:30:18 -0500")
References: <02ca01c7d2be$89a17eb0$0a97a8c0@blacktip>
Message-ID: <ada6441r2sd.fsf@cisco.com>

 > We have been doing a fair amount of performance testing on our SRP target.
 > One thing we found early on was that client writes were considerably slower
 > than client reads. We addressed this by patching the SRP client code so
 > that it could include the client write data in the SRP CMD IU if it would
 > fit. This notion is in iSER but is not in standard SRP. Architecturally,
 > the capability is signaled using an additional data buffer format bit.
 > We find that client write performance is considerably improved by using
 > this capability. We are calling SRP spec compliant writes "standard
 > writes" and our modified writes "iu data writes".

I think this may make sense but you probably want to involve T10 to
get it standardized somehow.  Also, although I know having a big IOP
number is important for various non-technical reasons, are there any
realistic storage workloads that do lots of single-block writes?

Also I guess you need to use giant IUs to be able to hold at least one
block in the IU?

 - R.


From jimmmott at austin.rr.com  Mon Jul 30 14:54:15 2007
From: jimmmott at austin.rr.com (Jim Mott)
Date: Mon, 30 Jul 2007 16:54:15 -0500
Subject: [ofa-general] Re: SDP kernel Oops.. (OFED-1.2)
In-Reply-To: <46AE59E9.7070103@psc.edu>
References: <46AE4989.9010508@psc.edu> <46AE59E9.7070103@psc.edu>
Message-ID: <004301c7d2f4$2d651180$882f3480$@rr.com>

Great!

-----Original Message-----
From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Paul Nowoczynski
Sent: Monday, July 30, 2007 4:37 PM
To: general at lists.openfabrics.org
Subject: [ofa-general] Re: SDP kernel Oops.. (OFED-1.2)

I was running old firmware. Upgrading to the 4.8.200 seems to have fixed 
the problem.
paul


Paul Nowoczynski wrote:
> Jim,
> I just ran with 1.2 and hit the same bug.  I've included the debug 
> msgs leading up to the oops (at the bottom).  I think the problem has 
> to do with handling a connection request after a socket has been 
> destroyed.  The failed instance of sdp_connect_handler() doesn't 
> appear to run sdp_init_qp() so I assume that it fails somewhere before 
> that.
>
> I wonder if event->param.conn.private_data is bogus?
>
> Thanks for your help.
> Paul
>
>
> int sdp_connect_handler(struct sock *sk, struct rdma_cm_id *id,
>                        struct rdma_cm_event *event)
> {
>        struct sockaddr_in *dst_addr;
>        struct sock *child;
>        const struct sdp_hh *h;
>        int rc;
>
>        sdp_dbg(sk, "%s %p -> %p\n", __func__, sdp_sk(sk)->id, id);
>
>        h = event->param.conn.private_data;
>
>        if (!h->max_adverts)
>                return -EINVAL;
>
>        child = sk_clone(sk, GFP_KERNEL);
>        if (!child)
>                return -ENOMEM;
>
>        sdp_add_sock(sdp_sk(child));
>        INIT_LIST_HEAD(&sdp_sk(child)->accept_queue);
>        INIT_LIST_HEAD(&sdp_sk(child)->backlog_queue);
>        INIT_DELAYED_WORK(&sdp_sk(child)->time_wait_work, 
> sdp_time_wait_work);
>        INIT_WORK(&sdp_sk(child)->destroy_work, sdp_destroy_work);
>
>        dst_addr = (struct sockaddr_in *)&id->route.addr.dst_addr;
>        inet_sk(child)->dport = dst_addr->sin_port;
>        inet_sk(child)->daddr = dst_addr->sin_addr.s_addr;
>
>        bh_unlock_sock(child);
>        __sock_put(child);
>
>        rc = sdp_init_qp(child, id);
> ...
>
> ################# Console Msgs 
> ###########################################
>
> oss08p login:
> Fedora Core release 3 (Heidelberg)
> Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64
>
> oss08p login: Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): 
> sdp_cma_handler event 4 id 000001015ed2b600
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): 
> RDMA_CM_EVENT_CONNECT_REQUEST
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_connect_handler 
> 000001015ed30c00 -> 000001015ed2b600
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_init_qp
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_init_qp done
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_connect_handler 
> bufs 64 xmit_size_goal 32768
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 
> expected 10 *err -22
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_cma_handler event 
> 4 handled
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): event 4 done. status 0
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 
> sk 000001014e790780 newsk 0000000000000000
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 
> expected 10 *err -22
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 
> sk 000001014e790780 newsk 0000000000000000
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_cma_handler 
> event 9 id 000001015ed2b600
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): 
> RDMA_CM_EVENT_ESTABLISHED
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_connected_handler
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_connected_handler 
> child connection established
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_cma_handler 
> event 9 handled
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): event 9 done. status 0
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 
> expected 10 *err -22
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_accept: 
> ib_req_notify_cq
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: status -22 
> sk 000001014e790780 newsk 000001014e790040
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 
> expected 10 *err -22
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 
> sk 000001014e790780 newsk 0000000000000000
> Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_setsockopt
>
> Fedora Core release 3 (Heidelberg)
> Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64
>
> oss08p login:
> Fedora Core release 3 (Heidelberg)
> Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64
>
> Fedora Core release 3 (Heidelberg)
> Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64
>
> oss08p login:
> Fedora Core release 3 (Heidelberg)
> Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64
>
> oss08p login: Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): 
> sdp_cma_handler event 4 id 000001015ee0c800
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): 
> RDMA_CM_EVENT_CONNECT_REQUEST
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_connect_handler 
> 000001015ed30c00 -> 000001015ee0c800
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_init_qp
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_init_qp done
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): 
> sdp_connect_handler bufs 64 xmit_size_goal 32768
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 
> expected 10 *err -22
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_cma_handler event 
> 4 handled
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): event 4 done. status 0
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 
> sk 000001014e790780 newsk 0000000000000000
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 
> expected 10 *err -22
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 
> sk 000001014e790780 newsk 0000000000000000
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_cma_handler 
> event 9 id 000001015ee0c800
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): 
> RDMA_CM_EVENT_ESTABLISHED
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_connected_handler
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_connected_handler 
> child connection established
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_cma_handler 
> event 9 handled
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): event 9 done. 
> status 0
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 
> expected 10 *err -22
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_accept: 
> ib_req_notify_cq
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: status -22 
> sk 000001014e790780 newsk 0000010151c457c0
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 
> expected 10 *err -22
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 
> sk 000001014e790780 newsk 0000000000000000
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_setsockopt
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_fin
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close: 
> entering time wait refcnt 2
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close: last 
> socket put 2
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_unhash
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_handle_wc: 
> destroy in time wait state
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_cma_handler 
> event 10 id 000001015ee0c800
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): 
> RDMA_CM_EVENT_DISCONNECTED
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_destroy_work: 
> refcnt 1
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_destruct
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close_sk
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): 
> sdp_disconnected_handler
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_cma_handler 
> event 10 handled
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_reset_sk
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): event 10 done. 
> status -104
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close_sk done; 
> releasing sock
> Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_destruct done
>
> Fedora Core release 3 (Heidelberg)
> Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64
>
> oss08p login:
> Fedora Core release 3 (Heidelberg)
> Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64
>
> oss08p login:
> Kernel BUG at sdp_cma:372
> invalid operand: 0000 [1] SMP
> CPU 1
> Modules linked in: ksocklnd ptlrpc obdclass lvfs lnet libcfs ib_ipoib 
> ib_srp ib_sdp rdma_cm ib_addr iw_cm ib_local_sa ib_cm iptable_filter i
> p_tables e1000 ib_sa ib_uverbs ib_umad ib_mthca ib_mad ib_core md 
> forcedeth
> Pid: 2362, comm: ib_cm/1 Not tainted 
> 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4
> RIP: 0010:[<ffffffffa00abbbf>] 
> <ffffffffa00abbbf>{:ib_sdp:sdp_connect_handler+207}
> RSP: 0018:0000010155f21bc8  EFLAGS: 00010246
> RAX: 0000000000000000 RBX: 0000010150e45800 RCX: 0000000000000000
> RDX: 00000000ffffffff RSI: 0000010150e450a8 RDI: ffffffffa00b3640
> RBP: ffffffffa00b3140 R08: 0000010155ef8ef8 R09: 0000010155ef8f08
> R10: 00000000ffffffff R11: 0000000000000000 R12: 000001014e790780
> R13: 0000010155f21cf8 R14: 0000000000000000 R15: 000001015681bfa4
> FS:  0000002a9589db00(0000) GS:ffffffff805a3140(0000) 
> knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000002a986adad8 CR3: 000000007ea38000 CR4: 00000000000006e0
> Process ib_cm/1 (pid: 2362, threadinfo 0000010155f20000, task 
> 00000101565897e0)
> Stack: 000001015ede2400 000001014e7909e0 000001014e790780 
> 000001015ede2400
>       0000010155f21cf8 0000000000000000 0000000000000000 ffffffffa00ac581
>       000001015ede2400 000001015ede2458
> Call Trace:<ffffffffa00ac581>{:ib_sdp:sdp_cma_handler+945} 
> <ffffffffa00a0347>{:rdma_cm:cma_acquire_dev+359}
>       <ffffffffa00a1488>{:rdma_cm:cma_req_handler+1000} 
> <ffffffffa006859a>{:ib_cm:cm_process_work+26}
>       <ffffffffa006901f>{:ib_cm:cm_req_handler+2463} 
> <ffffffffa0069490>{:ib_cm:cm_work_handler+0}
>       <ffffffffa00694d2>{:ib_cm:cm_work_handler+66} 
> <ffffffff80133133>{__wake_up+67}
>       <ffffffffa0069490>{:ib_cm:cm_work_handler+0} 
> <ffffffff80149110>{worker_thread+496}
>       <ffffffff80133070>{default_wake_function+0} 
> <ffffffff801330c0>{__wake_up_common+64}
>       <ffffffff80133070>{default_wake_function+0} 
> <ffffffff8014d630>{keventd_create_kthread+0}
>       <ffffffff80148f20>{worker_thread+0} 
> <ffffffff8014d630>{keventd_create_kthread+0}
>       <ffffffff8014d5e9>{kthread+217} <ffffffff8011144b>{child_rip+8}
>       <ffffffff8014d630>{keventd_create_kthread+0} 
> <ffffffff8014d510>{kthread+0}
>       <ffffffff80111443>{child_rip+0}
>
> Code: 0f 0b 01 f0 0a a0 ff ff ff ff 74 01 66 66 90 66 90 65 8b 04
> RIP <ffffffffa00abbbf>{:ib_sdp:sdp_connect_handler+207} RSP 
> <0000010155f21bc8>
> <0>Kernel panic Jul 30 16:10:11 -oss08p kernel: sdp_sock(988:0): 
> sdp_cma_handler event 4 id 000001015ede2400
> Jul 30 16:10:11 oss08p kernel: sdp_sock(988:0): 
> RDMA_CM_EVENT_CONNECT_REQUEST
> Jul 30 16:10:11 oss08p kernel: sdp_sock(988:0): sdp_connect_handler 
> 000001015ed30c00 -> 000001015ede 2400
> Jul 30 16:10:11 oss08p kernel: ----------- [cut here ] --------- 
> [please bite here ] ---------
> Jul 30 16:10:12 oss08p kernel: Kernel BUG at sdp_cma:372
> Jul 30 16:10:12 oss08p kernel: invalid operand: 0000 [1] SMP
> Jul 30 16:10:12 oss08p kernel: CPU 1
> Jul 30 16:n10:12 oss08p kernel: Modules linked in: ksocklnd ptlrpc 
> obdclass lvfs lnet libcfs ib_ipoib ib_srp ib_sdp rdma_cm ib_addr iw_cm ib
> _local_sa ib_cm iptable_filter ip_tables e1000 ib_sa ib_uverbs ib_umad 
> ib_mthca ib_mad ib_core md forcedeth
> Jul 30 16:10:12 osos08p kernel: Pid: 2362, comm: ib_cm/1 Not tainted 
> 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4
> Jul 30 16:10:12 oss08p kernel: RIP: 0010:[<ffffffffa00abbbf>] 
> <ffffffffa00abbbf>{:ib_sdp:sdp_connect_handler+207}
> Jul 30 16:10:12 oss08p kernel: RSP: 0018:0000010155f21bc8  EtFLAGS: 
> 00010246
> Jul 30 16:10:12 oss08p kernel: RAX: 0000000000000000 RBX: 
> 0000010150e45800 RCX: 0000000000000000
> Jul 30 16:10:12 oss08p kernel: RDX: 00000000ffffffff RSI: 
> 0000010150e450a8 RDI: ffffffffa00b3640
> Jul 30 16:10:12 oss08p kernel: RBP: fffffff fa00b3140 R08: 
> 0000010155ef8ef8 R09: 0000010155ef8f08
> Jul 30 16:10:12 oss08p kernel: R10: 00000000ffffffff R11: 
> 0000000000000000 R12: 000001014e790780
> Jul 30 16:10:12 oss08p kernel: R13: 0000010155f21cf8 R14: 
> 0000000000000000 R15: 000001015681bfa4
> Jul 30 16:10:12 oss08sp kernel: FS:  0000002a9589db00(0000) 
> GS:ffffffff805a3140(0000) knlGS:0000000000000000
> Jul 30 16:10:12 oss08p kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 
> 000000008005003b
> Jul 30 16:10:12 oss08p kernel: CR2: 0000002a986adad8 CR3: 
> 000000007ea38000 CR4: 00000000000006e0
> Jul y30 16:10:12 oss08p kernel: Process ib_cm/1 (pid: 2362, threadinfo 
> 0000010155f20000, task 00000101565897e0)
> Jul 30 16:10:12 oss08p kernel: Stack: 000001015ede2400 
> 000001014e7909e0 000001014e790780 000001015ede2400
> Jul 30 16:10:12 oss08p kernel:        00n00010155f21cf8 
> 0000000000000000 0000000000000000 ffffffffa00ac581
> Jul 30 16:10:12 oss08p kernel:        000001015ede2400 000001015ede2458
> Jul 30 16:10:12 oss08p kernel: Call 
> Trace:<ffffffffa00ac581>{:ib_sdp:sdp_cma_handler+945} 
> <ffffffffa00a0347>{:rdma_cm:cma_acquire_cdev+359}
> Jul 30 16:10:12 oss08p kernel:        
> <ffffffffa00a1488>{:rdma_cm:cma_req_handler+1000} 
> <ffffffffa006859a>{:ib_cm:cm_process_work+26}
> Jul 30 16:10:12 oss08p kernel:        
> <ffffffffa006901f>{:ib_cm:cm_req_handler+2463} 
> <ffffffffa0069490>{:ib_cim:cm_work_handler+0}
> Jul 30 16:10:12 oss08p kernel:        
> <ffffffffa00694d2>{:ib_cm:cm_work_handler+66} 
> <ffffffff80133133>{__wake_up+67}
> Jul 30 16:10:12 oss08p kernel:        
> <ffffffffa0069490>{:ib_cm:cm_work_handler+0} 
> <ffffffff80149110>{worker_thread+496}
> Jul 30 n16:10:12 oss08p kernel:        
> <ffffffff80133070>{default_wake_function+0} 
> <ffffffff801330c0>{__wake_up_common+64}
> Jul 30 16:10:12 oss08p kernel:        
> <ffffffff80133070>{default_wake_function+0} 
> <ffffffff8014d630>{keventd_create_kthread+0}
> Jul 30 16:g10:12 oss08p kernel:        
> <ffffffff80148f20>{worker_thread+0} 
> <ffffffff8014d630>{keventd_create_kthread+0}
> Jul 30 16:10:12 oss08p kernel:        <ffffffff8014d5e9>{kthread+217} 
> <ffffffff8011144b>{child_rip+8}
> Jul 30 16:10:12 oss08p kernel:        
> <ffffffff8014d630>{:keventd_create_kthread+0} 
> <ffffffff8014d510>{kthread+0}
> Jul 30 16:10:12 oss08p kernel:        <ffffffff80111443>{child_rip+0}
> Jul 30 16:10:12 oss08p kernel:
> Jul 30 16:10:12 oss08p kernel: Code: 0f 0b 01 f0 0a a0 ff ff ff ff 74 
> 01 66 66 90 66 90 65 8b  04
> Jul 30 16:10:12 oss08p kernel: RIP 
> <ffffffffa00abbbf>{:ib_sdp:sdp_connect_handler+207} RSP 
> <0000010155f21bc8>
> Jul 30 16:10:12 oss08p kernel:  <0>Kernel panic - not syncing: Oops
> Oops
>
>

_______________________________________________
general mailing list
general at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general


From ycai at Brocade.COM  Mon Jul 30 15:25:44 2007
From: ycai at Brocade.COM (Ying Cai)
Date: Mon, 30 Jul 2007 15:25:44 -0700
Subject: [ofa-general] Event for active/passive connection
Message-ID: <4C94DE2070B172459E4F1EE14BD2364E3CA97F@HQ-EXCH-5.corp.brocade.com>

Hi,

 
After reading the OFED 1.2 code, I have a question.

 
In cma_iw_handler():

 
case IW_CM_EVENT_CONNECT_REPLY:

            ...

                        switch (iw_event->status) {

                        case 0:

                                    event.event =
RDMA_CM_EVENT_ESTABLISHED;

                                    break;

                        ...

                        }

                        break;

            case IW_CM_EVENT_ESTABLISHED:

                        event.event = RDMA_CM_EVENT_ESTABLISHED;

                        break;

 
It could cause a problem in SDP, since in SDP RDMA_CM_EVENT_ESTABLISHED
is handled by sdp_connected_handler(), which can only handle passive
connection case (it assumes the socket has parent, which is only true
for listening sockets). Is the SDP over iWarp case tested, or did I miss
something?

 
Seems the correct event for SDP should be
RDMA_CM_EVENT_CONNECT_RESPONSE. 

 
Thanks,

-Ying

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070730/8fe1cb32/attachment.html>

From swise at opengridcomputing.com  Mon Jul 30 16:10:22 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 30 Jul 2007 18:10:22 -0500
Subject: [ewg] Re: [ofa-general] reminder: OFED meeting today at 9am PST
In-Reply-To: <0CB03EC2-3F31-4881-80FA-0F46D90825A3@cisco.com>
References: <46ADEE7F.2000005@mellanox.co.il>
	<46AE17E7.3020305@opengridcomputing.com>
	<0CB03EC2-3F31-4881-80FA-0F46D90825A3@cisco.com>
Message-ID: <46AE6FDE.6030604@opengridcomputing.com>

Jeff Squyres wrote:
> Yes, you missed it; the call was over about half an hour ago.  I 
> [re-]posted the dial-in info about 3 hours before the call this morning 
> on the ewg list.
> 

I see.  That's why I missed it. I'm not on the ewg list.

Are all attendees expected to be on the ewg list?


Steve.


> 
> On Jul 30, 2007, at 12:55 PM, Steve Wise wrote:
> 
>> Am I missing the call info?  I tried an older conf id, and it didn't 
>> work.  Can you please post the conf call info along with the meeting 
>> notification?
>>
>> Thanks,
>>
>> Steve.
>>
>>
>> Tziporet Koren wrote:
>>> Hi All,
>>> We will have our bi-weekly OFED meeting today at 9am PST
>>> Agenda:
>>> - Status update
>>> - Bugzilla cleanup
>>> If you have more agenda items please send them
>>> Tziporet
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>> To unsubscribe, please visit 
>>> http://openib.org/mailman/listinfo/openib-general
>>
>> _______________________________________________
>> ewg mailing list
>> ewg at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
> 
> 


From swise at opengridcomputing.com  Mon Jul 30 16:15:48 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 30 Jul 2007 18:15:48 -0500
Subject: [ofa-general] Event for active/passive connection
In-Reply-To: <4C94DE2070B172459E4F1EE14BD2364E3CA97F@HQ-EXCH-5.corp.brocade.com>
References: <4C94DE2070B172459E4F1EE14BD2364E3CA97F@HQ-EXCH-5.corp.brocade.com>
Message-ID: <46AE7124.7010903@opengridcomputing.com>

I've never tested SDP over iWARP...


Ying Cai wrote:
> Hi,
> 
>  
> 
> After reading the OFED 1.2 code, I have a question.
> 
>  
> 
> In cma_iw_handler():
> 
>  
> 
> case IW_CM_EVENT_CONNECT_REPLY:
> 
>             …
> 
>                         switch (iw_event->status) {
> 
>                         case 0:
> 
>                                     event.event = RDMA_CM_EVENT_ESTABLISHED;
> 
>                                     break;
> 
>                         …
> 
>                         }
> 
>                         break;
> 
>             case IW_CM_EVENT_ESTABLISHED:
> 
>                         event.event = RDMA_CM_EVENT_ESTABLISHED;
> 
>                         break;
> 
>  
> 
> It could cause a problem in SDP, since in SDP RDMA_CM_EVENT_ESTABLISHED 
> is handled by sdp_connected_handler(), which can only handle passive 
> connection case (it assumes the socket has parent, which is only true 
> for listening sockets). Is the SDP over iWarp case tested, or did I miss 
> something?
> 
>  
> 
> Seems the correct event for SDP should be RDMA_CM_EVENT_CONNECT_RESPONSE.
> 
>  
> 
> Thanks,
> 
> -Ying
> 
>  
> 
>  
> 
>  
> 
>  
> 
>  
> 
>  
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From jsquyres at cisco.com  Mon Jul 30 16:54:09 2007
From: jsquyres at cisco.com (Jeff Squyres)
Date: Mon, 30 Jul 2007 19:54:09 -0400
Subject: [ewg] Re: [ofa-general] reminder: OFED meeting today at 9am PST
In-Reply-To: <46AE6FDE.6030604@opengridcomputing.com>
References: <46ADEE7F.2000005@mellanox.co.il>
	<46AE17E7.3020305@opengridcomputing.com>
	<0CB03EC2-3F31-4881-80FA-0F46D90825A3@cisco.com>
	<46AE6FDE.6030604@opengridcomputing.com>
Message-ID: <03AEC794-684D-4FC3-BE77-024667268420@cisco.com>

On Jul 30, 2007, at 7:10 PM, Steve Wise wrote:

>> Yes, you missed it; the call was over about half an hour ago.  I  
>> [re-]posted the dial-in info about 3 hours before the call this  
>> morning on the ewg list.
>
> I see.  That's why I missed it. I'm not on the ewg list.
>
> Are all attendees expected to be on the ewg list?

It's an OFED-specific call, so I generally post the call info just to  
the EWG list (there's been some backlash before about posting OFED- 
specific stuff on the general list and/or not on the ewg list).

-- 
Jeff Squyres
Cisco Systems


From kenjeffries at storagegear.com  Mon Jul 30 17:00:18 2007
From: kenjeffries at storagegear.com (Ken Jeffries)
Date: Mon, 30 Jul 2007 19:00:18 -0500
Subject: [ofa-general] OFED SRP Client / StorageGear Target / Performance
	with Modified Write Protocol
In-Reply-To: <ada6441r2sd.fsf@cisco.com>
Message-ID: <02f601c7d305$c8b79480$0a97a8c0@blacktip>


Our implicit assumption has been that since T10 abandoned SRP 2 that the
T10/SRP community had little interest in SRP enhancements. If there is
IB/SRP community interest we would certainly persue a T10 project of
some sort. If StorageGear is the only interested party, then not so much.

Our general target is small clusters that benefit from SSDs. An SSD takes
advantage of IB much more fully than an IB RAID box does. Since we support
up to 4 hca's, we envision up to 8 system clusters that use our system
without needing an IB switch. Enabling low cost switchless clusters doesn't
sell many switches directly but enabling low cost IB does help the IB
market in general. 

We think these clusters will want to do random "small" i/o's and that "small"
will almost always be larger than 512 bytes. 

Yes we use giant IUs to be able to hold at least one block. "giant" is 
relative though. Using srp_sg_tablesize=255 results in an IU of 4148
bytes which is plenty to hold one 4096 byte block. Since our motherboard
supports up to 64GB, the overhead of the large IU's is a non-issue for
us. Of course the client only transmits the used portion of the IU so
non-iu-data-writes remain small on the wire. The client side code 
simply uses an additional s/g entry passed to the IB layer so no client
side copy is done.

Ken

-----Original Message-----
From: Roland Dreier [mailto:rdreier at cisco.com] 
Sent: Monday, July 30, 2007 4:38 PM
To: Ken Jeffries
Cc: general at lists.openfabrics.org
Subject: Re: [ofa-general] OFED SRP Client / StorageGear Target / Performance with Modified Write Protocol


 > We have been doing a fair amount of performance testing on our SRP target.
 > One thing we found early on was that client writes were considerably slower
 > than client reads. We addressed this by patching the SRP client code so
 > that it could include the client write data in the SRP CMD IU if it would
 > fit. This notion is in iSER but is not in standard SRP. Architecturally,
 > the capability is signaled using an additional data buffer format bit.
 > We find that client write performance is considerably improved by using
 > this capability. We are calling SRP spec compliant writes "standard
 > writes" and our modified writes "iu data writes".

I think this may make sense but you probably want to involve T10 to
get it standardized somehow.  Also, although I know having a big IOP
number is important for various non-technical reasons, are there any
realistic storage workloads that do lots of single-block writes?

Also I guess you need to use giant IUs to be able to hold at least one
block in the IU?

 - R.


From pradeeps at linux.vnet.ibm.com  Mon Jul 30 17:25:47 2007
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Mon, 30 Jul 2007 17:25:47 -0700
Subject: [ofa-general] Re: NOSRQ QP implementation issues
In-Reply-To: <adaabtdr2zt.fsf@cisco.com>
References: <adak5sr231k.fsf@cisco.com> <46AE3701.40603@linux.vnet.ibm.com>
	<adaabtdr2zt.fsf@cisco.com>
Message-ID: <46AE818B.4080107@linux.vnet.ibm.com>

Roland Dreier wrote:
>  > For sending (both on the active and passive side) the skbs are associated 
>  > with the tx_qp. The remote qp for the tx_qp is the rx_qp (on the other side)
>  > and WRs are posted to receive packets. An skb (for send) is not associated 
>  > with SQ of the rx_qp. Therefore, no packets are expected to be sent through
>  > the rx_qp.
>  > 
>  > In an erroneous case if packets do get sent to the wrong RQ, then they will
>  > get dropped as no WQEs are posted. As discussed, an RNR will be returned as
>  > expected and a new connection will get established. I still see no issues 
>  > with this either.
>  > 
>  > If in the future, we do want to use the unused SQ and RQs, then we will have
>  > to associate them with corresponding QP at the remote end. This will be work
>  > for both the SRQ and non-SRQ case.
>  > 
>  > I do not see any issues. Can you please explain what is missing with this 
>  > implementation?
> 
> I think what you are missing is that Linux is not necessarily the only
> IPoIB CM implementation.  The Linux IPoIB driver needs to be able to
> talk to any other implementation that follows the RFCs, in particular
> RFC 4755 for connected mode.  And according to my reading of the RFC
> at least, it is OK for a system to accept an IPoIB CM connection and
> then use that connection to send packets back to the system that
> originated the connection.  There is no requirement that a new
> connection be opened for traffic in the other direction.
> 
> And killing the connection as soon as a packet is sent in the wrong
> direction seems pretty wrong to me.  The current SRQ code actually
> handles it fine, because all the QPs, no matter which direction they
> were opened, are attached to the SRQ and hence have receives available.
> 
> One possibility would be to set the maxium receive MTU to 0 for
> connections initiated in the no-SRQ case.  However I'm not sure
> whether that is within the spirit of the RFC, and it might really
> confuse other systems that do want to send on that QP.  Another
> possibility would be to post one receive to all no-SRQ QPs, and if
> that receive is consumed then post more.
> 
>  - R.
> 
Thanks for pointing that out Roland. Yes, I was focussed on Linux and did not
consider other systems.

Michael, Thanks for catching this. Till I saw Roland's description I did not 
consider the other possibilities and did not see what you were alluding to.

What do you folks think about the following: in addition to posting 1 WR
suppose I create a separate CQ for the RQ (for tx_qp). There will be a small
completion handler that spits out a message that this request was received
from a non-Linux system, and then calls ipoib_ib_completion(). So, this way
we will not kill the connection, but the performance may be limited.

Pradeep


From amar.mudrankit at gmail.com  Mon Jul 30 23:45:06 2007
From: amar.mudrankit at gmail.com (Amar Mudrankit)
Date: Tue, 31 Jul 2007 12:15:06 +0530
Subject: [ofa-general] IPoIB CM Connection establishment
Message-ID: <c8028d330707302345l13c24141ife0b8646930b58af@mail.gmail.com>

Hi all,

          While establishing a connection with the remote node, path is
resolved and REQ is sent by the requester. We get a REP from the peer
indicating that it is ready for this connection establishment.

           At the requester's end, the REP is handled by
ipoib_cm_rep_handler function in which the context of the path is recalled.
All the skbs are then first de-queued, their dev pointers are changed to the
device present in context(skb->dev = p->dev) and again queued for
transmission using dev_queue_xmit.

           Now, when I traced back the initialization of context from
requester point of view, I found it done in function ipoib_cm_create_tx in
which dev argument is the network device corresponding to ipoib interface.

           Hence, what is the difference between skb->dev and p->dev?

           Is p->dev is a different network device because of new
connection? what is the difference between ipoib net_device and this new
p->dev?

           Precisely, I would like to understand the semantics of this
p->dev.

           Can anyone tell me whether this trace is right and point me to
correct trace if it is wrong?

Regards,
Amar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070731/c242dd74/attachment.html>

From kenyon.dahl at maroc-ebusiness.com  Tue Jul 31 04:01:51 2007
From: kenyon.dahl at maroc-ebusiness.com (Kendall Conn)
Date: Tue, 31 Jul 2007 08:01:51 -0300
Subject: [ofa-general] Good summer, dude
Message-ID: <853562849.82175104469044@maroc-ebusiness.com>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: deine.gif
Type: image/gif
Size: 11822 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070731/a620f404/attachment.gif>

From vlad at lists.openfabrics.org  Tue Jul 31 01:40:15 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Tue, 31 Jul 2007 01:40:15 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_kernel 20070731-0100 daily build status
Message-ID: <20070731084015.B8378E6086F@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.12
Passed on x86_64 with linux-2.6.20
Passed on ppc64 with linux-2.6.18
Passed on ia64 with linux-2.6.18
Passed on powerpc with linux-2.6.13
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.12
Passed on x86_64 with linux-2.6.16
Passed on powerpc with linux-2.6.18
Passed on x86_64 with linux-2.6.12
Passed on ppc64 with linux-2.6.19
Passed on ia64 with linux-2.6.19
Passed on ppc64 with linux-2.6.12
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.17
Passed on ppc64 with linux-2.6.15
Passed on x86_64 with linux-2.6.19
Passed on ppc64 with linux-2.6.16
Passed on powerpc with linux-2.6.15
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.16
Passed on ppc64 with linux-2.6.14
Passed on powerpc with linux-2.6.16
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.13
Passed on powerpc with linux-2.6.14
Passed on powerpc with linux-2.6.17
Passed on powerpc with linux-2.6.12
Passed on powerpc with linux-2.6.19
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.13
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.15
Passed on ia64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on ia64 with linux-2.6.16.21-0.8-default

Failed:


From jggp at brokershome.com  Tue Jul 31 02:31:58 2007
From: jggp at brokershome.com (Ottilia)
Date: Tue, 31 Jul 2007 12:31:58 +0300
Subject: [ofa-general] Cashed
Message-ID: <46AF018E.4020508@brokershome.com>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: Cashed.zip
Type: application/octet-stream
Size: 8528 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070731/b9317fa3/attachment.obj>

From sashak at voltaire.com  Tue Jul 31 02:41:36 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 31 Jul 2007 12:41:36 +0300
Subject: [ofa-general] Re: [PATCH][TRIVIAL] OpenSM/include/iba/ib_types.h:
	Some comment fixes
In-Reply-To: <f0e08f230707301254r43d81d0cv577c583fb4a328e7@mail.gmail.com>
References: <f0e08f230707301254r43d81d0cv577c583fb4a328e7@mail.gmail.com>
Message-ID: <20070731094136.GD13838@sashak.voltaire.com>

On 15:54 Mon 30 Jul     , Hal Rosenstock wrote:
> include/iba/ib_types.h: Some comment fixes
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From vlad at lists.openfabrics.org  Tue Jul 31 02:49:57 2007
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky)
Date: Tue, 31 Jul 2007 02:49:57 -0700 (PDT)
Subject: [ofa-general] ofa_1_2_c_kernel 20070731-0200 daily build status
Message-ID: <20070731094957.AE542E60805@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git
git_branch: ofed_1_2_c

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.12
Passed on x86_64 with linux-2.6.20
Passed on powerpc with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.12
Passed on x86_64 with linux-2.6.12
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.13
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.14
Passed on ppc64 with linux-2.6.15
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.13
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.17
Passed on powerpc with linux-2.6.18
Passed on ppc64 with linux-2.6.14
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on powerpc with linux-2.6.17
Passed on powerpc with linux-2.6.14
Passed on ppc64 with linux-2.6.13
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.16
Passed on powerpc with linux-2.6.15
Passed on powerpc with linux-2.6.16
Passed on x86_64 with linux-2.6.18
Passed on powerpc with linux-2.6.12
Passed on x86_64 with linux-2.6.13
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-22.ELsmp
Passed on x86_64 with linux-2.6.9-34.ELsmp
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5

Failed:


From hal.rosenstock at gmail.com  Tue Jul 31 03:39:34 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 31 Jul 2007 06:39:34 -0400
Subject: [ofa-general] [PATCH] mad.c: Fix memory leak in switch handling and
	improve error handling in ib_mad_recv_done_handler
Message-ID: <f0e08f230707310339y2f8eda60yf35599cdabd7002c@mail.gmail.com>

mad.c: Fix memory leak in switch handling and improve error handling in
ib_mad_recv_done_handler. Also, eliminate no longer needed return value
in agent.c:agent_send_response.

Signed-off-by: Suresh Shelvapille <suri at baymicrosystems.com>
Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

diff --git a/drivers/infiniband/core/agent.c b/drivers/infiniband/core/agent.c
index db2633e..4c1a1ca 100644
--- a/drivers/infiniband/core/agent.c
+++ b/drivers/infiniband/core/agent.c
@@ -78,15 +78,14 @@ ib_get_agent_port(struct ib_device *device, int port_num)
       return entry;
 }

-int agent_send_response(struct ib_mad *mad, struct ib_grh *grh,
-                       struct ib_wc *wc, struct ib_device *device,
-                       int port_num, int qpn)
+void agent_send_response(struct ib_mad *mad, struct ib_grh *grh,
+                        struct ib_wc *wc, struct ib_device *device,
+                        int port_num, int qpn)
 {
       struct ib_agent_port_private *port_priv;
       struct ib_mad_agent *agent;
       struct ib_mad_send_buf *send_buf;
       struct ib_ah *ah;
-       int ret;
       struct ib_mad_send_wr_private *mad_send_wr;

       if (device->node_type == RDMA_NODE_IB_SWITCH)
@@ -96,23 +95,21 @@ int agent_send_response(struct ib_mad *mad, struct
ib_grh *grh,

       if (!port_priv) {
               printk(KERN_ERR SPFX "Unable to find port agent\n");
-               return -ENODEV;
+               return;
       }

       agent = port_priv->agent[qpn];
       ah = ib_create_ah_from_wc(agent->qp->pd, wc, grh, port_num);
       if (IS_ERR(ah)) {
-               ret = PTR_ERR(ah);
-               printk(KERN_ERR SPFX "ib_create_ah_from_wc error:%d\n", ret);
-               return ret;
+               printk(KERN_ERR SPFX "ib_create_ah_from_wc error\n");
+               return;
       }

       send_buf = ib_create_send_mad(agent, wc->src_qp, wc->pkey_index, 0,
                                     IB_MGMT_MAD_HDR, IB_MGMT_MAD_DATA,
                                     GFP_KERNEL);
       if (IS_ERR(send_buf)) {
-               ret = PTR_ERR(send_buf);
-               printk(KERN_ERR SPFX "ib_create_send_mad error:%d\n", ret);
+               printk(KERN_ERR SPFX "ib_create_send_mad error\n");
               goto err1;
       }

@@ -126,16 +123,16 @@ int agent_send_response(struct ib_mad *mad,
struct ib_grh *grh,
               mad_send_wr->send_wr.wr.ud.port_num = port_num;
       }

-       if ((ret = ib_post_send_mad(send_buf, NULL))) {
-               printk(KERN_ERR SPFX "ib_post_send_mad error:%d\n", ret);
+       if (ib_post_send_mad(send_buf, NULL)) {
+               printk(KERN_ERR SPFX "ib_post_send_mad error\n");
               goto err2;
       }
-       return 0;
+       return;
 err2:
       ib_free_send_mad(send_buf);
 err1:
       ib_destroy_ah(ah);
-       return ret;
+       return;
 }

 static void agent_send_handler(struct ib_mad_agent *mad_agent,
diff --git a/drivers/infiniband/core/agent.h b/drivers/infiniband/core/agent.h
index 86d72fa..fb9ed14 100644
--- a/drivers/infiniband/core/agent.h
+++ b/drivers/infiniband/core/agent.h
@@ -46,8 +46,8 @@ extern int ib_agent_port_open(struct ib_device
*device, int port_num);

 extern int ib_agent_port_close(struct ib_device *device, int port_num);

-extern int agent_send_response(struct ib_mad *mad, struct ib_grh *grh,
-                              struct ib_wc *wc, struct ib_device *device,
-                              int port_num, int qpn);
+extern void agent_send_response(struct ib_mad *mad, struct ib_grh *grh,
+                               struct ib_wc *wc, struct ib_device *device,
+                               int port_num, int qpn);

 #endif /* __AGENT_H_ */
diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index bc547f1..f82900d 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -1842,16 +1842,11 @@ static void ib_mad_recv_done_handler(struct
ib_mad_port_private *port_priv,
 {
       struct ib_mad_qp_info *qp_info;
       struct ib_mad_private_header *mad_priv_hdr;
-       struct ib_mad_private *recv, *response;
+       struct ib_mad_private *recv, *response = NULL;
       struct ib_mad_list_head *mad_list;
       struct ib_mad_agent_private *mad_agent;
       int port_num;

-       response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL);
-       if (!response)
-               printk(KERN_ERR PFX "ib_mad_recv_done_handler no memory "
-                      "for response buffer\n");
-
       mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id;
       qp_info = mad_list->mad_queue->qp_info;
       dequeue_mad(mad_list);
@@ -1879,6 +1874,13 @@ static void ib_mad_recv_done_handler(struct
ib_mad_port_private *port_priv,
       if (!validate_mad(&recv->mad.mad, qp_info->qp->qp_num))
               goto out;

+       response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL);
+       if (!response) {
+               printk(KERN_ERR PFX "ib_mad_recv_done_handler no memory "
+                      "for response buffer\n");
+               goto out;
+       }
+
       if (port_priv->device->node_type == RDMA_NODE_IB_SWITCH)
               port_num = wc->port_num;
       else
@@ -1914,12 +1916,11 @@ static void ib_mad_recv_done_handler(struct
ib_mad_port_private *port_priv,
                       response->header.recv_wc.recv_buf.mad =
&response->mad.mad;
                       response->header.recv_wc.recv_buf.grh = &response->grh;

-                       if (!agent_send_response(&response->mad.mad,
-                                                &response->grh, wc,
-                                                port_priv->device,
-
smi_get_fwd_port(&recv->mad.smp),
-                                                qp_info->qp->qp_num))
-                               response = NULL;
+                       agent_send_response(&response->mad.mad,
+                                           &response->grh, wc,
+                                           port_priv->device,
+                                           smi_get_fwd_port(&recv->mad.smp),
+                                           qp_info->qp->qp_num);

                       goto out;
               }
@@ -1930,15 +1931,6 @@ local:
       if (port_priv->device->process_mad) {
               int ret;

-               if (!response) {
-                       printk(KERN_ERR PFX "No memory for response MAD\n");
-                       /*
-                        * Is it better to assume that
-                        * it wouldn't be processed ?
-                        */
-                       goto out;
-               }
-
               ret = port_priv->device->process_mad(port_priv->device, 0,
                                                    port_priv->port_num,
                                                    wc, &recv->grh,


From glebn at voltaire.com  Tue Jul 31 04:56:05 2007
From: glebn at voltaire.com (Gleb Natapov)
Date: Tue, 31 Jul 2007 14:56:05 +0300
Subject: [ofa-general] Re: Scalable reliable connection
In-Reply-To: <20070730125054.GO9963@mellanox.co.il>
References: <20070730125054.GO9963@mellanox.co.il>
Message-ID: <20070731115605.GU4434@minantech.com>

On Mon, Jul 30, 2007 at 03:50:54PM +0300, Michael S. Tsirkin wrote:
> With SRC:
> 		O(N ^ 2 * J)
> 
> 	This is achived by using a single send queue (per job, out of O(N * J) jobs)
> 	to send data to all J jobs running on a specific node (out of O(N) nodes).
> 	Hardware uses new "SRQ number" field in packet header to
> 	multiplex receive WRs and WCs to private memory of each job.
> 
But since the send queue cannot be used for receiving packets additional
receive QPs have to be created one per job so with SRC it is actually
    O(N ^ 2 * J + N * J)
unless I am missing something.

> This is similiar idea to IB RD.
Except that with RD there is no need to jump through hoops and create
separate QP for sending and receiving packets in order to achieve
scalability.

> Q: Why not use RD then?
> A: Because no hardware supports it.
Wrong answer :) There was no HW for SRC too, but Mellanox decided to
implement SRC instead of RD. The reasons Dror provided for this
a) RD is hard to do
 Not really very sounding reason IMO. Not doing RD is just pushing
 the complexity from HW to SW. And there are HW implementation of RD,
 not for IB though.
b) RD, as defined by IB spec, will not achieve good performance
 This reason is serious, but can Spec be changed to allow for high
 performance implementation? Spec compliance not something that stopped
 Mellanox from doing things before :)

Thanks for protocol explanation.

--
			Gleb.


From mst at dev.mellanox.co.il  Tue Jul 31 05:07:06 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 31 Jul 2007 15:07:06 +0300
Subject: [ofa-general] Re: Scalable reliable connection
In-Reply-To: <20070731115605.GU4434@minantech.com>
References: <20070730125054.GO9963@mellanox.co.il>
	<20070731115605.GU4434@minantech.com>
Message-ID: <20070731120706.GC9087@mellanox.co.il>

> Quoting Gleb Natapov <glebn at voltaire.com>:
> Subject: Re: Scalable reliable connection
> 
> On Mon, Jul 30, 2007 at 03:50:54PM +0300, Michael S. Tsirkin wrote:
> > With SRC:
> > 		O(N ^ 2 * J)
> > 
> > 	This is achived by using a single send queue (per job, out of O(N * J) jobs)
> > 	to send data to all J jobs running on a specific node (out of O(N) nodes).
> > 	Hardware uses new "SRQ number" field in packet header to
> > 	multiplex receive WRs and WCs to private memory of each job.
> > 
> But since the send queue cannot be used for receiving packets additional
> receive QPs have to be created one per job so with SRC it is actually
>     O(N ^ 2 * J + N * J)
> unless I am missing something.

Yes but since N >= 1, N ^ 2 >= N and so O(N ^ 2 * J + N * J) == O(N ^ 2 * J).

-- 
MST


From jim at mellanox.com  Tue Jul 31 05:07:03 2007
From: jim at mellanox.com (Jim Mott)
Date: Tue, 31 Jul 2007 05:07:03 -0700
Subject: [ofa-general] [PATCH V1 2/2] sdplib: add KEEPALIVE support
References: <46ADDB89.5030601@voltaire.com> <46ADDFE6.9000609@voltaire.com> 
Message-ID: <F57121538EA0C94F86018DDD40ADA1D16A6891@mtiexch01.mti.com>

Hi,
  This is the user space part of an OFED 1.3 patch to add keepalive
support to SDP.

Diff from OFED 1.2


Index: ofa_user/src/userspace/libsdp/src/port.c
===================================================================
--- ofa_user.orig/src/userspace/libsdp/src/port.c	2007-06-27
15:56:21.000000000 +0300
+++ ofa_user/src/userspace/libsdp/src/port.c	2007-07-03
20:16:47.000000000 +0300
@@ -793,8 +793,21 @@ setsockopt(
 	__sdp_log( 2, "SETSOCKOPT: <%s:%d:%d> level <%d> name <%d>\n",
 				  program_invocation_short_name, fd,
shadow_fd, level, optname );
 
+	if (level == SOL_SOCKET && optname == SO_KEEPALIVE &&
get_is_sdp_socket(fd)) {
+		level = AF_INET_SDP;
+		__sdp_log( 2, "SETSOCKOPT: <%s:%d:%d> substitute level
%d\n",
+			   program_invocation_short_name, fd, shadow_fd,
level );
+	}
+
 	ret = _socket_funcs.setsockopt( fd, level, optname, optval,
optlen );
 	if ( ( ret >= 0 ) && ( shadow_fd != -1 ) ) {
+		if (level == SOL_SOCKET && optname == SO_KEEPALIVE &&
+		    get_is_sdp_socket(shadow_fd)) {
+			level = AF_INET_SDP;
+			__sdp_log( 2, "SETSOCKOPT: <%s:%d:%d> substitute
level %d\n",
+				   program_invocation_short_name, fd,
shadow_fd, level );
+		}
+
 		sret = _socket_funcs.setsockopt( shadow_fd, level,
 
optname, optval, optlen );
 		if ( sret < 0 ) {


From jim at mellanox.com  Tue Jul 31 05:07:00 2007
From: jim at mellanox.com (Jim Mott)
Date: Tue, 31 Jul 2007 05:07:00 -0700
Subject: [ofa-general] [PATCH V1 1/2] sdp: add KEEPALIVE support
References: <46ADDB89.5030601@voltaire.com> <46ADDFE6.9000609@voltaire.com> 
Message-ID: <F57121538EA0C94F86018DDD40ADA1D16A6892@mtiexch01.mti.com>

Hi,
  This is the kernel part an OFED 1.3 patch to add keepalive support to
SDP.  There are a couple things to highlight.

1) No specific 'active' bit
  Instead of setting or clearing some bit on every send or receive, this
code just remembers the TX and RX heads every time the keepalive timer
pops.  If they are the same this pop as last pop, then the probe is
sent.  

2) Counter of all keepalives sent
  The keepalive probe itself is a zero byte RDMA (as per-spec).  It does
not generate a CQ entry unless there is a problem.  Since unlike TCP
there is nothing that 'tcpdump' or a sniffer could see on the wire, it
is hard to test that keepalives are being sent in the absence of
problems.
  In order to create an automated test, there is a /sys counter that
gets incremented every time a keepalive is sent.  An argument could be
made to add a counter to each socket, and add some options to get (and
reset) it.  I am open to doing it that way if people think it is better.

Diff from OFED 1.2


Index: ofa_kernel/drivers/infiniband/ulp/sdp/sdp.h
===================================================================
--- ofa_kernel.orig/drivers/infiniband/ulp/sdp/sdp.h	2007-07-16
19:42:32.000000000 +0300
+++ ofa_kernel/drivers/infiniband/ulp/sdp/sdp.h	2007-07-21
03:05:29.000000000 +0300
@@ -42,6 +42,7 @@ extern int sdp_data_debug_level;
 #define SDP_RESOLVE_TIMEOUT 1000
 #define SDP_ROUTE_TIMEOUT 1000
 #define SDP_RETRY_COUNT 5
+#define SDP_KEEPALIVE_TIME (120 * 60)
 
 #define SDP_TX_SIZE 0x40
 #define SDP_RX_SIZE 0x40
@@ -51,6 +52,7 @@ extern int sdp_data_debug_level;
 #define SDP_NUM_WC 4
 
 #define SDP_OP_RECV 0x800000000LL
+#define SDP_OP_SEND 0x400000000LL
 
 enum sdp_mid {
 	SDP_MID_HELLO = 0x0,
@@ -115,6 +117,12 @@ struct sdp_sock {
 
 	int time_wait;
 
+	unsigned keepalive_time;
+
+	/* tx_head/rx_head when keepalive timer started */
+	unsigned keepalive_tx_head;
+	unsigned keepalive_rx_head;
+
 	/* Data below will be reset on error */
 	/* rdma specific */
 	struct rdma_cm_id *id;
@@ -221,5 +229,7 @@ void sdp_urg(struct sdp_sock *ssk, struc
 void sdp_add_sock(struct sdp_sock *ssk);
 void sdp_remove_sock(struct sdp_sock *ssk);
 void sdp_remove_large_sock(void);
+void sdp_post_keepalive(struct sdp_sock *ssk);
+void sdp_start_keepalive_timer(struct sock *sk);
 
 #endif
Index: ofa_kernel/drivers/infiniband/ulp/sdp/sdp_bcopy.c
===================================================================
--- ofa_kernel.orig/drivers/infiniband/ulp/sdp/sdp_bcopy.c
2007-07-16 19:42:32.000000000 +0300
+++ ofa_kernel/drivers/infiniband/ulp/sdp/sdp_bcopy.c	2007-07-16
23:00:04.000000000 +0300
@@ -60,6 +60,12 @@ static int max_large_sockets = 1000;
 module_param_named(max_large_sockets, max_large_sockets, int, 0644);
 MODULE_PARM_DESC(max_large_sockets, "Max number of large sockets (32k
buffers).");
 
+#define sdp_cnt(var) do { (var)++; } while (0)
+static unsigned sdp_keepalive_probes_sent = 0;
+
+module_param_named(sdp_keepalive_probes_sent,
sdp_keepalive_probes_sent, uint, 0644);
+MODULE_PARM_DESC(sdp_keepalive_probes_sent, "Total number of keepalive
probes sent.");
+
 static int curr_large_sockets = 0;
 atomic_t sdp_current_mem_usage;
 spinlock_t sdp_large_sockets_lock;
@@ -107,6 +113,31 @@ static void sdp_fin(struct sock *sk)
 	}
 }
 
+void sdp_post_keepalive(struct sdp_sock *ssk)
+{
+	int rc;
+	struct ib_send_wr wr, *bad_wr;
+
+	sdp_dbg(&ssk->isk.sk, "%s\n", __func__);
+
+	memset(&wr, 0, sizeof(wr));
+
+	wr.next    = NULL;
+	wr.wr_id   = 0;
+	wr.sg_list = NULL;
+	wr.num_sge = 0;
+	wr.opcode  = IB_WR_RDMA_WRITE;
+
+	rc = ib_post_send(ssk->qp, &wr, &bad_wr);
+	if (rc) {
+		sdp_dbg(&ssk->isk.sk, "ib_post_keepalive failed with
status %d.\n", rc);
+		sdp_set_error(&ssk->isk.sk, -ECONNRESET);
+		wake_up(&ssk->wq);
+	}
+
+	sdp_cnt(sdp_keepalive_probes_sent);
+}
+
 void sdp_post_send(struct sdp_sock *ssk, struct sk_buff *skb, u8 mid)
 {
 	struct sdp_buf *tx_req;
@@ -158,7 +189,7 @@ void sdp_post_send(struct sdp_sock *ssk,
 	}
 
 	ssk->tx_wr.next = NULL;
-	ssk->tx_wr.wr_id = ssk->tx_head;
+	ssk->tx_wr.wr_id = ssk->tx_head | SDP_OP_SEND;
 	ssk->tx_wr.sg_list = ssk->ibsge;
 	ssk->tx_wr.num_sge = frags + 1;
 	ssk->tx_wr.opcode = IB_WR_SEND;
@@ -604,7 +635,7 @@ static void sdp_handle_wc(struct sdp_soc
 				__kfree_skb(skb);
 			}
 		}
-	} else {
+	} else if (likely(wc->wr_id & SDP_OP_SEND)) {
 		skb = sdp_send_completion(ssk, wc->wr_id);
 		if (unlikely(!skb))
 			return;
@@ -620,6 +651,22 @@ static void sdp_handle_wc(struct sdp_soc
 		}
 
 		sk_stream_write_space(&ssk->isk.sk);
+	} else {
+		sdp_cnt(sdp_keepalive_probes_sent);
+
+		if (likely(!wc->status))
+			return;
+
+		sdp_dbg(&ssk->isk.sk, " %s consumes KEEPALIVE status
%d\n",
+		        __func__, wc->status);
+
+		if (wc->status == IB_WC_WR_FLUSH_ERR)
+			return;
+
+		sdp_set_error(&ssk->isk.sk, -ECONNRESET);
+		wake_up(&ssk->wq);
+
+		return;
 	}
 
 	if (likely(!wc->status)) {
Index: ofa_kernel/drivers/infiniband/ulp/sdp/sdp_cma.c
===================================================================
--- ofa_kernel.orig/drivers/infiniband/ulp/sdp/sdp_cma.c
2007-07-16 19:42:32.000000000 +0300
+++ ofa_kernel/drivers/infiniband/ulp/sdp/sdp_cma.c	2007-07-16
23:00:04.000000000 +0300
@@ -270,8 +270,8 @@ static int sdp_response_handler(struct s
 
 	sk->sk_state = TCP_ESTABLISHED;
 
-	/* TODO: If SOCK_KEEPOPEN set, need to reset and start
-	   keepalive timer here */
+	if (sock_flag(sk, SOCK_KEEPOPEN))
+		sdp_start_keepalive_timer(sk);
 
 	if (sock_flag(sk, SOCK_DEAD))
 		return 0;
@@ -311,8 +311,8 @@ int sdp_connected_handler(struct sock *s
 
 	sk->sk_state = TCP_ESTABLISHED;
 
-	/* TODO: If SOCK_KEEPOPEN set, need to reset and start
-	   keepalive timer here */
+	if (sock_flag(sk, SOCK_KEEPOPEN))
+		sdp_start_keepalive_timer(sk);
 
 	if (sock_flag(sk, SOCK_DEAD))
 		return 0;
Index: ofa_kernel/drivers/infiniband/ulp/sdp/sdp_main.c
===================================================================
--- ofa_kernel.orig/drivers/infiniband/ulp/sdp/sdp_main.c
2007-07-16 19:42:38.000000000 +0300
+++ ofa_kernel/drivers/infiniband/ulp/sdp/sdp_main.c	2007-07-21
03:10:14.000000000 +0300
@@ -117,6 +117,11 @@ static int send_poll_thresh = 8192;
 module_param_named(send_poll_thresh, send_poll_thresh, int, 0644);
 MODULE_PARM_DESC(send_poll_thresh, "Send message size thresh hold over
which to start polling.");
 
+static unsigned int sdp_keepalive_time = SDP_KEEPALIVE_TIME;
+
+module_param_named(sdp_keepalive_time, sdp_keepalive_time, uint, 0644);
+MODULE_PARM_DESC(sdp_keepalive_time, "Default idle time in seconds
before keepalive probe sent.");
+
 struct workqueue_struct *sdp_workqueue;
 
 static struct list_head sock_list;
@@ -124,6 +129,11 @@ static spinlock_t sock_list_lock;
 
 DEFINE_RWLOCK(device_removal_lock);
 
+static inline unsigned int sdp_keepalive_time_when(const struct
sdp_sock *ssk)
+{
+	return ssk->keepalive_time ? : sdp_keepalive_time * HZ;
+}
+
 inline void sdp_add_sock(struct sdp_sock *ssk)
 {
 	spin_lock_irq(&sock_list_lock);
@@ -221,6 +231,86 @@ static void sdp_destroy_qp(struct sdp_so
 	kfree(ssk->tx_ring);
 }
 
+
+static void sdp_reset_keepalive_timer(struct sock *sk, unsigned long
len)
+{
+	struct sdp_sock *ssk = sdp_sk(sk);
+
+	sdp_dbg(sk, "%s\n", __func__);
+
+	ssk->keepalive_tx_head = ssk->tx_head;
+	ssk->keepalive_rx_head = ssk->rx_head;
+
+	sk_reset_timer(sk, &sk->sk_timer, jiffies + len);
+}
+
+static void sdp_delete_keepalive_timer(struct sock *sk)
+{
+	struct sdp_sock *ssk = sdp_sk(sk);
+
+	sdp_dbg(sk, "%s\n", __func__);
+
+	ssk->keepalive_tx_head = 0;
+	ssk->keepalive_rx_head = 0;
+
+	sk_stop_timer(sk, &sk->sk_timer);
+}
+
+static void sdp_keepalive_timer(unsigned long data)
+{
+	struct sock *sk = (struct sock *)data;
+	struct sdp_sock *ssk = sdp_sk(sk);
+
+	sdp_dbg(sk, "%s\n", __func__);
+
+	/* Only process if the socket is not in use */
+	bh_lock_sock(sk);
+	if (sock_owned_by_user(sk)) {
+		sdp_reset_keepalive_timer(sk, HZ / 20);
+		goto out;
+	}
+
+	if (!sock_flag(sk, SOCK_KEEPOPEN) || sk->sk_state == TCP_LISTEN
||
+	    sk->sk_state == TCP_CLOSE)
+		goto out;
+
+	if (ssk->keepalive_tx_head == ssk->tx_head &&
+	    ssk->keepalive_rx_head == ssk->rx_head)
+		sdp_post_keepalive(ssk);
+
+	sdp_reset_keepalive_timer(sk, sdp_keepalive_time_when(ssk));
+
+out:
+	bh_unlock_sock(sk);
+	sock_put(sk);
+}
+
+static void sdp_init_timer(struct sock *sk)
+{
+	init_timer(&sk->sk_timer);
+
+	sk->sk_timer.function = sdp_keepalive_timer;
+	sk->sk_timer.data = (unsigned long)sk;
+}
+
+static void sdp_set_keepalive(struct sock *sk, int val)
+{
+	sdp_dbg(sk, "%s %d\n", __func__, val);
+
+	if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))
+		return;
+
+	if (val && !sock_flag(sk, SOCK_KEEPOPEN))
+		sdp_start_keepalive_timer(sk);
+	else if (!val)
+		sdp_delete_keepalive_timer(sk);
+}
+
+void sdp_start_keepalive_timer(struct sock *sk)
+{
+	sdp_reset_keepalive_timer(sk,
sdp_keepalive_time_when(sdp_sk(sk)));
+}
+
 void sdp_reset_sk(struct sock *sk, int rc)
 {
 	struct sdp_sock *ssk = sdp_sk(sk);
@@ -365,6 +455,8 @@ static void sdp_close(struct sock *sk, l
 
 	sdp_dbg(sk, "%s\n", __func__);
 
+	sdp_delete_keepalive_timer(sk);
+
 	sk->sk_shutdown = SHUTDOWN_MASK;
 	if (sk->sk_state == TCP_LISTEN || sk->sk_state == TCP_SYN_SENT)
{
 		sdp_set_state(sk, TCP_CLOSE);
@@ -818,9 +910,6 @@ static int sdp_setsockopt(struct sock *s
 	int err = 0;
 
 	sdp_dbg(sk, "%s\n", __func__);
-	if (level != SOL_TCP)
-		return -ENOPROTOOPT;
-
 	if (optlen < sizeof(int))
 		return -EINVAL;
 
@@ -829,6 +918,28 @@ static int sdp_setsockopt(struct sock *s
 
 	lock_sock(sk);
 
+	/* SOCK_KEEPALIVE is really a SOL_SOCKET level option but there
+	 * is a problem handling it at that level.  In order to start
+	 * the keepalive timer on an SDP socket, we must call an SDP
+	 * specific routine.  Since sock_setsockopt() can not be modifed
+	 * to understand SDP, the application must pass that option
+	 * through to us.  Since SO_KEEPALIVE and TCP_DEFER_ACCEPT both
+	 * use the same optname, the level must not be SOL_TCP or
SOL_SOCKET
+	 */
+	if (level == PF_INET_SDP && optname == SO_KEEPALIVE) {
+		sdp_set_keepalive(sk, val);
+		if (val)
+			sock_set_flag(sk, SOCK_KEEPOPEN);
+		else
+			sock_reset_flag(sk, SOCK_KEEPOPEN);
+		goto out;
+	}
+
+	if (level != SOL_TCP) {
+		err = -ENOPROTOOPT;
+		goto out;
+	}
+
 	switch (optname) {
 	case TCP_NODELAY:
 		if (val) {
@@ -867,11 +978,23 @@ static int sdp_setsockopt(struct sock *s
 			sdp_push_pending_frames(sk);
 		}
 		break;
+	case TCP_KEEPIDLE:
+		if (val < 1 || val > MAX_TCP_KEEPIDLE)
+			err = -EINVAL;
+		else {
+			ssk->keepalive_time = val * HZ;
+
+			if (sock_flag(sk, SOCK_KEEPOPEN) &&
+			    !((1 << sk->sk_state) & (TCPF_CLOSE |
TCPF_LISTEN)))
+				sdp_reset_keepalive_timer(sk,
ssk->keepalive_time);
+		}
+		break;
 	default:
 		err = -ENOPROTOOPT;
 		break;
 	}
 
+out:
 	release_sock(sk);
 	return err;
 }
@@ -904,6 +1027,9 @@ static int sdp_getsockopt(struct sock *s
 	case TCP_CORK:
 		val = !!(ssk->nonagle&TCP_NAGLE_CORK);
 		break;
+	case TCP_KEEPIDLE:
+		val = ssk->keepalive_time ? ssk->keepalive_time / HZ :
sdp_keepalive_time;
+		break;
 	default:
 		return -ENOPROTOOPT;
 	}
@@ -1687,6 +1813,8 @@ static int sdp_create_socket(struct sock
 
 	sk->sk_destruct = sdp_destruct;
 
+	sdp_init_timer(sk);
+
 	sock->ops = &sdp_proto_ops;
 	sock->state = SS_UNCONNECTED;


From monisonlists at gmail.com  Tue Jul 31 06:33:20 2007
From: monisonlists at gmail.com (Moni Shoua)
Date: Tue, 31 Jul 2007 16:33:20 +0300
Subject: [ofa-general] Re: [PATCH V3 7/7] net/bonding: Delay sending of
	gratuitous ARP to avoid failure
In-Reply-To: <19319.1185827384@death>
References: <46ADDB89.5030601@voltaire.com> <46ADDFE6.9000609@voltaire.com>
	<19319.1185827384@death>
Message-ID: <46AF3A20.8080700@gmail.com>

Jay Vosburgh wrote:
> Moni Shoua <monis at voltaire.com> wrote:
> 
>> Delay sending a gratuitous_arp when LINK_STATE_LINKWATCH_PENDING bit
>> in dev->state field is on. This improves the chances for the arp packet to
>> be transmitted.
> 
> 	Under what circumstances were you seeing problems that delaying
> the gratuitous ARP until linkwatch is done improves things?  Is this
> really an IB thing, or did you experience problems here over regular
> ethernet?
> 

I tried to figure out what is the difference in the state/flags of the device when 
grat. ARP send succeeds and when it fails. I found exact correlation with the 
LINK_STATE_LINKWATCH_PENDING bit on.
I don't think that this is an IB issue but I can't be sure. I didn't run tests
for Ethernet.

>> Signed-off-by: Moni Shoua <monis at voltaire.com>
>> ---
>> drivers/net/bonding/bond_main.c |   25 +++++++++++++++++++++----
>> drivers/net/bonding/bonding.h   |    1 +
>> 2 files changed, 22 insertions(+), 4 deletions(-)
>>
>> Index: net-2.6/drivers/net/bonding/bond_main.c
>> ===================================================================
>> --- net-2.6.orig/drivers/net/bonding/bond_main.c	2007-07-25 15:33:25.000000000 +0300
>> +++ net-2.6/drivers/net/bonding/bond_main.c	2007-07-26 18:42:59.296296622 +0300
>> @@ -1134,8 +1134,13 @@ void bond_change_active_slave(struct bon
>> 		if (new_active && !bond->do_set_mac_addr)
>> 			memcpy(bond->dev->dev_addr,  new_active->dev->dev_addr,
>> 				new_active->dev->addr_len);
>> -
>> -		bond_send_gratuitous_arp(bond);
>> +		if (bond->curr_active_slave &&
>> +			test_bit(__LINK_STATE_LINKWATCH_PENDING, &bond->curr_active_slave->dev->state)){
>> +			dprintk("delaying gratuitous arp on %s\n",bond->curr_active_slave->dev->name);
>> +			bond->send_grat_arp=1;
>> +		}else{
>> +			bond_send_gratuitous_arp(bond);
>> +		}
> 
> 	Style issues throughout the patch series: many lines are too
> long, many things are all smashed together, e.g., "}else{" instead of
> "} else {", "send_grat_arp=1" instead of "send_grat_arp = 1", and so on.
> 
OK thanks. I'll fix and repost.
>> 	}
>> }
>>
>> @@ -2120,6 +2125,15 @@ void bond_mii_monitor(struct net_device 
>> 	 * program could monitor the link itself if needed.
>> 	 */
>>
>> +	if (bond->send_grat_arp) {
>> +		if (bond->curr_active_slave && test_bit(__LINK_STATE_LINKWATCH_PENDING, &bond->curr_active_slave->dev->state))
>> +			dprintk("Needs to send gratuitous arp but not yet\n",__FUNCTION__);
>> +		else {
>> +			dprintk("sending delayed gratuitous arp on ond->curr_active_slave->dev->name\n");
>> +			bond_send_gratuitous_arp(bond);
>> +			bond->send_grat_arp=0;
>> +		}
>> +	}
> 
> 
>> 	read_lock(&bond->curr_slave_lock);
>> 	oldcurrent = bond->curr_active_slave;
>> 	read_unlock(&bond->curr_slave_lock);
>> @@ -2513,6 +2527,7 @@ static void bond_send_gratuitous_arp(str
>> 	struct slave *slave = bond->curr_active_slave;
>> 	struct vlan_entry *vlan;
>> 	struct net_device *vlan_dev;
>> +	int i;
>>
>> 	dprintk("bond_send_grat_arp: bond %s slave %s\n", bond->dev->name,
>> 				slave ? slave->dev->name : "NULL");
>> @@ -2520,8 +2535,9 @@ static void bond_send_gratuitous_arp(str
>> 		return;
>>
>> 	if (bond->master_ip) {
>> -		bond_arp_send(slave->dev, ARPOP_REPLY, bond->master_ip,
>> -				  bond->master_ip, 0);
>> +		for (i=0;i<3;i++)
>> +			bond_arp_send(slave->dev, ARPOP_REPLY, bond->master_ip,
>> +					  bond->master_ip, 0);
>> 	}
> 
> 	If you delay the grat ARP until linkwatch is done, why is it
> also necessary to shotgun several ARPs instead of one?  Why are the ARPs
> sent for VLANs not also shotgunned in a similar fashion?
Besides the linkwatch issue I also noticed that on rare occasions, grat. ARPs
that found their way to the slave's xmit function were not xmitted.
The 3 times send is just an another attempt to improve chances.

I'd like to emphasize here that with IB slaves, grat. ARP is much more crucial to 
a successful change of slaves and that was my focus.

> 	If shotgunning like this really is useful, would it not make
> more sense to space them out a bit, e.g., over successive monitor
> passes?
> 
I guess you are right about that.
>> 	list_for_each_entry(vlan, &bond->vlan_list, vlan_list) {
>> @@ -4331,6 +4347,7 @@ static int bond_init(struct net_device *
>> 	bond->current_arp_slave = NULL;
>> 	bond->primary_slave = NULL;
>> 	bond->dev = bond_dev;
>> +	bond->send_grat_arp=0;
>> 	INIT_LIST_HEAD(&bond->vlan_list);
>>
>> 	/* Initialize the device entry points */
>> Index: net-2.6/drivers/net/bonding/bonding.h
>> ===================================================================
>> --- net-2.6.orig/drivers/net/bonding/bonding.h	2007-07-25 15:20:10.000000000 +0300
>> +++ net-2.6/drivers/net/bonding/bonding.h	2007-07-26 18:42:43.652087660 +0300
>> @@ -203,6 +203,7 @@ struct bonding {
>> 	struct   vlan_group *vlgrp;
>> 	struct   packet_type arp_mon_pt;
>> 	s8       do_set_mac_addr;
>> +	int	 send_grat_arp;
> 
> 	This need not be a full int, and (this applies to
> do_set_mac_addr, also) could probably be squeezed into gaps already
> existing within the struct bonding somewhere.
Thanks. Will be fixed.
> 
> 	-J
> 
> ---
> 	-Jay Vosburgh, IBM Linux Technology Center, fubar at us.ibm.com
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From dotanb at dev.mellanox.co.il  Tue Jul 31 06:37:11 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Tue, 31 Jul 2007 16:37:11 +0300
Subject: [ofa-general] [PATCH] rdma/ib_mad.h: add include to linux/list.h
Message-ID: <200707311637.11327.dotanb@dev.mellanox.co.il>

ib_mad.h uses struct list_head, so while
linux/list.h seems to be pulled in indirectly
by one of the headers it includes, the right thing
is to include linux/list.h directly.

Signed-off-by: Dotan Barak <dotanb at dev.mellanox.co.il>

---

diff --git a/include/rdma/ib_mad.h b/include/rdma/ib_mad.h
index 30712dd..8ec3799 100644
--- a/include/rdma/ib_mad.h
+++ b/include/rdma/ib_mad.h
@@ -39,6 +39,8 @@
 #if !defined( IB_MAD_H )
 #define IB_MAD_H
 
+#include <linux/list.h>
+
 #include <rdma/ib_verbs.h>
 
 /* Management base version */


From tziporet at mellanox.co.il  Tue Jul 31 06:40:06 2007
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Tue, 31 Jul 2007 16:40:06 +0300
Subject: [ofa-general] OFED July 30 meeting minutes
Message-ID: <6C2C79E72C305246B504CBA17B5500C9015639DD@mtlexch01.mtl.com>

OFED July 30 meeting summary
============================

1. Decided to have only one release in August - OFED 1.2.c. 
   Main reasons are - focus of all companies and saving the verification
and QA efforts.
2. Bugzilla: Everybody is requested to review the non-closed bugs and
decide what is the action needed. 

3. Status update: 
   a. OFED 1.2.c:
      - 1.2.c-10 will be available tomorrow (Aug-1).
      - 1.2.c release is targeted for Aug 8.
   b. OFED 1.3:
      - Kernel code base was changed to 2.6.23-rc1
      - new install scripts and spec files should be ready next week.
      - Other features - on track for now (no special updates).
      - Not clear if Open MPI will be ready with support for the new SRC
object.

Action Items:
=============
1. Chelsio and IBM (that requested 1.2.1 release) - make sure all your
changes are committed to 1.2.c branch

Reminder: Feature freeze for OFED 1.3 is targeted to Sep 4.

Tziporet
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070731/a692a07d/attachment.html>

From monisonlists at gmail.com  Tue Jul 31 06:44:08 2007
From: monisonlists at gmail.com (Moni Shoua)
Date: Tue, 31 Jul 2007 16:44:08 +0300
Subject: [ofa-general] Re: [PATCH V3 0/7] net/bonding: ADD IPoIB support
	for	the bonding driver
In-Reply-To: <adaejipr3l5.fsf@cisco.com>
References: <46ADDB89.5030601@voltaire.com> <adaejipr3l5.fsf@cisco.com>
Message-ID: <46AF3CA8.6050201@gmail.com>

Roland Dreier wrote:
>  > 1. When bonding enslaves an IPoIB device the bonding neighbor holds a 
>  > reference to a cleanup function in the IPoIB drives. This makes it unsafe to 
>  > unload the IPoIB module if there are bonding neighbors in the air. So, to 
>  > avoid this race one must unload bonding before unloading IPoIB. 
> 
> I think we really want to resolve this somehow.  Getting an oops by
> doing "modprobe -r ipoib" isn't that friendly.
> 
You are right and we want to resolve that.
One way is to clean the neigh destructor function from all IPoIB neighs.
The other way is to prevent ipoib unload if device is a slave or is referenced from 
somewhere else.

I guess I would like an advice here.
> Also, what happened to the problem of having an address handle
> belonging to the wrong device on bond failover?  Did you figure out a
> way to fix that one?
This is what patch 2 handles.
> 
>  - R.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From dotanb at dev.mellanox.co.il  Tue Jul 31 06:49:15 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Tue, 31 Jul 2007 16:49:15 +0300
Subject: [ofa-general] [PATCH] rdma/ib_verbs.h: add include to linux/list.h
	and linux/rwsem.h
Message-ID: <200707311649.15573.dotanb@dev.mellanox.co.il>

ib_verbs.h uses the structs list_head and rw_semaphore, so while
the files linux/list.h and linux/rwsem.h seems to be pulled in indirectly
by the other header files it includes, the right thing is to include those
files directly.

Signed-off-by: Dotan Barak <dotanb at dev.mellanox.co.il>

---

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 0627a6a..7a99f11 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -46,6 +46,8 @@
 #include <linux/mm.h>
 #include <linux/dma-mapping.h>
 #include <linux/kref.h>
+#include <linux/list.h>
+#include <linux/rwsem.h>
 
 #include <asm/atomic.h>
 #include <asm/scatterlist.h>


From erezz at voltaire.com  Tue Jul 31 06:51:31 2007
From: erezz at voltaire.com (Erez Zilber)
Date: Tue, 31 Jul 2007 16:51:31 +0300
Subject: [ofa-general] OFED 1.2.c-9 is available
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com>
Message-ID: <46AF3E63.20204@voltaire.com>

Tziporet,


Does 1.2.c-9 also include FMR support? Else, should we wait for 1.2.c-10?


Thanks,

Erez


Tziporet Koren wrote:

> Hi All,
>
> OFED 1.2.c-9 is available now on the OFA server under:
> _http://www.openfabrics.org/builds/connectx/release/_
> Note: this release was tested with FW 2.1.000 that will soon be
> available on Mellanox web site for download.
>
> Supported Platforms and Operating Systems
> =================================
> o CPU architectures:
>         - x86_64
>         - x86
>         - ppc64
>         - ia64
>
> o Linux Operating Systems:
>         - RedHat EL4 up3: 2.6.9-34.ELsmp
>         - RedHat EL4 up4: 2.6.9-42.ELsmp
>         - RedHat EL4 up5: 2.6.9-55.ELsmp
>         - RedHat EL5: 2.6.18-8.el5
>         - SLES10: 2.6.16.21-0.8-smp
>         - kernel.org: 2.6.20.x
>         - SLES10 SP1: 2.6.16.46-0.12-smp (partially tested)
>
> Main changes from OFED 1.2.c-8:
> =========================
> 1. Kernel oops in IPoIB on restart of the driver.
> 2. IPoIB CM is now the default.
> 3. MPI with SRQ is supported.
> 4. Itanium is now supported.
>
> mlx4 Fixed Bugs and Enhancements
> ===========================
> - Added support for PCI-Ex gen2 devices; device IDs: 26418 and 26428.
> - Query QP and query SRQ are now supported.
> - Internal error flow was added.
> - Number of QPs that can be attached to the same multicast group was
> increased to 56.
> - SRQ is now supported.
> - Fork is now supported.
>
> ConnectX specific known issues and limitations
> ===================================
> - The following commands and/or features are not supported:
>   o Resize CQ
>   o FMRs
>   o APM
>   o SQD
> - ibstat does not present all entries. Use ibv_devinfo instead.
> - To load the driver on machines with 64KB default page size UAR bar
> must be
>   enlarged. 64KB page size is the default of PPC with RHEL5 and
> Itanium with
>   64KB page size enabled.
>   Perform the following three steps:
>   1. Add the following line in the firmware configuration (INI) file
> under the
>      [HCA] section:
>        log2_uar_bar_megabytes = 5
>   2. Burn a modified firmware image with the changed INI file
>   3. Reboot the system
>
>
>
> Tziporet Koren
> Software Director
> Mellanox Technologies
> mailto: _tziporet at mellanox.co.il_ <mailto:tziporet at mellanox.co.il>
> Tel +972-4-9097200, ext 380
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From eitan at mellanox.co.il  Tue Jul 31 06:58:40 2007
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 31 Jul 2007 16:58:40 +0300
Subject: [ofa-general] RE: [PATCH][TIRIVIAL] ibdm/src/osm_check.cpp: Add
	missing include file
In-Reply-To: <f0e08f230707301054n50fe34adn68ddd6779d228d2c@mail.gmail.com>
References: <f0e08f230707301054n50fe34adn68ddd6779d228d2c@mail.gmail.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901FF1344@mtlexch01.mtl.com>

Applied. Thanks Hal.

Eitan Zahavi
> -----Original Message-----
> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] 
> Sent: Monday, July 30, 2007 8:54 PM
> To: Eitan Zahavi
> Cc: OpenFabrics General
> Subject: [PATCH][TIRIVIAL] ibdm/src/osm_check.cpp: Add 
> missing include file
> 
> ibdm/src/osm_check.cpp: Add missing include file
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> 
> diff --git a/ibdm/src/osm_check.cpp b/ibdm/src/osm_check.cpp 
> index 49215c2..f24eec6 100644
> --- a/ibdm/src/osm_check.cpp
> +++ b/ibdm/src/osm_check.cpp
> @@ -35,6 +35,7 @@
>  #include "Fabric.h"
>  #include "SubnMgt.h"
>  #include "CredLoops.h"
> +#include <unistd.h>
>  #include <getopt.h>
>  #include <fstream>
> 


From dotanb at dev.mellanox.co.il  Tue Jul 31 07:04:07 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Tue, 31 Jul 2007 17:04:07 +0300
Subject: [ofa-general] should the macro IB_UMEM_MAX_PAGE_CHUNK be removed
	from ib_verbs?
Message-ID: <46AF4157.3090007@dev.mellanox.co.il>

Hi Roland.

Here is the code of the macro IB_UMEM_MAX_PAGE_CHUNK (from ib_verbs):

#define IB_UMEM_MAX_PAGE_CHUNK                                          \
        ((PAGE_SIZE - offsetof(struct ib_umem_chunk, page_list)) /      \
         ((void *) &((struct ib_umem_chunk *) 0)->page_list[1] -        \
          (void *) &((struct ib_umem_chunk *) 0)->page_list[0]))


The problem is that the struct ib_umem_chunk is no longer in this header 
file (ib_verbs.h), the declaration
of this structure can be found in ib_umem.h.

This problem can be fixed by one of the following:
1) add include to ib_umem.h in ib_verbs.h
2) move this macro to ib_umem.h

what do you think?

thanks
Dotan


From mst at dev.mellanox.co.il  Tue Jul 31 07:04:36 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 31 Jul 2007 17:04:36 +0300
Subject: [ofa-general] Re: Re: [PATCH V3 0/7] net/bonding: ADD IPoIB support
	for	the bonding driver
In-Reply-To: <46AF3CA8.6050201@gmail.com>
References: <46ADDB89.5030601@voltaire.com> <adaejipr3l5.fsf@cisco.com>
	<46AF3CA8.6050201@gmail.com>
Message-ID: <20070731140436.GA16015@mellanox.co.il>

> Quoting Moni Shoua <monisonlists at gmail.com>:
> Subject: Re: Re: [PATCH V3 0/7] net/bonding: ADD IPoIB support for?the bonding driver
> 
> Roland Dreier wrote:
> >  > 1. When bonding enslaves an IPoIB device the bonding neighbor holds a 
> >  > reference to a cleanup function in the IPoIB drives. This makes it unsafe to 
> >  > unload the IPoIB module if there are bonding neighbors in the air. So, to 
> >  > avoid this race one must unload bonding before unloading IPoIB. 
> > 
> > I think we really want to resolve this somehow.  Getting an oops by
> > doing "modprobe -r ipoib" isn't that friendly.
> > 
> You are right and we want to resolve that.
> One way is to clean the neigh destructor function from all IPoIB neighs.
> The other way is to prevent ipoib unload if device is a slave or is referenced from 
> somewhere else.
> 
> I guess I would like an advice here.

I had this idea:


Maybe we could use hard_header_cache/header_cache_update methods instead of
neighbour cleanup calls.

To do this, it is possible that we'll have to switch from storing pointers
inside the neighbour to keeping an index there, but I expect the
performance impact to be minimal.

As a result, bonding would not have to copy pointers into ipoib module
and module removal would get fixed.

-- 
MST


From mst at dev.mellanox.co.il  Tue Jul 31 07:05:32 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 31 Jul 2007 17:05:32 +0300
Subject: [ofa-general] Re: should the macro IB_UMEM_MAX_PAGE_CHUNK be removed
	from ib_verbs?
In-Reply-To: <46AF4157.3090007@dev.mellanox.co.il>
References: <46AF4157.3090007@dev.mellanox.co.il>
Message-ID: <20070731140532.GB16015@mellanox.co.il>

> Quoting Dotan Barak <dotanb at dev.mellanox.co.il>:
> Subject: should the macro IB_UMEM_MAX_PAGE_CHUNK be removed from ib_verbs?
> 
> Hi Roland.
> 
> Here is the code of the macro IB_UMEM_MAX_PAGE_CHUNK (from ib_verbs):
> 
> #define IB_UMEM_MAX_PAGE_CHUNK                                          \
>        ((PAGE_SIZE - offsetof(struct ib_umem_chunk, page_list)) /      \
>         ((void *) &((struct ib_umem_chunk *) 0)->page_list[1] -        \
>          (void *) &((struct ib_umem_chunk *) 0)->page_list[0]))
> 
> 
> 
> The problem is that the struct ib_umem_chunk is no longer in this header 
> file (ib_verbs.h), the declaration
> of this structure can be found in ib_umem.h.
> 
> This problem can be fixed by one of the following:
> 1) add include to ib_umem.h in ib_verbs.h
> 2) move this macro to ib_umem.h
> 
> what do you think?

2 I guess.

-- 
MST


From tziporet at dev.mellanox.co.il  Tue Jul 31 07:16:12 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 31 Jul 2007 17:16:12 +0300
Subject: [ewg] Re: [ofa-general] OFED 1.2.c-9 is available
In-Reply-To: <46AF3E63.20204@voltaire.com>
References: <6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com>
	<46AF3E63.20204@voltaire.com>
Message-ID: <46AF442C.6020904@mellanox.co.il>

Erez Zilber wrote:
> Does 1.2.c-9 also include FMR support? Else, should we wait for 1.2.c-10?
>   
FMRs are not supported in 1.2.c-9.
1.2.c-10 supports FMRs and it will be available tomorrow

Tziporet


From ogerlitz at voltaire.com  Tue Jul 31 07:19:12 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 31 Jul 2007 17:19:12 +0300
Subject: [ofa-general] Re: Re: [PATCH V3 0/7] net/bonding: ADD IPoIB
	support for	the bonding driver
In-Reply-To: <20070731140436.GA16015@mellanox.co.il>
References: <46ADDB89.5030601@voltaire.com>
	<adaejipr3l5.fsf@cisco.com>	<46AF3CA8.6050201@gmail.com>
	<20070731140436.GA16015@mellanox.co.il>
Message-ID: <46AF44E0.50700@voltaire.com>

Michael S. Tsirkin wrote:
> Maybe we could use hard_header_cache/header_cache_update methods instead of
> neighbour cleanup calls.

> To do this, it is possible that we'll have to switch from storing pointers
> inside the neighbour to keeping an index there, but I expect the
> performance impact to be minimal.
> 
> As a result, bonding would not have to copy pointers into ipoib module
> and module removal would get fixed.

To be precise, bonding will copy all the symbols it copies today from 
the slave module (ipoib), see bond_setup_by_slave() in patch 3/7, except 
for the neighbour cleanup callback which is copied through coping the 
neigh_setup function.

Or.


From ogerlitz at voltaire.com  Tue Jul 31 07:21:47 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 31 Jul 2007 17:21:47 +0300
Subject: [ofa-general] Re: [ewg] mlx4/fmr support
In-Reply-To: <46AF442C.6020904@mellanox.co.il>
References: <6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com>	<46AF3E63.20204@voltaire.com>
	<46AF442C.6020904@mellanox.co.il>
Message-ID: <46AF457B.9040107@voltaire.com>

Tziporet Koren wrote:
> 1.2.c-10 supports FMRs and it will be available tomorrow

thats very important progress re iser support since unlike srp we can't 
work without fmr. When are you planning to send the mlx4 FMR code to 
review on the general list? I guess this is code candidate for 2.6.24, 
correct?

Or.


From mst at dev.mellanox.co.il  Tue Jul 31 07:22:35 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 31 Jul 2007 17:22:35 +0300
Subject: [ofa-general] Re: Re: Re: [PATCH V3 0/7] net/bonding: ADD IPoIB
	support for	the bonding driver
In-Reply-To: <46AF44E0.50700@voltaire.com>
References: <46ADDB89.5030601@voltaire.com> <adaejipr3l5.fsf@cisco.com>
	<46AF3CA8.6050201@gmail.com>
	<20070731140436.GA16015@mellanox.co.il>
	<46AF44E0.50700@voltaire.com>
Message-ID: <20070731142234.GC16015@mellanox.co.il>

> Quoting Or Gerlitz <ogerlitz at voltaire.com>:
> Subject: Re: Re: Re: [PATCH V3 0/7] net/bonding: ADD IPoIB support for?the bonding driver
> 
> Michael S. Tsirkin wrote:
> >Maybe we could use hard_header_cache/header_cache_update methods instead of
> >neighbour cleanup calls.
> 
> >To do this, it is possible that we'll have to switch from storing pointers
> >inside the neighbour to keeping an index there, but I expect the
> >performance impact to be minimal.
> >
> >As a result, bonding would not have to copy pointers into ipoib module
> >and module removal would get fixed.
> 
> To be precise, bonding will copy all the symbols it copies today from 
> the slave module (ipoib),
> see bond_setup_by_slave() in patch 3/7, except 
> for the neighbour cleanup callback which is copied through coping the 
> neigh_setup function.

Not really.
This copying of symbols is something that you added, isn't it?
So with this approach, it won't be needed.

It's always wrong to copy symbols from another module without
referencing it.
-- 
MST


From dotanb at dev.mellanox.co.il  Tue Jul 31 07:30:51 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Tue, 31 Jul 2007 17:30:51 +0300
Subject: [ofa-general] Re: should the macro IB_UMEM_MAX_PAGE_CHUNK be removed
	from ib_verbs?
In-Reply-To: <20070731140532.GB16015@mellanox.co.il>
References: <46AF4157.3090007@dev.mellanox.co.il>
	<20070731140532.GB16015@mellanox.co.il>
Message-ID: <46AF479B.2000505@dev.mellanox.co.il>

Michael S. Tsirkin wrote:
>> This problem can be fixed by one of the following:
>> 1) add include to ib_umem.h in ib_verbs.h
>> 2) move this macro to ib_umem.h
>>
>> what do you think?
>>     
>
> 2 I guess.
>   
I will send a patch with the suggested options in a few minutes.

Dotan


From rdreier at cisco.com  Tue Jul 31 07:29:36 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 31 Jul 2007 07:29:36 -0700
Subject: [ofa-general] Re: should the macro IB_UMEM_MAX_PAGE_CHUNK be
	removed from ib_verbs?
In-Reply-To: <46AF479B.2000505@dev.mellanox.co.il> (Dotan Barak's message of
	"Tue, 31 Jul 2007 17:30:51 +0300")
References: <46AF4157.3090007@dev.mellanox.co.il>
	<20070731140532.GB16015@mellanox.co.il>
	<46AF479B.2000505@dev.mellanox.co.il>
Message-ID: <adazm1coddb.fsf@cisco.com>

 >>> 2) move this macro to ib_umem.h

 >> 2 I guess.

 > I will send a patch with the suggested options in a few minutes.

Actually just move the macro to umem.c since that's the only place it
is (or should be) used anyway.

 - R.


From dotanb at dev.mellanox.co.il  Tue Jul 31 07:34:23 2007
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Tue, 31 Jul 2007 17:34:23 +0300
Subject: [ofa-general] [PATCH] include/rdma: move the macro
	IB_UMEM_MAX_PAGE_CHUNK to ib_umem.h
Message-ID: <200707311734.24055.dotanb@dev.mellanox.co.il>

After moving the struct ib_umem_chunk from the file ib_verbs.h to ib_umem.h
there isn't any reason for the macro IB_UMEM_MAX_PAGE_CHUNK to stay in
ib_verbs.h.

Signed-off-by: Dotan Barak <dotanb at dev.mellanox.co.il>

---

diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index c533d6c..69dea83 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -37,6 +37,11 @@
 #include <linux/scatterlist.h>
 #include <linux/workqueue.h>
 
+#define IB_UMEM_MAX_PAGE_CHUNK						\
+	((PAGE_SIZE - offsetof(struct ib_umem_chunk, page_list)) /	\
+	 ((void *) &((struct ib_umem_chunk *) 0)->page_list[1] -	\
+	  (void *) &((struct ib_umem_chunk *) 0)->page_list[0]))
+
 struct ib_ucontext;
 
 struct ib_umem {
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 0627a6a..43b4c97 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -731,11 +731,6 @@ struct ib_udata {
 	size_t       outlen;
 };
 
-#define IB_UMEM_MAX_PAGE_CHUNK						\
-	((PAGE_SIZE - offsetof(struct ib_umem_chunk, page_list)) /	\
-	 ((void *) &((struct ib_umem_chunk *) 0)->page_list[1] -	\
-	  (void *) &((struct ib_umem_chunk *) 0)->page_list[0]))
-
 struct ib_pd {
 	struct ib_device       *device;
 	struct ib_uobject      *uobject;


From ogerlitz at voltaire.com  Tue Jul 31 07:36:05 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 31 Jul 2007 17:36:05 +0300
Subject: [ofa-general] Re: [PATCH V3 0/7] net/bonding: ADD IPoIB support
 for	the	bonding driver
In-Reply-To: <20070731142234.GC16015@mellanox.co.il>
References: <46ADDB89.5030601@voltaire.com>
	<adaejipr3l5.fsf@cisco.com>	<46AF3CA8.6050201@gmail.com>	<20070731140436.GA16015@mellanox.co.il>	<46AF44E0.50700@voltaire.com>
	<20070731142234.GC16015@mellanox.co.il>
Message-ID: <46AF48D5.9000502@voltaire.com>

Michael S. Tsirkin wrote:
>> Quoting Or Gerlitz <ogerlitz at voltaire.com>:

>> To be precise, bonding will copy all the symbols it copies today from 
>> the slave module (ipoib), see bond_setup_by_slave() in patch 3/7

> Not really.
> This copying of symbols is something that you added, isn't it?
> So with this approach, it won't be needed.

> It's always wrong to copy symbols from another module without
> referencing it.

Its the --first-- time you make this comment, please suggest a different 
approach, the relevant code is below.

> +static void bond_setup_by_slave(struct net_device *bond_dev,
> +				struct net_device *slave_dev)
> +{
> +	bond_dev->hard_header	        = slave_dev->hard_header;
> +	bond_dev->rebuild_header        = slave_dev->rebuild_header;
> +	bond_dev->hard_header_cache	= slave_dev->hard_header_cache;
> +	bond_dev->header_cache_update   = slave_dev->header_cache_update;
> +	bond_dev->hard_header_parse	= slave_dev->hard_header_parse;
> +
> +	bond_dev->neigh_setup           = slave_dev->neigh_setup;
> +
> +	bond_dev->type		    = slave_dev->type;
> +	bond_dev->hard_header_len   = slave_dev->hard_header_len;
> +	bond_dev->addr_len	    = slave_dev->addr_len;
> +
> +	memcpy(bond_dev->broadcast, slave_dev->broadcast,
> +		slave_dev->addr_len);
> +}
> +
>  /* enslave device <slave> to bond device <master> */
>  int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
>  {
> @@ -1351,6 +1371,24 @@ int bond_enslave(struct net_device *bond
>  		goto err_undo_flags;
>  	}
>  
> +	/* set bonding device ether type by slave - bonding netdevices are
> +	 * created with ether_setup, so when the slave type is not ARPHRD_ETHER
> +	 * there is a need to override some of the type dependent attribs/funcs.
> +	 *
> +	 * bond ether type mutual exclusion - don't allow slaves of dissimilar
> +	 * ether type (eg ARPHRD_ETHER and ARPHRD_INFINIBAND) share the same bond
> +	 */
> +	if (bond->slave_cnt == 0) {
> +		if (slave_dev->type != ARPHRD_ETHER)
> +			bond_setup_by_slave(bond_dev, slave_dev);
> +	} else if (bond_dev->type != slave_dev->type) {
> +		printk(KERN_ERR DRV_NAME ": %s ether type (%d) is different from "
> +			"other slaves (%d), can not enslave it.\n", slave_dev->name,
> +			slave_dev->type, bond_dev->type);
> +			res = -EINVAL;
> +			goto err_undo_flags;
> +	}
> +


From tziporet at dev.mellanox.co.il  Tue Jul 31 07:49:58 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 31 Jul 2007 17:49:58 +0300
Subject: [ofa-general] Re: [ewg] mlx4/fmr support
In-Reply-To: <46AF457B.9040107@voltaire.com>
References: <6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com>	<46AF3E63.20204@voltaire.com>
	<46AF442C.6020904@mellanox.co.il> <46AF457B.9040107@voltaire.com>
Message-ID: <46AF4C16.802@mellanox.co.il>

Or Gerlitz wrote:
> Tziporet Koren wrote:
>> 1.2.c-10 supports FMRs and it will be available tomorrow
>
> thats very important progress re iser support since unlike srp we 
> can't work without fmr. When are you planning to send the mlx4 FMR 
> code to review on the general list? 
Jack will send it tomorrow.
> I guess this is code candidate for 2.6.24, correct?
yes 

Tziporet


From mst at dev.mellanox.co.il  Tue Jul 31 07:48:27 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 31 Jul 2007 17:48:27 +0300
Subject: [ofa-general] Re: [PATCH V3 0/7] net/bonding: ADD IPoIB support
	for	the	bonding driver
In-Reply-To: <46AF48D5.9000502@voltaire.com>
References: <46ADDB89.5030601@voltaire.com> <adaejipr3l5.fsf@cisco.com>
	<46AF3CA8.6050201@gmail.com>
	<20070731140436.GA16015@mellanox.co.il>
	<46AF44E0.50700@voltaire.com>
	<20070731142234.GC16015@mellanox.co.il>
	<46AF48D5.9000502@voltaire.com>
Message-ID: <20070731144827.GB17331@mellanox.co.il>

> Quoting Or Gerlitz <ogerlitz at voltaire.com>:
> Subject: Re: [PATCH V3 0/7] net/bonding: ADD IPoIB support for?the?bonding driver
> 
> Michael S. Tsirkin wrote:
> >>Quoting Or Gerlitz <ogerlitz at voltaire.com>:
> 
> >>To be precise, bonding will copy all the symbols it copies today from 
> >>the slave module (ipoib), see bond_setup_by_slave() in patch 3/7
> 
> >Not really.
> >This copying of symbols is something that you added, isn't it?
> >So with this approach, it won't be needed.
> 
> >It's always wrong to copy symbols from another module without
> >referencing it.
> 
> Its the --first-- time you make this comment,

It's really a well known fact. That's where the crash
with modprobe -r comes from, right?

> please suggest a different approach,

I don't know, really - if you want to access a module, you really must get
a reference to it, or to the device.
How about adding the module pointer to struct net_device?

>the relevant code is below.

>+static void bond_setup_by_slave(struct net_device *bond_dev,
>+				struct net_device *slave_dev)
>+{
>+	bond_dev->hard_header	        = slave_dev->hard_header;
>+	bond_dev->rebuild_header        = slave_dev->rebuild_header;
>+	bond_dev->hard_header_cache	= slave_dev->hard_header_cache;
>+	bond_dev->header_cache_update   = slave_dev->header_cache_update;
>+	bond_dev->hard_header_parse	= slave_dev->hard_header_parse;
>+
>+	bond_dev->neigh_setup           = slave_dev->neigh_setup;
>+
>+	bond_dev->type		    = slave_dev->type;
>+	bond_dev->hard_header_len   = slave_dev->hard_header_len;
>+	bond_dev->addr_len	    = slave_dev->addr_len;
>+
>+	memcpy(bond_dev->broadcast, slave_dev->broadcast,
>+		slave_dev->addr_len);
>+}
>+

Hmm, it seems that switching to hard_header_cache as I suggested won't help at all.
I wonder: is bonding currently broken with devices that implement
hard_header_cache/header_cache_update?

-- 
MST


From ogerlitz at voltaire.com  Tue Jul 31 07:57:46 2007
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 31 Jul 2007 17:57:46 +0300
Subject: [ofa-general] Re: [PATCH V3 0/7] net/bonding: ADD IPoIB support
 for	the	bonding driver
In-Reply-To: <20070731144827.GB17331@mellanox.co.il>
References: <46ADDB89.5030601@voltaire.com>
	<adaejipr3l5.fsf@cisco.com>	<46AF3CA8.6050201@gmail.com>	<20070731140436.GA16015@mellanox.co.il>	<46AF44E0.50700@voltaire.com>	<20070731142234.GC16015@mellanox.co.il>	<46AF48D5.9000502@voltaire.com>
	<20070731144827.GB17331@mellanox.co.il>
Message-ID: <46AF4DEA.9050202@voltaire.com>

Michael S. Tsirkin wrote:
>> Quoting Or Gerlitz <ogerlitz at voltaire.com>:
>> Michael S. Tsirkin wrote:

>>> It's always wrong to copy symbols from another module without
>>> referencing it.
>> Its the --first-- time you make this comment,

> It's really a well known fact. That's where the crash
> with modprobe -r comes from, right?

no, the crash --only-- comes from the neighbour cleanup function being 
called while ipoib is now probed out of the kernel. The other symbols 
are not problematic. I got positive feedback that this --is-- the 
problem in the previous posts and from Roland during my Sonoma presentation.

>> please suggest a different approach,

> I don't know, really - if you want to access a module, you really must get
> a reference to it, or to the device.
> How about adding the module pointer to struct net_device?

I think there used to be there owner field of type struct module and it 
was removed... we will check that.

>> the relevant code is below.
> 
>> +static void bond_setup_by_slave(struct net_device *bond_dev,
>> +				struct net_device *slave_dev)
>> +{
>> +	bond_dev->hard_header	        = slave_dev->hard_header;
>> +	bond_dev->rebuild_header        = slave_dev->rebuild_header;
>> +	bond_dev->hard_header_cache	= slave_dev->hard_header_cache;
>> +	bond_dev->header_cache_update   = slave_dev->header_cache_update;
>> +	bond_dev->hard_header_parse	= slave_dev->hard_header_parse;
>> +
>> +	bond_dev->neigh_setup           = slave_dev->neigh_setup;
>> +
>> +	bond_dev->type		    = slave_dev->type;
>> +	bond_dev->hard_header_len   = slave_dev->hard_header_len;
>> +	bond_dev->addr_len	    = slave_dev->addr_len;
>> +
>> +	memcpy(bond_dev->broadcast, slave_dev->broadcast,
>> +		slave_dev->addr_len);
>> +}
>> +
> 
> Hmm, it seems that switching to hard_header_cache as I suggested won't help at all.

why? please clarify.

> I wonder: is bonding currently broken with devices that implement
> hard_header_cache/header_cache_update?

I don't think so. Note that bond_setup_by_slave is only called for 
slaves whose ether type is --not-- Ethernet.

Or.


From swise at opengridcomputing.com  Tue Jul 31 08:16:27 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 31 Jul 2007 10:16:27 -0500
Subject: [ofa-general] ofed kernel git trees.
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901FF1044@mtlexch01.mtl.com>
References: <46AC65E5.5050404@voltaire.com><20070730084616.GE9963@mellanox.co.il><46ADC0FF.2080000@voltaire.com>
	<20070730141155.GB7360@mellanox.co.il>
	<6C2C79E72C305246B504CBA17B5500C901FF1044@mtlexch01.mtl.com>
Message-ID: <46AF524B.60603@opengridcomputing.com>

Vlad,

Which git tree should I be based against for ofed 1.2 development?

I've always used

git://git.openfabrics.org/~vlad/ofed_1_2/.git

But there is also:

git://git.openfabrics.org/ofed_1_2/linux-2.6.git.
git://git.openfabrics.org/~vlad/ofed_kernel.git.


Which should I use for 1.2 and 1.2.c?

Thanks,

Steve.


From vlad at mellanox.co.il  Tue Jul 31 08:21:35 2007
From: vlad at mellanox.co.il (Vladimir Sokolovsky)
Date: Tue, 31 Jul 2007 18:21:35 +0300
Subject: [ofa-general] RE: ofed kernel git trees.
In-Reply-To: <46AF524B.60603@opengridcomputing.com>
References: <46AC65E5.5050404@voltaire.com><20070730084616.GE9963@mellanox.co.il><46ADC0FF.2080000@voltaire.com>
	<20070730141155.GB7360@mellanox.co.il>
	<6C2C79E72C305246B504CBA17B5500C901FF1044@mtlexch01.mtl.com>
	<46AF524B.60603@opengridcomputing.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901FF13C1@mtlexch01.mtl.com>

> Which git tree should I be based against for ofed 1.2 development?
> 
> I've always used
> 
> git://git.openfabrics.org/~vlad/ofed_1_2/.git
> 
> But there is also:
> 
> git://git.openfabrics.org/ofed_1_2/linux-2.6.git.

~vlad/ofed_1_2/.git is a symbolic link to ofed_1_2/linux-2.6.git

> git://git.openfabrics.org/~vlad/ofed_kernel.git.

~vlad/ofed_kernel.git will be used for OFED-1.3

> 
> Which should I use for 1.2 and 1.2.c?
> 

for 1.2 you should use git://git.openfabrics.org/ofed_1_2/linux-2.6.git
(branch ofed_1_2)
for 1.2.c you should use
git://git.openfabrics.org/ofed_1_2/linux-2.6.git (branch ofed_1_2_c)

Regards,
Vladimir


From lilian.dahl at mediamehr.at  Tue Jul 31 09:53:53 2007
From: lilian.dahl at mediamehr.at (Maryann Stevens)
Date: Tue, 31 Jul 2007 15:53:53 -0100
Subject: [ofa-general] Dating site
Message-ID: <01c7d38a$feba7a90$58478254@lilian.dahl>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: leibe.gif
Type: image/gif
Size: 11481 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070731/be30a365/attachment.gif>

From sashak at voltaire.com  Tue Jul 31 09:02:23 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 31 Jul 2007 19:02:23 +0300
Subject: [ofa-general] QoS RFC
In-Reply-To: <46A89608.9010709@dev.mellanox.co.il>
References: <46A283B6.1070105@dev.mellanox.co.il>
	<20070723002010.GU27878@sashak.voltaire.com>
	<46A89608.9010709@dev.mellanox.co.il>
Message-ID: <20070731160223.GF29844@sashak.voltaire.com>

Hi Yevgeny,

On 15:39 Thu 26 Jul     , Yevgeny Kliteynik wrote:
> >>
> >>  * Comments may appear only in a separate line
> > Why? What is wrong with:
> > 	port-name: vs1/HCA-1/P1   # my best port
> 
>  I can use this too, but then the pound sign, wherever it will
>  appear, would mean commentary start. No \# or something like this
>  to include it in some other place - I don't want to complicate the
>  syntax. Sounds OK?

Are we planning to use '#' somewhere?

Anyway this comment is minor.

> >>      end-port-groups
> > I agree that proposed syntax has better for human readability than pure
> > XML, but isn't stuff like this will be more user-friendly?
> > Storage "Free Text description" = 0x10001, 0x10002, 0x10003 ;
> > , or
> > Storage "Free Text description" { 0x10001, 0x10002, 0x10003 };
> > , or
> > Storage "Free Text description": ROUTERS, CAS ;
> 
>  GUID list is a good idea.
>  Not sure about the other stuff. A certain port group can be defined
>  both by guids and by node-types. How about this:
> 
>            port-group
>                name: routers_and_mgt_nodes
>                use: all routers and management nodes
>                node-type: ROUTER
>                port-guid: 0x10001, 0x10002, 0x10003
>            end-port-group

I think it is doable too, like: 0x10001, 0x10002, 0x10003, ROUTER
Guess it should be easy to parse GUIDs, names and special names (like
ROUTER) in one line. Not sure it must be so, just thought...

> >>      qos-levels
> >>
> >>          # the first one is just setting SL
> >>          qos-level
> >>              use: for the lowest priority communication
> >>              sl: 15
> >>              packet-life: 16
> >>          end-qos-level
> >>          # the second sets SL and QoS Class
> >>          qos-level
> >>              use: low latency best bandwidth
> >>              sl: 0
> >>          end-qos-level
> >>          # the whole set: SL, MTU-Limit, Rate-Limit, Packet Lifetime, Path 
> >>  Bits
> >>          qos-level
> >>              use: just an example
> >>              sl: 0
> >>              mtu-limit: 1
> >>              rate-limit: 1
> >>              packet-life: 12
> >>              # Path Bits can be used e.g. to provide a different routes  
> >> through the
> >>              # subnet to a particular port
> >>              path-bits: 2,4,8-32
> >>          end-qos-level
> >>
> >>      end-qos-levels
> >>
> >>
> >>      # Match rules are scanned in a first-fit manner (like firewall rules  
> >> table)
> >>      qos-match-rules
> >>
> >>          # matching by single criteria: class (list of values and ranges)
> >>          qos-match-rule
> >>              # just a description
> >>              use: low latency by class 7-9 or 11
> >>              qos-class: 7-9,11
> >>              # number of qos-level to apply to the matching PR/MPR
> >>              qos-level-sn: 1
> > Isn't it better and less error prone to match qos_level by name and not
> > by sequential number?
> 
>  qos-level can have name, and then qos-match-rule will refer to this name.
>  But matching qos-level by sequential number makes it really easy to locate
>  the referred qos-level, which is important, as every PR/MPR request would
>  go through this process, so saving some runtime in this area is important 
>  IMHO.

Sure, it is important, but I'm not about internal data representation,
internally this should be fast reference - by index or by directly by
pointer. But in the file it would be easy for user to have names (numbers
could be used as names too) instead of just serial numbering on one
side, so an user will not need to count lines.


> >>  9. OpenSM features
> >>  -------------------
> >>  The QoS related functionality to be provided by OpenSM can be split into 
> >> two
> >>  main parts:
> >>
> >>  3.1. Fabric Setup
> >>  During fabric initialization the SM should parse the policy and apply its
> >>  settings to the discovered fabric elements. The following actions should 
> >> be
> >>  performed:
> >>  * Parsing of policy
> >>  * Node Group identification. Warning should be provided for each node not
> >>    specified but found.
> >>  * SL2VL settings validation should be checked:
> >>    + A warning will be provided if there are no matching targets for the  
> >> SL2VL
> >>      setting statement.
> >>    + An error message will be printed to the log file if an invalid 
> >> setting  is
> >>      found. A setting is invalid if it refers to:
> >>      - Non existing port numbers of the target devices
> >>      - Unsupported VLs for the target device. In the later case the map to 
> >>  non
> >>        existing VLs should be replaced to VL15 i.e. packets will be 
> >> dropped.
> > I'm not sure it is optimal. We could have well documented or even
> > configurable mapping rule instead, then this will not limit devices with
> > higher capabilities.
> 
>  I'm open for suggestions.

The rule like %(number of OpVLs)? Or even better - configurable mapping
rule?

> >>  * Only PR/MPR fields that have their component mask bit set should be
> >>    compared.
> >>  * For a rule to be "matching" a PR/MPR request all the rule fields should 
> >> be
> >>    "matching" their PR/MPR fields. Such that a PR/MPR request that does
> >>    not have a component mask field set for one of the rule defined fields  
> >>  can
> >>    not match that rule.
> >>  * A PR/MPR request that have a component mask bit set for one of the 
> >> fields
> >>    that is not defined by the rule can match the rule.
> > Aren't last two too restrictive? SA can just to filter-out paths in
> > response to match rest of the rule. No?
> 
>  Not sure I'm following.
>  The last bullet is not restrictive at all

Right, but mostly I'm about previous bullet - where client _must_ set
component mask to match all fields.

Sasha


From changquing.tang at hp.com  Tue Jul 31 09:12:09 2007
From: changquing.tang at hp.com (Tang, Changqing)
Date: Tue, 31 Jul 2007 16:12:09 -0000
Subject: [ofa-general] Scalable reliable connection
In-Reply-To: <20070730125054.GO9963@mellanox.co.il>
References: <20070730125054.GO9963@mellanox.co.il>
Message-ID: <349DCDA352EACF42A0C49FA6DCEA840301EDB30F@G3W0634.americas.hpqcorp.net>


A send queue can only serve max J jobs within a node. Is it possible to
make a single send queue to serve all jobs on all nodes ?

--CQ 

> -----Original Message-----
> From: general-bounces at lists.openfabrics.org 
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of 
> Michael S. Tsirkin
> Sent: Monday, July 30, 2007 7:51 AM
> To: Gleb Natapov
> Cc: Pavel Shamis; ewg at lists.openfabrics.org; Michael S. 
> Tsirkin; general at lists.openfabrics.org; Ishai Rabinovitz
> Subject: [ofa-general] Scalable reliable connection
> 
> 
> Here's some background on what SRC is.  This is basically 
> slide 6 in Dror's talk, for those that missed the talk.
> 
>  * * *
> 
> SRC is an extension supported by recent Mellanox hardware 
> which is geared toward reducing the number of QPs required 
> for all-to-all communication on systems with a high number of 
> jobs per node.
> 
> ===================================================================
> Motivation:
> ===================================================================
> Given N nodes with J jobs per node, number of QPs required 
> for all-to-all communication is:
> 
> With RC:
> 		O((N * J) ^ 2)
> 
> 	Since each job out of O(N * J) jobs must create a single QP
> 	to communicate with each one of O(N * J) other jobs.
> 
> With SRC:
> 		O(N ^ 2 * J)
> 
> 	This is achived by using a single send queue (per job, 
> out of O(N * J) jobs)
> 	to send data to all J jobs running on a specific node 
> (out of O(N) nodes).
> 	Hardware uses new "SRQ number" field in packet header to
> 	multiplex receive WRs and WCs to private memory of each job.
> 
> This is similiar idea to IB RD.
> Q: Why not use RD then?
> A: Because no hardware supports it.
> 
> Details:
> 
> ===================================================================
> Verbs extension:
> ===================================================================
> 
> - There is a new transport/QP type "SRC".
> - There is a new object type "SRC domain"
> - Each SRQ gets new (optional) attributes:
>         SRC domain
> 	SRC SRQ number
>         SRC CQ
>   SRQ must have either all 3 of these or none of these attributes
> 
> - QPs of type SRC have all the same attributes as regular RC QPs
>   connected to SRQ, except that:
>   A. Each SRC QP has a new required attribute "SRC domain"
>   B. SRC QPs do *not* have "SRQ" attribute
>   	(do not have a specific SRQ associated with them)
> 
> ===================================================================
> Protocol extension:
> ===================================================================
> SRC QP behaviour: Requestor
> - Post send WR for this QP type is extended with SRQ number field
>   This number is sent as part of packet header
> - SRC Packets follow rules for RC packets on the wire, exactly
>   What is different is their handling at the responder side
> 
> SRC QP behaviour: Responder
> Each incoming packet passes transport checks with respect to 
> the SRC QP, following RC rules, exactly.
> 
> After this, SRQ number in packet header is used to look up a 
> specific SRQ. SRC domain of the resulting SRQ must be equal 
> to SRC domain of the QP, otherwise a NAK is sent, and QP 
> moves to error state.
> 
> If the SRC domains match, receive WR and receive WC 
> processing are as follows:
> 
> - RC Send
>   - Rather than using SRQ to which the QP is attached,
>     SRQ is looked up by SRQ number in the packet.
>     Receive WR is taken from this SRQ.
>   - Completions are generated on the CQ specified in the SRQ
> 
> - RDMA/Atomic
>   - Rather than using PD to which the QP is attached,
>     SRQ is looked up by SRQ number in the packet.
>     PD of this SRQ is used for protection checks.
> ===================================================================
>  
> --
> MST
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From mshefty at ichips.intel.com  Tue Jul 31 09:15:07 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 31 Jul 2007 09:15:07 -0700
Subject: [ofa-general] Re: IPoIB path caching
In-Reply-To: <46ADAA85.8070106@voltaire.com>
References: <46A46A1D.6040000@voltaire.com>
	<46A4EF00.9070305@ichips.intel.com>	<46A5C8E6.5020906@voltaire.com>
	<46A628D8.4050109@ichips.intel.com>	<46A6F50C.5000906@voltaire.com>
	<46A78146.1090304@ichips.intel.com>	<46A846FC.5040704@voltaire.com>
	<46A8D80C.1090305@ichips.intel.com>	<20070726181132.GO19768@obsidianresearch.com>	<46AC509B.6020206@voltaire.com>	<20070729173232.GA14867@obsidianresearch.com>
	<46ADAA85.8070106@voltaire.com>
Message-ID: <46AF600B.2040904@ichips.intel.com>

> Indeed. The argument I was trying to make is that arp cache invalidation 
>  requires IPoIB PR cache invalidation, this handles 100% of the cases, 
> including the 10% not covered by doing cache invalidation based only on 
>  IB events such as port up / sm lid change / sm reregister / etc

ARP cache invalidation does not require, nor does it actually do IPoIB 
PR cache invalidation.  We can argue whether or not it should, but the 
two are not linked together today.

The local SA updates paths either in response to an event: LID change, 
port state change, GID in/out of service, etc., or when refreshed via a 
module parameter.  A refresh can occur in response to an administrative 
event, when told to by a system administrator, before executing large 
jobs, periodically based on a timer, or whenever else.  That policy is 
outside the scope of the proposed patches, but covers all other 
potential cases where the cache must be updated.

I like the advantages of keeping the local SA entirely in user space, 
but there are issues that need to be worked through first.  And 
implementation wise, it's unlikely to give us anything that remains in 
sync any better than what's already been proposed without the use of 
non-standard extensions.

- Sean


From mst at dev.mellanox.co.il  Tue Jul 31 09:15:52 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 31 Jul 2007 19:15:52 +0300
Subject: [ofa-general] Re: Scalable reliable connection
In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA840301EDB30F@G3W0634.americas.hpqcorp.net>
References: <20070730125054.GO9963@mellanox.co.il>
	<349DCDA352EACF42A0C49FA6DCEA840301EDB30F@G3W0634.americas.hpqcorp.net>
Message-ID: <20070731161552.GB5743@mellanox.co.il>

> Quoting Tang, Changqing <changquing.tang at hp.com>:
> Subject: RE: Scalable reliable connection
> 
> 
> A send queue can only serve max J jobs within a node. Is it possible to
> make a single send queue to serve all jobs on all nodes ?

How do you propose to do this?

-- 
MST


From changquing.tang at hp.com  Tue Jul 31 09:21:13 2007
From: changquing.tang at hp.com (Tang, Changqing)
Date: Tue, 31 Jul 2007 16:21:13 -0000
Subject: [ofa-general] RE: Scalable reliable connection
In-Reply-To: <20070731161552.GB5743@mellanox.co.il>
References: <20070730125054.GO9963@mellanox.co.il>
	<349DCDA352EACF42A0C49FA6DCEA840301EDB30F@G3W0634.americas.hpqcorp.net>
	<20070731161552.GB5743@mellanox.co.il>
Message-ID: <349DCDA352EACF42A0C49FA6DCEA840301EDB33D@G3W0634.americas.hpqcorp.net>

 
In this way, only one send queue is needed for each job(process), and we
don't need to track the location of each other job(which is on which
node).
from a job point of view, either self, or others, all others are
"equal"...

--CQ

> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst at dev.mellanox.co.il] 
> Sent: Tuesday, July 31, 2007 11:16 AM
> To: Tang, Changqing
> Cc: Michael S. Tsirkin; Gleb Natapov; Pavel Shamis; 
> ewg at lists.openfabrics.org; general at lists.openfabrics.org; 
> Ishai Rabinovitz
> Subject: Re: Scalable reliable connection
> 
> > Quoting Tang, Changqing <changquing.tang at hp.com>:
> > Subject: RE: Scalable reliable connection
> > 
> > 
> > A send queue can only serve max J jobs within a node. Is it 
> possible 
> > to make a single send queue to serve all jobs on all nodes ?
> 
> How do you propose to do this?
> 
> --
> MST
> 


From mshefty at ichips.intel.com  Tue Jul 31 09:25:37 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 31 Jul 2007 09:25:37 -0700
Subject: [ofa-general] QoS RFC
In-Reply-To: <46A94657.1020101@ichips.intel.com>
References: <46A283B6.1070105@dev.mellanox.co.il>
	<46A94657.1020101@ichips.intel.com>
Message-ID: <46AF6281.10709@ichips.intel.com>

FYI - It is my intention to implement the host side portion of QoS 
support.  (It's one of my path forward objectives.)  I plan on 
implementing the host side as outlined below.  If anyone has any 
comments, I would like to get them as soon as possible.

- Sean

Sean Hefty wrote:
>> 2. Architecture ----------------
> 
> This is a higher level approach to the problem, but I came up with the
> following QoS relationship hierarchy, where '->' means 'maps to'.
> 
> Application Service -> Service ID (or range)
> Service ID -> desired QoS
> QoS, SGID, DGID, PKey -> SGID, DGID, TClass, FlowLabel, PKey
> SGID, DGID, TC, FL, PKey -> SLID, DLID, SL (set if crossing subnets)
> SLID, DLID, SL -> MTU, Rate, VL, PacketLifeTime
> 
> I use these relationships below:
> 
>> 4. IPoIB ---------
>>
>> IPoIB already query the SA for its broadcast group information. The 
>> additional functionality required is for IPoIB to provide the
>> broadcast group SL, MTU, and RATE in every following PathRecord query
>> performed when a new UDAV is needed by IPoIB. We could assign a
>> special Service-ID for IPoIB use but since all communication on the
>> same IPoIB interface shares the same QoS-Level without the ability to
>>  differentiate it by target service we can ignore it for simplicity.
> 
> Rather than IPoIB specifying SL, MTU, and rate with PR queries, it 
> should specify TClass and FlowLabel.  This is necessary for IPoIB to 
> span IB subnets.
> 
>> 5. CMA features ----------------
>>
>> The CMA interface supports Service-ID through the notion of port
>> space as a prefixes to the port_num which is part of the sockaddr
>> provided to rdma_resolve_add(). What is missing is the explicit
>> request for a QoS-Class that should allow the ULP (like SDP) to
>> propagate a specific request for a class of service. A mechanism for
>> providing the QoS-Class is available in the IPv6 address, so we could
>> use that address field. Another option is to implement a special 
>> connection options API for CMA.
>>
>> Missing functionality by CMA is the usage of the provided QoS-Class
>> and Service-ID in the sent PR/MPR. When a response is obtained it is
>> an existing requirement for the CMA to use the PR/MPR from the
>> response in setting up the QP address vector.
> 
> I think the RDMA CM needs two solutions, depending on which address 
> family is used.  For IPv6, the existing interface is sufficient, and 
> works for both IB and iWarp.  The RDMA CM only needs to include the TC 
> and FL as part of its PR query.  For IPv4, to remain transport neutral, 
> I think we should add an rdma_set_option() routine to specify the QoS 
> field.  The RDMA CM would include the QoS field for PR query under this 
> condition.
> 
> For IB, this requires changes to the ib_sa to support the new PR 
> extensions.  I don't think we gain anything having the RDMA CM include 
> service IDs as part of the query.
> 
>> 6. SDP -------
>>
>> SDP uses CMA for building its connections. The Service-ID for SDP is
>> 0x000000000001PPPP, where PPPP are 4 hex digits holding the remote
>> TCP/IP Port Number to connect to. SDP might be provided with
>> SO_PRIORITY socket option. In that case the value provided should be
>> sent to the CMA as the TClass option of that connection.
> 
> SDP would use specify the QoS through the IPv6 address or 
> rdma_set_option() routine.
> 
>> 7. SRP -------
>>
>> Current SRP implementation uses its own CM callbacks (not CMA). So
>> SRP should fill in the Service-ID in the PR/MPR by itself and use
>> that information in setting up the QP. The T10 SRP standard defines
>> the SRP Service-ID to be defined by the SRP target I/O Controller
>> (but they should also comply with IBTA Service- ID rules). Anyway,
>> the Service-ID is reported by the I/O Controller in the ServiceEntries 
>> DMA attribute and should be used in the PR/MPR if the
>> SA reports its ability to handle QoS PR/MPRs.
> 
> I agree.
> 
>> 8. iSER -------- iSER uses CMA and thus should be very close to SDP.
>> The Service-ID for iSER should be TBD.
> 
> See RDMA CM and SDP.
> 
>> 3.2. PR/MPR query handling: OpenSM should be able to enforce the
>> provided policy on client request. The overall flow for such requests
>> is: first the request is matched against the defined match rules such
>> that the target QoS-Level definition is found. Given the QoS-Level a
>> path(s) search is performed with the given restrictions imposed by
>> that level. The following two sections describe these steps.
> 
> If we use the QoS hierarchy outlined above, I think we can construct 
> some fairly simple tables to guide our PR selection.  The SA may need to 
> construct the tables starting at the bottom and working up, but I 
> *think* it could be done.  And by distributing the tables, we can 
> support a more distributed (a la local SA) operation.
> 
>  From an administration point, I would be happier seeing something where 
> the administrator defines a QoS level in terms of latency or bandwidth 
> requirements and relative priority.  Then, if desired, the administrator 
> could provide more details, such as indicating which nodes would use 
> which services, minimum required MTUs, etc.  It would then be up to the 
> SA to map these requirements to specific TC, FL, SL, VL values.
> 
> In general, though, I'm personally far less concerned with the QoS 
> specification interface to the SA, versus the operation that takes place 
> on the hosts.
> 
> Comments on using this approach on the host side?


From hal.rosenstock at gmail.com  Tue Jul 31 09:27:58 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 31 Jul 2007 12:27:58 -0400
Subject: [ofa-general] Re: [ewg] OFED 1.2.c-9 is available
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com>
References: <AcfElXnvIIwJJbMaQjCvFz5bwcoaYw==>
	<6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com>
Message-ID: <f0e08f230707310927o19e1583dr958abd29180bc8ac@mail.gmail.com>

On 7/12/07, Tziporet Koren <tziporet at mellanox.co.il> wrote:
>
>
> Hi All,
>
> OFED 1.2.c-9 is available now on the OFA server under:
> http://www.openfabrics.org/builds/connectx/release/
> Note: this release was tested with FW 2.1.000 that will soon be available on
> Mellanox web site for download.
>
> Supported Platforms and Operating Systems
> =================================
> o CPU architectures:
>         - x86_64
>         - x86
>         - ppc64
>         - ia64
>
> o Linux Operating Systems:
>         - RedHat EL4 up3: 2.6.9-34.ELsmp
>         - RedHat EL4 up4: 2.6.9-42.ELsmp
>         - RedHat EL4 up5: 2.6.9-55.ELsmp
>         - RedHat EL5: 2.6.18-8.el5
>         - SLES10: 2.6.16.21-0.8-smp
>         - kernel.org: 2.6.20.x
>         - SLES10 SP1: 2.6.16.46-0.12-smp (partially tested)
>
> Main changes from OFED 1.2.c-8:
> =========================
> 1. Kernel oops in IPoIB on restart of the driver.
> 2. IPoIB CM is now the default.
> 3. MPI with SRQ is supported.
> 4. Itanium is now supported.
>
> mlx4 Fixed Bugs and Enhancements
> ===========================
> - Added support for PCI-Ex gen2 devices; device IDs: 26418 and 26428.
> - Query QP and query SRQ are now supported.
> - Internal error flow was added.
> - Number of QPs that can be attached to the same multicast group was
> increased to 56.
> - SRQ is now supported.
> - Fork is now supported.
>
> ConnectX specific known issues and limitations
> ===================================
> - The following commands and/or features are not supported:
>   o Resize CQ
>   o FMRs
>   o APM
>   o SQD
> - ibstat does not present all entries. Use ibv_devinfo instead.

What is missing  from ibstat for ConnectX ? What entries are missing ?

-- Hal

> - To load the driver on machines with 64KB default page size UAR bar must be
>   enlarged. 64KB page size is the default of PPC with RHEL5 and Itanium with
>   64KB page size enabled.
>   Perform the following three steps:
>   1. Add the following line in the firmware configuration (INI) file under
> the
>      [HCA] section:
>        log2_uar_bar_megabytes = 5
>   2. Burn a modified firmware image with the changed INI file
>   3. Reboot the system
>
>
>
> Tziporet Koren
> Software Director
> Mellanox Technologies
> mailto: tziporet at mellanox.co.il
> Tel +972-4-9097200, ext 380
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>


From mshefty at ichips.intel.com  Tue Jul 31 09:54:04 2007
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 31 Jul 2007 09:54:04 -0700
Subject: [ofa-general] SDP kernel Oops.
In-Reply-To: <004201c7d2de$6d1dca30$47595e90$@rr.com>
References: <46AE183F.5090907@psc.edu> <004201c7d2de$6d1dca30$47595e90$@rr.com>
Message-ID: <46AF692C.4090302@ichips.intel.com>

>   It appears that this is an illegal instruction (illegal operand) trap in a
> modified Rhat4U4 kernel. I am not sure about the line number, but perhaps
> sdp_cma_handler() is processing an RDMA_CM_EVENT_ROUTE_RESOLVED event. 

Based on the backtrace, this should be an RDMA_CM_EVENT_CONNECT_REQUEST 
event.

I would verify that whatever structure that is associated with a 
listening rdma_cm_id is still valid until after the listen has been 
destroyed.

- Sean


From sashak at voltaire.com  Tue Jul 31 10:26:36 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 31 Jul 2007 20:26:36 +0300
Subject: [ofa-general] [PATCH] opensm: remove lft setup hack
Message-ID: <20070731172636.GH29844@sashak.voltaire.com>


This removes the hack, where OpenSM's lfts were is updated by ucast_mgr
and not only from the network. Once it was needed for dumping fucntions,
which use the data from lft, but now the dumping is moved to the end of
the sweep, when all lfts are up to date.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_ucast_mgr.c |    7 -------
 1 files changed, 0 insertions(+), 7 deletions(-)

diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
index cfe1a58..b90509a 100644
--- a/opensm/opensm/osm_ucast_mgr.c
+++ b/opensm/opensm/osm_ucast_mgr.c
@@ -508,13 +508,6 @@ osm_ucast_mgr_set_fwd_table(
     else
     {
       p_mgr->any_change = TRUE;
-      /*
-        HACK: for now we will assume we succeeded to send
-        and set the local DB based on it. This should allow
-        us to immediatly dump out our routing.
-      */
-      osm_switch_set_ft_block(
-        p_sw, p_mgr->lft_buf + block_id_ho * 64, block_id_ho );
     }
   }
 
-- 
1.5.3.rc2.29.gc4640f


From sashak at voltaire.com  Tue Jul 31 10:33:20 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 31 Jul 2007 20:33:20 +0300
Subject: [ofa-general] [PATCH] opensm: report new ports before handover
	mastership
In-Reply-To: <ac71172a0707270747y77ae14eflf7268b2581d113bd@mail.gmail.com>
References: <ac71172a0707250957u6148b638s826a560ec013d3e0@mail.gmail.com>
	<20070725220204.GI31582@sashak.voltaire.com>
	<ac71172a0707261237wb833b1bq66c64ca39fb3c321@mail.gmail.com>
	<20070727025952.GE6691@sashak.voltaire.com>
	<ac71172a0707270747y77ae14eflf7268b2581d113bd@mail.gmail.com>
Message-ID: <20070731173320.GI29844@sashak.voltaire.com>


This adds new ports reporting (with trap 64) just before mastership
handover - new master does not report new ports in its first sweep.

Pointed out by: lbt (Lan) <transter at gmail.com>

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_state_mgr.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
index a6d0e24..1cf6257 100644
--- a/opensm/opensm/osm_state_mgr.c
+++ b/opensm/opensm/osm_state_mgr.c
@@ -2168,6 +2168,9 @@ Idle:
                p_remote_sm = __osm_state_mgr_get_highest_sm( p_mgr );
                if( p_remote_sm != NULL )
                {
+                  /* report new ports (trap 64) before leaving MASTER */
+                  __osm_state_mgr_report_new_ports( p_mgr );
+
                   /* need to handover the mastership
                    * to the remote sm, and move to standby */
                   __osm_state_mgr_send_handover( p_mgr, p_remote_sm );
-- 
1.5.3.rc2.29.gc4640f


From rdreier at cisco.com  Tue Jul 31 10:41:03 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 31 Jul 2007 10:41:03 -0700
Subject: [ofa-general] QoS RFC
In-Reply-To: <20070731160223.GF29844@sashak.voltaire.com> (Sasha Khapyorsky's
	message of "Tue, 31 Jul 2007 19:02:23 +0300")
References: <46A283B6.1070105@dev.mellanox.co.il>
	<20070723002010.GU27878@sashak.voltaire.com>
	<46A89608.9010709@dev.mellanox.co.il>
	<20070731160223.GF29844@sashak.voltaire.com>
Message-ID: <adavec0o4i8.fsf@cisco.com>

I think that defining a new file format is really going in the wrong
direction.  XML would make a lot of sense (and you could use something
like RELAX NG to define the schema very readably and precisely).  XML
has the advantage that many parsers, GUI editors, and other tools are
already widely available.

If you don't like XML for whatever reason, please at least consider
something like YAML before you invent something completely new.

 - R.


From Kapil.Dukle at med.ge.com  Tue Jul 31 10:49:27 2007
From: Kapil.Dukle at med.ge.com (Dukle, Kapil (GE Healthcare))
Date: Tue, 31 Jul 2007 13:49:27 -0400
Subject: [ofa-general] UDAPL code examples
Message-ID: <DE4D96C8DFF3B94BACC3B6FE3B7D140104D925D6@CINMLVEM11.e2k.ad.ge.com>


Hi all,

Does the OFED distribution have examples/code samples on how UDAPL can
be used? The examples I looked at in
perftest directly use Verbs API calls.

Thanks,

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070731/b42d0402/attachment.html>

From swise at opengridcomputing.com  Tue Jul 31 10:57:08 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 31 Jul 2007 12:57:08 -0500
Subject: [ofa-general] Re: [PATCH 2.6.23 1/2] Make the iw_cxgb3 module
	parameters writable.
In-Reply-To: <adad4y9skxm.fsf@cisco.com>
References: <20070729201226.31659.85900.stgit@dell3.ogc.int>
	<adad4y9skxm.fsf@cisco.com>
Message-ID: <46AF77F4.2000003@opengridcomputing.com>

Roland Dreier wrote:
> ugh, missed these before my last merge...
> 
> anyway:
> 
> why do we want to parameters writable?  a good changelog tells me
> what, why and how, and this changelog just covered the "what".  Also,
> I assume you've checked that it's OK for these variables to change at
> any time?

I want to be able to changes these parameters at run time.  Eventually, 
if we might want these parameters as rdma connection setup parameters. 
For now, its useful to be able to set them without reloading.

Also, it is safe to change them at any time.  All of these are read once 
and utilized at connection setup.  So changing them is safe in that 
existing connections aren't affected, and only subsequent connections 
will utilize the new values.

Sorry for the terse changelog...


Steve.


From ardavis at ichips.intel.com  Tue Jul 31 11:08:14 2007
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Tue, 31 Jul 2007 11:08:14 -0700
Subject: [ofa-general] UDAPL code examples
In-Reply-To: <DE4D96C8DFF3B94BACC3B6FE3B7D140104D925D6@CINMLVEM11.e2k.ad.ge.com>
References: <DE4D96C8DFF3B94BACC3B6FE3B7D140104D925D6@CINMLVEM11.e2k.ad.ge.com>
Message-ID: <46AF7A8E.8010902@ichips.intel.com>

Dukle, Kapil (GE Healthcare) wrote:

>
> Hi all,
>
> Does the OFED distribution have examples/code samples on how UDAPL can 
> be used? The examples I looked at in
> perftest directly use Verbs API calls.
>
Take a look at dtest (dapl/test/dtest/dtest.c) for a simple 
server/client example that does message sends, rdma writes, and rdma reads.

-arlin


From sean.hefty at intel.com  Tue Jul 31 11:14:03 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 31 Jul 2007 11:14:03 -0700
Subject: [ofa-general] Re: I think that there is a resource leak in the
	corefile mad_rmpp.c
In-Reply-To: <46AC6B5C.6020702@dev.mellanox.co.il>
Message-ID: <000301c7d39e$9487cdd0$12c8180a@amr.corp.intel.com>

>It seems that the AHs which are being created in alloc_response_msg()
>(which is being called from
>ack_ds_ack()) are not being destroyed because the rmpp_type of this
>packet is
>IB_MGMT_RMPP_TYPE_ACK, so the destroy AH is not being executed.

Thanks for the clarification.

This is a dual-sided RMPP issue involving the direction switch ACK.
ib_rmpp_send_handler() needs to distinguish this ACK from normal ACKs.  I will
see if I can come up with a (simple) fix for this.

- Sean


From sashak at voltaire.com  Tue Jul 31 11:41:38 2007
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 31 Jul 2007 21:41:38 +0300
Subject: [ofa-general] QoS RFC
In-Reply-To: <adavec0o4i8.fsf@cisco.com>
References: <46A283B6.1070105@dev.mellanox.co.il>
	<20070723002010.GU27878@sashak.voltaire.com>
	<46A89608.9010709@dev.mellanox.co.il>
	<20070731160223.GF29844@sashak.voltaire.com>
	<adavec0o4i8.fsf@cisco.com>
Message-ID: <20070731184138.GJ29844@sashak.voltaire.com>

On 10:41 Tue 31 Jul     , Roland Dreier wrote:
> I think that defining a new file format is really going in the wrong
> direction.  XML would make a lot of sense (and you could use something
> like RELAX NG to define the schema very readably and precisely).  XML
> has the advantage that many parsers, GUI editors, and other tools are
> already widely available.
> 
> If you don't like XML for whatever reason, please at least consider
> something like YAML before you invent something completely new.

We don't have any XML or YAML config files yet. Personally I would prefer
human rather than machine readable/writable files format just because
hand editing still be main option now and we don't have any useful GUI
management infrastructure.

Sasha


From swise at opengridcomputing.com  Tue Jul 31 12:12:32 2007
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 31 Jul 2007 14:12:32 -0500
Subject: [ofa-general] patches for 1.2.c
Message-ID: <46AF89A0.9070805@opengridcomputing.com>

Guys,

I have 2 more patches to go in ofed_1_2/ofed_1_2_c.

Is there some grand scheme to the naming of kernel_patches/fixes/* for 
1.2.c?  I noticed a slew of new files for the post-2.6.22 fixes, and 
wondered if there is a naming scheme?

Or should I just post a patch for the ofed_1_2 branch and let you all 
create the ofed_1_2_c kernel_patches/fixes/ patch file ??

Thanks,

Steve.


From tziporet at dev.mellanox.co.il  Tue Jul 31 12:19:05 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 31 Jul 2007 22:19:05 +0300
Subject: [ofa-general] Re: [ewg] OFED 1.2.c-9 is available
In-Reply-To: <f0e08f230707310927o19e1583dr958abd29180bc8ac@mail.gmail.com>
References: <AcfElXnvIIwJJbMaQjCvFz5bwcoaYw==>	<6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com>
	<f0e08f230707310927o19e1583dr958abd29180bc8ac@mail.gmail.com>
Message-ID: <46AF8B29.7090906@mellanox.co.il>

Hal Rosenstock wrote:
>> - ibstat does not present all entries. Use ibv_devinfo instead.
>>     
>
> What is missing  from ibstat for ConnectX ? What entries are missing ?
>
>   
See in the report below.
If you can fix it it will be great

Tziporet

#> ibstat
CA 'mlx4_0'
        CA type:                              <=== missing
        Number of ports: 2
        Firmware version:                <=== missing
        Hardware version:                <=== missing
        Node GUID: 0x0002c903000004bc
        System image GUID: 0x0002c903000004bf
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 20
                Base lid: 1
                LMC: 0
                SM lid: 1
                Capability mask: 0x02500868
                Port GUID: 0x0002c903000004bd
        Port 2:
                State: Active
                Physical state: LinkUp
                Rate: 20
                Base lid: 2
                LMC: 0
                SM lid: 1
                Capability mask: 0x02500868
                Port GUID: 0x0002c903000004be


From rdreier at cisco.com  Tue Jul 31 12:24:42 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 31 Jul 2007 12:24:42 -0700
Subject: [ofa-general] Re: [ewg] OFED 1.2.c-9 is available
In-Reply-To: <46AF8B29.7090906@mellanox.co.il> (Tziporet Koren's message of
	"Tue, 31 Jul 2007 22:19:05 +0300")
References: <AcfElXnvIIwJJbMaQjCvFz5bwcoaYw==>
	<6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com>
	<f0e08f230707310927o19e1583dr958abd29180bc8ac@mail.gmail.com>
	<46AF8B29.7090906@mellanox.co.il>
Message-ID: <ada7iog8jgl.fsf@cisco.com>

 >        CA type:                              <=== missing
 >        Firmware version:                <=== missing
 >        Hardware version:                <=== missing

These need sysfs entries from the mlx4_ib driver, I guess.


From tziporet at dev.mellanox.co.il  Tue Jul 31 12:43:13 2007
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 31 Jul 2007 22:43:13 +0300
Subject: [ofa-general] Re: [ewg] OFED 1.2.c-9 is available
In-Reply-To: <ada7iog8jgl.fsf@cisco.com>
References: <AcfElXnvIIwJJbMaQjCvFz5bwcoaYw==>	<6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com>	<f0e08f230707310927o19e1583dr958abd29180bc8ac@mail.gmail.com>	<46AF8B29.7090906@mellanox.co.il>
	<ada7iog8jgl.fsf@cisco.com>
Message-ID: <46AF90D1.8050000@mellanox.co.il>

Roland Dreier wrote:
>  >        CA type:                              <=== missing
>  >        Firmware version:                <=== missing
>  >        Hardware version:                <=== missing
>
> These need sysfs entries from the mlx4_ib driver, I guess.
>
>   
I think we have them but under drivers/net and not drivers/infiniband

Tziporet


From becker at nas.nasa.gov  Tue Jul 31 12:43:42 2007
From: becker at nas.nasa.gov (Jeff Becker)
Date: Tue, 31 Jul 2007 12:43:42 -0700
Subject: [ofa-general] RE: OFA website edits
In-Reply-To: <46AE5901.7010307@ichips.intel.com>
References: <B0095134066CC94FBC80973103FFA1FE046A5CB3@orsmsx416.amr.corp.intel.com>
	<adamyy27cxk.fsf@cisco.com> <46956FF9.50102@ichips.intel.com>
	<46968448.2000401@ichips.intel.com>
	<46A536EC.4060201@ichips.intel.com> <adazm1dr653.fsf@cisco.com>
	<46AE5901.7010307@ichips.intel.com>
Message-ID: <795c49870707311243l4615b464v3b1b0f1479870684@mail.gmail.com>

Hi. Jeff Scott asked me to help with this. I've started thinking about
how to implement it, and I may have a first cut by the end of this
week.

-jeff

On 7/30/07, Arlin Davis <ardavis at ichips.intel.com> wrote:
> Roland Dreier wrote:
>
> > > Maintainers: please review the following proposal regarding new public
> > > download locations/website links and respond. This request originated
> > > from xwg.
> > >
> > > http://lists.openfabrics.org/pipermail/xwg/2007-June/000018.html
> >
> >I guess it's OK, but what's the difference between a README and a
> >WEB_README?
> >
> >Would it make sense to have just one file (maybe in a format that is
> >easily transformed to HTML, eg reStructuredText) for all purposes?
> >
> >
>
> That works for me. I was waiting for to hear back from Jeff regarding a
> filename and content.
>
> Jeff, can you comment? What format will work best for you?
>
> -arlin
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From ardavis at ichips.intel.com  Tue Jul 31 13:28:44 2007
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Tue, 31 Jul 2007 13:28:44 -0700
Subject: [ofa-general] RE: [PATCH] OFED 1.2.1 rdma_cm response timeout
	module	parameter
In-Reply-To: <001301c7c404$4ace0f00$3c98070a@amr.corp.intel.com>
References: <001301c7c404$4ace0f00$3c98070a@amr.corp.intel.com>
Message-ID: <46AF9B7C.7020906@ichips.intel.com>

Sean Hefty wrote:

>>OFED 1.2 removed the rdma_set_option call used to adjust response timeout. We
>>are running into some cases on larger clusters that require longer timeouts
>>then the default. Can you consider this rdma_cm patch for OFED 1.2.1 that adds
>>a module parameter for the response timeout? Thanks.
>>    
>>
>
>What's in it for me?  :)
>
>  
>
>>Signed-off by: Arlin Davis <ardavis at ichips.intel.com>
>>    
>>
>
>Acked-by: Sean Hefty <sean.hefty at intel.com>
>
>Vlad, can you add this for OFED 1.2.1?
>
>- Sean
>  
>

Did this get added to 1.2.1?


From hal.rosenstock at gmail.com  Tue Jul 31 13:30:30 2007
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 31 Jul 2007 16:30:30 -0400
Subject: [ofa-general] Re: [ewg] OFED 1.2.c-9 is available
In-Reply-To: <46AF90D1.8050000@mellanox.co.il>
References: <6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com>
	<f0e08f230707310927o19e1583dr958abd29180bc8ac@mail.gmail.com>
	<46AF8B29.7090906@mellanox.co.il> <ada7iog8jgl.fsf@cisco.com>
	<46AF90D1.8050000@mellanox.co.il>
Message-ID: <f0e08f230707311330q7104df21l3ead50003354810b@mail.gmail.com>

On 7/31/07, Tziporet Koren <tziporet at dev.mellanox.co.il> wrote:
> Roland Dreier wrote:
> >  >        CA type:                              <=== missing
> >  >        Firmware version:                <=== missing
> >  >        Hardware version:                <=== missing
> >
> > These need sysfs entries from the mlx4_ib driver, I guess.
> >
> >
> I think we have them but under drivers/net and not drivers/infiniband

Why under drivers/net rather than drivers/infiniband like all the
other drivers ? Does this really need special casing (in libibumad) ?

-- Hal


> Tziporet
>


From rdreier at cisco.com  Tue Jul 31 13:33:54 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 31 Jul 2007 13:33:54 -0700
Subject: [ofa-general] Re: [ewg] OFED 1.2.c-9 is available
In-Reply-To: <f0e08f230707311330q7104df21l3ead50003354810b@mail.gmail.com>
	(Hal Rosenstock's message of "Tue, 31 Jul 2007 16:30:30 -0400")
References: <6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com>
	<f0e08f230707310927o19e1583dr958abd29180bc8ac@mail.gmail.com>
	<46AF8B29.7090906@mellanox.co.il> <ada7iog8jgl.fsf@cisco.com>
	<46AF90D1.8050000@mellanox.co.il>
	<f0e08f230707311330q7104df21l3ead50003354810b@mail.gmail.com>
Message-ID: <adatzrk71ot.fsf@cisco.com>

 > Why under drivers/net rather than drivers/infiniband like all the
 > other drivers ? Does this really need special casing (in libibumad) ?

Tziporet is incorrect.  There's nothing from the mlx4_core driver
either, and when it is implemented, it should work exactly the same as
all other drivers.


From davem at systemfabricworks.com  Tue Jul 31 13:39:27 2007
From: davem at systemfabricworks.com (davem at systemfabricworks.com)
Date: Tue, 31 Jul 2007 15:39:27 -0500
Subject: [ofa-general] [PATCH] infiniband-diags: Add common flags -P, -C,
	and -t
Message-ID: <46AF9DFF.mailEUD1EC05T@systemfabricworks.com>


   Add common flags -P, -C, and -t to infiniband-diags programs and scripts to
   allow specifying the HCA port number, HCA device name, and query timeout.
   These diagnostic programs can now be directed to either different fabrics
   attached to the system, or forced to use different ports should the fabric
   fail and comparisons are needed.

   Two of these had conflicting prior use of the flags.  For the ibcheckerrs
   script, -T is now used to specify the threshold file.  In the saquery
   program, -p and -c are now used to specify getting the PathRecord info
   and getting the SA's class port info.

   Other than the resolution of the three conflicts, all comands behave exactly
   the same as they did before the change if these common flags are not used.

Signed-off-by: David A. McMillen <davem at systemfabricworks.com>
---
 infiniband-diags/man/dump_lfts.8             |   12 +++++-
 infiniband-diags/man/dump_mfts.8             |   13 ++++++-
 infiniband-diags/man/ibcheckerrors.8         |    9 ++++-
 infiniband-diags/man/ibcheckerrs.8           |   15 ++++++-
 infiniband-diags/man/ibchecknet.8            |    9 ++++-
 infiniband-diags/man/ibchecknode.8           |    9 ++++-
 infiniband-diags/man/ibcheckport.8           |    9 ++++-
 infiniband-diags/man/ibcheckportstate.8      |    9 ++++-
 infiniband-diags/man/ibcheckportwidth.8      |    9 ++++-
 infiniband-diags/man/ibcheckstate.8          |    9 ++++-
 infiniband-diags/man/ibcheckwidth.8          |   12 +++++-
 infiniband-diags/man/ibclearcounters.8       |   11 ++++-
 infiniband-diags/man/ibclearerrors.8         |   11 ++++-
 infiniband-diags/man/ibdatacounters.8        |    8 +++-
 infiniband-diags/man/ibdatacounts.8          |    9 ++++-
 infiniband-diags/man/ibhosts.8               |   12 +++++-
 infiniband-diags/man/ibnodes.8               |   12 +++++-
 infiniband-diags/man/ibrouters.8             |   12 +++++-
 infiniband-diags/man/ibswitches.8            |   12 +++++-
 infiniband-diags/man/saquery.8               |   15 ++++++-
 infiniband-diags/scripts/dump_lfts.sh        |   49 ++++++++++++++++++----
 infiniband-diags/scripts/dump_mfts.sh        |   49 ++++++++++++++++++----
 infiniband-diags/scripts/ibcheckerrors.in    |   34 ++++++++++++----
 infiniband-diags/scripts/ibcheckerrs.in      |   27 ++++++++++---
 infiniband-diags/scripts/ibchecknet.in       |   36 ++++++++++++----
 infiniband-diags/scripts/ibchecknode.in      |   22 ++++++++--
 infiniband-diags/scripts/ibcheckport.in      |   22 ++++++++--
 infiniband-diags/scripts/ibcheckportstate.in |   22 ++++++++--
 infiniband-diags/scripts/ibcheckportwidth.in |   22 ++++++++--
 infiniband-diags/scripts/ibcheckstate.in     |   32 +++++++++++---
 infiniband-diags/scripts/ibcheckwidth.in     |   32 +++++++++++---
 infiniband-diags/scripts/ibclearcounters.in  |   32 +++++++++++---
 infiniband-diags/scripts/ibclearerrors.in    |   30 +++++++++++---
 infiniband-diags/scripts/ibdatacounters.in   |   34 ++++++++++++----
 infiniband-diags/scripts/ibdatacounts.in     |   25 +++++++++--
 infiniband-diags/scripts/ibhosts.in          |   30 +++++++++++--
 infiniband-diags/scripts/ibnodes.in          |    2 +-
 infiniband-diags/scripts/ibrouters.in        |   30 +++++++++++--
 infiniband-diags/scripts/ibswitches.in       |   30 +++++++++++--
 infiniband-diags/src/saquery.c               |   56 ++++++++++++++++++++------
 40 files changed, 680 insertions(+), 153 deletions(-)

diff --git a/infiniband-diags/man/dump_lfts.8 b/infiniband-diags/man/dump_lfts.8
index c1458b3..091de41 100644
--- a/infiniband-diags/man/dump_lfts.8
+++ b/infiniband-diags/man/dump_lfts.8
@@ -5,7 +5,8 @@ dump_lfts.sh \- dump InfiniBand unicast forwarding tables
 
 .SH SYNOPSIS
 .B dump_lfts.sh
-[\-h] [\-D] [>/path/to/dump-file]
+[\-h] [\-D] [\-C ca_name] [\-P ca_port] [\-t(imeout) timeout_ms] [>/path/to/dump-file]
+
 
 .SH DESCRIPTION
 .PP
@@ -24,6 +25,15 @@ dump forwarding tables using direct routed rather than LID routed SMPs
 .TP
 \fB\-h\fR
 show help
+.TP
+\fB\-C\fR <ca_name>
+use the specified ca_name.
+.TP
+\fB\-P\fR <ca_port>
+use the specified ca_port.
+.TP
+\fB\-t\fR <timeout_ms>
+override the default timeout for the solicited mads.
 
 .SH SEE ALSO
 .BR dump_mfts(8),
diff --git a/infiniband-diags/man/dump_mfts.8 b/infiniband-diags/man/dump_mfts.8
index fc8bc2e..90dd2ac 100644
--- a/infiniband-diags/man/dump_mfts.8
+++ b/infiniband-diags/man/dump_mfts.8
@@ -5,7 +5,8 @@ dump_lfts.sh \- dump InfiniBand multicast forwarding tables
 
 .SH SYNOPSIS
 .B dump_mfts.sh
-[\-h] [\-D] [>/path/to/file]
+[\-h] [\-D] [\-C ca_name] [\-P ca_port] [\-t(imeout) timeout_ms]
+[>/path/to/file]
 
 .SH DESCRIPTION
 .PP
@@ -21,6 +22,16 @@ dump forwarding tables using direct routed rather than LID routed SMPs
 .TP
 \fB\-h\fR
 show help
+.TP
+\fB\-C\fR <ca_name>
+use the specified ca_name.
+.TP
+\fB\-P\fR <ca_port>
+use the specified ca_port.
+.TP
+\fB\-t\fR <timeout_ms>
+override the default timeout for the solicited mads.
+
 
 .SH SEE ALSO
 .BR dump_lfts(8),
diff --git a/infiniband-diags/man/ibcheckerrors.8 b/infiniband-diags/man/ibcheckerrors.8
index 489d531..15b646f 100644
--- a/infiniband-diags/man/ibcheckerrors.8
+++ b/infiniband-diags/man/ibcheckerrors.8
@@ -5,7 +5,8 @@ ibcheckerrors \- validate IB subnet and report errors
 
 .SH SYNOPSIS
 .B ibcheckerrors
-[\-h] [\-b] [\-v] [\-N | \-nocolor] [<topology-file>]
+[\-h] [\-b] [\-v] [\-N | \-nocolor] [<topology-file> | \-C ca_name
+\-P ca_port \-t(imeout) timeout_ms]
 
 .SH DESCRIPTION
 .PP
@@ -21,6 +22,12 @@ errors (from port counters).
      not what they are.
 .PP
 \-N | \-nocolor	use mono rather than color mode
+.PP
+\-C <ca_name>   use the specified ca_name.
+.PP
+\-P <ca_port>   use the specified ca_port.
+.PP
+\-t <timeout_ms> override the default timeout for the solicited mads.
 
 .SH SEE ALSO
 .BR ibnetdiscover(8),
diff --git a/infiniband-diags/man/ibcheckerrs.8 b/infiniband-diags/man/ibcheckerrs.8
index 7b22163..f901889 100644
--- a/infiniband-diags/man/ibcheckerrs.8
+++ b/infiniband-diags/man/ibcheckerrs.8
@@ -5,7 +5,10 @@ ibcheckerrs \- validate IB port (or node) and report errors in counters above th
 
 .SH SYNOPSIS
 .B ibcheckerrs
-[\-h] [\-b] [\-v] [\-G] [\-t <threshold_file>] [\-s(how_thresholds)] [\-N | \-nocolor] <lid|guid> <port>
+[\-h] [\-b] [\-v] [\-G] [\-T <threshold_file>] [\-s(how_thresholds)]
+[\-N | \-nocolor] [\-C ca_name] [\-P ca_port] [\-t(imeout) timeout_ms]
+<lid|guid> <port>
+
 
 .SH DESCRIPTION
 .PP
@@ -23,7 +26,7 @@ specified using the -t <file> option.
 .PP
 \-s      show predefined thresholds
 .PP
-\-t      use specified threshold file
+\-T      use specified threshold file
 .PP
 \-v      increase the verbosity level
 .PP
@@ -31,6 +34,12 @@ specified using the -t <file> option.
         present, not what they are.
 .PP
 \-N | \-nocolor use mono rather than color mode
+.PP
+\-C <ca_name>    use the specified ca_name.
+.PP
+\-P <ca_port>    use the specified ca_port.
+.PP
+\-t <timeout_ms> override the default timeout for the solicited mads.
 
 .SH EXAMPLE
 .PP
@@ -38,7 +47,7 @@ ibcheckerrs 2           # check aggregated node counter for lid 2
 .PP
 ibcheckerrs 2   4       # check port counters for lid 2 port 4
 .PP
-ibcheckerrs -t xxx 2    # check node using xxx threshold file
+ibcheckerrs -T xxx 2    # check node using xxx threshold file
 
 .SH SEE ALSO
 .BR perfquery(8),
diff --git a/infiniband-diags/man/ibchecknet.8 b/infiniband-diags/man/ibchecknet.8
index ddeccc8..375427b 100644
--- a/infiniband-diags/man/ibchecknet.8
+++ b/infiniband-diags/man/ibchecknet.8
@@ -5,7 +5,8 @@ ibchecknet \- validate IB subnet and report errors
 
 .SH SYNOPSIS
 .B ibchecknet
-[\-h] [\-N | \-nocolor] [<topology-file>]
+[\-h] [\-N | \-nocolor] [<topology-file> | \-C ca_name \-P ca_port
+\-t(imeout) timeout_ms]
 
 .SH DESCRIPTION
 .PP
@@ -16,6 +17,12 @@ reports errors (from port counters).
 .SH OPTIONS
 .PP
 \-N | \-nocolor use mono rather than color mode
+.PP
+\-C <ca_name>    use the specified ca_name.
+.PP
+\-P <ca_port>    use the specified ca_port.
+.PP
+\-t <timeout_ms> override the default timeout for the solicited mads.
 
 .SH SEE ALSO
 .BR ibnetdiscover(8),
diff --git a/infiniband-diags/man/ibchecknode.8 b/infiniband-diags/man/ibchecknode.8
index ad1e88b..ecd8bf9 100644
--- a/infiniband-diags/man/ibchecknode.8
+++ b/infiniband-diags/man/ibchecknode.8
@@ -5,7 +5,8 @@ ibchecknode \- validate IB node and report errors
 
 .SH SYNOPSIS
 .B ibchecknode
-[\-h] [\-v] [\-N | \-nocolor] [\-G] <lid|guid>
+[\-h] [\-v] [\-N | \-nocolor] [\-G] [\-C ca_name] [\-P ca_port]
+[\-t(imeout) timeout_ms] <lid|guid>
 
 .SH DESCRIPTION
 .PP
@@ -21,6 +22,12 @@ Port address is a lid unless -G option is used to specify a GUID address.
 \-v      increase the verbosity level
 .PP
 \-N | \-nocolor use mono rather than color mode
+.PP
+\-C <ca_name>    use the specified ca_name.
+.PP
+\-P <ca_port>    use the specified ca_port.
+.PP
+\-t <timeout_ms> override the default timeout for the solicited mads.
 
 .SH EXAMPLE
 .PP
diff --git a/infiniband-diags/man/ibcheckport.8 b/infiniband-diags/man/ibcheckport.8
index 3a18f21..08166c3 100644
--- a/infiniband-diags/man/ibcheckport.8
+++ b/infiniband-diags/man/ibcheckport.8
@@ -5,7 +5,8 @@ ibcheckport \- validate IB port and report errors
 
 .SH SYNOPSIS
 .B ibcheckport
-[\-h] [\-v] [\-N | \-nocolor] [\-G] <lid|guid> <port>
+[\-h] [\-v] [\-N | \-nocolor] [\-G] [\-C ca_name] [\-P ca_port]
+[\-t(imeout) timeout_ms]  <lid|guid> <port>
 
 .SH DESCRIPTION
 .PP
@@ -21,6 +22,12 @@ Port address is a lid unless -G option is used to specify a GUID address.
 \-v      increase the verbosity level
 .PP
 \-N | \-nocolor use mono rather than color mode
+.PP
+\-C <ca_name>    use the specified ca_name.
+.PP
+\-P <ca_port>    use the specified ca_port.
+.PP
+\-t <timeout_ms> override the default timeout for the solicited mads.
 
 .SH EXAMPLE
 .PP
diff --git a/infiniband-diags/man/ibcheckportstate.8 b/infiniband-diags/man/ibcheckportstate.8
index 139da57..4c70f16 100644
--- a/infiniband-diags/man/ibcheckportstate.8
+++ b/infiniband-diags/man/ibcheckportstate.8
@@ -5,7 +5,8 @@ ibcheckportstate \- validate IB port for LinkUp and not Active state
 
 .SH SYNOPSIS
 .B ibcheckportstate
-[\-h] [\-v] [\-N | \-nocolor] [\-G] <lid|guid> <port>
+[\-h] [\-v] [\-N | \-nocolor] [\-G] [\-C ca_name] [\-P ca_port]
+[\-t(imeout) timeout_ms] <lid|guid> <port>
 
 .SH DESCRIPTION
 .PP
@@ -22,6 +23,12 @@ Port address is a lid unless -G option is used to specify a GUID address.
 \-v      increase the verbosity level
 .PP
 \-N | \-nocolor use mono rather than color mode
+.PP
+\-C <ca_name>    use the specified ca_name.
+.PP
+\-P <ca_port>    use the specified ca_port.
+.PP
+\-t <timeout_ms> override the default timeout for the solicited mads.
 
 .SH EXAMPLE
 .PP
diff --git a/infiniband-diags/man/ibcheckportwidth.8 b/infiniband-diags/man/ibcheckportwidth.8
index 304e345..541be8a 100644
--- a/infiniband-diags/man/ibcheckportwidth.8
+++ b/infiniband-diags/man/ibcheckportwidth.8
@@ -5,7 +5,8 @@ ibcheckportwidth \- validate IB port for 1x link width
 
 .SH SYNOPSIS
 .B ibcheckport
-[\-h] [\-v] [\-N | \-nocolor] [\-G] <lid|guid> <port>
+[\-h] [\-v] [\-N | \-nocolor] [\-G] [\-C ca_name] [\-P ca_port]
+[\-t(imeout) timeout_ms]  <lid|guid> <port>
 
 .SH DESCRIPTION
 .PP
@@ -21,6 +22,12 @@ Port address is a lid unless -G option is used to specify a GUID address.
 \-v      increase the verbosity level
 .PP
 \-N | \-nocolor use mono rather than color mode
+.PP
+\-C <ca_name>    use the specified ca_name.
+.PP
+\-P <ca_port>    use the specified ca_port.
+.PP
+\-t <timeout_ms> override the default timeout for the solicited mads.
 
 .SH EXAMPLE
 .PP
diff --git a/infiniband-diags/man/ibcheckstate.8 b/infiniband-diags/man/ibcheckstate.8
index 5cb41c9..e718979 100644
--- a/infiniband-diags/man/ibcheckstate.8
+++ b/infiniband-diags/man/ibcheckstate.8
@@ -5,7 +5,8 @@ ibcheckstate \- find ports in IB subnet which are link up but not active
 
 .SH SYNOPSIS
 .B ibcheckstate
-[\-h] [\-v] [\-N | \-nocolor] [<topology-file>]
+[\-h] [\-v] [\-N | \-nocolor] [<topology-file> | \-C ca_name \-P ca_port
+\-t(imeout) timeout_ms]
 
 .SH DESCRIPTION
 .PP
@@ -17,6 +18,12 @@ a port physical state other than LinkUp.
 .SH OPTIONS
 .PP
 \-N | \-nocolor use mono rather than color mode
+.PP
+\-C <ca_name>    use the specified ca_name.
+.PP
+\-P <ca_port>    use the specified ca_port.
+.PP
+\-t <timeout_ms> override the default timeout for the solicited mads.
 
 .SH SEE ALSO
 .BR ibnetdiscover(8),
diff --git a/infiniband-diags/man/ibcheckwidth.8 b/infiniband-diags/man/ibcheckwidth.8
index 5a3b1df..da9a70b 100644
--- a/infiniband-diags/man/ibcheckwidth.8
+++ b/infiniband-diags/man/ibcheckwidth.8
@@ -5,7 +5,9 @@ ibcheckwidth \- find 1x links in IB subnet
 
 .SH SYNOPSIS
 .B ibcheckwidth
-[\-h] [\-v] [\-N | \-nocolor] [<topology-file>]
+[\-h] [\-v] [\-N | \-nocolor] [<topology-file> | \-C ca_name
+\-P ca_port \-t(imeout) timeout_ms]
+
 
 .SH DESCRIPTION
 .PP
@@ -15,7 +17,13 @@ reports any 1x links.
 
 .SH OPTIONS
 .PP
-\-N | \-nocolor use mono rather than color mode
+\-N | \-nocolor  use mono rather than color mode
+.PP
+\-C <ca_name>    use the specified ca_name.
+.PP
+\-P <ca_port>    use the specified ca_port.
+.PP
+\-t <timeout_ms> override the default timeout for the solicited mads.
 
 .SH SEE ALSO
 .BR ibnetdiscover(8),
diff --git a/infiniband-diags/man/ibclearcounters.8 b/infiniband-diags/man/ibclearcounters.8
index 96ed8fa..d14e038 100644
--- a/infiniband-diags/man/ibclearcounters.8
+++ b/infiniband-diags/man/ibclearcounters.8
@@ -5,7 +5,8 @@ ibclearcounters \- clear port counters in IB subnet
 
 .SH SYNOPSIS
 .B ibclearcounters
-[\-h] [\-N | \-nocolor] [<topology-file>]
+[\-h] [\-N | \-nocolor] [<topology-file> | \-C ca_name
+\-P ca_port \-t(imeout) timeout_ms]
 
 .SH DESCRIPTION
 .PP
@@ -14,7 +15,13 @@ the IB subnet topology or using an already saved topology file.
 
 .SH OPTIONS
 .PP
-\-N | \-nocolor use mono rather than color mode
+\-N | \-nocolor  use mono rather than color mode
+.PP
+\-C <ca_name>    use the specified ca_name.
+.PP
+\-P <ca_port>    use the specified ca_port.
+.PP
+\-t <timeout_ms> override the default timeout for the solicited mads.
 
 .SH SEE ALSO
 .BR ibnetdiscover(8),
diff --git a/infiniband-diags/man/ibclearerrors.8 b/infiniband-diags/man/ibclearerrors.8
index 6479f9c..58f73d9 100644
--- a/infiniband-diags/man/ibclearerrors.8
+++ b/infiniband-diags/man/ibclearerrors.8
@@ -5,7 +5,8 @@ ibclearerrors \- clear error counters in IB subnet
 
 .SH SYNOPSIS
 .B ibclearerrors
-[\-h] [\-N | \-nocolor] [<topology-file>]
+[\-h] [\-N | \-nocolor] [<topology-file> | \-C ca_name \-P ca_port
+\-t(imeout) timeout_ms]
 
 .SH DESCRIPTION
 .PP
@@ -15,7 +16,13 @@ file.
 
 .SH OPTIONS
 .PP
-\-N | \-nocolor use mono rather than color mode
+\-N | \-nocolor  use mono rather than color mode
+.PP
+\-C <ca_name>    use the specified ca_name.
+.PP
+\-P <ca_port>    use the specified ca_port.
+.PP
+\-t <timeout_ms> override the default timeout for the solicited mads.
 
 .SH SEE ALSO
 .BR ibnetdiscover(8),
diff --git a/infiniband-diags/man/ibdatacounters.8 b/infiniband-diags/man/ibdatacounters.8
index 7d562a0..309a8f2 100644
--- a/infiniband-diags/man/ibdatacounters.8
+++ b/infiniband-diags/man/ibdatacounters.8
@@ -5,7 +5,7 @@ ibdatacounters \- query IB subnet for data counters
 
 .SH SYNOPSIS
 .B ibdatacounters
-[\-h] [\-b] [\-v] [\-N | \-nocolor] [<topology-file>]
+[\-h] [\-b] [\-v] [\-N | \-nocolor] [<topology-file> | \-C ca_name \-P ca_port \-t(imeout) timeout_ms]
 
 .SH DESCRIPTION
 .PP
@@ -21,6 +21,12 @@ the data counters (from port counters).
      not what they are.
 .PP
 \-N | \-nocolor	use mono rather than color mode
+.PP
+\-C <ca_name>    use the specified ca_name.
+.PP
+\-P <ca_port>    use the specified ca_port.
+.PP
+\-t <timeout_ms> override the default timeout for the solicited mads.
 
 .SH SEE ALSO
 .BR ibnetdiscover(8),
diff --git a/infiniband-diags/man/ibdatacounts.8 b/infiniband-diags/man/ibdatacounts.8
index 8a731a6..8b995f8 100644
--- a/infiniband-diags/man/ibdatacounts.8
+++ b/infiniband-diags/man/ibdatacounts.8
@@ -5,7 +5,8 @@ ibdatacounts \- get IB port data counters
 
 .SH SYNOPSIS
 .B ibdatacounts
-[\-h] [\-b] [\-v] [\-G] [\-N | \-nocolor] <lid|guid> [<port>]
+[\-h] [\-b] [\-v] [\-G] [\-N | \-nocolor] [\-C ca_name] [\-P ca_port]
+[\-t(imeout) timeout_ms] <lid|guid> [<port>]
 
 .SH DESCRIPTION
 .PP
@@ -24,6 +25,12 @@ address.
 \-b      brief mode
 .PP
 \-N | \-nocolor use mono rather than color mode
+.PP
+\-C <ca_name>    use the specified ca_name.
+.PP
+\-P <ca_port>    use the specified ca_port.
+.PP
+\-t <timeout_ms> override the default timeout for the solicited mads.
 
 .SH EXAMPLE
 .PP
diff --git a/infiniband-diags/man/ibhosts.8 b/infiniband-diags/man/ibhosts.8
index 31788fc..9d7fe9a 100644
--- a/infiniband-diags/man/ibhosts.8
+++ b/infiniband-diags/man/ibhosts.8
@@ -5,13 +5,23 @@ ibhosts \- show InfiniBand host nodes in topology
 
 .SH SYNOPSIS
 .B ibhosts
-[\-h] [<topology-file>]
+[\-h] [<topology-file> | \-C ca_name \-P ca_port \-t(imeout) timeout_ms]
 
 .SH DESCRIPTION
 .PP
 ibhosts is a script which either walks the IB subnet topology or uses an 
 already saved topology file and extracts the CA nodes.
 
+.SH OPTIONS
+.PP
+\-h      show the usage message
+.PP
+\-C <ca_name>    use the specified ca_name.
+.PP
+\-P <ca_port>    use the specified ca_port.
+.PP
+\-t <timeout_ms> override the default timeout for the solicited mads.
+
 .SH SEE ALSO
 .BR ibnetdiscover(8)
 
diff --git a/infiniband-diags/man/ibnodes.8 b/infiniband-diags/man/ibnodes.8
index fdd394c..dc59ca2 100644
--- a/infiniband-diags/man/ibnodes.8
+++ b/infiniband-diags/man/ibnodes.8
@@ -5,14 +5,24 @@ ibnodes \- show InfiniBand nodes in topology
 
 .SH SYNOPSIS
 .B ibnodes
-[<topology-file>]
+[\-h] [<topology-file> | \-C ca_name \-P ca_port \-t(imeout) timeout_ms]
 
 .SH DESCRIPTION
 .PP
 ibnodes is a script which either walks the IB subnet topology or uses an 
 already saved topology file and extracts the IB nodes (CAs and switches).
 
+.SH OPTIONS
+.PP
+\-h      show the usage message
+.PP
+\-C <ca_name>    use the specified ca_name.
+.PP
+\-P <ca_port>    use the specified ca_port.
+.PP
+\-t <timeout_ms> override the default timeout for the solicited mads.
 .SH SEE ALSO
+
 .BR ibnetdiscover(8)
 
 .SH AUTHOR
diff --git a/infiniband-diags/man/ibrouters.8 b/infiniband-diags/man/ibrouters.8
index 068a2d9..698e0ee 100644
--- a/infiniband-diags/man/ibrouters.8
+++ b/infiniband-diags/man/ibrouters.8
@@ -5,13 +5,23 @@ ibrouters \- show InfiniBand router nodes in topology
 
 .SH SYNOPSIS
 .B ibrouters
-[\-h] [<topology-file>]
+[\-h] [<topology-file> | \-C ca_name \-P ca_port \-t(imeout) timeout_ms]
 
 .SH DESCRIPTION
 .PP
 ibrouters is a script which either walks the IB subnet topology or uses an 
 already saved topology file and extracts the Rt nodes.
 
+.SH OPTIONS
+.PP
+\-h      show the usage message
+.PP
+\-C <ca_name>    use the specified ca_name.
+.PP
+\-P <ca_port>    use the specified ca_port.
+.PP
+\-t <timeout_ms> override the default timeout for the solicited mads.
+
 .SH SEE ALSO
 .BR ibnetdiscover(8)
 
diff --git a/infiniband-diags/man/ibswitches.8 b/infiniband-diags/man/ibswitches.8
index c9d3650..0929240 100644
--- a/infiniband-diags/man/ibswitches.8
+++ b/infiniband-diags/man/ibswitches.8
@@ -5,13 +5,23 @@ ibswitches\- show InfiniBand switch nodes in topology
 
 .SH SYNOPSIS
 .B ibswitches
-[\-h] [<topology-file>]
+[\-h] [<topology-file> | \-C ca_name \-P ca_port \-t(imeout) timeout_ms]
 
 .SH DESCRIPTION
 .PP
 ibswitches is a script which either walks the IB subnet topology or uses an 
 already saved topology file and extracts the switch nodes.
 
+.SH OPTIONS
+.PP
+\-h      show the usage message
+.PP
+\-C <ca_name>    use the specified ca_name.
+.PP
+\-P <ca_port>    use the specified ca_port.
+.PP
+\-t <timeout_ms> override the default timeout for the solicited mads.
+
 .SH SEE ALSO
 .BR ibnetdiscover(8)
 
diff --git a/infiniband-diags/man/saquery.8 b/infiniband-diags/man/saquery.8
index 535851f..5558cc9 100644
--- a/infiniband-diags/man/saquery.8
+++ b/infiniband-diags/man/saquery.8
@@ -5,7 +5,10 @@ saquery \- query InfiniBand subnet administration attributes
 
 .SH SYNOPSIS
 .B saquery 
-[\-h] [\-d] [\-P] [\-N] [\-\-list | \-D] [\-S] [\-I] [\-L] [\-l] [\-G] [\-C] [\-s] [\-g] [\-m] [--src-to-dst <src:dst>] [\-t(imeout) <msec>] [\-\-switch\-map <switch-map>] [<name> | <lid> | <guid>]
+[\-h] [\-d] [\-p] [\-N] [\-\-list | \-D] [\-S] [\-I] [\-L] [\-l] [\-G] [\-O]
+[\-U] [\-c] [\-s] [\-g] [\-m] [--src-to-dst <src:dst>] [\-C ca_name]
+[\-P ca_port] [\-t(imeout) <msec>] [\-\-switch\-map <switch-map>]
+[<name> | <lid> | <guid>]
 
 .SH DESCRIPTION
 .PP
@@ -15,7 +18,7 @@ saquery issues the selected SA query. Node records are queried by default.
 
 .PP
 .TP
-\fB\-P\fR
+\fB\-p\fR
 get PathRecord info
 .TP
 \fB\-N\fR
@@ -45,7 +48,7 @@ return the name for the Lid specified
 \fB\-U\fR
 return the name for the Guid specified
 .TP
-\fB\-C\fR
+\fB\-c\fR
 get the SA's class port info
 .TP
 \fB\-s\fR
@@ -63,6 +66,12 @@ description for each entry. Example: saquery -m 0xc000
 get a PathRecord for <src:dst>
 where src and dst are either node names or LIDs
 .TP
+\fB\-C\fR <ca_name>
+use the specified ca_name.
+.TP
+\fB\-P\fR <ca_port>
+use the specified ca_port.
+.TP
 \fB\-t\fR, \fB\-timeout\fR <msec>
 Specify SA query response timeout in milliseconds.
 Default is 100 milliseconds. You may want to use
diff --git a/infiniband-diags/scripts/dump_lfts.sh b/infiniband-diags/scripts/dump_lfts.sh
index 49e86da..67a307c 100755
--- a/infiniband-diags/scripts/dump_lfts.sh
+++ b/infiniband-diags/scripts/dump_lfts.sh
@@ -7,35 +7,66 @@
 
 usage ()
 {
-	echo "usage: $0 [-D]"
+	echo Usage: `basename $0` "[-h] [-D] [-C ca_name]" \
+	    "[-P ca_port] [-t(imeout) timeout_ms]"
 	exit 2
 }
 
 dump_by_lid ()
 {
-for sw_lid in `ibswitches \
+for sw_lid in `ibswitches $ca_info \
 		| sed -ne 's/^.* lid \([0-9a-f]*\) .*$/\1/p'` ; do
-	ibroute $sw_lid
+	ibroute $ca_info $sw_lid
 done
 }
 
 dump_by_dr_path ()
 {
-for sw_dr in `ibnetdiscover -v \
+for sw_dr in `ibnetdiscover $ca_info -v \
 		| sed -ne '/^DR path .* switch /s/^DR path \[\(.*\)\].*$/\1/p' \
 		| sed -e 's/\]\[/,/g' \
 		| sort -u` ; do
-	ibroute -D ${sw_dr}
+	ibroute $ca_info -D ${sw_dr}
 done
 }
 
+use_d=""
+ca_info=""
 
-if [ "$1" = "-D" ] ; then
+while [ "$1" ]; do
+	case $1 in
+	-D)
+		use_d="-D"
+		;;
+	-h)
+		usage
+		;;
+	-P | -C | -t | -timeout)
+		case $2 in
+		-*)
+			usage
+			;;
+		esac
+		if [ x$2 = x ] ; then
+			usage
+		fi
+		ca_info="$ca_info $1 $2"
+		shift
+		;;
+	-*)
+		usage
+		;;
+	*)
+		usage
+		;;
+	esac
+	shift
+done
+
+if [ "$use_d" = "-D" ] ; then
 	dump_by_dr_path
-elif [ -z "$1" ] ; then
-	dump_by_lid
 else
-	usage
+	dump_by_lid
 fi
 
 exit
diff --git a/infiniband-diags/scripts/dump_mfts.sh b/infiniband-diags/scripts/dump_mfts.sh
index 20281e8..39fc5fb 100755
--- a/infiniband-diags/scripts/dump_mfts.sh
+++ b/infiniband-diags/scripts/dump_mfts.sh
@@ -7,35 +7,66 @@
 
 usage ()
 {
-	echo "usage: $0 [-D]"
+	echo Usage: `basename $0` "[-h] [-D] [-C ca_name]" \
+	    "[-P ca_port] [-t(imeout) timeout_ms]"
 	exit 2
 }
 
 dump_by_lid ()
 {
-for sw_lid in `ibswitches \
+for sw_lid in `ibswitches $ca_info \
 		| sed -ne 's/^.* lid \([0-9a-f]*\) .*$/\1/p'` ; do
-	ibroute -M $sw_lid
+	ibroute $ca_info -M $sw_lid
 done
 }
 
 dump_by_dr_path ()
 {
-for sw_dr in `ibnetdiscover -v \
+for sw_dr in `ibnetdiscover $ca_info -v \
 		| sed -ne '/^DR path .* switch /s/^DR path \[\(.*\)\].*$/\1/p' \
 		| sed -e 's/\]\[/,/g' \
 		| sort -u` ; do
-	ibroute -D ${sw_dr}
+	ibroute $ca_info -M -D ${sw_dr}
 done
 }
 
+use_d=""
+ca_info=""
 
-if [ "$1" = "-D" ] ; then
+while [ "$1" ]; do
+	case $1 in
+	-D)
+		use_d="-D"
+		;;
+	-h)
+		usage
+		;;
+	-P | -C | -t | -timeout)
+		case $2 in
+		-*)
+			usage
+			;;
+		esac
+		if [ x$2 = x ] ; then
+			usage
+		fi
+		ca_info="$ca_info $1 $2"
+		shift
+		;;
+	-*)
+		usage
+		;;
+	*)
+		usage
+		;;
+	esac
+	shift
+done
+
+if [ "$use_d" = "-D" ] ; then
 	dump_by_dr_path
-elif [ -z "$1" ] ; then
-	dump_by_lid
 else
-	usage
+	dump_by_lid
 fi
 
 exit
diff --git a/infiniband-diags/scripts/ibcheckerrors.in b/infiniband-diags/scripts/ibcheckerrors.in
index e08eba3..01c7a99 100644
--- a/infiniband-diags/scripts/ibcheckerrors.in
+++ b/infiniband-diags/scripts/ibcheckerrors.in
@@ -3,7 +3,8 @@
 IBPATH=${IBPATH:- at IBSCRIPTPATH@}
 
 function usage() {
-	echo Usage: `basename $0` [-h] [-b] [-v] [-N \| -nocolor] [\<topology-file\>]
+	echo Usage: `basename $0` "[-h] [-b] [-v] [-N | -nocolor]"\
+	    "[<topology-file> | -C ca_name -P ca_port -t(imeout) timeout_ms]"
 	exit -1
 }
 
@@ -21,6 +22,8 @@ v=0
 ntype=""
 nodeguid=""
 oldlid=""
+topofile=""
+ca_info=""
 
 while [ "$1" ]; do
 	case $1 in
@@ -39,20 +42,35 @@ while [ "$1" ]; do
 		brief=-b
 		verbose=""
 		;;
+	-P | -C | -t | -timeout)
+		case $2 in
+		-*)
+			usage
+			;;
+		esac
+		if [ x$2 = x ] ; then
+			usage
+		fi
+		ca_info="$ca_info $1 $2"
+		shift
+		;;
 	-*)
 		usage
 		;;
 	*)
-		break
+		if [ "$topofile" ]; then
+			usage
+		fi
+		topofile="$1"
 		;;
 	esac
 	shift
 done
 
-if [ "$1" ]; then
-	netcmd="cat $1"
+if [ "$topofile" ]; then
+	netcmd="cat $topofile"
 else
-	netcmd="$IBPATH/ibnetdiscover"
+	netcmd="$IBPATH/ibnetdiscover $ca_info"
 fi
 
 eval $netcmd | awk '
@@ -62,12 +80,12 @@ BEGIN {
 function check_node(lid)
 {
 	nodechecked=1
-	if (system("'$IBPATH'/ibchecknode '$gflags' '$verbose' " lid)) {
+	if (system("'$IBPATH'/ibchecknode '"$ca_info"' '$gflags' '$verbose' " lid)) {
 		ne++
 		badnode=1
 		return
 	}
-	if (system("'$IBPATH'/ibcheckerrs '$gflags' '$verbose' '$brief' " lid " 255"))
+	if (system("'$IBPATH'/ibcheckerrs '"$ca_info"' '$gflags' '$verbose' '$brief' " lid " 255"))
 		nodeerr=1;
 }
 
@@ -102,7 +120,7 @@ function check_node(lid)
 		sub("\\(.*\\)", "", port)
 		gsub("[\\[\\]]", "", port)
 		if (nodeerr)
-			if (system("'$IBPATH'/ibcheckerrs '$gflags' '$verbose' '$brief' " lid " " port)) {
+			if (system("'$IBPATH'/ibcheckerrs '"$ca_info"' '$gflags' '$verbose' '$brief' " lid " " port)) {
 				if (!'$v' && oldlid != lid) {
 					print "# Checked " ntype ": nodeguid 0x" nodeguid " with failure"
 					oldlid = lid
diff --git a/infiniband-diags/scripts/ibcheckerrs.in b/infiniband-diags/scripts/ibcheckerrs.in
index ff3256b..99d45cd 100644
--- a/infiniband-diags/scripts/ibcheckerrs.in
+++ b/infiniband-diags/scripts/ibcheckerrs.in
@@ -3,7 +3,9 @@
 IBPATH=${IBPATH:- at IBSCRIPTPATH@}
 
 function usage() {
-	echo Usage: `basename $0` "[-h] [-b] [-v] [-G] [-t <threshold_file>] [-s(how_thresholds)] [-N \| -nocolor] <lid|guid> [<port>]"
+	echo Usage: `basename $0` "[-h] [-b] [-v] [-G] [-T <threshold_file>]" \
+	    "[-s(how_thresholds)] [-N \| -nocolor] [-C ca_name] [-P ca_port]" \
+	    "[-t(imeout) timeout_ms] <lid|guid> [<port>]"
 	exit -1
 }
 
@@ -64,6 +66,7 @@ guid_addr=""
 bw=""
 verbose=""
 brief=""
+ca_info=""
 
 while [ "$1" ]; do
 	case $1 in
@@ -81,7 +84,7 @@ while [ "$1" ]; do
 		brief=yes
 		verbose=""
 		;;
-	-t)
+	-T)
 		if ! [ -r $2 ]; then
 			echo "Can't use threshold file '$2'"
 			usage
@@ -93,6 +96,18 @@ while [ "$1" ]; do
 		show_thresholds
 		exit 0
 		;;
+	-P | -C | -t | -timeout)
+		case $2 in
+		-*)
+			usage
+			;;
+		esac
+		if [ x$2 = x ] ; then
+			usage
+		fi
+		ca_info="$ca_info $1 $2"
+		shift
+		;;
 	-*)
 		usage
 		;;
@@ -121,7 +136,7 @@ else
 fi
 
 if [ "$guid_addr" ]; then
-	if ! lid=`$IBPATH/ibaddr -G -L $1 | awk '/failed/{exit -1} {print $3}'`; then
+	if ! lid=`$IBPATH/ibaddr $ca_info -G -L $1 | awk '/failed/{exit -1} {print $3}'`; then
 		echo -n "guid $1 address resolution: "
 		red "FAILED"
 		exit -1
@@ -129,16 +144,16 @@ if [ "$guid_addr" ]; then
 	guid=$1
 else
 	lid=$1
-	if ! temp=`$IBPATH/ibaddr -L $1 | awk '/failed/{exit -1} {print $1}'`; then
+	if ! temp=`$IBPATH/ibaddr $ca_info -L $1 | awk '/failed/{exit -1} {print $1}'`; then
 		echo -n "lid $1 address resolution: "
 		red "FAILED"
 		exit -1
 	fi
 fi
 
-nodename=`smpquery nodedesc $lid | sed -e "s/^Node Description:\.*\(.*\)/\1/"`
+nodename=`smpquery $ca_info nodedesc $lid | sed -e "s/^Node Description:\.*\(.*\)/\1/"`
 
-if $IBPATH/perfquery $lid $portnum | awk -v mono=$bw -v brief=$brief -F '[.:]*' '
+if $IBPATH/perfquery $ca_info $lid $portnum | awk -v mono=$bw -v brief=$brief -F '[.:]*' '
 function blue(s)
 {
 	if (brief == "yes") {
diff --git a/infiniband-diags/scripts/ibchecknet.in b/infiniband-diags/scripts/ibchecknet.in
index 9f36742..e2f7fb8 100644
--- a/infiniband-diags/scripts/ibchecknet.in
+++ b/infiniband-diags/scripts/ibchecknet.in
@@ -3,7 +3,8 @@
 IBPATH=${IBPATH:- at IBSCRIPTPATH@}
 
 function usage() {
-	echo Usage: `basename $0` [-h] [-v] [-N \| -nocolor] [\<topology-file\>]
+	echo Usage: `basename $0` "[-h] [-v] [-N | -nocolor]" \
+	    "[<topology-file> | -C ca_name -P ca_port -t(imeout) timeout_ms]"
 	exit -1
 }
 
@@ -18,6 +19,8 @@ gflags=""
 verbose=""
 v=0
 oldlid=""
+topofile=""
+ca_info=""
 
 while [ "$1" ]; do
 	case $1 in
@@ -31,20 +34,35 @@ while [ "$1" ]; do
 		verbose=-v
 		v=0
 		;;
+	-P | -C | -t | -timeout)
+		case $2 in
+		-*)
+			usage
+			;;
+		esac
+		if [ x$2 = x ] ; then
+			usage
+		fi
+		ca_info="$ca_info $1 $2"
+		shift
+		;;
 	-*)
 		usage
 		;;
 	*)
-		break
+		if [ "$topofile" ]; then
+			usage
+		fi
+		topofile="$1"
 		;;
 	esac
 	shift
 done
 
-if [ "$1" ]; then
-	netcmd="cat $1"
+if [ "$topofile" ]; then
+	netcmd="cat $topofile"
 else
-	netcmd="$IBPATH/ibnetdiscover"
+	netcmd="$IBPATH/ibnetdiscover $ca_info"
 fi
 
 eval $netcmd | awk '
@@ -55,12 +73,12 @@ BEGIN {
 function check_node(lid)
 {
 	nodechecked=1
-	if (system("'$IBPATH'/ibchecknode '$gflags' '$verbose' " lid)) {
+	if (system("'$IBPATH'/ibchecknode'"$ca_info"' '$gflags' '$verbose' " lid)) {
 		ne++
 		badnode=1
 		return
 	}
-	if (system("'$IBPATH'/ibcheckerrs '$gflags' '$verbose' " lid " 255"))
+	if (system("'$IBPATH'/ibcheckerrs'"$ca_info"' '$gflags' '$verbose' " lid " 255"))
 		nodeerr=1;
 }
 
@@ -94,7 +112,7 @@ function check_node(lid)
 		}
 		sub("\\(.*\\)", "", port)
 		gsub("[\\[\\]]", "", port)
-		if (system("'$IBPATH'/ibcheckport '$gflags' '$verbose' " lid " " port)) {
+		if (system("'$IBPATH'/ibcheckport'"$ca_info"' '$gflags' '$verbose' " lid " " port)) {
 			if (!'$v' && oldlid != lid) {
 				print "# Checked " ntype ": nodeguid 0x" nodeguid " with failure"
 				oldlid = lid
@@ -103,7 +121,7 @@ function check_node(lid)
 		}
 
 		if (nodeerr)
-			if (system("'$IBPATH'/ibcheckerrs '$gflags' '$verbose' " lid " " port)) {
+			if (system("'$IBPATH'/ibcheckerrs'"$ca_info"' '$gflags' '$verbose' " lid " " port)) {
 				if (!'$v' && oldlid != lid) {
 					print "# Checked " ntype ": nodeguid 0x" nodeguid " with failure"
 					oldlid = lid
diff --git a/infiniband-diags/scripts/ibchecknode.in b/infiniband-diags/scripts/ibchecknode.in
index 9d3aaba..5eea7b5 100644
--- a/infiniband-diags/scripts/ibchecknode.in
+++ b/infiniband-diags/scripts/ibchecknode.in
@@ -3,7 +3,8 @@
 IBPATH=${IBPATH:- at IBSCRIPTPATH@}
 
 function usage() {
-	echo Usage: `basename $0` "[-h] [-v] [-N \| -nocolor] [-G] <lid|guid>"
+	echo Usage: `basename $0` "[-h] [-v] [-N | -nocolor] [-G]" \
+	    "[-C ca_name] [-P ca_port] [-t(imeout) timeout_ms] <lid|guid>"
 	exit -1
 }
 
@@ -30,6 +31,7 @@ function red() {
 guid_addr=""
 bw=""
 verbose=""
+ca_info=""
 
 while [ "$1" ]; do
 	case $1 in
@@ -42,6 +44,18 @@ while [ "$1" ]; do
 	-v)
 		verbose=yes
 		;;
+	-P | -C | -t | -timeout)
+		case $2 in
+		-*)
+			usage
+			;;
+		esac
+		if [ x$2 = x ] ; then
+			usage
+		fi
+		ca_info="$ca_info $1 $2"
+		shift
+		;;
 	-*)
 		usage
 		;;
@@ -57,14 +71,14 @@ if [ -z "$1" ]; then
 fi
 
 if [ "$guid_addr" ]; then
-	if ! lid=`$IBPATH/ibaddr -G -L $1 | awk '/failed/{exit -1} {print $3}'`; then
+	if ! lid=`$IBPATH/ibaddr $ca_info -G -L $1 | awk '/failed/{exit -1} {print $3}'`; then
 		echo -n "guid $1 address resolution: "
 		red "FAILED"
 		exit -1
 	fi
 else
 	lid=$1
-	if ! temp=`$IBPATH/ibaddr -L $1 | awk '/failed/{exit -1} {print $1}'`; then
+	if ! temp=`$IBPATH/ibaddr $ca_info -L $1 | awk '/failed/{exit -1} {print $1}'`; then
 		echo -n "lid $1 address resolution: "
 		red "FAILED"
 		exit -1
@@ -73,7 +87,7 @@ fi
 
 ## For now, check node only checks if node info is replied
 
-if $IBPATH/smpquery nodeinfo $lid > /dev/null 2>&1 ; then
+if $IBPATH/smpquery $ca_info nodeinfo $lid > /dev/null 2>&1 ; then
 	if [ "$verbose" = "yes" ]; then
 		echo -n "Node check lid $lid: "
 		green OK
diff --git a/infiniband-diags/scripts/ibcheckport.in b/infiniband-diags/scripts/ibcheckport.in
index f910fdc..3c7c396 100644
--- a/infiniband-diags/scripts/ibcheckport.in
+++ b/infiniband-diags/scripts/ibcheckport.in
@@ -3,7 +3,8 @@
 IBPATH=${IBPATH:- at IBSCRIPTPATH@}
 
 function usage() {
-	echo Usage: `basename $0` "[-h] [-v] [-N \| -nocolor] [-G] <lid|guid> <port>"
+	echo Usage: `basename $0` "[-h] [-v] [-N | -nocolor] [-G]" \
+	   "[-C ca_name] [-P ca_port] [-t(imeout) timeout_ms] <lid|guid> <port>"
 	exit -1
 }
 
@@ -30,6 +31,7 @@ function red() {
 guid_addr=""
 bw=""
 verbose=""
+ca_info=""
 
 while [ "$1" ]; do
 	case $1 in
@@ -42,6 +44,18 @@ while [ "$1" ]; do
 	-v)
 		verbose=yes
 		;;
+	-P | -C | -t | -timeout)
+		case $2 in
+		-*)
+			usage
+			;;
+		esac
+		if [ x$2 = x ] ; then
+			usage
+		fi
+		ca_info="$ca_info $1 $2"
+		shift
+		;;
 	-*)
 		usage
 		;;
@@ -59,7 +73,7 @@ fi
 portnum=$2
 
 if [ "$guid_addr" ]; then
-	if ! lid=`$IBPATH/ibaddr -G -L $1 | awk '/failed/{exit -1} {print $3}'`; then
+	if ! lid=`$IBPATH/ibaddr $ca_info -G -L $1 | awk '/failed/{exit -1} {print $3}'`; then
 		echo -n "guid $1 address resolution: "
 		red "FAILED"
 		exit -1
@@ -67,7 +81,7 @@ if [ "$guid_addr" ]; then
 	guid=$1
 else
 	lid=$1
-	if ! temp=`$IBPATH/ibaddr -L $1 | awk '/failed/{exit -1} {print $1}'`; then
+	if ! temp=`$IBPATH/ibaddr $ca_info -L $1 | awk '/failed/{exit -1} {print $1}'`; then
 		echo -n "lid $1 address resolution: "
 		red "FAILED"
 		exit -1
@@ -75,7 +89,7 @@ else
 fi
 
 
-if $IBPATH/smpquery portinfo $lid $portnum | awk -v mono=$bw -F '[.:]*' '
+if $IBPATH/smpquery $ca_info portinfo $lid $portnum | awk -v mono=$bw -F '[.:]*' '
 function blue(s)
 {
 	if (mono)
diff --git a/infiniband-diags/scripts/ibcheckportstate.in b/infiniband-diags/scripts/ibcheckportstate.in
index 3c36601..f3a5f05 100644
--- a/infiniband-diags/scripts/ibcheckportstate.in
+++ b/infiniband-diags/scripts/ibcheckportstate.in
@@ -3,7 +3,8 @@
 IBPATH=${IBPATH:- at IBSCRIPTPATH@}
 
 function usage() {
-	echo Usage: `basename $0` "[-h] [-v] [-N \| -nocolor] [-G] <lid|guid> <port>"
+	echo Usage: `basename $0` "[-h] [-v] [-N | -nocolor] [-G]" \
+	   "[-C ca_name] [-P ca_port] [-t(imeout) timeout_ms] <lid|guid> <port>"
 	exit -1
 }
 
@@ -30,6 +31,7 @@ function red() {
 guid_addr=""
 bw=""
 verbose=""
+ca_info=""
 
 while [ "$1" ]; do
 	case $1 in
@@ -42,6 +44,18 @@ while [ "$1" ]; do
 	-v)
 		verbose=yes
 		;;
+	-P | -C | -t | -timeout)
+		case $2 in
+		-*)
+			usage
+			;;
+		esac
+		if [ x$2 = x ] ; then
+			usage
+		fi
+		ca_info="$ca_info $1 $2"
+		shift
+		;;
 	-*)
 		usage
 		;;
@@ -59,7 +73,7 @@ fi
 portnum=$2
 
 if [ "$guid_addr" ]; then
-	if ! lid=`$IBPATH/ibaddr -G -L $1 | awk '/failed/{exit -1} {print $3}'`; then
+	if ! lid=`$IBPATH/ibaddr $ca_info -G -L $1 | awk '/failed/{exit -1} {print $3}'`; then
 		echo -n "guid $1 address resolution: "
 		red "FAILED"
 		exit -1
@@ -67,7 +81,7 @@ if [ "$guid_addr" ]; then
 	guid=$1
 else
 	lid=$1
-	if ! temp=`$IBPATH/ibaddr -L $1 | awk '/failed/{exit -1} {print $1}'`; then
+	if ! temp=`$IBPATH/ibaddr $ca_info -L $1 | awk '/failed/{exit -1} {print $1}'`; then
 		echo -n "lid $1 address resolution: "
 		red "FAILED"
 		exit -1
@@ -75,7 +89,7 @@ else
 fi
 
 
-if $IBPATH/smpquery portinfo $lid $portnum | awk -v mono=$bw -F '[.:]*' '
+if $IBPATH/smpquery $ca_info portinfo $lid $portnum | awk -v mono=$bw -F '[.:]*' '
 function blue(s)
 {
 	if (mono)
diff --git a/infiniband-diags/scripts/ibcheckportwidth.in b/infiniband-diags/scripts/ibcheckportwidth.in
index 5f6762e..fdc75d1 100644
--- a/infiniband-diags/scripts/ibcheckportwidth.in
+++ b/infiniband-diags/scripts/ibcheckportwidth.in
@@ -3,7 +3,8 @@
 IBPATH=${IBPATH:- at IBSCRIPTPATH@}
 
 function usage() {
-	echo Usage: `basename $0` "[-h] [-v] [-N \| -nocolor] [-G] <lid|guid> <port>"
+	echo Usage: `basename $0` "[-h] [-v] [-N | -nocolor] [-G]" \
+	   "[-C ca_name] [-P ca_port] [-t(imeout) timeout_ms] <lid|guid> <port>"
 	exit -1
 }
 
@@ -30,6 +31,7 @@ function red() {
 guid_addr=""
 bw=""
 verbose=""
+ca_info=""
 
 while [ "$1" ]; do
 	case $1 in
@@ -42,6 +44,18 @@ while [ "$1" ]; do
 	-v)
 		verbose=yes
 		;;
+	-P | -C | -t | -timeout)
+		case $2 in
+		-*)
+			usage
+			;;
+		esac
+		if [ x$2 = x ] ; then
+			usage
+		fi
+		ca_info="$ca_info $1 $2"
+		shift
+		;;
 	-*)
 		usage
 		;;
@@ -59,7 +73,7 @@ fi
 portnum=$2
 
 if [ "$guid_addr" ]; then
-	if ! lid=`$IBPATH/ibaddr -G -L $1 | awk '/failed/{exit -1} {print $3}'`; then
+	if ! lid=`$IBPATH/ibaddr $ca_info -G -L $1 | awk '/failed/{exit -1} {print $3}'`; then
 		echo -n "guid $1 address resolution: "
 		red "FAILED"
 		exit -1
@@ -67,7 +81,7 @@ if [ "$guid_addr" ]; then
 	guid=$1
 else
 	lid=$1
-	if ! temp=`$IBPATH/ibaddr -L $1 | awk '/failed/{exit -1} {print $1}'`; then
+	if ! temp=`$IBPATH/ibaddr $ca_info -L $1 | awk '/failed/{exit -1} {print $1}'`; then
 		echo -n "lid $1 address resolution: "
 		red "FAILED"
 		exit -1
@@ -75,7 +89,7 @@ else
 fi
 
 
-if $IBPATH/smpquery portinfo $lid $portnum | awk -v mono=$bw -F '[.:]*' '
+if $IBPATH/smpquery $ca_info portinfo $lid $portnum | awk -v mono=$bw -F '[.:]*' '
 function blue(s)
 {
 	if (mono)
diff --git a/infiniband-diags/scripts/ibcheckstate.in b/infiniband-diags/scripts/ibcheckstate.in
index 30b5513..944e139 100644
--- a/infiniband-diags/scripts/ibcheckstate.in
+++ b/infiniband-diags/scripts/ibcheckstate.in
@@ -3,7 +3,8 @@
 IBPATH=${IBPATH:- at IBSCRIPTPATH@}
 
 function usage() {
-	echo Usage: `basename $0` [-h] [-v] [-N \| -nocolor] [\<topology-file\>]
+	echo Usage: `basename $0` "[-h] [-v] [-N | -nocolor]" \
+	    "[<topology-file> | -C ca_name -P ca_port -t(imeout) timeout_ms]"
 	exit -1
 }
 
@@ -20,6 +21,8 @@ v=0
 ntype=""
 nodeguid=""
 oldlid=""
+topofile=""
+ca_info=""
 
 while [ "$1" ]; do
 	case $1 in
@@ -33,20 +36,35 @@ while [ "$1" ]; do
 		verbose=-v
 		v=1
 		;;
+	-P | -C | -t | -timeout)
+		case $2 in
+		-*)
+			usage
+			;;
+		esac
+		if [ x$2 = x ] ; then
+			usage
+		fi
+		ca_info="$ca_info $1 $2"
+		shift
+		;;
 	-*)
 		usage
 		;;
 	*)
-		break
+		if [ "$topofile" ]; then
+			usage
+		fi
+		topofile="$1"
 		;;
 	esac
 	shift
 done
 
-if [ "$1" ]; then
-	netcmd="cat $1"
+if [ "$topofile" ]; then
+	netcmd="cat $topofile"
 else
-	netcmd="$IBPATH/ibnetdiscover"
+	netcmd="$IBPATH/ibnetdiscover $ca_info"
 fi
 
 eval $netcmd | awk '
@@ -57,7 +75,7 @@ BEGIN {
 function check_node(lid)
 {
 	nodechecked=1
-	if (system("'$IBPATH'/ibchecknode '$gflags' '$verbose' " lid)) {
+	if (system("'$IBPATH'/ibchecknode'"$ca_info"' '$gflags' '$verbose' " lid)) {
 		ne++
 		badnode=1
 		return
@@ -93,7 +111,7 @@ function check_node(lid)
 		}
 		sub("\\(.*\\)", "", port)
 		gsub("[\\[\\]]", "", port)
-		if (system("'$IBPATH'/ibcheckportstate '$gflags' '$verbose' " lid " " port)) {
+		if (system("'$IBPATH'/ibcheckportstate'"$ca_info"' '$gflags' '$verbose' " lid " " port)) {
 			if (!'$v' && oldlid != lid) {
 				print "# Checked " ntype ": nodeguid 0x" nodeguid " with failure"
 				oldlid = lid
diff --git a/infiniband-diags/scripts/ibcheckwidth.in b/infiniband-diags/scripts/ibcheckwidth.in
index 072d433..8ad0f7f 100644
--- a/infiniband-diags/scripts/ibcheckwidth.in
+++ b/infiniband-diags/scripts/ibcheckwidth.in
@@ -3,7 +3,8 @@
 IBPATH=${IBPATH:- at IBSCRIPTPATH@}
 
 function usage() {
-	echo Usage: `basename $0` [-h] [-v] [-N \| -nocolor] [\<topology-file\>]
+	echo Usage: `basename $0` "[-h] [-v] [-N | -nocolor]" \
+	    "[<topology-file> \| -C ca_name -P ca_port -t(imeout) timeout_ms]"
 	exit -1
 }
 
@@ -20,6 +21,8 @@ v=0
 ntype=""
 nodeguid=""
 oldlid=""
+topofile=""
+ca_info=""
 
 while [ "$1" ]; do
 	case $1 in
@@ -33,20 +36,35 @@ while [ "$1" ]; do
 		verbose="-v"
 		v=1
 		;;
+	-P | -C | -t | -timeout)
+		case $2 in
+		-*)
+			usage
+			;;
+		esac
+		if [ x$2 = x ] ; then
+			usage
+		fi
+		ca_info="$ca_info $1 $2"
+		shift
+		;;
 	-*)
 		usage
 		;;
 	*)
-		break
+		if [ "$topofile" ]; then
+			usage
+		fi
+		topofile="$1"
 		;;
 	esac
 	shift
 done
 
-if [ "$1" ]; then
-	netcmd="cat $1"
+if [ "$topofile" ]; then
+	netcmd="cat $topofile"
 else
-	netcmd="$IBPATH/ibnetdiscover"
+	netcmd="$IBPATH/ibnetdiscover $ca_info"
 fi
 
 eval $netcmd | awk '
@@ -57,7 +75,7 @@ BEGIN {
 function check_node(lid)
 {
 	nodechecked=1
-	if (system("'$IBPATH'/ibchecknode '$gflags' '$verbose' " lid)) {
+	if (system("'$IBPATH'/ibchecknode'"$ca_info"' '$gflags' '$verbose' " lid)) {
 		ne++
 		badnode=1
 		return
@@ -93,7 +111,7 @@ function check_node(lid)
 		}
 		sub("\\(.*\\)", "", port)
 		gsub("[\\[\\]]", "", port)
-		if (system("'$IBPATH'/ibcheckportwidth '$gflags' '$verbose' " lid " " port)) {
+		if (system("'$IBPATH'/ibcheckportwidth'"$ca_info"' '$gflags' '$verbose' " lid " " port)) {
 			if (!'$v' && oldlid != lid) {
 				print "# Checked " ntype ": nodeguid 0x" nodeguid " with failure"
 				oldlid = lid
diff --git a/infiniband-diags/scripts/ibclearcounters.in b/infiniband-diags/scripts/ibclearcounters.in
index 54551b3..b3c009e 100644
--- a/infiniband-diags/scripts/ibclearcounters.in
+++ b/infiniband-diags/scripts/ibclearcounters.in
@@ -3,7 +3,8 @@
 IBPATH=${IBPATH:- at IBSCRIPTPATH@}
 
 function usage() {
-	echo Usage: `basename $0` [-h] [-N \| -nocolor] [\<topology-file\>]
+	echo Usage: `basename $0` "[-h] [-N | -nocolor] [<topology-file>" \
+	    "| -C ca_name -P ca_port -t(imeout) timeout_ms]"
 	exit -1
 }
 
@@ -18,6 +19,8 @@ gflags=""
 verbose=""
 v=0
 oldlid=""
+topofile=""
+ca_info=""
 
 while [ "$1" ]; do
 	case $1 in
@@ -27,20 +30,35 @@ while [ "$1" ]; do
 	-N|-nocolor)
 		gflags=-N
 		;;
+	-P | -C | -t | -timeout)
+		case $2 in
+		-*)
+			usage
+			;;
+		esac
+		if [ x$2 = x ] ; then
+			usage
+		fi
+		ca_info="$ca_info $1 $2"
+		shift
+		;;
 	-*)
 		usage
 		;;
 	*)
-		break
+		if [ "$topofile" ]; then
+			usage
+		fi
+		topofile="$1"
 		;;
 	esac
 	shift
 done
 
-if [ "$1" ]; then
-	netcmd="cat $1"
+if [ "$topofile" ]; then
+	netcmd="cat $topofile"
 else
-	netcmd="$IBPATH/ibnetdiscover"
+	netcmd="$IBPATH/ibnetdiscover $ca_info"
 fi
 
 eval $netcmd | awk '
@@ -48,14 +66,14 @@ eval $netcmd | awk '
 function clear_counters(lid)
 {
 	nodecleared=1
-	if (system("'$IBPATH'/perfquery '$gflags' -R -a " lid))
+	if (system("'$IBPATH'/perfquery'"$ca_info"' '$gflags' -R -a " lid))
 		nodeerr++
 }
 
 function clear_port_counters(lid, port)
 {
 	nodecleared=1
-	if (system("'$IBPATH'/perfquery '$gflags' -R " lid " " port))
+	if (system("'$IBPATH'/perfquery'"$ca_info"' '$gflags' -R " lid " " port))
 		nodeerr++
 }
 
diff --git a/infiniband-diags/scripts/ibclearerrors.in b/infiniband-diags/scripts/ibclearerrors.in
index 4a086ae..097c3fe 100644
--- a/infiniband-diags/scripts/ibclearerrors.in
+++ b/infiniband-diags/scripts/ibclearerrors.in
@@ -3,7 +3,8 @@
 IBPATH=${IBPATH:- at IBSCRIPTPATH@}
 
 function usage() {
-	echo Usage: `basename $0` [-h] [-N \| -nocolor] [\<topology-file\>]
+	echo Usage: `basename $0` "[-h] [-N | -nocolor] [<topology-file>" \
+	    "| -C ca_name -P ca_port -t(imeout) timeout_ms]"
 	exit -1
 }
 
@@ -18,6 +19,8 @@ gflags=""
 verbose=""
 v=0
 oldlid=""
+topofile=""
+ca_info=""
 
 while [ "$1" ]; do
 	case $1 in
@@ -27,20 +30,35 @@ while [ "$1" ]; do
 	-N|-nocolor)
 		gflags=-N
 		;;
+	-P | -C | -t | -timeout)
+		case $2 in
+		-*)
+			usage
+			;;
+		esac
+		if [ x$2 = x ] ; then
+			usage
+		fi
+		ca_info="$ca_info $1 $2"
+		shift
+		;;
 	-*)
 		usage
 		;;
 	*)
-		break
+		if [ "$topofile" ]; then
+			usage
+		fi
+		topofile="$1"
 		;;
 	esac
 	shift
 done
 
-if [ "$1" ]; then
-	netcmd="cat $1"
+if [ "$topofile" ]; then
+	netcmd="cat $topofile"
 else
-	netcmd="$IBPATH/ibnetdiscover"
+	netcmd="$IBPATH/ibnetdiscover $ca_info"
 fi
 
 eval $netcmd | awk '
@@ -48,7 +66,7 @@ eval $netcmd | awk '
 function clear_errors(lid, port)
 {
 	nodecleared=1
-	if (system("'$IBPATH'/perfquery '$gflags' -R " lid " " port " 0x0fff"))
+	if (system("'$IBPATH'/perfquery'"$ca_info"' '$gflags' -R " lid " " port " 0x0fff"))
 		nodeerr++
 }
 
diff --git a/infiniband-diags/scripts/ibdatacounters.in b/infiniband-diags/scripts/ibdatacounters.in
index d27149e..bee9bd8 100644
--- a/infiniband-diags/scripts/ibdatacounters.in
+++ b/infiniband-diags/scripts/ibdatacounters.in
@@ -3,7 +3,8 @@
 IBPATH=${IBPATH:- at IBSCRIPTPATH@}
 
 function usage() {
-	echo Usage: `basename $0` [-h] [-b] [-v] [-N \| -nocolor] [\<topology-file\>]
+	echo Usage: `basename $0` "[-h] [-b] [-v] [-N | -nocolor]" \
+	    "[<topology-file> \| -C ca_name -P ca_port -t(imeout) timeout_ms]"
 	exit -1
 }
 
@@ -21,6 +22,8 @@ v=0
 ntype=""
 nodeguid=""
 oldlid=""
+topofile=""
+ca_info=""
 
 while [ "$1" ]; do
 	case $1 in
@@ -39,20 +42,35 @@ while [ "$1" ]; do
 		brief=-b
 		verbose=""
 		;;
+	-P | -C | -t | -timeout)
+		case $2 in
+		-*)
+			usage
+			;;
+		esac
+		if [ x$2 = x ] ; then
+			usage
+		fi
+		ca_info="$ca_info $1 $2"
+		shift
+		;;
 	-*)
 		usage
 		;;
 	*)
-		break
+		if [ "$topofile" ]; then
+			usage
+		fi
+		topofile="$1"
 		;;
 	esac
 	shift
 done
 
-if [ "$1" ]; then
-	netcmd="cat $1"
+if [ "$topofile" ]; then
+	netcmd="cat $topofile"
 else
-	netcmd="$IBPATH/ibnetdiscover"
+	netcmd="$IBPATH/ibnetdiscover $ca_info"
 fi
 
 eval $netcmd | awk '
@@ -62,12 +80,12 @@ BEGIN {
 function check_node(lid)
 {
 	nodechecked=1
-	if (system("'$IBPATH'/ibchecknode '$gflags' '$verbose' " lid)) {
+	if (system("'$IBPATH'/ibchecknode'"$ca_info"' '$gflags' '$verbose' " lid)) {
 		ne++
 		badnode=1
 		return
 	}
-	if (system("'$IBPATH'/ibdatacounts'$gflags' '$verbose' '$brief' " lid " 255"))
+	if (system("'$IBPATH'/ibdatacounts'"$ca_info"' '$gflags' '$verbose' '$brief' " lid " 255"))
 		nodeerr=1;
 }
 
@@ -102,7 +120,7 @@ function check_node(lid)
 		sub("\\(.*\\)", "", port)
 		gsub("[\\[\\]]", "", port)
 		if (nodeerr)
-			if (system("'$IBPATH'/ibdatacounts'$gflags' '$verbose' '$brief' " lid " " port)) {
+			if (system("'$IBPATH'/ibdatacounts'"$ca_info"' '$gflags' '$verbose' '$brief' " lid " " port)) {
 				if (!'$v' && oldlid != lid) {
 					print "# Checked " ntype ": nodeguid 0x" nodeguid " with failure"
 					oldlid = lid
diff --git a/infiniband-diags/scripts/ibdatacounts.in b/infiniband-diags/scripts/ibdatacounts.in
index 668558f..927a978 100644
--- a/infiniband-diags/scripts/ibdatacounts.in
+++ b/infiniband-diags/scripts/ibdatacounts.in
@@ -3,7 +3,9 @@
 IBPATH=${IBPATH:- at IBSCRIPTPATH@}
 
 function usage() {
-	echo Usage: `basename $0` "[-h] [-b] [-v] [-G] [-N \| -nocolor] <lid|guid> [<port>]"
+	echo Usage: `basename $0` "[-h] [-b] [-v] [-G] [-N | -nocolor]" \
+	    "[-C ca_name] [-P ca_port] [-t(imeout) timeout_ms] <lid|guid>" \
+	    "[<port>]"
 	exit -1
 }
 
@@ -31,6 +33,7 @@ guid_addr=""
 bw=""
 verbose=""
 brief=""
+ca_info=""
 
 while [ "$1" ]; do
 	case $1 in
@@ -48,6 +51,18 @@ while [ "$1" ]; do
 		brief=yes
 		verbose=""
 		;;
+	-P | -C | -t | -timeout)
+		case $2 in
+		-*)
+			usage
+			;;
+		esac
+		if [ x$2 = x ] ; then
+			usage
+		fi
+		ca_info="$ca_info $1 $2"
+		shift
+		;;
 	-*)
 		usage
 		;;
@@ -76,7 +91,7 @@ else
 fi
 
 if [ "$guid_addr" ]; then
-	if ! lid=`$IBPATH/ibaddr -G -L $1 | awk '/failed/{exit -1} {print $3}'`; then
+	if ! lid=`$IBPATH/ibaddr $ca_info -G -L $1 | awk '/failed/{exit -1} {print $3}'`; then
 		echo -n "guid $1 address resolution: "
 		red "FAILED"
 		exit -1
@@ -84,16 +99,16 @@ if [ "$guid_addr" ]; then
 	guid=$1
 else
 	lid=$1
-	if ! temp=`$IBPATH/ibaddr -L $1 | awk '/failed/{exit -1} {print $1}'`; then
+	if ! temp=`$IBPATH/ibaddr $ca_info -L $1 | awk '/failed/{exit -1} {print $1}'`; then
 		echo -n "lid $1 address resolution: "
 		red "FAILED"
 		exit -1
 	fi
 fi
 
-nodename=`smpquery nodedesc $lid | sed -e "s/^Node Description:\.*\(.*\)/\1/"`
+nodename=`smpquery $ca_info nodedesc $lid | sed -e "s/^Node Description:\.*\(.*\)/\1/"`
 
-if $IBPATH/perfquery $lid $portnum | awk -v mono=$bw -v brief=$brief -F '[.:]*' '
+if $IBPATH/perfquery $ca_info $lid $portnum | awk -v mono=$bw -v brief=$brief -F '[.:]*' '
 function blue(s)
 {
 	if (brief == "yes") {
diff --git a/infiniband-diags/scripts/ibhosts.in b/infiniband-diags/scripts/ibhosts.in
index b9aadc1..0d6b1bc 100644
--- a/infiniband-diags/scripts/ibhosts.in
+++ b/infiniband-diags/scripts/ibhosts.in
@@ -3,28 +3,48 @@
 IBPATH=${IBPATH:- at IBSCRIPTPATH@}
 
 function usage() {
-	echo Usage: `basename $0` [-h] [\<topology-file\>]
+	echo Usage: `basename $0` "[-h] [<topology-file> | -C ca_name" \
+	    "-P ca_port -t(imeout) timeout_ms]"
 	exit -1
 }
 
+topofile=""
+ca_info=""
+
 while [ "$1" ]; do
 	case $1 in
 	-h)
 		usage
 		;;
+	-P | -C | -t | -timeout)
+		case $2 in
+		-*)
+			usage
+			;;
+		esac
+		if [ x$2 = x ] ; then
+			usage
+		fi
+		ca_info="$ca_info $1 $2"
+		shift
+		;;
 	-*)
 		usage
 		;;
 	*)
-		break
+		if [ "$topofile" ]; then
+			usage
+		fi
+		topofile="$1"
 		;;
 	esac
+	shift
 done
 
-if [ "$1" ]; then
-	netcmd="cat $1"
+if [ "$topofile" ]; then
+	netcmd="cat $topofile"
 else
-	netcmd="$IBPATH/ibnetdiscover"
+	netcmd="$IBPATH/ibnetdiscover $ca_info"
 fi
 
 eval $netcmd | awk '
diff --git a/infiniband-diags/scripts/ibnodes.in b/infiniband-diags/scripts/ibnodes.in
index 32acd9c..5871da8 100644
--- a/infiniband-diags/scripts/ibnodes.in
+++ b/infiniband-diags/scripts/ibnodes.in
@@ -2,4 +2,4 @@
 
 IBPATH=${IBPATH:- at IBSCRIPTPATH@}
 
-$IBPATH/ibhosts; $IBPATH/ibswitches
+$IBPATH/ibhosts $@; $IBPATH/ibswitches $@
diff --git a/infiniband-diags/scripts/ibrouters.in b/infiniband-diags/scripts/ibrouters.in
index 96ebfe0..fea72bb 100644
--- a/infiniband-diags/scripts/ibrouters.in
+++ b/infiniband-diags/scripts/ibrouters.in
@@ -3,28 +3,48 @@
 IBPATH=${IBPATH:- at IBSCRIPTPATH@}
 
 function usage() {
-	echo Usage: `basename $0` [-h] [\<topology-file\>]
+	echo Usage: `basename $0` "[-h] [<topology-file> | -C ca_name" \
+	    "-P ca_port -t(imeout) timeout_ms]"
 	exit -1
 }
 
+topofile=""
+ca_info=""
+
 while [ "$1" ]; do
 	case $1 in
 	-h)
 		usage
 		;;
+	-P | -C | -t | -timeout)
+		case $2 in
+		-*)
+			usage
+			;;
+		esac
+		if [ x$2 = x ] ; then
+			usage
+		fi
+		ca_info="$ca_info $1 $2"
+		shift
+		;;
 	-*)
 		usage
 		;;
 	*)
-		break
+		if [ "$topofile" ]; then
+			usage
+		fi
+		topofile="$1"
 		;;
 	esac
+	shift
 done
 
-if [ "$1" ]; then
-	netcmd="cat $1"
+if [ "$topofile" ]; then
+	netcmd="cat $topofile"
 else
-	netcmd="$IBPATH/ibnetdiscover"
+	netcmd="$IBPATH/ibnetdiscover $ca_info"
 fi
 
 eval $netcmd | awk '
diff --git a/infiniband-diags/scripts/ibswitches.in b/infiniband-diags/scripts/ibswitches.in
index 2a92360..859aacd 100644
--- a/infiniband-diags/scripts/ibswitches.in
+++ b/infiniband-diags/scripts/ibswitches.in
@@ -3,28 +3,48 @@
 IBPATH=${IBPATH:- at IBSCRIPTPATH@}
 
 function usage() {
-	echo Usage: `basename $0` [-h] [\<topology-file\>]
+	echo Usage: `basename $0` "[-h] [<topology-file> | -C ca_name" \
+	    "-P ca_port -t(imeout) timeout_ms]"
 	exit -1
 }
 
+topofile=""
+ca_info=""
+
 while [ "$1" ]; do
 	case $1 in
 	-h)
 		usage
 		;;
+	-P | -C | -t | -timeout)
+		case $2 in
+		-*)
+			usage
+			;;
+		esac
+		if [ x$2 = x ] ; then
+			usage
+		fi
+		ca_info="$ca_info $1 $2"
+		shift
+		;;
 	-*)
 		usage
 		;;
 	*)
-		break
+		if [ "$topofile" ]; then
+			usage
+		fi
+		topofile="$1"
 		;;
 	esac
+	shift
 done
 
-if [ "$1" ]; then
-	netcmd="cat $1"
+if [ "$topofile" ]; then
+	netcmd="cat $topofile"
 else
-	netcmd="$IBPATH/ibnetdiscover"
+	netcmd="$IBPATH/ibnetdiscover $ca_info"
 fi
 
 eval $netcmd | awk '
diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c
index daff824..522399e 100644
--- a/infiniband-diags/src/saquery.c
+++ b/infiniband-diags/src/saquery.c
@@ -73,6 +73,8 @@ osm_mad_pool_t     mad_pool;
 osm_vendor_t      *vendor = NULL;
 int                osm_debug = 0;
 uint32_t           sa_timeout_ms = DEFAULT_SA_TIMEOUT_MS;
+char		  *sa_hca_name = NULL;
+uint32_t           sa_port_num = 0;
 
 enum {
 	ALL,
@@ -137,7 +139,7 @@ print_node_record(ib_node_record_t *node_record)
 		if (p_ni->node_type == IB_NODE_TYPE_SWITCH)
 			name = lookup_switch_name(switch_map_fp,
 						  cl_ntoh64(p_ni->node_guid),
-						  p_nd->description);
+						  (char *)p_nd->description);
 		else
 			name = clean_nodedesc((char *)p_nd->description);
 		printf("%s\n", name);
@@ -956,6 +958,7 @@ get_bind_handle(void)
 	ib_api_status_t    status;
 	ib_port_attr_t     attr_array[MAX_PORTS];
 	uint32_t           num_ports = MAX_PORTS;
+	uint32_t           ca_name_index = 0;
 
 	complib_init();
 
@@ -985,6 +988,16 @@ get_bind_handle(void)
 	}
 
 	for (i = 0; i < num_ports; i++) {
+		if (i > 1 && cl_ntoh64(attr_array[i].port_guid)
+				!= (cl_ntoh64(attr_array[i-1].port_guid) + 1))
+			ca_name_index++;
+		if (sa_port_num && sa_port_num != attr_array[i].port_num)
+			continue;
+		if (sa_hca_name && i == 0)
+			continue;
+		if (sa_hca_name
+		 && strcmp(sa_hca_name, vendor->ca_names[ca_name_index]) != 0)
+			continue;
 		if (attr_array[i].link_state == IB_LINK_ACTIVE)
 			port_guid = attr_array[i].port_guid;
 	}
@@ -1029,10 +1042,13 @@ clean_up(void)
 static void
 usage(void)
 {
-	fprintf(stderr, "Usage: %s [-h -d -P -N] [--list | -D] [-S -I -L -l -G -O -U -C -s -g -m --src-to-dst <src:dst> -t(imeout) <msec>] [<name> | <lid> | <guid>]\n", argv0);
+	fprintf(stderr, "Usage: %s [-h -d -p -N] [--list | -D] [-S -I -L -l -G"
+		" -O -U -c -s -g -m --src-to-dst <src:dst> -C <ca_name> "
+		"-P <ca_port> -t(imeout) <msec>] [<name> | <lid> | <guid>]\n",
+		argv0);
 	fprintf(stderr, "   Queries node records by default\n");
 	fprintf(stderr, "   -d enable debugging\n");
-	fprintf(stderr, "   -P get PathRecord info\n");
+	fprintf(stderr, "   -p get PathRecord info\n");
 	fprintf(stderr, "   -N get NodeRecord info\n");
 	fprintf(stderr, "   --list | -D the node desc of the CA's\n");
 	fprintf(stderr, "   -S get ServiceRecord info\n");
@@ -1042,15 +1058,21 @@ usage(void)
 	fprintf(stderr, "   -G return the Guids of the name specified\n");
 	fprintf(stderr, "   -O return name for the Lid specified\n");
 	fprintf(stderr, "   -U return name for the Guid specified\n");
-	fprintf(stderr, "   -C get the SA's class port info\n");
-	fprintf(stderr, "   -s return the PortInfoRecords with isSM or isSMdisabled capability mask bit on\n");
+	fprintf(stderr, "   -c get the SA's class port info\n");
+	fprintf(stderr, "   -s return the PortInfoRecords with isSM or "
+				"isSMdisabled capability mask bit on\n");
 	fprintf(stderr, "   -g get multicast group info\n");
 	fprintf(stderr, "   -m get multicast member info\n");
-	fprintf(stderr, "      (if multicast group specified, list member GIDs only for group specified\n");
+	fprintf(stderr, "      (if multicast group specified, list member GIDs"
+				" only for group specified\n");
 	fprintf(stderr, "      specified, for example 'saquery -m 0xC000')\n");
 	fprintf(stderr, "   --src-to-dst get a PathRecord for <src:dst>\n"
-			"                where src amd dst are either node names or LIDs\n");
-	fprintf(stderr, "   -t | --timeout <msec> specify the SA query response timeout (default %u msec)\n",
+			"                where src amd dst are either node "
+				"names or LIDs\n");
+	fprintf(stderr, "   -C <ca_name> specify the SA query HCA\n");
+	fprintf(stderr, "   -P <ca_port> specify the SA query port\n");
+	fprintf(stderr, "   -t | --timeout <msec> specify the SA query "
+				"response timeout (default %u msec)\n",
 			DEFAULT_SA_TIMEOUT_MS);
 	fprintf(stderr, "   --switch-map <switch-map> specify a switch map\n");
 	exit(-1);
@@ -1068,9 +1090,9 @@ main(int argc, char **argv)
 	ib_net16_t         dst_lid;
 	ib_api_status_t    status;
 
-	static char const str_opts[] = "PVNDLlGOUCSIsgmdht:";
+	static char const str_opts[] = "pVNDLlGOUcSIsgmdhP:C:t:";
 	static const struct option long_opts [] = {
-	   {"P", 0, 0, 'P'},
+	   {"p", 0, 0, 'p'},
 	   {"Version", 0, 0, 'V'},
 	   {"N", 0, 0, 'N'},
 	   {"L", 0, 0, 'L'},
@@ -1082,9 +1104,11 @@ main(int argc, char **argv)
 	   {"g", 0, 0, 'g'},
 	   {"m", 0, 0, 'm'},
 	   {"d", 0, 0, 'd'},
-	   {"C", 0, 0, 'C'},
+	   {"c", 0, 0, 'c'},
 	   {"S", 0, 0, 'S'},
 	   {"I", 0, 0, 'I'},
+	   {"P", 1, 0, 'P'},
+	   {"C", 1, 0, 'C'},
 	   {"help", 0, 0, 'h'},
 	   {"list", 0, 0, 'D'},
 	   {"src-to-dst", 1, 0, 1},
@@ -1118,7 +1142,7 @@ main(int argc, char **argv)
 		case 2:
 			switch_map = strdup(optarg);
 			break;
-		case 'P':
+		case 'p':
 			query_type = IB_MAD_ATTR_PATH_RECORD;
 			break;
 		case 'V':
@@ -1127,7 +1151,7 @@ main(int argc, char **argv)
 		case 'D':
 			node_print_desc = ALL_DESC;
 			break;
-		case 'C':
+		case 'c':
 			query_type = IB_MAD_ATTR_CLASS_PORT_INFO;
 			break;
 		case 'S':
@@ -1167,6 +1191,12 @@ main(int argc, char **argv)
 		case 'd':
 			osm_debug = 1;
 			break;
+		case 'C':
+			sa_hca_name = optarg;
+			break;
+		case 'P':
+			sa_port_num = strtoul(optarg, NULL, 0);
+			break;
 		case 't':
 			sa_timeout_ms = strtoul(optarg, NULL, 0);
 			break;


From mst at dev.mellanox.co.il  Tue Jul 31 14:08:39 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 1 Aug 2007 00:08:39 +0300
Subject: [ofa-general] Re: patches for 1.2.c
In-Reply-To: <46AF89A0.9070805@opengridcomputing.com>
References: <46AF89A0.9070805@opengridcomputing.com>
Message-ID: <20070731210839.GE20859@mellanox.co.il>

> Quoting Steve Wise <swise at opengridcomputing.com>:
> Subject: patches for 1.2.c
> 
> Guys,
> 
> I have 2 more patches to go in ofed_1_2/ofed_1_2_c.
> 
> Is there some grand scheme to the naming of kernel_patches/fixes/* for 
> 1.2.c?  I noticed a slew of new files for the post-2.6.22 fixes, and 
> wondered if there is a naming scheme?

Not really, just stick the module name in there please so it's
easy to figure that cxgb3 is involved.

> Or should I just post a patch for the ofed_1_2 branch and let you all 
> create the ofed_1_2_c kernel_patches/fixes/ patch file ??

It's best if you post the patch that should go into kernel_patches/fixes/,
or clone the ofed_1_2_c branch and add the file there.

-- 
MST


From mst at dev.mellanox.co.il  Tue Jul 31 14:18:39 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 1 Aug 2007 00:18:39 +0300
Subject: [ofa-general] Re: QoS RFC
In-Reply-To: <adavec0o4i8.fsf@cisco.com>
References: <46A283B6.1070105@dev.mellanox.co.il>
	<20070723002010.GU27878@sashak.voltaire.com>
	<46A89608.9010709@dev.mellanox.co.il>
	<20070731160223.GF29844@sashak.voltaire.com>
	<adavec0o4i8.fsf@cisco.com>
Message-ID: <20070731211839.GH20859@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: QoS RFC
> 
> I think that defining a new file format is really going in the wrong
> direction.  XML would make a lot of sense (and you could use something
> like RELAX NG to define the schema very readably and precisely).  XML
> has the advantage that many parsers, GUI editors, and other tools are
> already widely available.
> 
> If you don't like XML for whatever reason, please at least consider
> something like YAML before you invent something completely new.

I second that.

-- 
MST


From mst at dev.mellanox.co.il  Tue Jul 31 14:56:47 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 1 Aug 2007 00:56:47 +0300
Subject: [ofa-general] Re: OFED 1.2.c-9 is available
In-Reply-To: <adatzrk71ot.fsf@cisco.com>
References: <6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com>
	<f0e08f230707310927o19e1583dr958abd29180bc8ac@mail.gmail.com>
	<46AF8B29.7090906@mellanox.co.il> <ada7iog8jgl.fsf@cisco.com>
	<46AF90D1.8050000@mellanox.co.il>
	<f0e08f230707311330q7104df21l3ead50003354810b@mail.gmail.com>
	<adatzrk71ot.fsf@cisco.com>
Message-ID: <20070731215647.GB5290@mellanox.co.il>

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [ofa-general] Re: OFED 1.2.c-9 is available
> 
>  > Why under drivers/net rather than drivers/infiniband like all the
>  > other drivers ? Does this really need special casing (in libibumad) ?
> 
> Tziporet is incorrect.  There's nothing from the mlx4_core driver
> either, and when it is implemented, it should work exactly the same as
> all other drivers.

At some point you suggested sticking this stuff under the pci device and
adding softlinks under drivers/infiniband, so that
if there's an ethernet device on top of the core these can be shared.

Not sure how to do this though, and no idea why would
just adding the attributes in both places be any worse, either.

Comments?

-- 
MST


From sean.hefty at intel.com  Tue Jul 31 15:10:54 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 31 Jul 2007 15:10:54 -0700
Subject: [ofa-general] [PATCH] ib/mad: fix address handle leak in mad_rmpp
In-Reply-To: <46AC6B5C.6020702@dev.mellanox.co.il>
Message-ID: <000001c7d3bf$aa5e0090$ff0da8c0@amr.corp.intel.com>

The address handle associated with dual-sided RMPP direction
switch ACKs is never destroyed.  Free the AH for ACKs which
fall into this category.  Problem was reported by Dotan
Barak (dotanb at dev.mellanox.co.il).

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
Dotan, can you verify that this fixes the problem for you?  (I tested against
osmtest as you indicated as well.)

Roland, this fix would be for 2.6.23.

 drivers/infiniband/core/mad_rmpp.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/core/mad_rmpp.c b/drivers/infiniband/core/mad_rmpp.c
index 3663fd7..d43bc62 100644
--- a/drivers/infiniband/core/mad_rmpp.c
+++ b/drivers/infiniband/core/mad_rmpp.c
@@ -163,8 +163,10 @@ static struct ib_mad_send_buf *alloc_response_msg(struct ib_mad_agent *agent,
 				 hdr_len, 0, GFP_KERNEL);
 	if (IS_ERR(msg))
 		ib_destroy_ah(ah);
-	else
+	else {
 		msg->ah = ah;
+		msg->context[0] = ah;
+	}
 
 	return msg;
 }
@@ -197,9 +199,7 @@ static void ack_ds_ack(struct ib_mad_agent_private *agent,
 
 void ib_rmpp_send_handler(struct ib_mad_send_wc *mad_send_wc)
 {
-	struct ib_rmpp_mad *rmpp_mad = mad_send_wc->send_buf->mad;
-
-	if (rmpp_mad->rmpp_hdr.rmpp_type != IB_MGMT_RMPP_TYPE_ACK)
+	if (mad_send_wc->send_buf->context[0] == mad_send_wc->send_buf->ah)
 		ib_destroy_ah(mad_send_wc->send_buf->ah);
 	ib_free_send_mad(mad_send_wc->send_buf);
 }


From rdreier at cisco.com  Tue Jul 31 15:19:52 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 31 Jul 2007 15:19:52 -0700
Subject: [ofa-general] Re: OFED 1.2.c-9 is available
In-Reply-To: <20070731215647.GB5290@mellanox.co.il> (Michael S. Tsirkin's
	message of "Wed, 1 Aug 2007 00:56:47 +0300")
References: <6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com>
	<f0e08f230707310927o19e1583dr958abd29180bc8ac@mail.gmail.com>
	<46AF8B29.7090906@mellanox.co.il> <ada7iog8jgl.fsf@cisco.com>
	<46AF90D1.8050000@mellanox.co.il>
	<f0e08f230707311330q7104df21l3ead50003354810b@mail.gmail.com>
	<adatzrk71ot.fsf@cisco.com> <20070731215647.GB5290@mellanox.co.il>
Message-ID: <adaabtc6ws7.fsf@cisco.com>

 > At some point you suggested sticking this stuff under the pci device and
 > adding softlinks under drivers/infiniband, so that
 > if there's an ethernet device on top of the core these can be shared.

 > Not sure how to do this though, and no idea why would
 > just adding the attributes in both places be any worse, either.

I didn't look at whether it's easy to create symlinks in sysfs.  I
don't really see any problem with just having both mlx4_ib and
mlx4_eth export the same data, and in fact that may make sense if
there is a different way mlx4_eth might want to export it (ethtool?).

However I definitely don't think we should force all userspace tools
to look in two different places in sysfs for information.  And if
there are some attributes that all devices support, then I guess we
should move the implementation of those attrs into core/sysfs.c.

 - R.


From rdreier at cisco.com  Tue Jul 31 15:20:18 2007
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 31 Jul 2007 15:20:18 -0700
Subject: [ofa-general] Re: [PATCH] ib/mad: fix address handle leak in
	mad_rmpp
In-Reply-To: <000001c7d3bf$aa5e0090$ff0da8c0@amr.corp.intel.com> (Sean Hefty's
	message of "Tue, 31 Jul 2007 15:10:54 -0700")
References: <000001c7d3bf$aa5e0090$ff0da8c0@amr.corp.intel.com>
Message-ID: <ada64406wrh.fsf@cisco.com>

 > Roland, this fix would be for 2.6.23.

OK, I'll wait for Dotan's ACK.


From sean.hefty at intel.com  Tue Jul 31 17:04:54 2007
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 31 Jul 2007 17:04:54 -0700
Subject: [ofa-general] RE: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from
	the host TCP port space.
In-Reply-To: <46ACF9DD.1010509@opengridcomputing.com>
Message-ID: <000001c7d3cf$97a0b050$9c98070a@amr.corp.intel.com>

>The correct solution in my mind is to use the host stack's TCP port
>space for _all_ RDMA_PS_TCP port allocations.   The patch below is a
>minimal delta to unify the port spaces bay using the kernel stack to
>bind ports.  This is done by allocating a kernel socket and binding to
>the appropriate local addr/port.  It also allows the kernel stack to
>pick ephemeral ports by virtue of just passing in port 0 on the kernel
>bind operation.

I'm not thrilled with the idea of overlapping port spaces, and I can't come up
with a solution that works for all situations.  I understand the overlapping
port space problem, but I consider the ability to use the same port number for
both RDMA and sockets a feature.

What if MPI used a similar mechanism as SDP?  That is, if it gets a port number
from sockets, it reserves that same RDMA port number, or vice-versa.  The
rdma_cm advertises separate port spaces from TCP/UDP, so IMO any assumption
otherwise, at this point, is a bug in the user's code.

Before merging the port spaces, I'd like a way for an application to use a
single well-known port number that works over both RDMA and sockets.

>RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

Is there any reason to limit this behavior to TCP only, or would we also include
UDP?

>diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
>index 9e0ab04..e4d2d7f 100644
>--- a/drivers/infiniband/core/cma.c
>+++ b/drivers/infiniband/core/cma.c
>@@ -111,6 +111,7 @@ struct rdma_id_private {
>  	struct rdma_cm_id	id;
>
>  	struct rdma_bind_list	*bind_list;
>+	struct socket		*sock;

This points off to a rather largish structure...

>  	struct hlist_node	node;
>  	struct list_head	list;
>  	struct list_head	listen_list;
>@@ -695,6 +696,8 @@ static void cma_release_port(struct rdma
>  		kfree(bind_list);
>  	}
>  	mutex_unlock(&lock);
>+	if (id_priv->sock)
>+		sock_release(id_priv->sock);
>  }
>
>  void rdma_destroy_id(struct rdma_cm_id *id)
>@@ -1790,6 +1793,25 @@ static int cma_use_port(struct idr *ps,
>  	return 0;
>  }
>
>+static int cma_get_tcp_port(struct rdma_id_private *id_priv)
>+{
>+	int ret;
>+	struct socket *sock;
>+
>+	ret = sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &sock);
>+	if (ret)
>+		return ret;
>+	ret = sock->ops->bind(sock,
>+			  (struct socketaddr *)&id_priv->id.route.addr.src_addr,
>+			  ip_addr_size(&id_priv->id.route.addr.src_addr));
>+	if (ret) {
>+		sock_release(sock);
>+		return ret;
>+	}
>+	id_priv->sock = sock;
>+	return 0;
>+}
>+
>  static int cma_get_port(struct rdma_id_private *id_priv)
>  {
>  	struct idr *ps;
>@@ -1801,6 +1823,9 @@ static int cma_get_port(struct rdma_id_p
>  		break;
>  	case RDMA_PS_TCP:
>  		ps = &tcp_ps;
>+		ret = cma_get_tcp_port(id_priv); /* Synch with native stack */
>+		if (ret)
>+			goto out;

Would we need tcp_ps (and udp_ps) anymore?  Also, I think SDP maps into the TCP
port space already, so changes to SDP will be needed as well, which may
eliminate its port space.

- Sean


From davem at systemfabricworks.com  Tue Jul 31 17:09:54 2007
From: davem at systemfabricworks.com (davem at systemfabricworks.com)
Date: Tue, 31 Jul 2007 19:09:54 -0500
Subject: [ofa-general] [PATCH] infiniband-diags/scripts: Fix Bug 239 Error
	Reporting
Message-ID: <46AFCF52.mailFVL1I50O7@systemfabricworks.com>


   Fix Bug 239 OpenIB diag scripts don't return error when lacking umad
   permissions.  Returning the error from the head of a shell pipeline is a
   problem, so this fix causes the awk scripts to pass error messages through.
   This will pass all standard error messages.

   This patch needs [ofa-general] [PATCH] infiniband-diags: Add common flags
   -P, -C, and -t (posted Tue Jul 31 13:39:27 PDT 2007) applied first.

Signed-off-by: David A. McMillen <davem at systemfabricworks.com>
---
 infiniband-diags/scripts/ibcheckerrors.in    |   11 +++++++++--
 infiniband-diags/scripts/ibcheckerrs.in      |   13 ++++++++++---
 infiniband-diags/scripts/ibchecknet.in       |   16 ++++++++++++++--
 infiniband-diags/scripts/ibcheckport.in      |   11 +++++++++--
 infiniband-diags/scripts/ibcheckportstate.in |   11 +++++++++--
 infiniband-diags/scripts/ibcheckportwidth.in |   11 +++++++++--
 infiniband-diags/scripts/ibcheckstate.in     |   10 +++++++++-
 infiniband-diags/scripts/ibcheckwidth.in     |   10 +++++++++-
 infiniband-diags/scripts/ibclearcounters.in  |   10 +++++++++-
 infiniband-diags/scripts/ibclearerrors.in    |   10 +++++++++-
 infiniband-diags/scripts/ibdatacounters.in   |   11 +++++++++--
 infiniband-diags/scripts/ibdatacounts.in     |   11 +++++++++--
 infiniband-diags/scripts/ibhosts.in          |    9 ++++++++-
 infiniband-diags/scripts/ibrouters.in        |    9 ++++++++-
 infiniband-diags/scripts/ibswitches.in       |    9 ++++++++-
 15 files changed, 138 insertions(+), 24 deletions(-)

diff --git a/infiniband-diags/scripts/ibcheckerrors.in b/infiniband-diags/scripts/ibcheckerrors.in
index 01c7a99..ebf44ec 100644
--- a/infiniband-diags/scripts/ibcheckerrors.in
+++ b/infiniband-diags/scripts/ibcheckerrors.in
@@ -73,7 +73,9 @@ else
 	netcmd="$IBPATH/ibnetdiscover $ca_info"
 fi
 
-eval $netcmd | awk '
+text="`eval $netcmd`"
+rv=$?
+echo "$text" | awk '
 BEGIN {
 	ne=0
 }
@@ -129,10 +131,15 @@ function check_node(lid)
 			}
 }
 
+/^ib/	{print $0}
+/ibpanic:/	{print $0}
+/ibwarn:/	{print $0}
+/iberror:/	{print $0}
+
 END {
 	printf "\n## Summary: %d nodes checked, %d bad nodes found\n", nnodes, ne
 	printf "##          %d ports checked, %d ports have errors beyond threshold\n", nports, pcnterr
 	exit (ne + pcnterr)
 }
 '
-exit $?
+exit $rv
diff --git a/infiniband-diags/scripts/ibcheckerrs.in b/infiniband-diags/scripts/ibcheckerrs.in
index 99d45cd..aa29525 100644
--- a/infiniband-diags/scripts/ibcheckerrs.in
+++ b/infiniband-diags/scripts/ibcheckerrs.in
@@ -151,9 +151,11 @@ else
 	fi
 fi
 
-nodename=`smpquery $ca_info nodedesc $lid | sed -e "s/^Node Description:\.*\(.*\)/\1/"`
+nodename=`$IBPATH/smpquery $ca_info nodedesc $lid | sed -e "s/^Node Description:\.*\(.*\)/\1/"`
 
-if $IBPATH/perfquery $ca_info $lid $portnum | awk -v mono=$bw -v brief=$brief -F '[.:]*' '
+text="`eval $IBPATH/perfquery $ca_info $lid $portnum`"
+rv=$?
+if echo "$text" | awk -v mono=$bw -v brief=$brief -F '[.:]*' '
 function blue(s)
 {
 	if (brief == "yes") {
@@ -184,6 +186,11 @@ BEGIN {
 
 /^CounterSelect/ {next}
 
+/^ib/  {print $0}
+/ibpanic:/     {print $0}
+/ibwarn:/      {print $0}
+/iberror:/     {print $0}
+
 /^PortSelect/	{ if ($2 != '$portnum') {err = err "error: lid '$lid' port " $2 " does not match query ('$portnum')\n"; exit -1}}
 
 $1 ~ "(Xmt|Rcv)(Pkts|Data)" { next }
@@ -201,7 +208,7 @@ END {
 		exit -1
 	}
 	exit 0
-}' 2>&1 ; then
+}' 2>&1 && test $rv -eq 0 ; then
 	if [ "$verbose" = "yes" ]; then
 		echo -n "Error check on lid $lid ($nodename) port $portname: "
 		green OK
diff --git a/infiniband-diags/scripts/ibchecknet.in b/infiniband-diags/scripts/ibchecknet.in
index e2f7fb8..a47ab8e 100644
--- a/infiniband-diags/scripts/ibchecknet.in
+++ b/infiniband-diags/scripts/ibchecknet.in
@@ -65,7 +65,9 @@ else
 	netcmd="$IBPATH/ibnetdiscover $ca_info"
 fi
 
-eval $netcmd | awk '
+text="`eval $netcmd`"
+rv=$?
+echo "$text" | awk '
 BEGIN {
 	ne=0
 	pe=0
@@ -130,6 +132,11 @@ function check_node(lid)
 			}
 }
 
+/^ib/	{print $0}
+/ibpanic:/	{print $0}
+/ibwarn:/	{print $0}
+/iberror:/	{print $0}
+
 END {
 	printf "\n## Summary: %d nodes checked, %d bad nodes found\n", nnodes, ne
 	printf "##          %d ports checked, %d bad ports found\n", nports, pe
@@ -137,4 +144,9 @@ END {
 	exit (ne + pe + pcnterr)
 }
 '
-exit $?
+av=$?
+if [ $av -ne 0 ] ; then
+	exit $av
+else
+	exit $rv
+fi
diff --git a/infiniband-diags/scripts/ibcheckport.in b/infiniband-diags/scripts/ibcheckport.in
index 3c7c396..94cfc6c 100644
--- a/infiniband-diags/scripts/ibcheckport.in
+++ b/infiniband-diags/scripts/ibcheckport.in
@@ -89,7 +89,9 @@ else
 fi
 
 
-if $IBPATH/smpquery $ca_info portinfo $lid $portnum | awk -v mono=$bw -F '[.:]*' '
+text="`eval $IBPATH/smpquery $ca_info portinfo $lid $portnum`"
+rv=$?
+if echo "$text" | awk -v mono=$bw -F '[.:]*' '
 function blue(s)
 {
 	if (mono)
@@ -114,6 +116,11 @@ function blue(s)
 
 #/^LocalPort/	{ if ($2 != '$portnum') {err = err "#error: port " $2 " does not match query ('$portnum')\n"; exit -1}}
 
+/^ib/	{print $0}
+/ibpanic:/	{print $0}
+/ibwarn:/	{print $0}
+/iberror:/	{print $0}
+
 END {
 	if (err != "") {
 		blue(err)
@@ -124,7 +131,7 @@ END {
 		exit -1
 	}
 	exit 0
-}' 2>&1 ; then
+}' 2>&1 && test $rv -eq 0 ; then
 	if [ "$verbose" = "yes" ]; then
 		echo -n "Port check lid $lid port $portnum: "
 		green "OK"
diff --git a/infiniband-diags/scripts/ibcheckportstate.in b/infiniband-diags/scripts/ibcheckportstate.in
index f3a5f05..2931f06 100644
--- a/infiniband-diags/scripts/ibcheckportstate.in
+++ b/infiniband-diags/scripts/ibcheckportstate.in
@@ -89,7 +89,9 @@ else
 fi
 
 
-if $IBPATH/smpquery $ca_info portinfo $lid $portnum | awk -v mono=$bw -F '[.:]*' '
+text="`eval $IBPATH/smpquery $ca_info portinfo $lid $portnum`"
+rv=$?
+if echo "$text" | awk -v mono=$bw -F '[.:]*' '
 function blue(s)
 {
 	if (mono)
@@ -106,6 +108,11 @@ function blue(s)
 
 /^LinkState/{ if ($2 != "Active") warn = warn "#warn: Logical link state is " $2 "  lid '$lid' port '$portnum'\n"}
 
+/^ib/	{print $0}
+/ibpanic:/	{print $0}
+/ibwarn:/	{print $0}
+/iberror:/	{print $0}
+
 END {
 	if (err != "") {
 		blue(err)
@@ -116,7 +123,7 @@ END {
 		exit -1
 	}
 	exit 0
-}' 2>&1 ; then
+}' 2>&1 && test $rv -eq 0 ; then
 	if [ "$verbose" = "yes" ]; then
 		echo -n "Port check lid $lid port $portnum: "
 		green "OK"
diff --git a/infiniband-diags/scripts/ibcheckportwidth.in b/infiniband-diags/scripts/ibcheckportwidth.in
index fdc75d1..84f1ef7 100644
--- a/infiniband-diags/scripts/ibcheckportwidth.in
+++ b/infiniband-diags/scripts/ibcheckportwidth.in
@@ -89,7 +89,9 @@ else
 fi
 
 
-if $IBPATH/smpquery $ca_info portinfo $lid $portnum | awk -v mono=$bw -F '[.:]*' '
+text="`eval $IBPATH/smpquery $ca_info portinfo $lid $portnum`"
+rv=$?
+if echo "$text" | awk -v mono=$bw -F '[.:]*' '
 function blue(s)
 {
 	if (mono)
@@ -104,6 +106,11 @@ function blue(s)
 /^LinkWidthSupported/{ if ($2 != "1X") { next } }
 /^LinkWidthActive/{ if ($2 == "1X") warn = warn "#warn: Link configured as 1X  lid '$lid' port '$portnum'\n"}
 
+/^ib/	{print $0}
+/ibpanic:/	{print $0}
+/ibwarn:/	{print $0}
+/iberror:/	{print $0}
+
 END {
 	if (err != "") {
 		blue(err)
@@ -114,7 +121,7 @@ END {
 		exit -1
 	}
 	exit 0
-}' 2>&1 ; then
+}' 2>&1 && test $rv -eq 0 ; then
 	if [ "$verbose" = "yes" ]; then
 		echo -n "Port check lid $lid port $portnum: "
 		green "OK"
diff --git a/infiniband-diags/scripts/ibcheckstate.in b/infiniband-diags/scripts/ibcheckstate.in
index 944e139..6ce0854 100644
--- a/infiniband-diags/scripts/ibcheckstate.in
+++ b/infiniband-diags/scripts/ibcheckstate.in
@@ -67,7 +67,9 @@ else
 	netcmd="$IBPATH/ibnetdiscover $ca_info"
 fi
 
-eval $netcmd | awk '
+text="`eval $netcmd`"
+rv=$?
+echo "$text" | awk '
 BEGIN {
 	ne=0
 	pe=0
@@ -120,8 +122,14 @@ function check_node(lid)
 		}
 }
 
+/^ib/	{print $0}
+/ibpanic:/	{print $0}
+/ibwarn:/	{print $0}
+/iberror:/	{print $0}
+
 END {
 	printf "\n## Summary: %d nodes checked, %d bad nodes found\n", nnodes, ne
 	printf "##          %d ports checked, %d ports with bad state found\n", nports, pe
 }
 '
+exit $rv
diff --git a/infiniband-diags/scripts/ibcheckwidth.in b/infiniband-diags/scripts/ibcheckwidth.in
index 8ad0f7f..f8f6a8b 100644
--- a/infiniband-diags/scripts/ibcheckwidth.in
+++ b/infiniband-diags/scripts/ibcheckwidth.in
@@ -67,7 +67,9 @@ else
 	netcmd="$IBPATH/ibnetdiscover $ca_info"
 fi
 
-eval $netcmd | awk '
+text="`eval $netcmd`"
+rv=$?
+echo "$text" | awk '
 BEGIN {
 	ne=0
 	pe=0
@@ -120,8 +122,14 @@ function check_node(lid)
 		}
 }
 
+/^ib/	{print $0}
+/ibpanic:/	{print $0}
+/ibwarn:/	{print $0}
+/iberror:/	{print $0}
+
 END {
 	printf "\n## Summary: %d nodes checked, %d bad nodes found\n", nnodes, ne
 	printf "##          %d ports checked, %d ports with 1x width in error found\n", nports, pe
 }
 '
+exit $rv
diff --git a/infiniband-diags/scripts/ibclearcounters.in b/infiniband-diags/scripts/ibclearcounters.in
index b3c009e..1818c42 100644
--- a/infiniband-diags/scripts/ibclearcounters.in
+++ b/infiniband-diags/scripts/ibclearcounters.in
@@ -61,7 +61,9 @@ else
 	netcmd="$IBPATH/ibnetdiscover $ca_info"
 fi
 
-eval $netcmd | awk '
+text="`eval $netcmd`"
+rv=$?
+echo "$text" | awk '
 
 function clear_counters(lid)
 {
@@ -100,7 +102,13 @@ function clear_port_counters(lid, port)
 			}
 		}
 
+/^ib/	{print $0}
+/ibpanic:/	{print $0}
+/ibwarn:/	{print $0}
+/iberror:/	{print $0}
+
 END {
 	printf "\n## Summary: %d nodes cleared %d errors\n", nnodes, nodeerr
 }
 '
+exit $rv
diff --git a/infiniband-diags/scripts/ibclearerrors.in b/infiniband-diags/scripts/ibclearerrors.in
index 097c3fe..c63283a 100644
--- a/infiniband-diags/scripts/ibclearerrors.in
+++ b/infiniband-diags/scripts/ibclearerrors.in
@@ -61,7 +61,9 @@ else
 	netcmd="$IBPATH/ibnetdiscover $ca_info"
 fi
 
-eval $netcmd | awk '
+text="`eval $netcmd`"
+rv=$?
+echo "$text" | awk '
 
 function clear_errors(lid, port)
 {
@@ -93,7 +95,13 @@ function clear_errors(lid, port)
 			}
 		}
 
+/^ib/	{print $0}
+/ibpanic:/	{print $0}
+/ibwarn:/	{print $0}
+/iberror:/	{print $0}
+
 END {
 	printf "\n## Summary: %d nodes cleared %d errors\n", nnodes, nodeerr
 }
 '
+exit $rv
diff --git a/infiniband-diags/scripts/ibdatacounters.in b/infiniband-diags/scripts/ibdatacounters.in
index bee9bd8..902a865 100644
--- a/infiniband-diags/scripts/ibdatacounters.in
+++ b/infiniband-diags/scripts/ibdatacounters.in
@@ -73,7 +73,9 @@ else
 	netcmd="$IBPATH/ibnetdiscover $ca_info"
 fi
 
-eval $netcmd | awk '
+text="`eval $netcmd`"
+rv=$?
+echo "$text" | awk '
 BEGIN {
 	ne=0
 }
@@ -128,10 +130,15 @@ function check_node(lid)
 			}
 }
 
+/^ib/	{print $0}
+/ibpanic:/	{print $0}
+/ibwarn:/	{print $0}
+/iberror:/	{print $0}
+
 END {
 	printf "\n## Summary: %d nodes checked, %d bad nodes found\n", nnodes, ne
 	printf "##          %d ports checked\n", nports
 	exit (ne )
 }
 '
-exit $?
+exit $rv
diff --git a/infiniband-diags/scripts/ibdatacounts.in b/infiniband-diags/scripts/ibdatacounts.in
index 927a978..bbdff71 100644
--- a/infiniband-diags/scripts/ibdatacounts.in
+++ b/infiniband-diags/scripts/ibdatacounts.in
@@ -108,7 +108,9 @@ fi
 
 nodename=`smpquery $ca_info nodedesc $lid | sed -e "s/^Node Description:\.*\(.*\)/\1/"`
 
-if $IBPATH/perfquery $ca_info $lid $portnum | awk -v mono=$bw -v brief=$brief -F '[.:]*' '
+text="`eval $IBPATH/perfquery $ca_info $lid $portnum`"
+rv=$?
+if echo "$text" | awk -v mono=$bw -v brief=$brief -F '[.:]*' '
 function blue(s)
 {
 	if (brief == "yes") {
@@ -128,6 +130,11 @@ function blue(s)
 
 /^CounterSelect/ {next}
 
+/^ib/	{print $0}
+/ibpanic:/	{print $0}
+/ibwarn:/	{print $0}
+/iberror:/	{print $0}
+
 /^PortSelect/	{ if ($2 != '$portnum') {err = err "error: lid '$lid' port " $2 " does not match query ('$portnum')\n"; exit -1}}
 
 $1 ~ "(Xmt|Rcv)(Pkts|Data)" { print $1 ":........................." $2 }
@@ -142,7 +149,7 @@ END {
 		exit -1
 	}
 	exit 0
-}' 2>&1 ; then
+}' 2>&1 && test $rv -eq 0 ; then
 	if [ "$verbose" = "yes" ]; then
 		echo -n "Error on lid $lid ($nodename) port $portname: "
 		green OK
diff --git a/infiniband-diags/scripts/ibhosts.in b/infiniband-diags/scripts/ibhosts.in
index 0d6b1bc..a287edf 100644
--- a/infiniband-diags/scripts/ibhosts.in
+++ b/infiniband-diags/scripts/ibhosts.in
@@ -47,7 +47,14 @@ else
 	netcmd="$IBPATH/ibnetdiscover $ca_info"
 fi
 
-eval $netcmd | awk '
+text="`eval $netcmd`"
+rv=$?
+echo "$text" | awk '
 /^Ca/	{print $1 "\t: 0x" substr($3, 4, 16) " ports " $2 " "\
 		substr($0, match($0, "#[ \t]*")+RLENGTH)} 
+/^ib/	{print $0}
+/ibpanic:/	{print $0}
+/ibwarn:/	{print $0}
+/iberror:/	{print $0}
 '
+exit $rv
diff --git a/infiniband-diags/scripts/ibrouters.in b/infiniband-diags/scripts/ibrouters.in
index fea72bb..e053794 100644
--- a/infiniband-diags/scripts/ibrouters.in
+++ b/infiniband-diags/scripts/ibrouters.in
@@ -47,7 +47,14 @@ else
 	netcmd="$IBPATH/ibnetdiscover $ca_info"
 fi
 
-eval $netcmd | awk '
+text="`eval $netcmd`"
+rv=$?
+echo "$text" | awk '
 /^Rt/	{print $1 "\t: 0x" substr($3, 4, 16) " ports " $2 " "\
 		substr($0, match($0, "#[ \t]*")+RLENGTH)} 
+/^ib/	{print $0}
+/ibpanic:/	{print $0}
+/ibwarn:/	{print $0}
+/iberror:/	{print $0}
 '
+exit $rv
diff --git a/infiniband-diags/scripts/ibswitches.in b/infiniband-diags/scripts/ibswitches.in
index 859aacd..0476d0e 100644
--- a/infiniband-diags/scripts/ibswitches.in
+++ b/infiniband-diags/scripts/ibswitches.in
@@ -47,7 +47,9 @@ else
 	netcmd="$IBPATH/ibnetdiscover $ca_info"
 fi
 
-eval $netcmd | awk '
+text="`eval $netcmd`"
+rv=$?
+echo "$text" | awk '
 /^Switch/	{
 			l=$0
 			desc=substr(l, match(l, "#[ \t]*")+RLENGTH)
@@ -69,4 +71,9 @@ eval $netcmd | awk '
 			else
 				print $1 "\t: 0x" substr($3, 4, 16) " ports " $2 " "\
 					desc " " type " " pinfo}
+/^ib/	{print $0}
+/ibpanic:/	{print $0}
+/ibwarn:/	{print $0}
+/iberror:/	{print $0}
 '
+exit $rv


From tamura at osrg.net  Tue Jul 31 17:58:51 2007
From: tamura at osrg.net (Yoshiaki Tamura)
Date: Wed, 01 Aug 2007 09:58:51 +0900
Subject: [ofa-general] [PATCH] build_env.sh: remove package checking for
	debian
In-Reply-To: <20070730091020.GH9963@mellanox.co.il>
References: <20070730091020.GH9963@mellanox.co.il>
Message-ID: <46AFDACB.907@osrg.net>

Hi,

The following patch removes some package checking which stops building OFED 1.2 
on debian. Although it's not a complete solution to port OFED on debian,
it should help some debian users to build.

Thanks,

Yoshi

Signed-off-by: Yoshi Tamura <tamura at osrg.net>

--- a/build_env.sh	2007-06-21 23:38:41.000000000 +0900
+++ b/build_env.sh	2007-07-31 14:21:50.000000000 +0900
@@ -1790,6 +1790,8 @@
          debian)
          libibverbs_BUILD_REQ=
          libibverbs_devel_BUILD_REQ=
+        libibverbs_RUN_REQ=
+        mstflint_BUILD_REQ=
          ;;
          redhat5)
          # start_udev is required to create /dev/infiniband


From mst at dev.mellanox.co.il  Tue Jul 31 22:41:52 2007
From: mst at dev.mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 1 Aug 2007 08:41:52 +0300
Subject: [ofa-general] Re: RE: [PATCH] OFED 1.2.1 rdma_cm response timeout
	module	parameter
In-Reply-To: <46AF9B7C.7020906@ichips.intel.com>
References: <001301c7c404$4ace0f00$3c98070a@amr.corp.intel.com>
	<46AF9B7C.7020906@ichips.intel.com>
Message-ID: <20070801054152.GB17884@mellanox.co.il>

> Quoting Arlin Davis <ardavis at ichips.intel.com>:
> Subject: Re: RE: [PATCH] OFED 1.2.1 rdma_cm response timeout module?parameter
> 
> Sean Hefty wrote:
> 
> >>OFED 1.2 removed the rdma_set_option call used to adjust response 
> >>timeout. We
> >>are running into some cases on larger clusters that require longer 
> >>timeouts
> >>then the default. Can you consider this rdma_cm patch for OFED 1.2.1 that 
> >>adds
> >>a module parameter for the response timeout? Thanks.
> >>   
> >>
> >
> >What's in it for me?  :)
> >
> > 
> >
> >>Signed-off by: Arlin Davis <ardavis at ichips.intel.com>
> >>   
> >>
> >
> >Acked-by: Sean Hefty <sean.hefty at intel.com>
> >
> >Vlad, can you add this for OFED 1.2.1?
> >
> >- Sean
> > 
> >
> 
> Did this get added to 1.2.1?


http://www.openfabrics.org/git/?p=ofed_1_2/linux-2.6.git;a=blob;f=kernel_patches/fixes/cma_response_timeout.patch;hb=HEAD

-- 
MST


From vlad at mellanox.co.il  Tue Jul 31 22:54:29 2007
From: vlad at mellanox.co.il (Vladimir Sokolovsky)
Date: Wed, 1 Aug 2007 08:54:29 +0300
Subject: [ofa-general] RE: [PATCH] OFED 1.2.1 rdma_cm response timeout
	module	parameter
In-Reply-To: <46AF9B7C.7020906@ichips.intel.com>
References: <001301c7c404$4ace0f00$3c98070a@amr.corp.intel.com>
	<46AF9B7C.7020906@ichips.intel.com>
Message-ID: <6C2C79E72C305246B504CBA17B5500C901FF14DA@mtlexch01.mtl.com>

> Sean Hefty wrote:
> 
> >>OFED 1.2 removed the rdma_set_option call used to adjust response
> timeout. We
> >>are running into some cases on larger clusters that require longer
> timeouts
> >>then the default. Can you consider this rdma_cm patch for OFED 1.2.1
> that adds
> >>a module parameter for the response timeout? Thanks.
> >>
> >>
> >
> >What's in it for me?  :)
> >
> >
> >
> >>Signed-off by: Arlin Davis <ardavis at ichips.intel.com>
> >>
> >>
> >
> >Acked-by: Sean Hefty <sean.hefty at intel.com>
> >
> >Vlad, can you add this for OFED 1.2.1?
> >
> >- Sean
> >
> >
> 
> Did this get added to 1.2.1?

Yes,
It is in ofed_1_2/linux-2.6.git (both ofed_1_2 and ofed_1_2_c branches)

commit 020bfb400c759ba89ffb0b13c41f2ca50181aebe
Author: Arlin Davis <ardavis at ichips.intel.com>
Date:   Thu Jul 12 12:01:39 2007 +0300

    OFED 1.2 removed the rdma_set_option call used to adjust response
    timeout. We are running into some cases on larger clusters that
require
    longer timeouts then the default.

    Signed-off by: Arlin Davis <ardavis at ichips.intel.com>

Regards,
Vladimir