From vlad at lists.openfabrics.org  Fri May  1 03:22:30 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Fri,  1 May 2009 03:22:30 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090501-0200 daily build status
Message-ID: <20090501102230.BFDB5E61313@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From arkady.kanevsky at gmail.com  Fri May  1 04:24:36 2009
From: arkady.kanevsky at gmail.com (arkady kanevsky)
Date: Fri, 1 May 2009 07:24:36 -0400
Subject: [ofa-general] uDAPL DTO completion question.
In-Reply-To: <49FA7C21.1050400@cs.anu.edu.au>
References: <49D2BD00.5010002@cs.anu.edu.au>
	<469958e00903312040j7700d2ccr9104996c2fc29cd4@mail.gmail.com>
	<517c62fb0903312253w6344d62j1b8c072354b15ad2@mail.gmail.com>
	<49D30C7F.1050201@cs.anu.edu.au>
	<469958e00904010852ydaf5b07wccdde27dd02ca724@mail.gmail.com>
	<49FA7C21.1050400@cs.anu.edu.au>
Message-ID: <517c62fb0905010424k76da4e59tc9a7a857ba5af727@mail.gmail.com>

Jie,it sounds to me that either the variable is not volatile or compiler
optimization
causes some problem. I would check for these first.
Arkady

On Fri, May 1, 2009 at 12:35 AM, Jie Cai <Jie.Cai at cs.anu.edu.au> wrote:

> Thanks for the help along my understanding with IB and uDAPL.
>
> Is that possible to spin remote memory of a rdma atomic compare and swap
> (dat_ib_post_cmp_and_swap())?
>
> I have wrote a program that initiator atomic cmp_swap a value to a remote
> memory.
>
> Instead of sending a message to notify the remoter about the completion of
> cmp_swap,
> the remoter actually doing a memory spin to test the update on the memory
> (e.g. while(target == 0);).
>
> However, at remote side, this while loops goes infinitely, and the
> initiator has already received DAT_IB_DTO_EVENT.
>
> I don't really understand what's going on, and what would be a correct way
> to spin memory for checking remote
> write updates.
>
> Any suggestions?
>
> Regards,
> Jie
>
> --
> Jie Cai
>
>
>
>
> Caitlin Bestler wrote:
>
>> On Tue, Mar 31, 2009 at 11:41 PM, Jie Cai <Jie.Cai at cs.anu.edu.au> wrote:
>>
>>
>>> Understood now. A further question is here again.
>>>
>>> To implement software level acknowledgment to inform initiator that data
>>> has been available for remoter, is that possible to use a busy loop at
>>> remote
>>> side to detect the last element of transferring has appear in the memory.
>>>
>>> Or remoter has to wait for the event of recv matching initiator's send,
>>> then
>>> send a message back to initiator as a acknowledgment?
>>>
>>>
>>>
>>
>> There are two issues when spinning on a remote memory update.
>>
>> The first is that packets may be received and processed out of order,
>> especially for iWARP. Therefore the fact that the last byte has been
>> received and placed does not guarantee that the prior packets have
>> been received and placed.
>>
>> More importantly, the order in which updates become visible to a
>> specific software thread can make the order of updates unpredictable
>> to the application.
>>
>> When delivering a completion the Provider is responsible for dealing
>> with both of these problems. So when you reap a completion from the
>> CQ, the operation it represents (and all prior operations) are complete.
>> There are no gaps in received packets, nothing is still sitting on an
>> Adapter buffer waiting to be placed in host memory.
>>
>> If your application does not want to block you can consider polling
>> the cq whether than enabling notifications. But polling memory locations
>> directly should only be done when you're willing to have bus/adapter
>> specific dependencies. You working code might stop working when
>> your network changes, or you install a new Adapter that has a different
>> strategy for optimizing its writes over the PCIe bus.
>>
>>
>


-- 
Cheers,
Arkady Kanevsky
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090501/f6ae0f6d/attachment.html>

From tziporet at mellanox.co.il  Fri May  1 04:33:32 2009
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Fri, 1 May 2009 14:33:32 +0300
Subject: [ofa-general] OFED 1.4.1-rc4  is available
Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD020E76C4@mtlexch01.mtl.com>


Hi,

OFED-1.4.1-rc4  release is available on

http://www.openfabrics.org/downloads/OFED/ofed-1.4.1/OFED-1.4.1-rc4.tgz 
To get BUILD_ID run ofed_info

Please report any issues in bugzilla https://bugs.openfabrics.org/  for
OFED 1.4.1

Vladimir & Tziporet

========================================================================


Release information:
------------------------------
Linux Operating Systems:
      - RedHat EL4 up4:  2.6.9-42.ELsmp      *
      - RedHat EL4 up5:  2.6.9-55.ELsmp
      - RedHat EL4 up6:  2.6.9-67.ELsmp
      - RedHat EL4 up7:  2.6.9-78.ELsmp
      - RedHat EL5:        2.6.18-8.el5
      - RedHat EL5 up1:  2.6.18-53.el5
      - RedHat EL5 up2:  2.6.18-92.el5
      - RedHat EL5 up3:  2.6.18-128.el5
      - OEL 4.5:              2.6.9-55.ELsmp
      - OEL 5.2:              2.6.18-92.el5
      - CentOS 5.2:         2.6.18-92.el5
      - Fedora C9:           2.6.25-14.fc9          *
      - SLES10:              2.6.16.21-0.8-smp
      - SLES10 SP1:       2.6.16.46-0.12-smp
      - SLES10 SP1 up1: 2.6.16.53-0.16-smp
      - SLES10 SP2:       2.6.16.60-0.21-smp
      - SLES11 GA:         2.6.27.13-1-default
      - OpenSuSE 10.3:   2.6.22.5-31             *
      - kernel.org:             2.6.26 and 2.6.27

    * Minimal QA for these versions

Systems:
      * x86_64
      * x86
      * ia64
      * ppc64


Main Changes from OFED-1.4.1-rc2
==========================
- 22 bugs fixed (see attachment)
- Attached kernel git tree changes for details


Tasks that should be completed for RC4 (Apr 20):
====================================
1. High priority bug fixes - see list bellow
2. Documentation update

Open bugs:
========
bug_id	bug_severity	op_sys		assigned_to
short_short_desc
1607    	blo  	 	SLES
Jeffrey.C.Becker at nasa.gov  	kernel oops during login on sles10 sp2
with OFED-1.4.1-20...
1616 	cri 	 	RHEL 		jon at opengridcomputing.com
iommu_alloc error when running connectathon on ppc64 nfs ...
1571 	cri 	 	RHEL 		vu at mellanox.com
nfsrdma server crash @test5 connectathon basic test,

Note: I saw some mails that some of these bugs are fixed but since they
are still open in bugzilla I report them here.
Please update bugzilla with any fixed bug.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ofed-1.4.1-rc3_rc4.log
Type: application/octet-stream
Size: 23602 bytes
Desc: ofed-1.4.1-rc3_rc4.log
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090501/7c545b92/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ofed-1.4.1-rc4-fixed-bugs.csv
Type: application/octet-stream
Size: 2454 bytes
Desc: ofed-1.4.1-rc4-fixed-bugs.csv
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090501/7c545b92/attachment-0001.obj>

From jsquyres at cisco.com  Fri May  1 04:56:48 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Fri, 1 May 2009 07:56:48 -0400
Subject: [ofa-general] Re: New proposal for memory management
In-Reply-To: <20090430222230.GF32114@obsidianresearch.com>
References: <20090429215508.GW4431@obsidianresearch.com>
	<C61E2CCC.4A57%bwbarre@sandia.gov>
	<20090429222125.GX4431@obsidianresearch.com>
	<1241044080.3403.374.camel@chromite.mv.qlogic.com>
	<20090429224411.GC32114@obsidianresearch.com>
	<23635E11-F18E-4799-9B6E-C3163000A3A3@cisco.com>
	<20090430222230.GF32114@obsidianresearch.com>
Message-ID: <4069D3B1-7208-4808-869A-B3B10E36C59E@cisco.com>

On Apr 30, 2009, at 6:22 PM, Jason Gunthorpe wrote:

> After reading all the postings, I think my idea to fix the verbs API
> to not, essentially, corrupt an existing registration when the virtual
> address space changes is the best bet. This slightly changes the
> semantics of the verbs MR to refer to virtual address space within the
> process, not the underlying object(s) that happen to be mapped there
> when the registration is made.
>

I'm not sure how this helps MPI -- our registration caches will still  
become invalid if the MPI app free()'s registered memory...?

MPI maintains a registration cache because registration is so  
expensive.  Even if the registration cache becomes "safely" invalid  
(e.g., you'll never get a scenario where one virtual address could  
have previously pointed to a different hardware address within the  
span of one process), it doesn't help.
>
> MPIs can choose to continue to hook malloc/free/etc or not, it doesn't
>

No no no!  We don't want to continue hooking this stuff.  The hooks  
are horrible, horrible, horrible -- there's real-world apps that break  
them.

> > While MPI is currently the biggest victim, this broken memory  
> management
> > model is also an enormous roadblock for any other application or  
> ULP to
> > write to verbs.
>
> I'm not sure this is true.. The purpose built verbs apps I've worked
> on don't have a problem like MPI, and managing the memory registration
> was not hard.


Ok, I'll back off slightly: if you want verbs to go mainstream, there  
will be many other ULPs / middleware libraries that have memory models  
like MPI's (that the upper layer is responsible for allocating/freeing  
message buffers).  Put differently: the TCP/sockets stack doesn't have  
this restriction; it will be extremely difficult to convert legions of  
sockets programmers to verbs if you effectively restrict large  
messages to only be allocated/freed by the network layer (kinda  
defeats the point of RDMA if you have to copy large messages, right?).

-- 
Jeff Squyres
Cisco Systems


From jsquyres at cisco.com  Fri May  1 05:11:42 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Fri, 1 May 2009 08:11:42 -0400
Subject: [ofa-general] New proposal for memory management
In-Reply-To: <517c62fb0904301425y2bb7b468qfd2cadd7d41f15d1@mail.gmail.com>
References: <C61F4144.4A92%bwbarre@sandia.gov>
	<659DF081-1112-47B6-9CB2-45B8D5C5E8C2@cisco.com>
	<382A478CAD40FA4FB46605CF81FE39F42B5C8631@orsmsx507.amr.corp.intel.com>
	<517c62fb0904301425y2bb7b468qfd2cadd7d41f15d1@mail.gmail.com>
Message-ID: <0C054C01-6DF8-4B7D-A540-9693BEA58EDD@cisco.com>

On Apr 30, 2009, at 5:25 PM, arkady kanevsky wrote:

> are the MPI applications that are broken are the ones which is  
> malloc/free
> instead of MPI_ALLOC calls?

Yes.

-- 
Jeff Squyres
Cisco Systems


From arkady.kanevsky at gmail.com  Fri May  1 05:25:29 2009
From: arkady.kanevsky at gmail.com (arkady kanevsky)
Date: Fri, 1 May 2009 08:25:29 -0400
Subject: [ofa-general] New proposal for memory management
In-Reply-To: <0C054C01-6DF8-4B7D-A540-9693BEA58EDD@cisco.com>
References: <C61F4144.4A92%bwbarre@sandia.gov>
	<659DF081-1112-47B6-9CB2-45B8D5C5E8C2@cisco.com>
	<382A478CAD40FA4FB46605CF81FE39F42B5C8631@orsmsx507.amr.corp.intel.com>
	<517c62fb0904301425y2bb7b468qfd2cadd7d41f15d1@mail.gmail.com>
	<0C054C01-6DF8-4B7D-A540-9693BEA58EDD@cisco.com>
Message-ID: <517c62fb0905010525s5a05cb76w736afe494d67aeca@mail.gmail.com>

Jeff,What if we provide a script which converts all malloc/freecalls into
MPI ones and move MPI_INIT before any memory allocation?
Will these application user be willing to do the conversion?
Will it fix all the problems or are there some loose ends?
Arkady

On Fri, May 1, 2009 at 8:11 AM, Jeff Squyres <jsquyres at cisco.com> wrote:

> On Apr 30, 2009, at 5:25 PM, arkady kanevsky wrote:
>
>  are the MPI applications that are broken are the ones which is malloc/free
>> instead of MPI_ALLOC calls?
>>
>
> Yes.
>
> --
> Jeff Squyres
> Cisco Systems
>
>


-- 
Cheers,
Arkady Kanevsky
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090501/b61aa037/attachment.html>

From jsquyres at cisco.com  Fri May  1 05:48:39 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Fri, 1 May 2009 08:48:39 -0400
Subject: [ofa-general] New proposal for memory management
In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F42B5C876E@orsmsx507.amr.corp.intel.com>
References: <382A478CAD40FA4FB46605CF81FE39F42B5C8631@orsmsx507.amr.corp.intel.com>
	<C61F6F4D.4AB3%bwbarre@sandia.gov>
	<382A478CAD40FA4FB46605CF81FE39F42B5C876E@orsmsx507.amr.corp.intel.com>
Message-ID: <8B573344-0870-4352-8BC2-AED66E1E4234@cisco.com>

On Apr 30, 2009, at 6:01 PM, Woodruff, Robert J wrote:

> To me, all this sounds like a lot of whining....
> Why can't the OS fix all my problems.

Absolutely not.  As Brian stated, we have cited some real-world  
problems that we cannot fix (and we have tried many, many different  
workarounds over the past few years to fix them).

It sounds like your main objection to fixing them is "it's too much  
work."  :-(

> There's an application at Sandia and at Los Alamos which both of  
> which cause problems for our linker tricks.  This leads to such  
> things as (proven) silent data corruption.
>

There are other apps that have also been reported over the years.  C++  
apps with their own allocators as especially problematic.  Abaqus had  
to change their memory allocation model several years ago to be able  
to workaround these issues.  These memory models also break valgrind,  
purify, and other memory-checking debuggers.

> Have you tried these applications with any MPI other than OpenMPI ?   
> i.e., does this corruption happen with Intel MPI and other MPIs as  
> well?
>

We have been trying to say that this is a general problem that there  
currently is no guaranteed fix for.  There's always a way to break the  
MPI workarounds for verbs' broken memory management model because  
there's no way to guarantee the memory allocation hooks.

There's two main reasons for fix these issues:

1. Business: to attract network programmers to verbs (and therefore to  
attract applications and therefore increase market share), it has to  
be simpler and within reach of today's commodity sockets-level  
programmers.  Forcing them to have registration caches and to do  
memory allocation hooking significantly raises the bar.  To date, this  
has been shunned by all network programmers except HPC and a handful  
of storage protocols.

2. Technical: if OFED says "to get good performance with verbs, you  
have to do malloc/mmap/etc. hooks and have a registration cache, "this  
unnecessarily *significantly* raises the education and code complexity  
barrier to entry for verbs programmers.  It's also un-scaleable -- if  
this is something you *have* to do for good performance, why doesn't  
the network stack do it?  It seems weird that you would effectively  
force all ULPs/MPIs/applications to implement the same functionality.   
The memory allocation hooking model also fails if more than one verbs- 
based middleware is used in the same application (because only one  
will be able to use the memory hooks per process).

Here's a story that encompasses both reasons:

We had Open MPI *not* use the registration cache by default for a long  
time because of the danger it posed to applications.  Users could  
activate the registration cache with a simple command line parameter.   
But nobody would do that -- they wanted to run with top performance  
right out of the box (which is not unreasonable).  It also led to  
OMPI's competitors -- ahem, *YOU* at Sonoma 2009 (!) -- citing "look,  
Open MPI's performance is bad!  Our MPI's performance is GREAT!"  Open  
MPI therefore was forced to change its defaults in the 1.3 series to  
activate the [dangerous] memory registration cache by default.

You mentioned that doing this stuff is a choice; the choice that MPI's/ 
ULPs/applications therefore have is:

- don't use registration caches/memory allocation hooking, have  
terrible performance
- use registration caches/memory allocation hooking, have good  
performance

Which is no choice at all.  If customers pay top dollar for these  
networks, they want to see benchmarks run out of the box that show  
that they're getting every flop/byte-per-second that they can.  The  
fact that the programming model is needlessly complicated (and  
dangerous) to get that performance is something that the MPI's have  
tolerated because we had to for competition's sake.

This is not something that non-HPC customers will accept.

> Of the solutions that have been presented so far,
> I think the kernel notifier approach would be a better solution.
>

Note that Jason G. said in this thread: "Notifiers are going to be  
very troublesome, every time any sort of synchronous to user space  
notifier has been proposed or implemented in the kernel it has been a  
disaster."

-- 
Jeff Squyres
Cisco Systems


From jsquyres at cisco.com  Fri May  1 05:56:58 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Fri, 1 May 2009 08:56:58 -0400
Subject: [ofa-general] New proposal for memory management
In-Reply-To: <517c62fb0905010525s5a05cb76w736afe494d67aeca@mail.gmail.com>
References: <C61F4144.4A92%bwbarre@sandia.gov>
	<659DF081-1112-47B6-9CB2-45B8D5C5E8C2@cisco.com>
	<382A478CAD40FA4FB46605CF81FE39F42B5C8631@orsmsx507.amr.corp.intel.com>
	<517c62fb0904301425y2bb7b468qfd2cadd7d41f15d1@mail.gmail.com>
	<0C054C01-6DF8-4B7D-A540-9693BEA58EDD@cisco.com>
	<517c62fb0905010525s5a05cb76w736afe494d67aeca@mail.gmail.com>
Message-ID: <96A68779-ED89-49BB-9C29-B9BB33221FD6@cisco.com>

On May 1, 2009, at 8:25 AM, arkady kanevsky wrote:

> What if we provide a script which converts all malloc/free
> calls into MPI ones and move MPI_INIT before any memory allocation?
> Will these application user be willing to do the conversion?

We've been trying to educate MPI application developers for 10  
years.  :-)

If you think a script will help, go for it.  :-)

Sorry; I'm not trying to be snide -- this thread is getting  
increasingly frustrating.  No, I don't think it will help for a few  
reasons:

- MPI's already support malloc/etc. buffers; changing that now would  
be a big change -- based on this one network stack.

- MPI's are competitive.  If one MPI forces the use of MPI_ALLOC_MEM,  
then others will say "you should use my MPI because then you don't  
have to change your code to use MPI_ALLOC_MEM."  Because we're  
ultimately competing for customer's dollars -- MPI's actively try to  
make programming/using their product as easy as possible.

- Fortran is always problematic.  I haven't thought through the  
problems there, but I know of many apps that have huge arrays declared  
statically (which the fortran compiler gets from the heap, not the  
stack).  Forcing them to change to F90-style pointers would never  
happen.

- I cited earlier in the thread MPI-based middleware that could do  
MPI_ALLOC_MEM (potentially plus a copy) for short messages, but likely  
re-uses application buffers directly for large messages because the  
copy cost would be too much.  Specifically: if MPI is not the top  
level middleware in an application -- some other middleware is  
fronting the network stack, like a computational library or somesuch  
-- they might have to make exactly the same compromises (e.g.,  
application buffers are too large, so let's just use those instead of  
MPI_ALLOC_MEM+copy).

-- 
Jeff Squyres
Cisco Systems


From jsquyres at cisco.com  Fri May  1 06:07:40 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Fri, 1 May 2009 09:07:40 -0400
Subject: [ofa-general] New proposal for memory management
In-Reply-To: <C61F7DF4.4ABE%bwbarre@sandia.gov>
References: <C61F7DF4.4ABE%bwbarre@sandia.gov>
Message-ID: <EDC68D6F-BDF9-4889-9CA4-E654BAFEEB2D@cisco.com>

On Apr 30, 2009, at 6:10 PM, Barrett, Brian W wrote:

> I'm done now.  You don't want to fix your crap, that's fine.  Just  
> don't be
> surprised by the continued "why you shouldn't use IB" presentations  
> from
> people who have to write applications to it.
>


Let's not forget that Brian is not only an MPI developer (i.e., a  
network programmer), he's also a customer.

If OpenFabrics only wants the HPC market, you can probably ignore this  
entire thread.  The OpenFabrics-based MPI's will hobble along like  
they have been.  If you want larger markets, it's probably pretty safe  
to assume that Brian's reactions are going to be quite similar to  
enterprise network programmers.

To be clear: it's not just verbs education (books, tutorials, FAQ's,  
etc.) that is required to win the hearts and minds of enterprise  
network programmers.  You also need a network API that is no more  
complex than common sockets usage.    Verbs -- and the additional  
baggage that it requires for performance, like registration caches and  
memory allocation hooking -- does not currently meet this requirement.

-- 
Jeff Squyres
Cisco Systems


From jsquyres at cisco.com  Fri May  1 06:17:44 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Fri, 1 May 2009 09:17:44 -0400
Subject: [ofa-general] Re: [mwg] Re: RDMA tutorial and OFA
In-Reply-To: <3F6F638B8D880340AB536D29CD4C1E19450E0496@orsmsx501.amr.corp.intel.com>
References: <3F6F638B8D880340AB536D29CD4C1E19450E0496@orsmsx501.amr.corp.intel.com>
Message-ID: <264AF717-9B0C-4DB3-922A-39DCA1940900@cisco.com>

I'd also like to call the IWG's and MWG's attention to the other  
thread currently running on the general list: "New proposal for memory  
management."

There are many points in there about attracting non-HPC / enterprise  
network programmers to write verbs-based applications.  It's not just  
documentation / education that is missing -- having a series of FAQs  
and tutorials about verbs programming is not enough.  You need a  
network programming API that is no more complex than common sockets  
usage.

Specifically: let's not forget that HPC (OF's biggest market right  
now) tends to attract network programmers with PhD's, and/or who are  
among the top programming talent in the world (yes, that's being  
snobbish -- but it's still true).  To make OF within reach of the  
masses, you want to lower the bar so that legions of sockets-based  
network programmers can hope to learn/use this stuff without requiring  
them to get a PhD first.


On Apr 30, 2009, at 6:12 PM, Ryan, Jim wrote:

> At the risk of piling on, I think what Lloyd is suggesting is very  
> important. The objections I continue to hear about programming using  
> RDMA are along the lines of "it's too hard" or "no one knows how to  
> do it".
>
> It occurs to me if we could provide some concise instruction, that,  
> coupled with the undeniable benefits of RDMA, could provide a  
> compelling package for "RDMA for the masses"
>
> thanks, Jim
>
> From: mwg-bounces at lists.openfabrics.org [mailto:mwg-bounces at lists.openfabrics.org 
> ] On Behalf Of Lloyd Dickman
> Sent: Thursday, April 30, 2009 1:17 PM
> To: arkady kanevsky; bill.boas at openfabrics.org
> Cc: iwg at lists.openfabrics.org; Paul Grun; OFA at lists.openfabrics.org;  
> Paul Gray; Working Group; Wayne Augsburger; Andy Grover; Richard  
> Frank;Jeff at lists.openfabrics.org; Squyres; Mikkel Hagen; Scott at lists.openfabrics.org 
> ; general at lists.openfabrics.org; Friedman; bobs at voltaire.com;  
> Sumanta Chatterjee;asafs`@voltaire.com; Roland Dreier
> Subject: RE: [mwg] Re: RDMA tutorial and OFA
>
> I support the idea of the RDMA tutorial.  Beyond the “meat” as  
> described below, I would encourage the tutorial to include a “how to  
> program RDMA” section.  While OFA Verbs provides a rich set of  
> mechanisms, it is difficult for the average programmer to get a  
> solid handle on how to use the capabilities, register memory, …   
> Some cookbook examples, or perhaps development of several  
> programming “patterns” can go a long way to having RDMA become a  
> much more mainstream application programming paradigm.
>
> Lloyd
>
> From: mwg-bounces at lists.openfabrics.org [mailto:mwg-bounces at lists.openfabrics.org 
> ] On Behalf Of arkady kanevsky
> Sent: Thursday, April 30, 2009 11:27 AM
> To: bill.boas at openfabrics.org
> Cc: iwg at lists.openfabrics.org; Paul Grun; Paul Gray; OFA Marketing  
> Working Group; Wayne Augsburger; Andy Grover; Richard Frank; asafs`@voltaire.com 
> ; Jeff Squyres; Mikkel Hagen;general at lists.openfabrics.org; Scott  
> Friedman; bobs at voltaire.com; Sumanta Chatterjee; Roland Dreier
> Subject: [mwg] Re: RDMA tutorial and OFA
>
> Keep me in the loop.
> I am interested to do it also.
> Thanks,
> Arkady
> On Thu, Apr 30, 2009 at 1:39 PM, Bill Boas  
> <Bill.Boas at openfabrics.org> wrote:
> Richard, Andy,
>
> Thanks for copying me Richard. I had not seen Andy's email on the  
> general
> list.
>
> Figuring out how to get tutorial and other documentation created and
> published in the list of things to get done in 2009 for me in my  
> part-time
> role as Exec. Dir.
>
> There is no funding set up for this at the moment but I believe  
> there will
> be in about 30 days.
>
> That's because I'm thinking that we can get funding for this by  
> making it
> part of the funding for a new marketing plan for OFA that, with Wayne
> Augsburger and Jim Ryan, we are preparing for the OFA Board to vote  
> on at
> the next con-call meeting which is on May 20 at 9.00AM PDT.
>
> Would you be willing to work with me and create a small team from  
> others
> within OFA who have the same interest to prepare a description by  
> May 20 of
> what the tutorial would look like, who would contribute to it, how  
> to get it
> "polished up" for web and/or book style publication, what the  
> overall costs
> would be, etc.
>
> My thoughts, that could be a starting point for the team's work, are  
> that we
> would make the creation a collective effort.
>
> The tutorial would have several sections for example general intro,  
> benefits
> of RDMA, applicability in HPC and Enterprise, networking background  
> etc.
> Members of the Marketing Working Group would be responsible for this.
>
> The "meat" would be sections for kernel level things (verbs etc.),  
> then user
> space things (verbs etc.), then APIs like MPI, SDP, EDS etc. - each  
> section
> overseen by the technical leaders/maintainers of the code within OFA  
> for
> that section (for Example Tom Talpey for NFSoRDMA, or you Richard  
> for RDS)
>
> Finally the tutorial would have sections about Interoperability  
> Testing that
> OFA/IOL does but also what customers can do on there own systems -  
> Arkady
> and Rupert and IOL have put in an SC09 tutorial proposal that we could
> leverage in this section.
>
> To all readers of this email:-
> If you have read this far, please give us all some feedback. If you  
> have
> material you'd like to contribute please say so. If there's a better  
> way,
> tell us what you think it is!
>
> Thanks,
>
> Bill.
>
> Bill Boas
> Executive Director and Vice Chair
> OpenFabrics Alliance
> 510-375-8840
> Bill.Boas at openfabrics.org
> www.openfabrics.org
>
> -----Original Message-----
> From: Richard Frank [mailto:richard.frank at oracle.com]
> Sent: Wednesday, April 29, 2009 12:58 PM
> To: Andy Grover
> Cc: Bill Boas; Sumanta Chatterjee
> Subject: Re: RDMA tutorial and OFA
>
> Andy, I saw your postings to ofa-general on this and I agree it  
> would be
> great to have this documentation.
>
> As OpenFabrics is really about RDMA... we need to make it simpler
> for folks to pick up and run with RDMA concepts ...vs.. digging thru  
> the IB
> specs and code examples, etc.
>
> Let's see what Bill Boas thinks...perhaps OFA has a writer on board  
> that
> can help us do this..?
>
> I can also help provide input for a new OFA RDMA tutorial doc..
>
> Rick
>
> Andy Grover wrote:
> > Hi Rick,
> >
> > Are you around for a brief chat this afternoon? I have a crazy  
> idea that
> > involves OFA doing something (or putting up $$) and I wanted to  
> see what
> > you thought, since you're Oracle's OFA rep, right?
> >
> > -- Andy
> >
> >
>
>
>
> -- 
> Cheers,
> Arkady Kanevsky


-- 
Jeff Squyres
Cisco Systems


From jsquyres at cisco.com  Fri May  1 06:26:18 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Fri, 1 May 2009 09:26:18 -0400
Subject: [ofa-general] New proposal for memory management
In-Reply-To: <3B6F5C9068D5864096EB291236C3386F058EDB8C@xmb-sjc-21d.amer.cisco.com>
References: <C6E9B5EC-C922-4675-8469-3D7C5AB4C9BE@cisco.com>
	<loom.20090430T052134-692@post.gmane.org>
	<48644923-A2EF-4C4F-8F9A-B1658BAA36FE@cisco.com>
	<3B6F5C9068D5864096EB291236C3386F058EDB8C@xmb-sjc-21d.amer.cisco.com>
Message-ID: <A60A8DC7-EEF9-414C-AB49-57B96ED6C15E@cisco.com>

On Apr 30, 2009, at 7:20 PM, Aaron Fabbri (aafabbri) wrote:

>> Yes, MPI_ALLOC_MEM / MPI_FREE_MEM calls have been around for
>> a long time (~10 years?).  Using them does avoid many of the
>> problems that have been discussed.  Most (all?) MPI's either
>> support ALLOC_MEM / FREE_MEM by registering at allocation
>> time and unregistering at free time, or some variation of that.
>
> Ah.  Are there any problems that are not addressed by having MPI own  
> allocation of network bufs?

Sure, there's lots of them.  :-)  But this thread is just about the  
memory allocation management issues.

> (BTW registering for each allocation could be improved, I think.)

Probably so.  Since so few MPI applications use these calls, OMPI  
hasn't really bothered to tune them.

>> But unfortunately, very few MPI apps use these calls; they use
>> malloc() and friends instead.  Or they're written in Fortran,
>> where such concepts are not easily mapped (don't
>> underestimate how much Fortran MPI code runs on verbs!).
>> Indeed, in some layered scenarios, it's not easy to use these
>> calls (e.g., if an MPI-enabled computational library may
>> re-use user-provided buffers because they're so large, etc.).
>
> I understand the difficulty.  A couple possible counterpoints:
>
> 1. Make the next version of MPI spec *require* using the mpi_alloc
> atuff.

The MPI Forum (the standards body) has been very resistant to this,  
especially based on the requirements of one not-pervasive network  
stack.  It would effectively break all legacy MPI applications, too.   
I seriously doubt that the Forum would go for that.

FWIW: the way the MPI spec is worded, it says that you *may* get  
performance benefit from using MPI_ALLOC_MEM.  E.g., an MPI can always  
support using malloc buffers -- just copy into network-special  
buffers.  The performance would be terrible :-), but it would be  
correct.

> 2. MPI already requires recompilation of apps, right?  I don't know
> fortran, or what it uses for allocation, but worse case, maybe you  
> could
> change the standard libraries or compilers.

We tried that -- interposing our own copies of malloc, free, mmap, ...  
etc. (e.g., inside libmpi).  Ick.  Horrible, horrible ick.  And it  
definitely breaks some real-world apps and memory-checking debuggers/ 
tools.

> 3. Rip out your registration cache.  Make malloc'd buffers go really
> slow (register in fast path) and mpi_alloc_mem() buffers go really  
> fast.
> People will migrate.  The hard part of this would be getting all  
> MPIs to
> agree on this, I'm guessing.


See http://lists.openfabrics.org/pipermail/general/2009-May/ 
059376.html -- Open MPI effectively tried this and got beat up by a)  
competing MPI's, and b) the marketing supporting Open MPI.  :-\

People won't migrate, nor will main-line MPI benchmarks.  Customers  
want top performance out-of-the-box with their MPI (which is not  
unreasonable).  Users have used malloc() for 10+ years, and other  
networks don't require the use of MPI_ALLOC_MEM.

-- 
Jeff Squyres
Cisco Systems


From tmtalpey at gmail.com  Fri May  1 06:25:33 2009
From: tmtalpey at gmail.com (Tom Talpey)
Date: Fri, 01 May 2009 09:25:33 -0400
Subject: [ofa-general] New proposal for memory management
In-Reply-To: <EDC68D6F-BDF9-4889-9CA4-E654BAFEEB2D@cisco.com>
References: <C61F7DF4.4ABE%bwbarre@sandia.gov>
	<EDC68D6F-BDF9-4889-9CA4-E654BAFEEB2D@cisco.com>
Message-ID: <49faf95d.02c3f10a.1fc8.ffffaff5@mx.google.com>

At 09:07 AM 5/1/2009, Jeff Squyres wrote:
>On Apr 30, 2009, at 6:10 PM, Barrett, Brian W wrote:
>
>> I'm done now.  You don't want to fix your crap, that's fine.  Just  
>> don't be
>> surprised by the continued "why you shouldn't use IB" presentations  
>> from
>> people who have to write applications to it.
>>
>
>
>Let's not forget that Brian is not only an MPI developer (i.e., a  
>network programmer), he's also a customer.
>
>If OpenFabrics only wants the HPC market, you can probably ignore this  
>entire thread.  The OpenFabrics-based MPI's will hobble along like  
>they have been.  If you want larger markets, it's probably pretty safe  
>to assume that Brian's reactions are going to be quite similar to  
>enterprise network programmers.

Completely agree. I will add that enterprise network programmers are
going to reject registration caching as well, because it introduces
vulnerabilities into the data path - silent data corruption. For example,
storage won't tolerate it, databases won't, etc.

The problem is that userspace memory registration is slow. Let's address
that, not address how to make a hack (registration caching) go faster.

We've solved this in the kernel with FRMR, why not take a similar solution
up to user verbs? Wouldn't that address it, by allowing the library to safely
and efficiently manage registration on a per-io basis?

Tom.


From todd.rimmer at qlogic.com  Fri May  1 07:27:50 2009
From: todd.rimmer at qlogic.com (Todd Rimmer)
Date: Fri, 1 May 2009 09:27:50 -0500
Subject: [ofa-general] Re: [mwg] Re: RDMA tutorial and OFA
In-Reply-To: <264AF717-9B0C-4DB3-922A-39DCA1940900@cisco.com>
References: <3F6F638B8D880340AB536D29CD4C1E19450E0496@orsmsx501.amr.corp.intel.com>
	<264AF717-9B0C-4DB3-922A-39DCA1940900@cisco.com>
Message-ID: <5AEC2602AE03EB46BFC16C6B9B200DA8134A97300F@MNEXMB2.qlogic.org>

It goes beyond just a tutorial.

In talking to customers, the consensus is that many application programmers struggle with sockets, RDMA is an order of magnitude beyond that.  It's not a cut on programmers, there are some very strong ones in the enterprise, but a fair percentage only have associate degrees or technical school training.  Even the extremely smart ones have 100 things to juggle (and often must write code such that entry level programmers can support it), so the risk/reward or ROI of learning RDMA has to be there.  The higher the learning cost the more difficult to justify the effort.

To summarize what is really needed:
- simplified APIs and easy migration of applications
	- SDP with zCopy was supposed to be a start, unfortunately the implementation required relinking applications.  Sounds simple to developers, but very tricky in the field, especially with complex apps, 3rd party scripts to start them, etc.  A kernel based "socket switch" approach is needed to make this 100% transparent.

- good simple examples of how to do it, sample programs etc
	- write the samples then analyze and improve the API to further simplify them
	- connection establishment is still difficult in OFED.  Also many apps are shortcutting the process by avoiding SA queries (hence impacting the ability of the applications to work properly with QOS, LMC, complex fabrics (torus, etc), Partitioning, etc).
	- either the Base API needs to improve or "helper libraries" are needed on top of it.

- effective tools to debug applications.  Right now there are very limited debug facilities in the ofa kernel (and most require a debug build), strace is not applicable to user verbs (due to kernel bypass), etc.  You need ways to analyze resources (QPs, MRs, etc) while the application is running or after it has dumped.  You need ways to trace the sequence of Verbs calls to analyze program behavior and bugs.  Also ways to analyze the "on wire" behavior (aka tcpdump) of an application while its running is needed.  Right now it's impossible in OFED to identify how many QPs are open, let alone which applications are using them, etc.  Tools like madeye are inefficient and lack the proper filtering to be effective for all but very simple problems.

- accessibility in scripting languages and other languages (java, C#, etc).  Many languages have powerful capabilities to manage sockets and TCP layers above it (http, smtp, etc).  However there is no effective way to use RDMA and IB in languages other than C.  A start for scripting languages could be the transparent SDP approach.  For java, C++, C# and other languages there needs to be effective APIs and libraries that map well into the style of the language.

Todd Rimmer
Chief Architect 
QLogic Network Systems Group
Voice: 610-233-4852     Fax: 610-233-4777
Todd.Rimmer at QLogic.com  www.QLogic.com
 

> -----Original Message-----
> From: general-bounces at lists.openfabrics.org [mailto:general-
> bounces at lists.openfabrics.org] On Behalf Of Jeff Squyres
> Sent: Friday, May 01, 2009 9:18 AM
> To: Ryan, Jim
> Cc: iwg at lists.openfabrics.org; Paul Grun; asafs`@voltaire.com; Paul
> Gray; Working Group; Wayne Augsburger; Lloyd Dickman; Sumanta
> Chatterjee; Mikkel Hagen; Roland Dreier (rdreier); bobs at voltaire.com;
> Jeff at lists.openfabrics.org; general at lists.openfabrics.org; Friedman;
> bill.boas at openfabrics.org; OFA at lists.openfabrics.org;
> Scott at lists.openfabrics.org
> Subject: [ofa-general] Re: [mwg] Re: RDMA tutorial and OFA
> 
> I'd also like to call the IWG's and MWG's attention to the other
> thread currently running on the general list: "New proposal for memory
> management."
> 
> There are many points in there about attracting non-HPC / enterprise
> network programmers to write verbs-based applications.  It's not just
> documentation / education that is missing -- having a series of FAQs
> and tutorials about verbs programming is not enough.  You need a
> network programming API that is no more complex than common sockets
> usage.
> 
> Specifically: let's not forget that HPC (OF's biggest market right
> now) tends to attract network programmers with PhD's, and/or who are
> among the top programming talent in the world (yes, that's being
> snobbish -- but it's still true).  To make OF within reach of the
> masses, you want to lower the bar so that legions of sockets-based
> network programmers can hope to learn/use this stuff without requiring
> them to get a PhD first.
> 
> 
> 
> On Apr 30, 2009, at 6:12 PM, Ryan, Jim wrote:
> 
> > At the risk of piling on, I think what Lloyd is suggesting is very
> > important. The objections I continue to hear about programming using
> > RDMA are along the lines of "it's too hard" or "no one knows how to
> > do it".
> >
> > It occurs to me if we could provide some concise instruction, that,
> > coupled with the undeniable benefits of RDMA, could provide a
> > compelling package for "RDMA for the masses"
> >
> > thanks, Jim
> >
> > From: mwg-bounces at lists.openfabrics.org [mailto:mwg-
> bounces at lists.openfabrics.org
> > ] On Behalf Of Lloyd Dickman
> > Sent: Thursday, April 30, 2009 1:17 PM
> > To: arkady kanevsky; bill.boas at openfabrics.org
> > Cc: iwg at lists.openfabrics.org; Paul Grun; OFA at lists.openfabrics.org;
> > Paul Gray; Working Group; Wayne Augsburger; Andy Grover; Richard
> > Frank;Jeff at lists.openfabrics.org; Squyres; Mikkel Hagen;
> Scott at lists.openfabrics.org
> > ; general at lists.openfabrics.org; Friedman; bobs at voltaire.com;
> > Sumanta Chatterjee;asafs`@voltaire.com; Roland Dreier
> > Subject: RE: [mwg] Re: RDMA tutorial and OFA
> >
> > I support the idea of the RDMA tutorial.  Beyond the "meat" as
> > described below, I would encourage the tutorial to include a "how to
> > program RDMA" section.  While OFA Verbs provides a rich set of
> > mechanisms, it is difficult for the average programmer to get a
> > solid handle on how to use the capabilities, register memory, ...
> > Some cookbook examples, or perhaps development of several
> > programming "patterns" can go a long way to having RDMA become a
> > much more mainstream application programming paradigm.
> >
> > Lloyd
> >
> > From: mwg-bounces at lists.openfabrics.org [mailto:mwg-
> bounces at lists.openfabrics.org
> > ] On Behalf Of arkady kanevsky
> > Sent: Thursday, April 30, 2009 11:27 AM
> > To: bill.boas at openfabrics.org
> > Cc: iwg at lists.openfabrics.org; Paul Grun; Paul Gray; OFA Marketing
> > Working Group; Wayne Augsburger; Andy Grover; Richard Frank;
> asafs`@voltaire.com
> > ; Jeff Squyres; Mikkel Hagen;general at lists.openfabrics.org; Scott
> > Friedman; bobs at voltaire.com; Sumanta Chatterjee; Roland Dreier
> > Subject: [mwg] Re: RDMA tutorial and OFA
> >
> > Keep me in the loop.
> > I am interested to do it also.
> > Thanks,
> > Arkady
> > On Thu, Apr 30, 2009 at 1:39 PM, Bill Boas
> > <Bill.Boas at openfabrics.org> wrote:
> > Richard, Andy,
> >
> > Thanks for copying me Richard. I had not seen Andy's email on the
> > general
> > list.
> >
> > Figuring out how to get tutorial and other documentation created and
> > published in the list of things to get done in 2009 for me in my
> > part-time
> > role as Exec. Dir.
> >
> > There is no funding set up for this at the moment but I believe
> > there will
> > be in about 30 days.
> >
> > That's because I'm thinking that we can get funding for this by
> > making it
> > part of the funding for a new marketing plan for OFA that, with Wayne
> > Augsburger and Jim Ryan, we are preparing for the OFA Board to vote
> > on at
> > the next con-call meeting which is on May 20 at 9.00AM PDT.
> >
> > Would you be willing to work with me and create a small team from
> > others
> > within OFA who have the same interest to prepare a description by
> > May 20 of
> > what the tutorial would look like, who would contribute to it, how
> > to get it
> > "polished up" for web and/or book style publication, what the
> > overall costs
> > would be, etc.
> >
> > My thoughts, that could be a starting point for the team's work, are
> > that we
> > would make the creation a collective effort.
> >
> > The tutorial would have several sections for example general intro,
> > benefits
> > of RDMA, applicability in HPC and Enterprise, networking background
> > etc.
> > Members of the Marketing Working Group would be responsible for this.
> >
> > The "meat" would be sections for kernel level things (verbs etc.),
> > then user
> > space things (verbs etc.), then APIs like MPI, SDP, EDS etc. - each
> > section
> > overseen by the technical leaders/maintainers of the code within OFA
> > for
> > that section (for Example Tom Talpey for NFSoRDMA, or you Richard
> > for RDS)
> >
> > Finally the tutorial would have sections about Interoperability
> > Testing that
> > OFA/IOL does but also what customers can do on there own systems -
> > Arkady
> > and Rupert and IOL have put in an SC09 tutorial proposal that we
> could
> > leverage in this section.
> >
> > To all readers of this email:-
> > If you have read this far, please give us all some feedback. If you
> > have
> > material you'd like to contribute please say so. If there's a better
> > way,
> > tell us what you think it is!
> >
> > Thanks,
> >
> > Bill.
> >
> > Bill Boas
> > Executive Director and Vice Chair
> > OpenFabrics Alliance
> > 510-375-8840
> > Bill.Boas at openfabrics.org
> > www.openfabrics.org
> >
> > -----Original Message-----
> > From: Richard Frank [mailto:richard.frank at oracle.com]
> > Sent: Wednesday, April 29, 2009 12:58 PM
> > To: Andy Grover
> > Cc: Bill Boas; Sumanta Chatterjee
> > Subject: Re: RDMA tutorial and OFA
> >
> > Andy, I saw your postings to ofa-general on this and I agree it
> > would be
> > great to have this documentation.
> >
> > As OpenFabrics is really about RDMA... we need to make it simpler
> > for folks to pick up and run with RDMA concepts ...vs.. digging thru
> > the IB
> > specs and code examples, etc.
> >
> > Let's see what Bill Boas thinks...perhaps OFA has a writer on board
> > that
> > can help us do this..?
> >
> > I can also help provide input for a new OFA RDMA tutorial doc..
> >
> > Rick
> >
> > Andy Grover wrote:
> > > Hi Rick,
> > >
> > > Are you around for a brief chat this afternoon? I have a crazy
> > idea that
> > > involves OFA doing something (or putting up $$) and I wanted to
> > see what
> > > you thought, since you're Oracle's OFA rep, right?
> > >
> > > -- Andy
> > >
> > >
> >
> >
> >
> > --
> > Cheers,
> > Arkady Kanevsky
> 
> 
> --
> Jeff Squyres
> Cisco Systems
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-
> general


From jsquyres at cisco.com  Fri May  1 08:25:40 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Fri, 1 May 2009 11:25:40 -0400
Subject: [ofa-general] Re: [mwg] Re: RDMA tutorial and OFA
In-Reply-To: <5AEC2602AE03EB46BFC16C6B9B200DA8134A97300F@MNEXMB2.qlogic.org>
References: <3F6F638B8D880340AB536D29CD4C1E19450E0496@orsmsx501.amr.corp.intel.com>
	<264AF717-9B0C-4DB3-922A-39DCA1940900@cisco.com>
	<5AEC2602AE03EB46BFC16C6B9B200DA8134A97300F@MNEXMB2.qlogic.org>
Message-ID: <47C854CD-173C-45CC-9DBC-482EACC18921@cisco.com>

Hear, hear!

FWIW, I think the attached slide shows it pictorially pretty well.

A good one-line summary: MPI is so popular [in HPC] because the simple  
things are simple; with verbs, even the simple things are hard.


On May 1, 2009, at 10:27 AM, Todd Rimmer wrote:

> It goes beyond just a tutorial.
>
> In talking to customers, the consensus is that many application  
> programmers struggle with sockets, RDMA is an order of magnitude  
> beyond that.  It's not a cut on programmers, there are some very  
> strong ones in the enterprise, but a fair percentage only have  
> associate degrees or technical school training.  Even the extremely  
> smart ones have 100 things to juggle (and often must write code such  
> that entry level programmers can support it), so the risk/reward or  
> ROI of learning RDMA has to be there.  The higher the learning cost  
> the more difficult to justify the effort.
>
> To summarize what is really needed:
> - simplified APIs and easy migration of applications
>         - SDP with zCopy was supposed to be a start, unfortunately  
> the implementation required relinking applications.  Sounds simple  
> to developers, but very tricky in the field, especially with complex  
> apps, 3rd party scripts to start them, etc.  A kernel based "socket  
> switch" approach is needed to make this 100% transparent.
>
> - good simple examples of how to do it, sample programs etc
>         - write the samples then analyze and improve the API to  
> further simplify them
>         - connection establishment is still difficult in OFED.  Also  
> many apps are shortcutting the process by avoiding SA queries (hence  
> impacting the ability of the applications to work properly with QOS,  
> LMC, complex fabrics (torus, etc), Partitioning, etc).
>         - either the Base API needs to improve or "helper libraries"  
> are needed on top of it.
>
> - effective tools to debug applications.  Right now there are very  
> limited debug facilities in the ofa kernel (and most require a debug  
> build), strace is not applicable to user verbs (due to kernel  
> bypass), etc.  You need ways to analyze resources (QPs, MRs, etc)  
> while the application is running or after it has dumped.  You need  
> ways to trace the sequence of Verbs calls to analyze program  
> behavior and bugs.  Also ways to analyze the "on wire" behavior (aka  
> tcpdump) of an application while its running is needed.  Right now  
> it's impossible in OFED to identify how many QPs are open, let alone  
> which applications are using them, etc.  Tools like madeye are  
> inefficient and lack the proper filtering to be effective for all  
> but very simple problems.
>
> - accessibility in scripting languages and other languages (java,  
> C#, etc).  Many languages have powerful capabilities to manage  
> sockets and TCP layers above it (http, smtp, etc).  However there is  
> no effective way to use RDMA and IB in languages other than C.  A  
> start for scripting languages could be the transparent SDP  
> approach.  For java, C++, C# and other languages there needs to be  
> effective APIs and libraries that map well into the style of the  
> language.
>
> Todd Rimmer
> Chief Architect
> QLogic Network Systems Group
> Voice: 610-233-4852     Fax: 610-233-4777
> Todd.Rimmer at QLogic.com  www.QLogic.com
>
>
> > -----Original Message-----
> > From: general-bounces at lists.openfabrics.org [mailto:general-
> > bounces at lists.openfabrics.org] On Behalf Of Jeff Squyres
> > Sent: Friday, May 01, 2009 9:18 AM
> > To: Ryan, Jim
> > Cc: iwg at lists.openfabrics.org; Paul Grun; asafs`@voltaire.com; Paul
> > Gray; Working Group; Wayne Augsburger; Lloyd Dickman; Sumanta
> > Chatterjee; Mikkel Hagen; Roland Dreier (rdreier);  
> bobs at voltaire.com;
> > Jeff at lists.openfabrics.org; general at lists.openfabrics.org; Friedman;
> > bill.boas at openfabrics.org; OFA at lists.openfabrics.org;
> > Scott at lists.openfabrics.org
> > Subject: [ofa-general] Re: [mwg] Re: RDMA tutorial and OFA
> >
> > I'd also like to call the IWG's and MWG's attention to the other
> > thread currently running on the general list: "New proposal for  
> memory
> > management."
> >
> > There are many points in there about attracting non-HPC / enterprise
> > network programmers to write verbs-based applications.  It's not  
> just
> > documentation / education that is missing -- having a series of FAQs
> > and tutorials about verbs programming is not enough.  You need a
> > network programming API that is no more complex than common sockets
> > usage.
> >
> > Specifically: let's not forget that HPC (OF's biggest market right
> > now) tends to attract network programmers with PhD's, and/or who are
> > among the top programming talent in the world (yes, that's being
> > snobbish -- but it's still true).  To make OF within reach of the
> > masses, you want to lower the bar so that legions of sockets-based
> > network programmers can hope to learn/use this stuff without  
> requiring
> > them to get a PhD first.
>


-- 
Jeff Squyres
Cisco Systems
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jsquyres-panel-barriers-to-ofed-adoption-slide-5.pdf
Type: application/pdf
Size: 345219 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090501/8cd64f57/attachment.pdf>

From caitlin.bestler at gmail.com  Fri May  1 09:08:13 2009
From: caitlin.bestler at gmail.com (Caitlin Bestler)
Date: Fri, 1 May 2009 09:08:13 -0700
Subject: [ofa-general] uDAPL DTO completion question.
In-Reply-To: <517c62fb0905010424k76da4e59tc9a7a857ba5af727@mail.gmail.com>
References: <49D2BD00.5010002@cs.anu.edu.au>
	<469958e00903312040j7700d2ccr9104996c2fc29cd4@mail.gmail.com>
	<517c62fb0903312253w6344d62j1b8c072354b15ad2@mail.gmail.com>
	<49D30C7F.1050201@cs.anu.edu.au>
	<469958e00904010852ydaf5b07wccdde27dd02ca724@mail.gmail.com>
	<49FA7C21.1050400@cs.anu.edu.au>
	<517c62fb0905010424k76da4e59tc9a7a857ba5af727@mail.gmail.com>
Message-ID: <469958e00905010908kb6d1361n43b48b7486824bf3@mail.gmail.com>

On Fri, May 1, 2009 at 4:24 AM, arkady kanevsky
<arkady.kanevsky at gmail.com> wrote:
> Jie,
> it sounds to me that either the variable is not volatile or compiler
> optimization
> causes some problem. I would check for these first.
> Arkady
>

Agreed, it is definitely a caching issue.

Atomics are InfiniBand specific, and there are some fairly complex
rules that govern
how much the HCA can do caching. The gotcha is that they basically provide some
cache coherency guarantees within the context of a connection, but not
much between
connections or versus local applications.

That said, it would be rare for HCA caching to be the cause of
anything worse than
some unexpected ordering. Adapters cache when they have to, but would
really rather
not allocate or track a lot of resources. Updating to real physical
memory ASAP is much
simpler.

Compilers, on the other hand, *love* optimizing. The key thing to
understand is that the
HCA is another processor, one that is at least as distant as any other
CPU core. Any
and all techniques used when sharing memory with another processor apply.

Completions hide all that from the application, just promising that
specific things are
coherent when the user invokes the verbs to reap a completion. So
whenever you do
without completions you are dealing with an arbitrary multi-processor
memory coherence
problem.


From sashak at voltaire.com  Fri May  1 09:07:17 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 1 May 2009 19:07:17 +0300
Subject: [ofa-general] Re: [PATCH v3 3/3] Convert ibnetdiscover to use new
	ibnetdisc library.
In-Reply-To: <20090427142533.85f00f4d.weiny2@llnl.gov>
References: <20090403154301.f656e7a4.weiny2@llnl.gov>
	<20090423082535.GD8281@sk>
	<20090423100206.c2621310.weiny2@llnl.gov>
	<20090425103216.GB28604@sk>
	<20090427142533.85f00f4d.weiny2@llnl.gov>
Message-ID: <20090501160717.GD14714@sk.iol.unh.edu>

On 14:25 Mon 27 Apr     , Ira Weiny wrote:
> > 
> >    `PROG_LDADD' is inappropriate for passing program-specific linker
> > flags (except for `-l', `-L', `-dlopen' and `-dlpreopen').  So, use
> > the `PROG_LDFLAGS' variable for this purpose.  
> > 
> > So '-L' is exception suitable for LDADD.
> 
> Ah ok, I did not know about the exception.  We can change if you prefer.

I asked for a "general knowledge". I don't see a reason to change
especially.

Sasha


From pashash at gmail.com  Fri May  1 09:27:53 2009
From: pashash at gmail.com (Pavel Shamis (Pasha))
Date: Fri, 01 May 2009 19:27:53 +0300
Subject: [ofa-general] New proposal for memory management
In-Reply-To: <3B6F5C9068D5864096EB291236C3386F058EDB8C@xmb-sjc-21d.amer.cisco.com>
References: <C6E9B5EC-C922-4675-8469-3D7C5AB4C9BE@cisco.com>	<loom.20090430T052134-692@post.gmane.org>	<48644923-A2EF-4C4F-8F9A-B1658BAA36FE@cisco.com>
	<3B6F5C9068D5864096EB291236C3386F058EDB8C@xmb-sjc-21d.amer.cisco.com>
Message-ID: <49FB2309.5090702@dev.mellanox.co.il>

Aaron Fabbri (aafabbri) wrote:
> 3. Rip out your registration cache.  Make malloc'd buffers go really
> slow (register in fast path) and mpi_alloc_mem() buffers go really fast.
> People will migrate.
People will migrate to what ? (A) new malloc ? Or (B) other interconnect 
platform
that does not require from user to change his application in order to 
get reasonable performance ?
I'm not sure that people will chose (A) ;-)

Pasha.


From robert.j.woodruff at intel.com  Fri May  1 09:37:41 2009
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Fri, 1 May 2009 09:37:41 -0700
Subject: [ofa-general] New proposal for memory management
In-Reply-To: <8B573344-0870-4352-8BC2-AED66E1E4234@cisco.com>
References: <382A478CAD40FA4FB46605CF81FE39F42B5C8631@orsmsx507.amr.corp.intel.com>
	<C61F6F4D.4AB3%bwbarre@sandia.gov>
	<382A478CAD40FA4FB46605CF81FE39F42B5C876E@orsmsx507.amr.corp.intel.com>
	<8B573344-0870-4352-8BC2-AED66E1E4234@cisco.com>
Message-ID: <382A478CAD40FA4FB46605CF81FE39F42B63D6F1@orsmsx507.amr.corp.intel.com>

Jeff wrote,

>It sounds like your main objection to fixing them is "it's too much  
>work."  :-(

Not really, in general I think the kernel folks like to keep stuff
out of the kernel if it is not really needed, i.e., if it can be implemented 
in user-space, especially really complicated
things like this. It is probably a somewhat tricky code to implement, prone
to bugs that could cause instability in the kernel. Remember if there is a bug
in user-space and an application dies, it only effects that one application,
if the bug is in the kernel and it crashes the system it affects everyone.

That said, if you can find someone to implement it, then do it and send in
the patches. Assuming the code is not too ugly, maybe it would get accepted.


From koop at cse.ohio-state.edu  Fri May  1 09:57:13 2009
From: koop at cse.ohio-state.edu (Matthew Koop)
Date: Fri, 1 May 2009 12:57:13 -0400 (EDT)
Subject: [ofa-general] New proposal for memory management
In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F42B63D6F1@orsmsx507.amr.corp.intel.com>
Message-ID: <Pine.GSO.4.40.0905011244230.14559-100000@kappa.cse.ohio-state.edu>


I'm sure sure everyone has gotten this point yet -- It's not a matter of
the MPIs having this complicated memory registration code in userspace
they want to push into the kernel to simplify their lives.

The problem is that this registration cache in userspace is a hack that
can't even be guaranteed. Besides that, this hack ruins many memory
debugging tools.

This alone is a major hassle for many users since the memory registration
caching changes the timings. A user can't be told to just run the
application with the registration cache on for normal runs and then debug
with it off. Many errors end up being timing dependant.

Matt

On Fri, 1 May 2009, Woodruff, Robert J wrote:

> Jeff wrote,
>
> >It sounds like your main objection to fixing them is "it's too much
> >work."  :-(
>
> Not really, in general I think the kernel folks like to keep stuff
> out of the kernel if it is not really needed, i.e., if it can be implemented
> in user-space, especially really complicated
> things like this. It is probably a somewhat tricky code to implement, prone
> to bugs that could cause instability in the kernel. Remember if there is a bug
> in user-space and an application dies, it only effects that one application,
> if the bug is in the kernel and it crashes the system it affects everyone.
>
> That said, if you can find someone to implement it, then do it and send in
> the patches. Assuming the code is not too ugly, maybe it would get accepted.
>
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From rdreier at cisco.com  Fri May  1 10:09:47 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 01 May 2009 10:09:47 -0700
Subject: [ofa-general] New proposal for memory management
In-Reply-To: <8B573344-0870-4352-8BC2-AED66E1E4234@cisco.com> (Jeff Squyres's
	message of "Fri, 1 May 2009 08:48:39 -0400")
References: <382A478CAD40FA4FB46605CF81FE39F42B5C8631@orsmsx507.amr.corp.intel.com>
	<C61F6F4D.4AB3%bwbarre@sandia.gov>
	<382A478CAD40FA4FB46605CF81FE39F42B5C876E@orsmsx507.amr.corp.intel.com>
	<8B573344-0870-4352-8BC2-AED66E1E4234@cisco.com>
Message-ID: <adaljpgyckk.fsf@cisco.com>

 > You mentioned that doing this stuff is a choice; the choice that
 > MPI's/ ULPs/applications therefore have is:
 >
 > - don't use registration caches/memory allocation hooking, have
 > terrible performance
 > - use registration caches/memory allocation hooking, have good
 > performance

I think it's a bit of a stretch to suggest that all or even most
userspace RDMA applications have the same need for registration caching
as MPI.  In fact my feeling is that the fact that MPI must deal with
RDMA to arbitrary memory allocated by an application out of MPI's
control is the exception.  My most recent experience was with Cisco's
RAB library, and in that case we simply designed the library so that all
RDMA was done to memory allocated by the library -- so no need for a
registration cache, and in fact no need for registration in any fast
path.  I suspect that the majority of code written to use RDMA natively
will be designed with similar properties.

So this proposal is very much an MPI-specific interface.  Which leads to
my next point.  I have no doubt that the MPI community has a very good
idea of a memory registration interface that would make MPI
implementations simpler and more robust.  However I don't think there's
quite as much expertise about what the best way to implement such an
interface is.

My initial reaction is that I don't want to extend the kernel ABI with
a set of new MPI-specific verbs if there's a way around it.  We've been
told over and over that the registration cache is complex and fragile
code -- but moving complex and fragile code into the kernel doesn't
magically make it any simpler or more robust, it just means that bugs
now crash the whole system instead of just affecting one process.

Now, of course MMU notifiers allow the kernel to know reliably when a
process's page tables change, which means that all the complicated
malloc hooking etc is not needed.  So that complexity is avoided in the
kernel.  But suppose I give userspace the same MMU notifier capability
(eg I add a system call like "if any mappings in the virtual address
range X ... Y change, then write a 1 to virtual address Z") -- then what
do I gain from having the rest of the registration caching in the
kernel?  (And avoiding the duplication of caching code between multiple
MPI implementations is not an answer -- it's quite feasible to put the
caching code into libibverbs if that's the best place for it)

 - R.


From sashak at voltaire.com  Fri May  1 10:38:06 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 1 May 2009 20:38:06 +0300
Subject: [ofa-general] Re: [PATCH 0/5] Follow on patch series to libibnetdisc
	including converting ibqueryerrors.pl
In-Reply-To: <20090427150409.9c10e479.weiny2@llnl.gov>
References: <20090422185441.6f8601dc.weiny2@llnl.gov>
	<20090425175710.GI28604@sk>
	<20090427150409.9c10e479.weiny2@llnl.gov>
Message-ID: <20090501173806.GF14714@sk.iol.unh.edu>

On 15:04 Mon 27 Apr     , Ira Weiny wrote:
> 
> The port output should be from low to high.

> What do you see?

Yes, the port order is good (I was wrong about it). But switch order is
reserved - first discovered switch is printed last. Right?

Sasha


From arkady.kanevsky at gmail.com  Fri May  1 10:57:57 2009
From: arkady.kanevsky at gmail.com (arkady kanevsky)
Date: Fri, 1 May 2009 13:57:57 -0400
Subject: [ofa-general] New proposal for memory management
In-Reply-To: <Pine.GSO.4.40.0905011244230.14559-100000@kappa.cse.ohio-state.edu>
References: <382A478CAD40FA4FB46605CF81FE39F42B63D6F1@orsmsx507.amr.corp.intel.com>
	<Pine.GSO.4.40.0905011244230.14559-100000@kappa.cse.ohio-state.edu>
Message-ID: <517c62fb0905011057o5dfef3d0pc949fdd083360bf4@mail.gmail.com>

Matt,What is your feel for FMR like model in user space?
If MPI implementation will do FMR under the covers do you
foresee some issues? It will not be as performant as preregistering
memory once earlier in the MPI program.
Thanks,
Arkady

On Fri, May 1, 2009 at 12:57 PM, Matthew Koop <koop at cse.ohio-state.edu>wrote:

>
> I'm sure sure everyone has gotten this point yet -- It's not a matter of
> the MPIs having this complicated memory registration code in userspace
> they want to push into the kernel to simplify their lives.
>
> The problem is that this registration cache in userspace is a hack that
> can't even be guaranteed. Besides that, this hack ruins many memory
> debugging tools.
>
> This alone is a major hassle for many users since the memory registration
> caching changes the timings. A user can't be told to just run the
> application with the registration cache on for normal runs and then debug
> with it off. Many errors end up being timing dependant.
>
> Matt
>
> On Fri, 1 May 2009, Woodruff, Robert J wrote:
>
> > Jeff wrote,
> >
> > >It sounds like your main objection to fixing them is "it's too much
> > >work."  :-(
> >
> > Not really, in general I think the kernel folks like to keep stuff
> > out of the kernel if it is not really needed, i.e., if it can be
> implemented
> > in user-space, especially really complicated
> > things like this. It is probably a somewhat tricky code to implement,
> prone
> > to bugs that could cause instability in the kernel. Remember if there is
> a bug
> > in user-space and an application dies, it only effects that one
> application,
> > if the bug is in the kernel and it crashes the system it affects
> everyone.
> >
> > That said, if you can find someone to implement it, then do it and send
> in
> > the patches. Assuming the code is not too ugly, maybe it would get
> accepted.
> >
> >
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> >
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>


-- 
Cheers,
Arkady Kanevsky
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090501/38c3d540/attachment.html>

From jgunthorpe at obsidianresearch.com  Fri May  1 11:18:31 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Fri, 1 May 2009 12:18:31 -0600
Subject: [ofa-general] Re: New proposal for memory management
In-Reply-To: <4069D3B1-7208-4808-869A-B3B10E36C59E@cisco.com>
References: <20090429215508.GW4431@obsidianresearch.com>
	<C61E2CCC.4A57%bwbarre@sandia.gov>
	<20090429222125.GX4431@obsidianresearch.com>
	<1241044080.3403.374.camel@chromite.mv.qlogic.com>
	<20090429224411.GC32114@obsidianresearch.com>
	<23635E11-F18E-4799-9B6E-C3163000A3A3@cisco.com>
	<20090430222230.GF32114@obsidianresearch.com>
	<4069D3B1-7208-4808-869A-B3B10E36C59E@cisco.com>
Message-ID: <20090501181830.GB3475@obsidianresearch.com>

On Fri, May 01, 2009 at 07:56:48AM -0400, Jeff Squyres wrote:
> On Apr 30, 2009, at 6:22 PM, Jason Gunthorpe wrote:
> 
> >After reading all the postings, I think my idea to fix the verbs API
> >to not, essentially, corrupt an existing registration when the virtual
> >address space changes is the best bet. This slightly changes the
> >semantics of the verbs MR to refer to virtual address space within the
> >process, not the underlying object(s) that happen to be mapped there
> >when the registration is made.
 
> I'm not sure how this helps MPI -- our registration caches will still  
> become invalid if the MPI app free()'s registered memory...?

No, they don't. The only reason you have a problem today is because
the memory registration is tied to the underlying *object* not the
virtual address. So when the app fiddles with things and changed the
virtual address to object mapping it wrecks your caching.

If instead the registration is tied to a virtual address, then it
doesn't matter what the app does, that virtual address range will
*always* point to the currently mapped objects.

If the app does free() and then mallocs() without an intervining kernel
call then it doesn't matter, your cache of registered VM addreses
still says that it is available

If the app does free() resulting in munmap and then malloc() resulting
in mmap() and re-uses the same address then, again, it doesn't matter
to you because the VM address is still registered by the kernel and is
switched to the new mmap().

The only problem is over time your cache will have registions of VM
that are not in use by the app, or don't have backing objects any
longer. This is not a correctness problem, but it might be a
performance problem.

> MPI maintains a registration cache because registration is so  
> expensive.  Even if the registration cache becomes "safely" invalid  
> (e.g., you'll never get a scenario where one virtual address could  
> have previously pointed to a different hardware address within the  
> span of one process), it doesn't help.

How so? That would seem to close the data corruption hole
entirely. Sure you still have to call registration functions but one
step at a time :)

> Ok, I'll back off slightly: if you want verbs to go mainstream, there  
> will be many other ULPs / middleware libraries that have memory models  
> like MPI's (that the upper layer is responsible for allocating/freeing  
> message buffers).  Put differently: the TCP/sockets stack doesn't have  
> this restriction; it will be extremely difficult to convert legions of  
> sockets programmers to verbs if you effectively restrict large  
> messages to only be allocated/freed by the network layer (kinda  
> defeats the point of RDMA if you have to copy large messages, right?).

Fair enough - but the registration model is pretty much an inevitable
consequence of kernel bypass. If you really want to get rid of it then
you need to have an operating mode where the WRs are generated by the
kernel through syscalls like all the other network stacks. I've not
seen any notion of how to seperate the two ideas at least..

Jason


From aafabbri at cisco.com  Fri May  1 11:40:30 2009
From: aafabbri at cisco.com (Aaron Fabbri)
Date: Fri, 1 May 2009 18:40:30 +0000 (UTC)
Subject: [ofa-general] New proposal for memory management
References: <C6E9B5EC-C922-4675-8469-3D7C5AB4C9BE@cisco.com>	<loom.20090430T052134-692@post.gmane.org>	<48644923-A2EF-4C4F-8F9A-B1658BAA36FE@cisco.com>
	<3B6F5C9068D5864096EB291236C3386F058EDB8C@xmb-sjc-21d.amer.cisco.com>
	<49FB2309.5090702@dev.mellanox.co.il>
Message-ID: <loom.20090501T181654-398@post.gmane.org>

Pavel Shamis (Pasha <pashash <at> gmail.com> writes:
> 
> Aaron Fabbri (aafabbri) wrote:
> > 3. Rip out your registration cache.  Make malloc'd buffers go really
> > slow (register in fast path) and mpi_alloc_mem() buffers go really fast.
> > People will migrate.
> People will migrate to what ? (A) new malloc ? Or (B) other interconnect 
> platform
> that does not require from user to change his application in order to 
> get reasonable performance ?
> I'm not sure that people will chose (A) 
> 

Agreed.  As I said, getting all MPIs to agree would be the hard part.  Given 
that, I think a new malloc() is easier than porting to non-MPI middleware.

My point is this:

- Verbs works well for a number of applications (Roland and I have each written 
multiple, for example).

- IMHO, there is a problem with your API that should be fixed (the messaging 
layer needs to manage network buf allocation).  If you required mpi_alloc_mem, 
you would get rid of a whole layer of complicated crap.  It may not be feasible 
for you, but it is the right thing to do from an engineering perspective, 
right?

- "MPI" doesn't want to fix the problem, but instead is asking other people to 
make kernel changes for them and saying things like "verbs is broken".

I totally see your guys' problem and feel for you.  Either way it comes down to 
politics; getting some MPI-specific code into the Linux Kernel (fun?), or 
getting MPI users to have to change crusty old scientific code (very fun?).

Could you use the silent corruption problems as leverage to get MPI to move to 
mpi_malloc_mem?

Final point I want to make is that this is open source, so you can always try 
submitting some elite patches and get the changes you need.

Aaron


From sashak at voltaire.com  Fri May  1 11:55:08 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 1 May 2009 21:55:08 +0300
Subject: [ofa-general] Re: [PATCH 8/8] Convert ibqueryerrors.pl to C and use
	new ibnetdisc library.
In-Reply-To: <20090427145026.7e074ffc.weiny2@llnl.gov>
References: <20090423133120.acf0af63.weiny2@llnl.gov>
	<20090425155441.GE28604@sk>
	<20090427145026.7e074ffc.weiny2@llnl.gov>
Message-ID: <20090501185508.GH14714@sk.iol.unh.edu>

On 14:50 Mon 27 Apr     , Ira Weiny wrote:
> 
> The removal of this line causes the '-S' option to segfault.  Patch to pq/ibn4
> is below.

Thanks. I'm applying this to pq/ibn4

> I will work up a separate patch.  Right now you are correct if the SA is
> unresponsive the "-S" option will fail.  iblinkinfo does the full scan every
> time.  But that slows down the query for a single switch to the same O(n)
> query that a full system scan requires.  I would rather have that query be
> O(1).  So I implemented ibqueryerrors in this manner with the intent of going
> back and "fixing" iblinkinfo.  I think having a fall back on a full system
> scan is a good idea.  Patch for both tools will follow...  :-D

Thanks. I'm doing pq/ibn4 merge now. We can apply the rest after this.

Sasha


From jgunthorpe at obsidianresearch.com  Fri May  1 11:59:00 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Fri, 1 May 2009 12:59:00 -0600
Subject: [ofa-general] New proposal for memory management
In-Reply-To: <49faf95d.02c3f10a.1fc8.ffffaff5@mx.google.com>
References: <C61F7DF4.4ABE%bwbarre@sandia.gov>
	<EDC68D6F-BDF9-4889-9CA4-E654BAFEEB2D@cisco.com>
	<49faf95d.02c3f10a.1fc8.ffffaff5@mx.google.com>
Message-ID: <20090501185900.GJ32114@obsidianresearch.com>

On Fri, May 01, 2009 at 09:25:33AM -0400, Tom Talpey wrote:

> Completely agree. I will add that enterprise network programmers are
> going to reject registration caching as well, because it introduces
> vulnerabilities into the data path - silent data corruption. For example,
> storage won't tolerate it, databases won't, etc.

By the same token those apps that care about data security like you
site *must* manually manage their registration to only expose the
memory that needs to be exposed at any time. That is a mandatory step
as soon as you have client initiated RDMA operations, no matter what
your protocol is.

> The problem is that userspace memory registration is slow. Let's address
> that, not address how to make a hack (registration caching) go faster.

Indeed, but how? You need to make a syscall to pin and map the pages,
which is fine, but how do you communicate the information to the HCA
in a manner that is utterly secure and doesn't let userspace 'fiddle'
it to point to arbitary random memory? You get burned pretty fast by
fact that the HCA is DMA'ing instructions out of user space directly :(

Jason


From sashak at voltaire.com  Fri May  1 12:47:26 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 1 May 2009 22:47:26 +0300
Subject: [ofa-general] [PATCH] libibumad: keep port capmask as 32-bit
	variable
In-Reply-To: <112DCCB086FF41ABA5940BB49318165C@amr.corp.intel.com>
References: <49F16310.1080902@ext.bull.net>
	<132D7B1EACCC462387A1C7FB9EAC4F2D@amr.corp.intel.com>
	<20090425210255.GL28604@sk>
	<112DCCB086FF41ABA5940BB49318165C@amr.corp.intel.com>
Message-ID: <20090501194726.GJ14714@sk.iol.unh.edu>


For unknown reason IB port capmask was defined as 64-bit unsigned. Which
caused some portability problems. Fixing this.

Pointed out by Nicolas Morey-Chaisemartin and Sean Hefty.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 libibumad/include/infiniband/umad.h |    2 +-
 libibumad/src/umad.c                |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/libibumad/include/infiniband/umad.h b/libibumad/include/infiniband/umad.h
index 91ccf1d..78862c8 100644
--- a/libibumad/include/infiniband/umad.h
+++ b/libibumad/include/infiniband/umad.h
@@ -129,7 +129,7 @@ typedef struct umad_port {
 	unsigned state;
 	unsigned phys_state;
 	unsigned rate;
-	uint64_t capmask;
+	uint32_t capmask;
 	uint64_t gid_prefix;
 	uint64_t port_guid;
 	unsigned pkeys_size;
diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c
index 72ef506..deb3b9d 100644
--- a/libibumad/src/umad.c
+++ b/libibumad/src/umad.c
@@ -160,7 +160,7 @@ get_port(char *ca_name, char *dir, int portnum, umad_port_t *port)
 		goto clean;
 	if (sys_read_uint(port_dir, SYS_PORT_RATE, &port->rate) < 0)
 		goto clean;
-	if (sys_read_uint64(port_dir, SYS_PORT_CAPMASK, &port->capmask) < 0)
+	if (sys_read_uint(port_dir, SYS_PORT_CAPMASK, &port->capmask) < 0)
 		goto clean;
 
 	port->capmask = htonl(port->capmask);
-- 
1.6.1.2.319.gbd9e


From sashak at voltaire.com  Fri May  1 12:50:23 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 1 May 2009 22:50:23 +0300
Subject: [ofa-general] [PATCH] ibstat.c: use htohl() for 32-bit capmask
	conversion
In-Reply-To: <20090501194726.GJ14714@sk.iol.unh.edu>
References: <49F16310.1080902@ext.bull.net>
	<132D7B1EACCC462387A1C7FB9EAC4F2D@amr.corp.intel.com>
	<20090425210255.GL28604@sk>
	<112DCCB086FF41ABA5940BB49318165C@amr.corp.intel.com>
	<20090501194726.GJ14714@sk.iol.unh.edu>
Message-ID: <20090501195022.GK14714@sk.iol.unh.edu>


capmask field was changed to be 32-bit, so use ntohl() instead of
ntohll(). Casting is also not needed then.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 infiniband-diags/src/ibstat.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/infiniband-diags/src/ibstat.c b/infiniband-diags/src/ibstat.c
index 7985be1..06f39ae 100644
--- a/infiniband-diags/src/ibstat.c
+++ b/infiniband-diags/src/ibstat.c
@@ -111,7 +111,7 @@ port_dump(umad_port_t *port, int alone)
 	printf("%sBase lid: %d\n", pre, port->base_lid);
 	printf("%sLMC: %d\n", pre, port->lmc);
 	printf("%sSM lid: %d\n", pre, port->sm_lid);
-	printf("%sCapability mask: 0x%08x\n", pre, (unsigned)ntohll(port->capmask));
+	printf("%sCapability mask: 0x%08x\n", pre, ntohl(port->capmask));
 	printf("%sPort GUID: 0x%016llx\n", pre, (long long unsigned)ntohll(port->port_guid));
 	return 0;
 }
-- 
1.6.1.2.319.gbd9e


From sashak at voltaire.com  Fri May  1 12:52:49 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 1 May 2009 22:52:49 +0300
Subject: [ofa-general] Re: [PATCH] opensm/PerfMgr: Change redir_tbl_size to
	num_ports for better clarity
In-Reply-To: <20090426123009.GA25119@comcast.net>
References: <20090426123009.GA25119@comcast.net>
Message-ID: <20090501195249.GL14714@sk.iol.unh.edu>

On 08:30 Sun 26 Apr     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Fri May  1 12:56:03 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 1 May 2009 22:56:03 +0300
Subject: [ofa-general] Re: [PATCH] ibsim: Fixed custom release in SPEC file
In-Reply-To: <49F58E31.3020005@ext.bull.net>
References: <49F58E31.3020005@ext.bull.net>
Message-ID: <20090501195603.GM14714@sk.iol.unh.edu>

On 12:51 Mon 27 Apr     , Nicolas Morey-Chaisemartin wrote:
> Removed a space which make rpmbuild fail when _dist and CUSTOM_RELEASE are set:
> error: line 15: Tag takes single token only: Release: ofed1.4.1 .fc11
> 
> This is due to 
> Release: %rel%{?dist}
> and %rel having a trailing whitespace.
> 
> 
> Signed-off-by: Nicolas Morey-Chaisemartin <nicolas.morey-chaisemartin at ext.bull.net>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Fri May  1 12:57:02 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 1 May 2009 22:57:02 +0300
Subject: [ofa-general] Re: [PATCH] management: Fixed custom_release in SPEC
	files
In-Reply-To: <49F58FF9.8070608@ext.bull.net>
References: <49F58FF9.8070608@ext.bull.net>
Message-ID: <20090501195702.GN14714@sk.iol.unh.edu>

On 12:59 Mon 27 Apr     , Nicolas Morey-Chaisemartin wrote:
> Removed a space which make rpmbuild fail when _dist and CUSTOM_RELEASE are set:
> error: line 15: Tag takes single token only: Release: ofed1.4.1 .fc11
> 
> This is due to 
> Release: %rel%{?dist}
> and %rel having a trailing whitespace.
> 
> Signed-off-by: Nicolas Morey-Chaisemartin <nicolas.morey-chaisemartin at ext.bull.net>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Fri May  1 13:27:23 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 1 May 2009 23:27:23 +0300
Subject: [ofa-general] Re: [PATCH] opensm: Add SuperMicro to list of
	recognized vendors
In-Reply-To: <20090427135330.GA24559@comcast.net>
References: <20090427135330.GA24559@comcast.net>
Message-ID: <20090501202723.GO14714@sk.iol.unh.edu>

On 09:53 Mon 27 Apr     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Fri May  1 13:27:43 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 1 May 2009 23:27:43 +0300
Subject: [ofa-general] Re: [PATCH] OpenSM: include/vendor/osm_vendor.h -
	Replaced #elif with no condition by #else
In-Reply-To: <49F5A930.1030102@ext.bull.net>
References: <49F5A930.1030102@ext.bull.net>
Message-ID: <20090501202743.GP14714@sk.iol.unh.edu>

On 14:46 Mon 27 Apr     , Nicolas Morey-Chaisemartin wrote:
> 
> Signed-off-by: Nicolas Morey-Chaisemartin <nicolas.morey-chaisemartin at ext.bull.net>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Fri May  1 13:42:37 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 1 May 2009 23:42:37 +0300
Subject: [ofa-general] Re: [PATCH] infiniband-diags/saquery.c: Display
	attribute ID in hex rather than decimal
In-Reply-To: <20090427181753.GA20430@comcast.net>
References: <20090427181753.GA20430@comcast.net>
Message-ID: <20090501204237.GQ14714@sk.iol.unh.edu>

On 14:17 Mon 27 Apr     , Hal Rosenstock wrote:
> 
> for easier correlation to IBA spec
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Fri May  1 13:44:37 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 1 May 2009 23:44:37 +0300
Subject: [ofa-general] Re: [PATCH] opensm: Changes to spec and make files for
	updated release notes
In-Reply-To: <20090427110832.GA22098@comcast.net>
References: <20090427110832.GA22098@comcast.net>
Message-ID: <20090501204437.GR14714@sk.iol.unh.edu>

On 07:08 Mon 27 Apr     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Fri May  1 13:50:28 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 1 May 2009 23:50:28 +0300
Subject: [ofa-general] Re: [PATCH] libibmad: Add support for SA PathRecord SL
	field
In-Reply-To: <20090427110619.GA22089@comcast.net>
References: <20090427110619.GA22089@comcast.net>
Message-ID: <20090501205028.GS14714@sk.iol.unh.edu>

Hi Hal,

On 07:06 Mon 27 Apr     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

There are different things mixed in this patch (like self NodeInfo
resolution and redirection status printouts)? Is it just typo?

Sasha

> ---
> diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
> index 0e47ccf..c74cb1d 100644
> --- a/libibmad/include/infiniband/mad.h
> +++ b/libibmad/include/infiniband/mad.h
> @@ -500,6 +500,7 @@ enum MAD_FIELDS {
>  	IB_SA_PR_DLID_F,
>  	IB_SA_PR_SLID_F,
>  	IB_SA_PR_NPATH_F,
> +	IB_SA_PR_SL_F,
>  
>  	/*
>  	 * MC Member rec
> diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c
> index c24bc12..81693a2 100644
> --- a/libibmad/src/fields.c
> +++ b/libibmad/src/fields.c
> @@ -305,6 +305,7 @@ static const ib_field_t ib_mad_f[] = {
>  	{BITSOFFS(320, 16), "PathRecDLid", mad_dump_uint},
>  	{BITSOFFS(336, 16), "PathRecSLid", mad_dump_uint},
>  	{BITSOFFS(393, 7), "PathRecNumPath", mad_dump_uint},
> +	{BITSOFFS(428, 4), "PathRecSL", mad_dump_uint},
>  
>  	/*
>  	 * MC Member rec
> diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c
> index 691bdc3..f17da11 100644
> --- a/libibmad/src/resolve.c
> +++ b/libibmad/src/resolve.c
> @@ -59,6 +59,7 @@ int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout,
>  		return -1;
>  
>  	mad_decode_field(portinfo, IB_PORT_SMLID_F, &lid);
> +	mad_decode_field(portinfo, IB_PORT_SMSL_F, &sm_id->sl);
>  
>  	return ib_portid_set(sm_id, lid, 0, 0);
>  }
> @@ -74,12 +75,23 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
>  {
>  	ib_portid_t sm_portid;
>  	char buf[IB_SA_DATA_SIZE] = { 0 };
> +	ib_portid_t self = { 0 };
> +	uint64_t selfguid;
> +	ibmad_gid_t selfgid;
> +	uint8_t nodeinfo[64];
>  
>  	if (!sm_id) {
>  		sm_id = &sm_portid;
>  		if (ib_resolve_smlid_via(sm_id, timeout, srcport) < 0)
>  			return -1;
>  	}
> +
> +	if (!smp_query_via(nodeinfo, &self, IB_ATTR_NODE_INFO, 0, 0, srcport))
> +		return -1;
> +	mad_decode_field(nodeinfo, IB_NODE_PORT_GUID_F, &selfguid);
> +	mad_set_field64(selfgid, 0, IB_GID_PREFIX_F, IB_DEFAULT_SUBN_PREFIX);
> +	mad_set_field64(selfgid, 0, IB_GID_GUID_F, selfguid);
> +
>  	if (*(uint64_t *) & portid->gid == 0)
>  		mad_set_field64(portid->gid, 0, IB_GID_PREFIX_F,
>  				IB_DEFAULT_SUBN_PREFIX);
> @@ -87,10 +99,11 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
>  		mad_set_field64(portid->gid, 0, IB_GID_GUID_F, *guid);
>  
>  	if ((portid->lid =
> -	     ib_path_query_via(srcport, portid->gid, portid->gid, sm_id,
> +	     ib_path_query_via(srcport, selfgid, portid->gid, sm_id,
>  			       buf)) < 0)
>  		return -1;
>  
> +	mad_decode_field(buf, IB_SA_PR_SL_F, &portid->sl);
>  	return 0;
>  }
>  
> @@ -167,6 +180,7 @@ int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid,
>  		return -1;
>  
>  	mad_decode_field(portinfo, IB_PORT_LID_F, &portid->lid);
> +	mad_decode_field(portinfo, IB_PORT_SMSL_F, &portid->sl);
>  	mad_decode_field(portinfo, IB_PORT_GID_PREFIX_F, &prefix);
>  	mad_decode_field(nodeinfo, IB_NODE_PORT_GUID_F, &guid);
>  
> diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
> index 07b623d..21fcc9a 100644
> --- a/libibmad/src/rpc.c
> +++ b/libibmad/src/rpc.c
> @@ -187,7 +187,7 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc,
>  	      ib_portid_t * dport, void *payload, void *rcvdata)
>  {
>  	int status, len;
> -	uint8_t sndbuf[1024], rcvbuf[1024], *mad;
> +	uint8_t sndbuf[1024], rcvbuf[1024], *mad, mgmtclass;
>  	int timeout, retries;
>  
>  	len = 0;
> @@ -209,7 +209,18 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc,
>  
>  	mad = umad_get_mad(rcvbuf);
>  
> -	if ((status = mad_get_field(mad, 0, IB_DRSMP_STATUS_F)) != 0) {
> +	status = mad_get_field(mad, 0, IB_MAD_STATUS_F);
> +	mgmtclass = mad_get_field(mad, 0, IB_MAD_MGMTCLASS_F);
> +	if (mgmtclass == IB_SMI_DIRECT_CLASS)
> +		status &= 0x7fff;
> +	else if (mgmtclass != IB_SMI_CLASS) {
> +		if (status & 2) {
> +			ERRS("MAD redirection not supported; dport (%s)",
> +			     portid2str(dport));
> +			return 0;
> +		}
> +	}
> +	if (status) {
>  		ERRS("MAD completed with error status 0x%x; dport (%s)",
>  		     status, portid2str(dport));
>  		return 0;
> @@ -254,8 +265,12 @@ void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc,
>  	mad = umad_get_mad(rcvbuf);
>  
>  	if ((status = mad_get_field(mad, 0, IB_MAD_STATUS_F)) != 0) {
> -		ERRS("MAD completed with error status 0x%x; dport (%s)",
> -		     status, portid2str(dport));
> +		if (status & 2)
> +			ERRS("MAD redirection not supported; dport (%s)",
> +			     portid2str(dport));
> +		else
> +			ERRS("MAD completed with error status 0x%x; dport (%s)",
> +			     status, portid2str(dport));
>  		return 0;
>  	}
>  
> 


From hal.rosenstock at gmail.com  Fri May  1 13:59:04 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 1 May 2009 16:59:04 -0400
Subject: [ofa-general] Re: [PATCH] libibmad: Add support for SA PathRecord
	SL field
In-Reply-To: <20090501205028.GS14714@sk.iol.unh.edu>
References: <20090427110619.GA22089@comcast.net>
	<20090501205028.GS14714@sk.iol.unh.edu>
Message-ID: <f0e08f230905011359o53b622efsc4f1b3ad4601ccab@mail.gmail.com>

Hi Sasha,

On Fri, May 1, 2009 at 4:50 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> Hi Hal,
>
> On 07:06 Mon 27 Apr     , Hal Rosenstock wrote:
>>
>> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
>
> There are different things mixed in this patch (like self NodeInfo
> resolution and redirection status printouts)? Is it just typo?

Yes, it was meant to just be the mad.h and fields.c part. The other
files were mistakenly included. Sorry. Let me know if you want me to
regenerate this.

-- Hal

> Sasha
>
>> ---
>> diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
>> index 0e47ccf..c74cb1d 100644
>> --- a/libibmad/include/infiniband/mad.h
>> +++ b/libibmad/include/infiniband/mad.h
>> @@ -500,6 +500,7 @@ enum MAD_FIELDS {
>>       IB_SA_PR_DLID_F,
>>       IB_SA_PR_SLID_F,
>>       IB_SA_PR_NPATH_F,
>> +     IB_SA_PR_SL_F,
>>
>>       /*
>>        * MC Member rec
>> diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c
>> index c24bc12..81693a2 100644
>> --- a/libibmad/src/fields.c
>> +++ b/libibmad/src/fields.c
>> @@ -305,6 +305,7 @@ static const ib_field_t ib_mad_f[] = {
>>       {BITSOFFS(320, 16), "PathRecDLid", mad_dump_uint},
>>       {BITSOFFS(336, 16), "PathRecSLid", mad_dump_uint},
>>       {BITSOFFS(393, 7), "PathRecNumPath", mad_dump_uint},
>> +     {BITSOFFS(428, 4), "PathRecSL", mad_dump_uint},
>>
>>       /*
>>        * MC Member rec
>> diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c
>> index 691bdc3..f17da11 100644
>> --- a/libibmad/src/resolve.c
>> +++ b/libibmad/src/resolve.c
>> @@ -59,6 +59,7 @@ int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout,
>>               return -1;
>>
>>       mad_decode_field(portinfo, IB_PORT_SMLID_F, &lid);
>> +     mad_decode_field(portinfo, IB_PORT_SMSL_F, &sm_id->sl);
>>
>>       return ib_portid_set(sm_id, lid, 0, 0);
>>  }
>> @@ -74,12 +75,23 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
>>  {
>>       ib_portid_t sm_portid;
>>       char buf[IB_SA_DATA_SIZE] = { 0 };
>> +     ib_portid_t self = { 0 };
>> +     uint64_t selfguid;
>> +     ibmad_gid_t selfgid;
>> +     uint8_t nodeinfo[64];
>>
>>       if (!sm_id) {
>>               sm_id = &sm_portid;
>>               if (ib_resolve_smlid_via(sm_id, timeout, srcport) < 0)
>>                       return -1;
>>       }
>> +
>> +     if (!smp_query_via(nodeinfo, &self, IB_ATTR_NODE_INFO, 0, 0, srcport))
>> +             return -1;
>> +     mad_decode_field(nodeinfo, IB_NODE_PORT_GUID_F, &selfguid);
>> +     mad_set_field64(selfgid, 0, IB_GID_PREFIX_F, IB_DEFAULT_SUBN_PREFIX);
>> +     mad_set_field64(selfgid, 0, IB_GID_GUID_F, selfguid);
>> +
>>       if (*(uint64_t *) & portid->gid == 0)
>>               mad_set_field64(portid->gid, 0, IB_GID_PREFIX_F,
>>                               IB_DEFAULT_SUBN_PREFIX);
>> @@ -87,10 +99,11 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
>>               mad_set_field64(portid->gid, 0, IB_GID_GUID_F, *guid);
>>
>>       if ((portid->lid =
>> -          ib_path_query_via(srcport, portid->gid, portid->gid, sm_id,
>> +          ib_path_query_via(srcport, selfgid, portid->gid, sm_id,
>>                              buf)) < 0)
>>               return -1;
>>
>> +     mad_decode_field(buf, IB_SA_PR_SL_F, &portid->sl);
>>       return 0;
>>  }
>>
>> @@ -167,6 +180,7 @@ int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid,
>>               return -1;
>>
>>       mad_decode_field(portinfo, IB_PORT_LID_F, &portid->lid);
>> +     mad_decode_field(portinfo, IB_PORT_SMSL_F, &portid->sl);
>>       mad_decode_field(portinfo, IB_PORT_GID_PREFIX_F, &prefix);
>>       mad_decode_field(nodeinfo, IB_NODE_PORT_GUID_F, &guid);
>>
>> diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
>> index 07b623d..21fcc9a 100644
>> --- a/libibmad/src/rpc.c
>> +++ b/libibmad/src/rpc.c
>> @@ -187,7 +187,7 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc,
>>             ib_portid_t * dport, void *payload, void *rcvdata)
>>  {
>>       int status, len;
>> -     uint8_t sndbuf[1024], rcvbuf[1024], *mad;
>> +     uint8_t sndbuf[1024], rcvbuf[1024], *mad, mgmtclass;
>>       int timeout, retries;
>>
>>       len = 0;
>> @@ -209,7 +209,18 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc,
>>
>>       mad = umad_get_mad(rcvbuf);
>>
>> -     if ((status = mad_get_field(mad, 0, IB_DRSMP_STATUS_F)) != 0) {
>> +     status = mad_get_field(mad, 0, IB_MAD_STATUS_F);
>> +     mgmtclass = mad_get_field(mad, 0, IB_MAD_MGMTCLASS_F);
>> +     if (mgmtclass == IB_SMI_DIRECT_CLASS)
>> +             status &= 0x7fff;
>> +     else if (mgmtclass != IB_SMI_CLASS) {
>> +             if (status & 2) {
>> +                     ERRS("MAD redirection not supported; dport (%s)",
>> +                          portid2str(dport));
>> +                     return 0;
>> +             }
>> +     }
>> +     if (status) {
>>               ERRS("MAD completed with error status 0x%x; dport (%s)",
>>                    status, portid2str(dport));
>>               return 0;
>> @@ -254,8 +265,12 @@ void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc,
>>       mad = umad_get_mad(rcvbuf);
>>
>>       if ((status = mad_get_field(mad, 0, IB_MAD_STATUS_F)) != 0) {
>> -             ERRS("MAD completed with error status 0x%x; dport (%s)",
>> -                  status, portid2str(dport));
>> +             if (status & 2)
>> +                     ERRS("MAD redirection not supported; dport (%s)",
>> +                          portid2str(dport));
>> +             else
>> +                     ERRS("MAD completed with error status 0x%x; dport (%s)",
>> +                          status, portid2str(dport));
>>               return 0;
>>       }
>>
>>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From sashak at voltaire.com  Fri May  1 14:25:32 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 2 May 2009 00:25:32 +0300
Subject: [ofa-general] Re: [PATCH] libibmad: Add support for SA
	PathRecord SL field
In-Reply-To: <f0e08f230905011359o53b622efsc4f1b3ad4601ccab@mail.gmail.com>
References: <20090427110619.GA22089@comcast.net>
	<20090501205028.GS14714@sk.iol.unh.edu>
	<f0e08f230905011359o53b622efsc4f1b3ad4601ccab@mail.gmail.com>
Message-ID: <20090501212532.GT14714@sk.iol.unh.edu>

On 16:59 Fri 01 May     , Hal Rosenstock wrote:
> 
> Yes, it was meant to just be the mad.h and fields.c part. The other
> files were mistakenly included. Sorry. Let me know if you want me to
> regenerate this.

And wasn't SL decoding to portid's sl field part of the patch?

I think it would be better to regenerate, then it will be clear what was
supposed to be there.

Sasha


From hal.rosenstock at gmail.com  Fri May  1 14:30:38 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 1 May 2009 17:30:38 -0400
Subject: [ofa-general] Re: [PATCH] libibmad: Add support for SA PathRecord
	SL field
In-Reply-To: <20090501212532.GT14714@sk.iol.unh.edu>
References: <20090427110619.GA22089@comcast.net>
	<20090501205028.GS14714@sk.iol.unh.edu>
	<f0e08f230905011359o53b622efsc4f1b3ad4601ccab@mail.gmail.com>
	<20090501212532.GT14714@sk.iol.unh.edu>
Message-ID: <f0e08f230905011430p1fa8a004i53157b2c64f89a00@mail.gmail.com>

On Fri, May 1, 2009 at 5:25 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 16:59 Fri 01 May     , Hal Rosenstock wrote:
>>
>> Yes, it was meant to just be the mad.h and fields.c part. The other
>> files were mistakenly included. Sorry. Let me know if you want me to
>> regenerate this.
>
> And wasn't SL decoding to portid's sl field part of the patch?

Not yet; It's under test.

> I think it would be better to regenerate, then it will be clear what was
> supposed to be there.

OK.

-- Hal

> Sasha
>


From hnrose at comcast.net  Fri May  1 14:33:12 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Fri, 1 May 2009 17:33:12 -0400
Subject: [ofa-general] [PATCH] libibmad: Add support for SA PathRecord SL
	field
Message-ID: <20090501213312.GA29913@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h
index 2f5673f..432710a 100644
--- a/libibmad/include/infiniband/mad.h
+++ b/libibmad/include/infiniband/mad.h
@@ -500,6 +500,7 @@ enum MAD_FIELDS {
 	IB_SA_PR_DLID_F,
 	IB_SA_PR_SLID_F,
 	IB_SA_PR_NPATH_F,
+	IB_SA_PR_SL_F,
 
 	/*
 	 * MC Member rec
diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c
index 60b310c..129f7e5 100644
--- a/libibmad/src/fields.c
+++ b/libibmad/src/fields.c
@@ -305,6 +305,7 @@ static const ib_field_t ib_mad_f[] = {
 	{BITSOFFS(320, 16), "PathRecDLid", mad_dump_uint},
 	{BITSOFFS(336, 16), "PathRecSLid", mad_dump_uint},
 	{BITSOFFS(393, 7), "PathRecNumPath", mad_dump_uint},
+	{BITSOFFS(428, 4), "PathRecSL", mad_dump_uint},
 
 	/*
 	 * MC Member rec


From jgunthorpe at obsidianresearch.com  Fri May  1 14:36:52 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Fri, 1 May 2009 15:36:52 -0600
Subject: [ofa-general] Re: [PATCH v2] rdma_cm: Add debugfs entries to
	monitor rdma_cm connections
In-Reply-To: <49F9A729.3090904@voltaire.com>
References: <49F05AAE.4020606@Voltaire.COM>
	<90A07B5E8FAD49C187EAE583A976A3CD@amr.corp.intel.com>
	<49F42D40.5000200@Voltaire.COM> <49F5A2EC.3050807@Voltaire.com>
	<49F5AED6.4070208@Voltaire.COM> <49F5AFEA.5090003@voltaire.com>
	<20090427162349.GI4431@obsidianresearch.com>
	<49F9A729.3090904@voltaire.com>
Message-ID: <20090501213652.GO32114@obsidianresearch.com>

On Thu, Apr 30, 2009 at 04:27:05PM +0300, Or Gerlitz wrote:
> Jason Gunthorpe wrote:
>> including a PID is not best, you should include enough information to 
>> figure out the pid(s) from proc/xx/fd, and vice versa.

> maybe its not the best solution but it seems to me good enough

Well, we have to live with these interfaces literally forever,
shortcuts ultimately just cause more problems down the road..

Reall the thinking should be 'I want to make lsof work usefully' not
'I want some random and different hack to let me see something'. And
yes, that is harder. But the IB stack is now at the point where these
small hard things are the sort of work that is needed to get parity
with the other stuff in linux..

Jason


From sashak at voltaire.com  Fri May  1 14:36:27 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 2 May 2009 00:36:27 +0300
Subject: [ofa-general] Re: [PATCH] libibmad: Add support for SA PathRecord SL
	field
In-Reply-To: <20090501213312.GA29913@comcast.net>
References: <20090501213312.GA29913@comcast.net>
Message-ID: <20090501213627.GU14714@sk.iol.unh.edu>

On 17:33 Fri 01 May     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From hnrose at comcast.net  Fri May  1 14:47:24 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Fri, 1 May 2009 17:47:24 -0400
Subject: [ofa-general] [PATCH] opensm/PerfMgr: Remove some underbars from
	internal names
Message-ID: <20090501214724.GA30974@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/include/opensm/osm_perfmgr.h b/opensm/include/opensm/osm_perfmgr.h
index 16a59ef..e6a1cfe 100644
--- a/opensm/include/opensm/osm_perfmgr.h
+++ b/opensm/include/opensm/osm_perfmgr.h
@@ -97,15 +97,15 @@ typedef struct redir {
 } redir_t;
 
 /* Node to store information about which nodes we are monitoring */
-typedef struct _monitored_node {
+typedef struct monitored_node {
 	cl_map_item_t map_item;
-	struct _monitored_node *next;
+	struct monitored_node *next;
 	uint64_t guid;
 	boolean_t esp0;
 	char *name;
 	uint32_t num_ports;
 	redir_t redir_port[1];	/* redirection on a per port basis */
-} __monitored_node_t;
+} monitored_node_t;
 
 struct osm_opensm;
 /****s* OpenSM: PerfMgr/osm_perfmgr_t
@@ -133,7 +133,7 @@ typedef struct osm_perfmgr {
 	cl_event_t sig_query;	/* will throttle our querys */
 	uint32_t max_outstanding_queries;
 	cl_qmap_t monitored_map;	/* map the nodes we are tracking */
-	__monitored_node_t *remove_list;
+	monitored_node_t *remove_list;
 } osm_perfmgr_t;
 /*
 * FIELDS
diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c
index 7c24819..93644a0 100644
--- a/opensm/opensm/osm_perfmgr.c
+++ b/opensm/opensm/osm_perfmgr.c
@@ -119,7 +119,7 @@ extern int wait_for_pending_transactions(osm_stats_t * stats);
 /**********************************************************************
  * Internal helper functions.
  **********************************************************************/
-static void __init_monitored_nodes(osm_perfmgr_t * pm)
+static void init_monitored_nodes(osm_perfmgr_t * pm)
 {
 	cl_qmap_init(&pm->monitored_map);
 	pm->remove_list = NULL;
@@ -127,7 +127,7 @@ static void __init_monitored_nodes(osm_perfmgr_t * pm)
 	cl_event_init(&pm->sig_query, FALSE);
 }
 
-static void __mark_for_removal(osm_perfmgr_t * pm, __monitored_node_t * node)
+static void mark_for_removal(osm_perfmgr_t * pm, monitored_node_t * node)
 {
 	if (pm->remove_list) {
 		node->next = pm->remove_list;
@@ -138,10 +138,10 @@ static void __mark_for_removal(osm_perfmgr_t * pm, __monitored_node_t * node)
 	}
 }
 
-static void __remove_marked_nodes(osm_perfmgr_t * pm)
+static void remove_marked_nodes(osm_perfmgr_t * pm)
 {
 	while (pm->remove_list) {
-		__monitored_node_t *next = pm->remove_list->next;
+		monitored_node_t *next = pm->remove_list->next;
 
 		cl_qmap_remove_item(&pm->monitored_map,
 				    (cl_map_item_t *) (pm->remove_list));
@@ -153,7 +153,7 @@ static void __remove_marked_nodes(osm_perfmgr_t * pm)
 	}
 }
 
-static inline void __decrement_outstanding_queries(osm_perfmgr_t * pm)
+static inline void decrement_outstanding_queries(osm_perfmgr_t * pm)
 {
 	cl_atomic_dec(&pm->outstanding_queries);
 	cl_event_signal(&pm->sig_query);
@@ -173,7 +173,7 @@ static void perfmgr_mad_recv_callback(osm_madw_t * p_madw, void *bind_context,
 	osm_madw_copy_context(p_madw, p_req_madw);
 	osm_mad_pool_put(pm->mad_pool, p_req_madw);
 
-	__decrement_outstanding_queries(pm);
+	decrement_outstanding_queries(pm);
 
 	/* post this message for later processing. */
 	if (cl_disp_post(pm->pc_disp_h, OSM_MSG_MAD_PORT_COUNTERS,
@@ -196,7 +196,7 @@ static void perfmgr_mad_send_err_callback(void *bind_context,
 	uint64_t node_guid = context->perfmgr_context.node_guid;
 	uint8_t port = context->perfmgr_context.port;
 	cl_map_item_t *p_node;
-	__monitored_node_t *p_mon_node;
+	monitored_node_t *p_mon_node;
 
 	OSM_LOG_ENTER(pm->log);
 
@@ -209,7 +209,7 @@ static void perfmgr_mad_send_err_callback(void *bind_context,
 			PRIx64 " not found in monitored map\n", node_guid);
 		goto Exit;
 	}
-	p_mon_node = (__monitored_node_t *) p_node;
+	p_mon_node = (monitored_node_t *) p_node;
 
 	OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C02: %s (0x%" PRIx64
 		") port %u\n", p_mon_node->name, p_mon_node->guid, port);
@@ -236,7 +236,7 @@ static void perfmgr_mad_send_err_callback(void *bind_context,
 Exit:
 	osm_mad_pool_put(pm->mad_pool, p_madw);
 
-	__decrement_outstanding_queries(pm);
+	decrement_outstanding_queries(pm);
 
 	OSM_LOG_EXIT(pm->log);
 }
@@ -305,7 +305,7 @@ Exit:
 /**********************************************************************
  * Given a monitored node and a port, return the qp
  **********************************************************************/
-static ib_net32_t get_qp(__monitored_node_t * mon_node, uint8_t port)
+static ib_net32_t get_qp(monitored_node_t * mon_node, uint8_t port)
 {
 	ib_net32_t qp = cl_ntoh32(1);
 
@@ -322,7 +322,7 @@ static ib_net32_t get_qp(__monitored_node_t * mon_node, uint8_t port)
  * return the appropriate lid to query that port
  **********************************************************************/
 static ib_net16_t get_lid(osm_node_t * p_node, uint8_t port,
-			  __monitored_node_t * mon_node)
+			  monitored_node_t * mon_node)
 {
 	if (mon_node && mon_node->num_ports && port < mon_node->num_ports &&
 	    mon_node->redir_port[port].redir_lid)
@@ -414,12 +414,12 @@ static ib_api_status_t perfmgr_send_pc_mad(osm_perfmgr_t * perfmgr,
 /**********************************************************************
  * sweep the node_guid_tbl and collect the node guids to be tracked
  **********************************************************************/
-static void __collect_guids(cl_map_item_t * p_map_item, void *context)
+static void collect_guids(cl_map_item_t * p_map_item, void *context)
 {
 	osm_node_t *node = (osm_node_t *) p_map_item;
 	uint64_t node_guid = cl_ntoh64(node->node_info.node_guid);
 	osm_perfmgr_t *pm = (osm_perfmgr_t *) context;
-	__monitored_node_t *mon_node = NULL;
+	monitored_node_t *mon_node = NULL;
 	uint32_t num_ports;
 
 	OSM_LOG_ENTER(pm->log);
@@ -462,7 +462,7 @@ static void perfmgr_query_counters(cl_map_item_t * p_map_item, void *context)
 	ib_api_status_t status = IB_SUCCESS;
 	osm_perfmgr_t *pm = context;
 	osm_node_t *node = NULL;
-	__monitored_node_t *mon_node = (__monitored_node_t *) p_map_item;
+	monitored_node_t *mon_node = (monitored_node_t *) p_map_item;
 	osm_madw_context_t mad_context;
 	uint64_t node_guid = 0;
 	ib_net32_t remote_qp;
@@ -477,7 +477,7 @@ static void perfmgr_query_counters(cl_map_item_t * p_map_item, void *context)
 			"ERR 4C07: Node \"%s\" (guid 0x%" PRIx64
 			") no longer exists so removing from PerfMgr monitoring\n",
 			mon_node->name, mon_node->guid);
-		__mark_for_removal(pm, mon_node);
+		mark_for_removal(pm, mon_node);
 		goto Exit;
 	}
 
@@ -779,7 +779,7 @@ void osm_perfmgr_process(osm_perfmgr_t * pm)
 	 */
 	OSM_LOG(pm->log, OSM_LOG_VERBOSE, "Gathering PerfMgr stats\n");
 	cl_plock_acquire(pm->lock);
-	cl_qmap_apply_func(&pm->subn->node_guid_tbl, __collect_guids, pm);
+	cl_qmap_apply_func(&pm->subn->node_guid_tbl, collect_guids, pm);
 	cl_plock_release(pm->lock);
 
 	/* then for each node query their counters */
@@ -788,7 +788,7 @@ void osm_perfmgr_process(osm_perfmgr_t * pm)
 	/* Clean out any nodes found to be removed during the
 	 * sweep
 	 */
-	__remove_marked_nodes(pm);
+	remove_marked_nodes(pm);
 
 #if ENABLE_OSM_PERF_MGR_PROFILE
 	/* spin on outstanding queries */
@@ -854,7 +854,7 @@ void osm_perfmgr_destroy(osm_perfmgr_t * pm)
  * will be missed.
  **********************************************************************/
 static void perfmgr_check_oob_clear(osm_perfmgr_t * pm,
-				    __monitored_node_t * mon_node, uint8_t port,
+				    monitored_node_t * mon_node, uint8_t port,
 				    perfmgr_db_err_reading_t * cr,
 				    perfmgr_db_data_cnt_reading_t * dc)
 {
@@ -938,7 +938,7 @@ static int counter_overflow_32(ib_net32_t val)
  * MAD to the port.
  **********************************************************************/
 static void perfmgr_check_overflow(osm_perfmgr_t * pm,
-				   __monitored_node_t * mon_node, uint8_t port,
+				   monitored_node_t * mon_node, uint8_t port,
 				   ib_port_counters_t * pc)
 {
 	osm_madw_context_t mad_context;
@@ -1009,7 +1009,7 @@ Exit:
  * Check values for logging of errors
  **********************************************************************/
 static void perfmgr_log_events(osm_perfmgr_t * pm,
-			       __monitored_node_t * mon_node, uint8_t port,
+			       monitored_node_t * mon_node, uint8_t port,
 			       perfmgr_db_err_reading_t * reading)
 {
 	perfmgr_db_err_reading_t prev_read;
@@ -1066,7 +1066,7 @@ static void pc_rcv_process(void *context, void *data)
 	perfmgr_db_err_reading_t err_reading;
 	perfmgr_db_data_cnt_reading_t data_reading;
 	cl_map_item_t *p_node;
-	__monitored_node_t *p_mon_node;
+	monitored_node_t *p_mon_node;
 
 	OSM_LOG_ENTER(pm->log);
 
@@ -1079,7 +1079,7 @@ static void pc_rcv_process(void *context, void *data)
 			PRIx64 " not found in monitored map\n", node_guid);
 		goto Exit;
 	}
-	p_mon_node = (__monitored_node_t *) p_node;
+	p_mon_node = (monitored_node_t *) p_node;
 
 	OSM_LOG(pm->log, OSM_LOG_VERBOSE,
 		"Processing received MAD status 0x%x context 0x%"
@@ -1233,7 +1233,7 @@ ib_api_status_t osm_perfmgr_init(osm_perfmgr_t * pm, osm_opensm_t * osm,
 		goto Exit;
 	}
 
-	__init_monitored_nodes(pm);
+	init_monitored_nodes(pm);
 
 	cl_timer_start(&pm->sweep_timer, pm->sweep_time_s * 1000);
 

From weiny2 at llnl.gov  Fri May  1 16:53:34 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Fri, 1 May 2009 16:53:34 -0700
Subject: [ofa-general] Re: [PATCH 0/5] Follow on patch series to libibnetdisc
 including converting ibqueryerrors.pl
In-Reply-To: <20090501173806.GF14714@sk.iol.unh.edu>
References: <20090422185441.6f8601dc.weiny2@llnl.gov>
	<20090425175710.GI28604@sk>
	<20090427150409.9c10e479.weiny2@llnl.gov>
	<20090501173806.GF14714@sk.iol.unh.edu>
Message-ID: <20090501165334.59bf72a9.weiny2@llnl.gov>

On Fri, 1 May 2009 20:38:06 +0300
Sasha Khapyorsky <sashak at voltaire.com> wrote:

> On 15:04 Mon 27 Apr     , Ira Weiny wrote:
> > 
> > The port output should be from low to high.
> 
> > What do you see?
> 
> Yes, the port order is good (I was wrong about it). But switch order is
> reserved - first discovered switch is printed last. Right?

Actually there is no specific order on the switch output at this point.  If
you choose the "-g" option ibnetdiscover will print differently based on
"chassis".

I did not attempt to preserve any switch or HCA order printing.  I don't know
of any utils which require this.  Am I wrong?

Ira


From andy.grover at oracle.com  Fri May  1 17:41:49 2009
From: andy.grover at oracle.com (Andy Grover)
Date: Fri, 01 May 2009 17:41:49 -0700
Subject: [ofa-general] Re: [mwg] Re: RDMA tutorial and OFA
In-Reply-To: <5AEC2602AE03EB46BFC16C6B9B200DA8134A97300F@MNEXMB2.qlogic.org>
References: <3F6F638B8D880340AB536D29CD4C1E19450E0496@orsmsx501.amr.corp.intel.com>	<264AF717-9B0C-4DB3-922A-39DCA1940900@cisco.com>
	<5AEC2602AE03EB46BFC16C6B9B200DA8134A97300F@MNEXMB2.qlogic.org>
Message-ID: <49FB96CD.6090100@oracle.com>

Todd Rimmer wrote:
> It goes beyond just a tutorial.

> In talking to customers, the consensus is that many application
> programmers struggle with sockets, RDMA is an order of magnitude
> beyond that.  It's not a cut on programmers, there are some very
> strong ones in the enterprise, but a fair percentage only have
> associate degrees or technical school training.  Even the extremely
> smart ones have 100 things to juggle (and often must write code such
> that entry level programmers can support it), so the risk/reward or
> ROI of learning RDMA has to be there.  The higher the learning cost
> the more difficult to justify the effort.

Totally agree.

Someone emailed me off-list and mentioned he had proposed an RDMA/IB
book to a few publishers and been turned down. (!?!) Don't know if that
would still be the case but it means there's a lot of work to do
increasing the technology's mindshare and perceived relevance to a lot
of developers, and the OFA and its members need to get the ball rolling
before we can expect the "... for Dummies" people to want to write a
book about RDMA :-)

> simplified APIs and easy
> migration of applications

> accessibility in scripting languages and other languages

Both of these would be great, and I think go together -- a C# RDMA API
is going to be more accessible to a C# programmer first just because
it's in the right language, but also handle many boilerplate sections of
code on behalf of the user, presenting a simpler API than the C API.

> - good simple examples of how to do it, sample programs etc

Yes I would think this could be in the Tutorial, or a Cookbook section?

> - connection establishment is still difficult in OFED.  Also many
> apps are shortcutting the process by avoiding SA queries (hence
> impacting the ability of the applications to work properly with QOS,
> LMC, complex fabrics (torus, etc), Partitioning, etc). - either the
> Base API needs to improve or "helper libraries" are needed on top of
> it.

Could go in the language wrapper libs. A helper lib for C API itself
also might be nice, yes.

> - effective tools to debug applications.

True!

Regards -- Andy


From Jie.Cai at cs.anu.edu.au  Fri May  1 23:36:29 2009
From: Jie.Cai at cs.anu.edu.au (Jie Cai)
Date: Sat, 02 May 2009 16:36:29 +1000
Subject: [ofa-general] uDAPL DTO completion question.
In-Reply-To: <469958e00905010908kb6d1361n43b48b7486824bf3@mail.gmail.com>
References: <49D2BD00.5010002@cs.anu.edu.au>	
	<469958e00903312040j7700d2ccr9104996c2fc29cd4@mail.gmail.com>	
	<517c62fb0903312253w6344d62j1b8c072354b15ad2@mail.gmail.com>	
	<49D30C7F.1050201@cs.anu.edu.au>	
	<469958e00904010852ydaf5b07wccdde27dd02ca724@mail.gmail.com>	
	<49FA7C21.1050400@cs.anu.edu.au>	
	<517c62fb0905010424k76da4e59tc9a7a857ba5af727@mail.gmail.com>
	<469958e00905010908kb6d1361n43b48b7486824bf3@mail.gmail.com>
Message-ID: <49FBE9ED.8090308@cs.anu.edu.au>

Yes, the variable "target" has been declared volatile. However,
it is a pointer points to "char *rbuf" with type cast, where rbuf been 
allocated
memory with malloc. Will this bring the trouble?

I tried gcc with no optimization, -O2 and -O3 as well, but the program 
still goes
infinitely.

Still haven't figured out where is the problem. Do u have some other 
comments?

Regards,

-- 
Jie Cai


Caitlin Bestler wrote:
> On Fri, May 1, 2009 at 4:24 AM, arkady kanevsky
> <arkady.kanevsky at gmail.com> wrote:
>   
>> Jie,
>> it sounds to me that either the variable is not volatile or compiler
>> optimization
>> causes some problem. I would check for these first.
>> Arkady
>>
>>     
>
> Agreed, it is definitely a caching issue.
>
> Atomics are InfiniBand specific, and there are some fairly complex
> rules that govern
> how much the HCA can do caching. The gotcha is that they basically provide some
> cache coherency guarantees within the context of a connection, but not
> much between
> connections or versus local applications.
>
> That said, it would be rare for HCA caching to be the cause of
> anything worse than
> some unexpected ordering. Adapters cache when they have to, but would
> really rather
> not allocate or track a lot of resources. Updating to real physical
> memory ASAP is much
> simpler.
>
> Compilers, on the other hand, *love* optimizing. The key thing to
> understand is that the
> HCA is another processor, one that is at least as distant as any other
> CPU core. Any
> and all techniques used when sharing memory with another processor apply.
>
> Completions hide all that from the application, just promising that
> specific things are
> coherent when the user invokes the verbs to reap a completion. So
> whenever you do
> without completions you are dealing with an arbitrary multi-processor
> memory coherence
> problem.
>   


From vlad at lists.openfabrics.org  Sat May  2 03:22:09 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sat,  2 May 2009 03:22:09 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090502-0200 daily build status
Message-ID: <20090502102209.DFC8AE613C2@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From bart.vanassche at gmail.com  Sat May  2 04:46:24 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Sat, 2 May 2009 13:46:24 +0200
Subject: [ofa-general] OFED, the backported <linux/scatterlist.h> header and
	sg_init_table()
Message-ID: <e2e108260905020446j41996c85o35d41f09dc4ff64d@mail.gmail.com>

Hello,

Yesterday I installed OFED-1.4.1-rc4 on a CentOS 5.3 system and started
looking at the backported kernel headers. I found the following in the
header file
/usr/src/ofa_kernel-1.4.1/kernel_addons/backport/2.6.18-EL5.3/include/linux/scatterlist.h:

#define sg_init_table(a, b)

Or: sg_init_table() is defined to do nothing. I was expecting the following
however:

#define sg_init_table(sgl, nents) memset(sgl, 0, sizeof(*sgl) * nents);

The sg_init_table() function is implemented in e.g. 2.6.29 as follows:

void sg_init_table(struct scatterlist *sgl, unsigned int nents)
{
        memset(sgl, 0, sizeof(*sgl) * nents);
#ifdef CONFIG_DEBUG_SG
        {
                unsigned int i;
                for (i = 0; i < nents; i++)
                        sgl[i].sg_magic = SG_MAGIC;
        }
#endif
        sg_mark_end(&sgl[nents - 1]);
}

Does anyone know why sg_init_table() is defined such that it does nothing in
the backported OFED headers ?

Bart.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090502/7aafbdc4/attachment.html>

From vlad at lists.openfabrics.org  Sun May  3 03:21:42 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sun,  3 May 2009 03:21:42 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090503-0200 daily build status
Message-ID: <20090503102142.B2566E611FE@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From tziporet at dev.mellanox.co.il  Sun May  3 03:48:32 2009
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Sun, 03 May 2009 13:48:32 +0300
Subject: [ofa-general] Build failures on current 1.4.1 dailies
In-Reply-To: <ab66a8180904282237k19366778j21359a61112e4399@mail.gmail.com>
References: <ab66a8180904282237k19366778j21359a61112e4399@mail.gmail.com>
Message-ID: <49FD7680.1060508@mellanox.co.il>

Jon/Steve
I see the issue is with nfs - please look at this

Thanks
Tziporet

Gennadiy Nerubayev wrote:
> Hi all,
>
> Running on 2.6.27.21 x64. ofa_kernel build error as follows:
>
> -I/usr/src/redhat/BUILD/kernel-2.6.27.21/arch/x86_64/include \
>  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing 
> -fno-common -Werror-implicit-function-declaration -Os -m64 
> -mtune=generic -mno-red-zone -mc
> model=kernel -funit-at-a-time -maccumulate-outgoing-args 
> -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -pipe 
> -Wno-sign-compare -fno-asynchronous-unwind-tables
> -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Iinclude/asm-x86/mach-default 
> -fno-stack-protector -fomit-frame-pointer -g 
> -Wdeclaration-after-statement -Wno-pointer-sign
>  -fwrapv -DMODULE -D"KBUILD_STR(s)=#s" 
> -D"KBUILD_BASENAME=KBUILD_STR(file)"  
> -D"KBUILD_MODNAME=KBUILD_STR(nfs)" -c -o 
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/
> fs/nfs/.tmp_file.o 
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c: In function 
> 'nfs_write_begin':
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c:354: error: 
> implicit declaration of function '__grab_cache_page'
> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c:354: 
> warning: assignment makes pointer from integer without a cast
> make[3]: *** 
> [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.o] Error 1
> make[2]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs] Error 2
> make[1]: *** [_module_/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1] Error 2
> make[1]: Leaving directory `/usr/src/redhat/BUILD/kernel-2.6.27.21'
> make: *** [kernel] Error 2
> error: Bad exit status from /var/tmp/rpm-tmp.2461 (%build)
>
> Assuming we turn off nfs stuff to go further, error number two is from 
> infiniband-diags:
> <snip>
> checking whether to build shared libraries... yes
> checking whether to build static libraries... yes
> checking for sys_read_string in -libcommon... yes
> checking for umad_init in -libumad... yes
> checking for mad_dump_int in -libmad... no
> configure: error: mad_dump_int() not found. diags require libibmad.
> error: Bad exit status from /var/tmp/rpm-tmp.42050 (%build)
>
> I confirmed that pulling management git and compiling libs and diags 
> from there does not have this issue, and that the libibmad.so.1 that 
> gets compiled in the daily OFED does not have mad_dump_int().
>
>


From jackm at dev.mellanox.co.il  Sun May  3 05:15:44 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Sun, 3 May 2009 15:15:44 +0300
Subject: [ofa-general] OFED,
	the backported <linux/scatterlist.h> header and sg_init_table()
In-Reply-To: <e2e108260905020446j41996c85o35d41f09dc4ff64d@mail.gmail.com>
References: <e2e108260905020446j41996c85o35d41f09dc4ff64d@mail.gmail.com>
Message-ID: <200905031515.45206.jackm@dev.mellanox.co.il>

On Saturday 02 May 2009 14:46, Bart Van Assche wrote:
> Does anyone know why sg_init_table() is defined such that it does nothing in
> the backported OFED headers ?
> 
My mistake while doing backports.
Will be fixed in rc5.

- Jack


From jackm at dev.mellanox.co.il  Sun May  3 06:04:05 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Sun, 3 May 2009 16:04:05 +0300
Subject: [ofa-general] OFED,
	the backported <linux/scatterlist.h> header and sg_init_table()
In-Reply-To: <e2e108260905020446j41996c85o35d41f09dc4ff64d@mail.gmail.com>
References: <e2e108260905020446j41996c85o35d41f09dc4ff64d@mail.gmail.com>
Message-ID: <200905031604.05907.jackm@dev.mellanox.co.il>

On Saturday 02 May 2009 14:46, Bart Van Assche wrote:
> Hello,
> 
> Yesterday I installed OFED-1.4.1-rc4 on a CentOS 5.3 system and started
> looking at the backported kernel headers. I found the following in the
> header file
> /usr/src/ofa_kernel-1.4.1/kernel_addons/backport/2.6.18-EL5.3/include/linux/scatterlist.h:
> 
> #define sg_init_table(a, b)
> 
> Or: sg_init_table() is defined to do nothing. I was expecting the following
> however:
> 
> #define sg_init_table(sgl, nents) memset(sgl, 0, sizeof(*sgl) * nents);
> 
> The sg_init_table() function is implemented in e.g. 2.6.29 as follows:
> 
> void sg_init_table(struct scatterlist *sgl, unsigned int nents)
> {
>         memset(sgl, 0, sizeof(*sgl) * nents);
> #ifdef CONFIG_DEBUG_SG
>         {
>                 unsigned int i;
>                 for (i = 0; i < nents; i++)
>                         sgl[i].sg_magic = SG_MAGIC;
>         }
> #endif
>         sg_mark_end(&sgl[nents - 1]);
> }
> 
> Does anyone know why sg_init_table() is defined such that it does nothing in
> the backported OFED headers ?
> 
> Bart.

I checked this more carefully.
Use of sg_init_table was introduced in 2.6.24 by Jens Axboe, in commit
45711f1af6eff1a6d010703b4862e0d2b9afd056. (see chunks for core/umem.c)

Before this, no initialization was done on the sg page_list, and we had no
problems.  When doing the backport, then, I simply made this a NOP.
I'm not convinced that sg_init_table needs to be implemented in kernels earlier
than 2.6.24, since this call is not replacing anything (e.g., a kzalloc), and
the page list was not previously zeroed out before usage.

What do you think?

- Jack


From bart.vanassche at gmail.com  Sun May  3 08:36:53 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Sun, 3 May 2009 17:36:53 +0200
Subject: [ofa-general] OFED, the backported <linux/scatterlist.h> header 
	and sg_init_table()
In-Reply-To: <200905031604.05907.jackm@dev.mellanox.co.il>
References: <e2e108260905020446j41996c85o35d41f09dc4ff64d@mail.gmail.com>
	<200905031604.05907.jackm@dev.mellanox.co.il>
Message-ID: <e2e108260905030836r108c8578rfb2ea5ffad6df50e@mail.gmail.com>

On Sun, May 3, 2009 at 3:04 PM, Jack Morgenstein
<jackm at dev.mellanox.co.il> wrote:
> On Saturday 02 May 2009 14:46, Bart Van Assche wrote:
>> Hello,
>>
>> Yesterday I installed OFED-1.4.1-rc4 on a CentOS 5.3 system and started
>> looking at the backported kernel headers. I found the following in the
>> header file
>> /usr/src/ofa_kernel-1.4.1/kernel_addons/backport/2.6.18-EL5.3/include/linux/scatterlist.h:
>>
>> #define sg_init_table(a, b)
>>
>> Or: sg_init_table() is defined to do nothing. I was expecting the following
>> however:
>>
>> #define sg_init_table(sgl, nents) memset(sgl, 0, sizeof(*sgl) * nents);
>>
>> The sg_init_table() function is implemented in e.g. 2.6.29 as follows:
>>
>> void sg_init_table(struct scatterlist *sgl, unsigned int nents)
>> {
>>         memset(sgl, 0, sizeof(*sgl) * nents);
>> #ifdef CONFIG_DEBUG_SG
>>         {
>>                 unsigned int i;
>>                 for (i = 0; i < nents; i++)
>>                         sgl[i].sg_magic = SG_MAGIC;
>>         }
>> #endif
>>         sg_mark_end(&sgl[nents - 1]);
>> }
>>
>> Does anyone know why sg_init_table() is defined such that it does nothing in
>> the backported OFED headers ?
>
> I checked this more carefully.
> Use of sg_init_table was introduced in 2.6.24 by Jens Axboe, in commit
> 45711f1af6eff1a6d010703b4862e0d2b9afd056. (see chunks for core/umem.c)
>
> Before this, no initialization was done on the sg page_list, and we had no
> problems.  When doing the backport, then, I simply made this a NOP.
> I'm not convinced that sg_init_table needs to be implemented in kernels earlier
> than 2.6.24, since this call is not replacing anything (e.g., a kzalloc), and
> the page list was not previously zeroed out before usage.
>
> What do you think?

My opinion is that it is really dangerous and confusing to have one
version of the sg_init_table() macro that performs initialization and
another version that does not. As an example, the OFED source file
net/sunrpc/xdr.c invokes sg_init_table(). When this code is compiled
against e.g. a 2.6.27 kernel, invoking sg_init_table() will
initialize the sg-list properly because in this case the
sg_init_table() included with the 2.6.27 kernel is used. When this
code is compiled against e.g. an RHEL 5.3 kernel, invoking the
sg_init_table() macro will have no effect because the sg_init_table()
macro from OFED's backported header files is used. Is this effect
really desired ?

Bart.


From pashash at gmail.com  Sun May  3 09:32:18 2009
From: pashash at gmail.com (Pavel Shamis (Pasha))
Date: Sun, 03 May 2009 19:32:18 +0300
Subject: [ofa-general] New proposal for memory management
In-Reply-To: <loom.20090501T181654-398@post.gmane.org>
References: <C6E9B5EC-C922-4675-8469-3D7C5AB4C9BE@cisco.com>	<loom.20090430T052134-692@post.gmane.org>	<48644923-A2EF-4C4F-8F9A-B1658BAA36FE@cisco.com>	<3B6F5C9068D5864096EB291236C3386F058EDB8C@xmb-sjc-21d.amer.cisco.com>	<49FB2309.5090702@dev.mellanox.co.il>
	<loom.20090501T181654-398@post.gmane.org>
Message-ID: <49FDC712.1010208@dev.mellanox.co.il>


> - Verbs works well for a number of applications (Roland and I have each written 
> multiple, for example).
>   
BTW it is not too much user-level native IB applications that I know.
Verbs works perfect for kernel level ULPs that actually hide all the 
complexity from user level.

> - IMHO, there is a problem with your API that should be fixed (the messaging 
> layer needs to manage network buf allocation).  If you required mpi_alloc_mem, 
> you would get rid of a whole layer of complicated crap.  It may not be feasible 
> for you, but it is the right thing to do from an engineering perspective, 
> right?
>   
 From engineering point maybe it is correct. From business point we have 
application that were
written and defined before OFA. The application worked well and continue 
to work well today with
other modern interconnects. MPI people want to use IB and asking from 
OFA people to help to resolve
problem that just can not be resolved on user/MPI level.

> - "MPI" doesn't want to fix the problem, but instead is asking other people to 
> make kernel changes for them and saying things like "verbs is broken".
>   
Maybe the best way is to change spec and push people to use 
mpi_alloc_mem, but it is long term solution.
We want to allow people to run the applications now.
> I totally see your guys' problem and feel for you.  Either way it comes down to 
> politics; getting some MPI-specific code into the Linux Kernel (fun?), or 
> getting MPI users to have to change crusty old scientific code (very fun?).
>   
BTW the registration cache code may be useful not only for MPI model. I 
definitely see other
HPC models were I would like to have kernel level registration cache.

It will be very difficult to push users to change their code, especially 
when you have other
interconnect that does not require from them any code changes.
>
> Final point I want to make is that this is open source, so you can always try 
> submitting some elite patches and get the changes you need.
>   
Before somebody will put any human resource on this project it will be 
good to know if the
concept of this solution will be accepted by OFA community and it is 
reason why we are discussing it here.

Thanks
Pasha


From nicolas.morey-chaisemartin at ext.bull.net  Mon May  4 02:02:12 2009
From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey-Chaisemartin)
Date: Mon, 04 May 2009 11:02:12 +0200
Subject: [ofa-general] [PATCH] ibutils - Fix cleanup phase
Message-ID: <49FEAF14.1080606@ext.bull.net>

Move deletion of RPM_BUILD_ROOT before RPM_BUILD_DIR to avoid
'rm: cannot get current directory: No such file or directory' errors during
cleanup phase (showed up on old IA64 RHEL).

Signed-off-by: Sebastien Dugue <sebastien.dugue at bull.net>
Signed-off-by: Nicolas Morey-Chaisemartin <nicolas.morey-chaisemartin at ext.bull.net>
---
 ibutils.spec.in |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/ibutils.spec.in b/ibutils.spec.in
index 628230a..1770eb1 100644
--- a/ibutils.spec.in
+++ b/ibutils.spec.in
@@ -81,8 +81,8 @@ esac
 
 %clean
 #Remove installed driver after rpm build finished
-rm -rf $RPM_BUILD_DIR/%{name}-%{version}
 rm -rf $RPM_BUILD_ROOT
+rm -rf $RPM_BUILD_DIR/%{name}-%{version}
 
 %post
 /sbin/ldconfig


From nicolas.morey-chaisemartin at ext.bull.net  Mon May  4 02:03:21 2009
From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey-Chaisemartin)
Date: Mon, 04 May 2009 11:03:21 +0200
Subject: [ofa-general] [PATCH] Fixed dependcies of ibdmsh on libibdmcom.la
Message-ID: <49FEAF59.6090502@ext.bull.net>


Signed-off-by: Nicolas Morey-Chaisemartin <nicolas.morey-chaisemartin at ext.bull.net>
---
Repost as I sent it to sasha and not yevgeny !
 ibdm/ibdm/Makefile.am |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/ibdm/ibdm/Makefile.am b/ibdm/ibdm/Makefile.am
index 1c57b3b..ba5789a 100644
--- a/ibdm/ibdm/Makefile.am
+++ b/ibdm/ibdm/Makefile.am
@@ -88,6 +88,7 @@ bin_PROGRAMS  = ibdmsh
 ibdmsh_SOURCES = ibdmsh_wrap.cpp
 ibdmsh_LDADD =  -libdmcom $(TCL_LIBS)
 ibdmsh_LDFLAGS = -static -Wl,-rpath -Wl,$(TCL_PREFIX)/lib
+ibdmsh_DEPENDENCIES=$(lib_LTLIBRARIES)
 
 $(srcdir)/Fabric.cpp: $(srcdir)/git_version.h
 
-- 
1.6.2-rc2.GIT


From kliteyn at dev.mellanox.co.il  Mon May  4 02:22:49 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 04 May 2009 12:22:49 +0300
Subject: [ofa-general] Re: [PATCH] ibutils - Fix cleanup phase
In-Reply-To: <49FEAF14.1080606@ext.bull.net>
References: <49FEAF14.1080606@ext.bull.net>
Message-ID: <49FEB3E9.10607@dev.mellanox.co.il>

Nicolas Morey-Chaisemartin wrote:
> Move deletion of RPM_BUILD_ROOT before RPM_BUILD_DIR to avoid
> 'rm: cannot get current directory: No such file or directory' errors during
> cleanup phase (showed up on old IA64 RHEL).
> 
> Signed-off-by: Sebastien Dugue <sebastien.dugue at bull.net>
> Signed-off-by: Nicolas Morey-Chaisemartin <nicolas.morey-chaisemartin at ext.bull.net>

Thanks, applied.

-- Yevgeny


From kliteyn at dev.mellanox.co.il  Mon May  4 02:25:34 2009
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 04 May 2009 12:25:34 +0300
Subject: [ofa-general] Re: [PATCH] Fixed dependcies of ibdmsh on
	libibdmcom.la
In-Reply-To: <49FEAF59.6090502@ext.bull.net>
References: <49FEAF59.6090502@ext.bull.net>
Message-ID: <49FEB48E.2090109@dev.mellanox.co.il>

Nicolas Morey-Chaisemartin wrote:
> Signed-off-by: Nicolas Morey-Chaisemartin <nicolas.morey-chaisemartin at ext.bull.net>
> ---
> Repost as I sent it to sasha and not yevgeny !
>  ibdm/ibdm/Makefile.am |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/ibdm/ibdm/Makefile.am b/ibdm/ibdm/Makefile.am
> index 1c57b3b..ba5789a 100644
> --- a/ibdm/ibdm/Makefile.am
> +++ b/ibdm/ibdm/Makefile.am
> @@ -88,6 +88,7 @@ bin_PROGRAMS  = ibdmsh
>  ibdmsh_SOURCES = ibdmsh_wrap.cpp
>  ibdmsh_LDADD =  -libdmcom $(TCL_LIBS)
>  ibdmsh_LDFLAGS = -static -Wl,-rpath -Wl,$(TCL_PREFIX)/lib
> +ibdmsh_DEPENDENCIES=$(lib_LTLIBRARIES)
>  
>  $(srcdir)/Fabric.cpp: $(srcdir)/git_version.h
>  

Thanks, applied.
Guess it should take care of bugzilla issue 1539
(https://bugs.openfabrics.org/show_bug.cgi?id=1539)

-- Yevgeny


From vlad at lists.openfabrics.org  Mon May  4 03:25:55 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Mon,  4 May 2009 03:25:55 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090504-0200 daily build status
Message-ID: <20090504102555.82A30E61401@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From nicolas.morey-chaisemartin at ext.bull.net  Mon May  4 03:58:09 2009
From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey-Chaisemartin)
Date: Mon, 04 May 2009 12:58:09 +0200
Subject: [ofa-general] [PATCH] infiniband-diags: Added libibnetdiscover to
	.spec file
Message-ID: <49FECA41.7060200@ext.bull.net>


Signed-off-by: Nicolas Morey-Chaisemartin <nicolas.morey-chaisemartin at ext.bull.net>
---
 infiniband-diags/infiniband-diags.spec.in |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/infiniband-diags/infiniband-diags.spec.in b/infiniband-diags/infiniband-diags.spec.in
index 4bbd907..07c46c9 100644
--- a/infiniband-diags/infiniband-diags.spec.in
+++ b/infiniband-diags/infiniband-diags.spec.in
@@ -51,9 +51,13 @@ rm -rf $RPM_BUILD_ROOT
 %{_sbindir}/check_lft_balance.pl
 %{_sbindir}/set_nodedesc.sh
 %{_sbindir}/sm*
+%{_libdir}/*.a
+%{_libdir}/*.so*
+%{_includedir}/infiniband/*.h
 %define _perldir %(perl -e 'use Config; $T=$Config{installsitearch}; $T=~/(.*)\\/site_perl.*/; print $1;')
 %{_perldir}/*
 %{_mandir}/man8/*
+%{_mandir}/man3/*
 %doc README COPYING ChangeLog
 
 %changelog
-- 
1.6.2-rc2.GIT


From amirv.mellanox at gmail.com  Mon May  4 05:52:32 2009
From: amirv.mellanox at gmail.com (Amir Mellanox)
Date: Mon, 4 May 2009 15:52:32 +0300
Subject: [ofa-general] [PATCHv2] sdp: Fixed SDP to work on 2.6.29+
In-Reply-To: <49F862C8.6030102@ext.bull.net>
References: <49F862C8.6030102@ext.bull.net>
Message-ID: <18e64aac0905040552x7d4ebc16oe69383186afd073e@mail.gmail.com>

Thanks,

I committed the fix to ofed-1.5 tree

- Amir

On Wed, Apr 29, 2009 at 5:23 PM, Nicolas Morey-Chaisemartin <
nicolas.morey-chaisemartin at ext.bull.net> wrote:

> orphan_count and sockets_allocated have been changed from atomic_t to
> percpu_counter.
> As percpu_counter are huge they can be allocated on the stack without
> causing sdp module to crash.
> Both variable are now dynamically allocated at module init.
>
> Signed-off-by: Nicolas Morey-Chaisemartin <
> nicolas.morey-chaisemartin at ext.bull.net>
> ---
>  drivers/infiniband/ulp/sdp/sdp_main.c |   29 +++++++++++++++++++----------
>  1 files changed, 19 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/infiniband/ulp/sdp/sdp_main.c
> b/drivers/infiniband/ulp/sdp/sdp_main.c
> index 51801e0..7a38c47 100644
> --- a/drivers/infiniband/ulp/sdp/sdp_main.c
> +++ b/drivers/infiniband/ulp/sdp/sdp_main.c
> @@ -580,7 +580,7 @@ adjudge_to_death:
>                /* TODO: tcp_fin_time to get timeout */
>                sdp_dbg(sk, "%s: entering time wait refcnt %d\n", __func__,
>                        atomic_read(&sk->sk_refcnt));
> -               atomic_inc(sk->sk_prot->orphan_count);
> +               percpu_counter_inc(sk->sk_prot->orphan_count);
>        }
>
>        /* TODO: limit number of orphaned sockets.
> @@ -861,7 +861,7 @@ void sdp_cancel_dreq_wait_timeout(struct sdp_sock *ssk)
>                sock_put(&ssk->isk.sk, SOCK_REF_DREQ_TO);
>        }
>
> -       atomic_dec(ssk->isk.sk.sk_prot->orphan_count);
> +       percpu_counter_dec(ssk->isk.sk.sk_prot->orphan_count);
>  }
>
>  void sdp_destroy_work(struct work_struct *work)
> @@ -902,7 +902,7 @@ void sdp_dreq_wait_timeout_work(struct work_struct
> *work)
>        sdp_sk(sk)->dreq_wait_timeout = 0;
>
>        if (sk->sk_state == TCP_FIN_WAIT1)
> -               atomic_dec(ssk->isk.sk.sk_prot->orphan_count);
> +               percpu_counter_dec(ssk->isk.sk.sk_prot->orphan_count);
>
>        sdp_exch_state(sk, TCPF_LAST_ACK | TCPF_FIN_WAIT1, TCP_TIME_WAIT);
>
> @@ -2162,9 +2162,9 @@ void sdp_urg(struct sdp_sock *ssk, struct sk_buff
> *skb)
>                sk->sk_data_ready(sk, 0);
>  }
>
> -static atomic_t sockets_allocated;
> +static struct percpu_counter *sockets_allocated;
>  static atomic_t memory_allocated;
> -static atomic_t orphan_count;
> +static struct percpu_counter *orphan_count;
>  static int memory_pressure;
>  struct proto sdp_proto = {
>         .close       = sdp_close,
> @@ -2182,10 +2182,8 @@ struct proto sdp_proto = {
>         .get_port    = sdp_get_port,
>        /* Wish we had this: .listen   = sdp_listen */
>        .enter_memory_pressure = sdp_enter_memory_pressure,
> -       .sockets_allocated = &sockets_allocated,
>        .memory_allocated = &memory_allocated,
>        .memory_pressure = &memory_pressure,
> -       .orphan_count = &orphan_count,
>         .sysctl_mem             = sysctl_tcp_mem,
>         .sysctl_wmem            = sysctl_tcp_wmem,
>         .sysctl_rmem            = sysctl_tcp_rmem,
> @@ -2540,6 +2538,15 @@ static int __init sdp_init(void)
>        spin_lock_init(&sock_list_lock);
>        spin_lock_init(&sdp_large_sockets_lock);
>
> +       sockets_allocated = kmalloc(sizeof(*sockets_allocated),
> GFP_KERNEL);
> +       orphan_count = kmalloc(sizeof(*orphan_count), GFP_KERNEL);
> +       percpu_counter_init(sockets_allocated, 0);
> +       percpu_counter_init(orphan_count, 0);
> +
> +       sdp_proto.sockets_allocated = sockets_allocated;
> +       sdp_proto.orphan_count = orphan_count;
> +
> +
>        sdp_workqueue = create_singlethread_workqueue("sdp");
>        if (!sdp_workqueue) {
>                return -ENOMEM;
> @@ -2574,9 +2581,9 @@ static void __exit sdp_exit(void)
>        sock_unregister(PF_INET_SDP);
>        proto_unregister(&sdp_proto);
>
> -       if (atomic_read(&orphan_count))
> -               printk(KERN_WARNING "%s: orphan_count %d\n", __func__,
> -                      atomic_read(&orphan_count));
> +       if (percpu_counter_read_positive(orphan_count))
> +               printk(KERN_WARNING "%s: orphan_count %lld\n", __func__,
> +                      percpu_counter_read_positive(orphan_count));
>        destroy_workqueue(sdp_workqueue);
>        flush_scheduled_work();
>
> @@ -2589,6 +2596,8 @@ static void __exit sdp_exit(void)
>        sdp_proc_unregister();
>
>        ib_unregister_client(&sdp_client);
> +       kfree(orphan_count);
> +       kfree(sockets_allocated);
>  }
>
>  module_init(sdp_init);
> --
> 1.6.2.GIT
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090504/68a377a3/attachment.html>

From michael.heinz at qlogic.com  Mon May  4 06:19:04 2009
From: michael.heinz at qlogic.com (Mike Heinz)
Date: Mon, 4 May 2009 08:19:04 -0500
Subject: [ofa-general] Patch for libvendor incompatibility with QLogic SM
References: <4C2744E8AD2982428C5BFE523DF8CDCB3E7462465F@MNEXMB1.qlogic.org>
	<f0e08f230812181133o4d4d89d6g69b83a9ee0a9c9ee@mail.gmail.com>
	<4C2744E8AD2982428C5BFE523DF8CDCB3E74624662@MNEXMB1.qlogic.org>
	<f0e08f230812181215t69d74222na072c8b52ab68270@mail.gmail.com>
	<4C2744E8AD2982428C5BFE523DF8CDCB3E74624663@MNEXMB1.qlogic.org>
	<f0e08f230812181231n397d9666s7342508fb41601b@mail.gmail.com>
	<4C2744E8AD2982428C5BFE523DF8CDCB3E74624665@MNEXMB1.qlogic.org>
	<f0e08f230812181302m266c20d7o38f0413ef7784f35@mail.gmail.com> 
Message-ID: <4C2744E8AD2982428C5BFE523DF8CDCB43E870D5B7@MNEXMB1.qlogic.org>

Hey, all -

I submitted this patch back in December; there's some question on my end about whether or not it was accepted for the next release of OFED.

Can anyone set me straight?

--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania

-----Original Message-----
From: Mike Heinz 
Sent: Thursday, December 18, 2008 4:05 PM
To: 'Hal Rosenstock'
Cc: general at lists.openfabrics.org
Subject: RE: [ofa-general] Patch for libvendor incompatibility with QLogic SM

No problem. I figured it had to be something like that. 


--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania

-----Original Message-----
From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] 
Sent: Thursday, December 18, 2008 4:02 PM
To: Mike Heinz
Cc: general at lists.openfabrics.org
Subject: Re: [ofa-general] Patch for libvendor incompatibility with QLogic SM

Mike,

On Thu, Dec 18, 2008 at 3:49 PM, Mike Heinz <michael.heinz at qlogic.com> wrote:
> Hal,
>
> You've got me really confused now - there are only two cases that need changing, OSMV_QUERY_PATH_REC_BY_GIDS and OSMV_QUERY_PATH_REC_BY_PORT_GUIDS;  OSMV_QUERY_PATH_REC_BY_LIDS does *not* need to be changed because it uses the GET method. Thus, this should be the correct patch. (I'm re-including it for clarity).

The below looks right to me. The previous one with osm_vendor_mlx_sa.c was truncated somehow in my gmail and appeared to only have 1 of the 2 cases and I didn't look at the attachment. Sorry for the confusion.

-- Hal

>
> Signed-off-by: Michael Heinz <mheinz at qlogic.com>
> --------------------------------
> --- osm_vendor_ibumad_sa.c.orig 2008-10-20 01:00:09.000000000 -0400
> +++ osm_vendor_ibumad_sa.c      2008-12-18 14:50:49.000000000 -0500
> @@ -615,7 +615,8 @@
>                sa_mad_data.attr_offset =
>                    ib_get_attr_offset(sizeof(ib_path_rec_t));
>                sa_mad_data.comp_mask =
> -                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID);
> +                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID | IB_PR_COMPMASK_NUMBPATH);
> +               path_rec.num_path = 0x7f;
>                sa_mad_data.p_attr = &path_rec;
>                ib_gid_set_default(&path_rec.dgid,
>                                   ((osmv_guid_pair_t *) (p_query_req-> 
> @@ -634,7 +635,8 @@
>                sa_mad_data.attr_offset =
>                    ib_get_attr_offset(sizeof(ib_path_rec_t));
>                sa_mad_data.comp_mask =
> -                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID);
> +                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID | IB_PR_COMPMASK_NUMBPATH);
> +               path_rec.num_path = 0x7f;
>                sa_mad_data.p_attr = &path_rec;
>                memcpy(&path_rec.dgid,
>                       &((osmv_gid_pair_t *) (p_query_req->p_query_input))->
> --- osm_vendor_mlx_sa.c.orig    2008-10-20 01:00:09.000000000 -0400
> +++ osm_vendor_mlx_sa.c 2008-12-18 14:51:34.000000000 -0500
> @@ -743,7 +743,8 @@
>                sa_mad_data.attr_offset =
>                    ib_get_attr_offset(sizeof(ib_path_rec_t));
>                sa_mad_data.comp_mask =
> -                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID);
> +                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID | IB_PR_COMPMASK_NUMBPATH);
> +               path_rec.num_path = 0x7f;
>                sa_mad_data.p_attr = &path_rec;
>                ib_gid_set_default(&path_rec.dgid,
>                                   ((osmv_guid_pair_t *) (p_query_req-> 
> @@ -763,7 +764,8 @@
>                sa_mad_data.attr_offset =
>                    ib_get_attr_offset(sizeof(ib_path_rec_t));
>                sa_mad_data.comp_mask =
> -                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID);
> +                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID | IB_PR_COMPMASK_NUMBPATH);
> +               path_rec.num_path = 0x7f;
>                sa_mad_data.p_attr = &path_rec;
>                memcpy(&path_rec.dgid,
>                       &((osmv_gid_pair_t *) 
> (p_query_req->p_query_input))->
>
> --
> Michael Heinz
> Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania
>
> -----Original Message-----
> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> Sent: Thursday, December 18, 2008 3:32 PM
> To: Mike Heinz
> Cc: general at lists.openfabrics.org
> Subject: Re: [ofa-general] Patch for libvendor incompatibility with 
> QLogic SM
>
> On Thu, Dec 18, 2008 at 3:22 PM, Mike Heinz <michael.heinz at qlogic.com> wrote:
>>
>>> Right and it wouldn't need num_paths either (as get assumes 1) so I don't think the changes for OSMV_QUERY_PATH_REC_BY_LIDS in both these patches are needed.
>>
>> Sorry if I was unclear, the last patch submission neither sets the num_path field nor the attribute mask for OSMV_QUERY_PATH_REC_BY_LIDS queries.
>
> Right; I didn't see the updated patch was for both sa files. In the new patch, one case was missed in terms of the needed change though unless I missed that too...
>


From hal.rosenstock at gmail.com  Mon May  4 06:36:54 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 4 May 2009 09:36:54 -0400
Subject: [ofa-general] Patch for libvendor incompatibility with QLogic SM
In-Reply-To: <4C2744E8AD2982428C5BFE523DF8CDCB43E870D5B7@MNEXMB1.qlogic.org>
References: <4C2744E8AD2982428C5BFE523DF8CDCB3E7462465F@MNEXMB1.qlogic.org>
	<f0e08f230812181133o4d4d89d6g69b83a9ee0a9c9ee@mail.gmail.com>
	<4C2744E8AD2982428C5BFE523DF8CDCB3E74624662@MNEXMB1.qlogic.org>
	<f0e08f230812181215t69d74222na072c8b52ab68270@mail.gmail.com>
	<4C2744E8AD2982428C5BFE523DF8CDCB3E74624663@MNEXMB1.qlogic.org>
	<f0e08f230812181231n397d9666s7342508fb41601b@mail.gmail.com>
	<4C2744E8AD2982428C5BFE523DF8CDCB3E74624665@MNEXMB1.qlogic.org>
	<f0e08f230812181302m266c20d7o38f0413ef7784f35@mail.gmail.com>
	<4C2744E8AD2982428C5BFE523DF8CDCB43E870D5B7@MNEXMB1.qlogic.org>
Message-ID: <f0e08f230905040636v47f3e2b2y344156f3bd6df550@mail.gmail.com>

On 5/4/09, Mike Heinz <michael.heinz at qlogic.com> wrote:
> Hey, all -
>
> I submitted this patch back in December; there's some question on my end
> about whether or not it was accepted for the next release of OFED.
>
> Can anyone set me straight?

It is commit fa905120f9971bf1601cc3fed4a7900fe9814892 on the master.

It depends on what you mean by next release of OFED as to whether it
will be there. If you mean OFED 1.4.1, then the answer appears to be
not currently. See opensm-3.2 branch.

-- Hal

> --
> Michael Heinz
> Principal Engineer, Qlogic Corporation
> King of Prussia, Pennsylvania
>
> -----Original Message-----
> From: Mike Heinz
> Sent: Thursday, December 18, 2008 4:05 PM
> To: 'Hal Rosenstock'
> Cc: general at lists.openfabrics.org
> Subject: RE: [ofa-general] Patch for libvendor incompatibility with QLogic
> SM
>
> No problem. I figured it had to be something like that.
>
>
> --
> Michael Heinz
> Principal Engineer, Qlogic Corporation
> King of Prussia, Pennsylvania
>
> -----Original Message-----
> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> Sent: Thursday, December 18, 2008 4:02 PM
> To: Mike Heinz
> Cc: general at lists.openfabrics.org
> Subject: Re: [ofa-general] Patch for libvendor incompatibility with QLogic
> SM
>
> Mike,
>
> On Thu, Dec 18, 2008 at 3:49 PM, Mike Heinz <michael.heinz at qlogic.com>
> wrote:
>> Hal,
>>
>> You've got me really confused now - there are only two cases that need
>> changing, OSMV_QUERY_PATH_REC_BY_GIDS and
>> OSMV_QUERY_PATH_REC_BY_PORT_GUIDS;  OSMV_QUERY_PATH_REC_BY_LIDS does *not*
>> need to be changed because it uses the GET method. Thus, this should be
>> the correct patch. (I'm re-including it for clarity).
>
> The below looks right to me. The previous one with osm_vendor_mlx_sa.c was
> truncated somehow in my gmail and appeared to only have 1 of the 2 cases and
> I didn't look at the attachment. Sorry for the confusion.
>
> -- Hal
>
>>
>> Signed-off-by: Michael Heinz <mheinz at qlogic.com>
>> --------------------------------
>> --- osm_vendor_ibumad_sa.c.orig 2008-10-20 01:00:09.000000000 -0400
>> +++ osm_vendor_ibumad_sa.c      2008-12-18 14:50:49.000000000 -0500
>> @@ -615,7 +615,8 @@
>>                sa_mad_data.attr_offset =
>>                    ib_get_attr_offset(sizeof(ib_path_rec_t));
>>                sa_mad_data.comp_mask =
>> -                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID);
>> +                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID |
>> IB_PR_COMPMASK_NUMBPATH);
>> +               path_rec.num_path = 0x7f;
>>                sa_mad_data.p_attr = &path_rec;
>>                ib_gid_set_default(&path_rec.dgid,
>>                                   ((osmv_guid_pair_t *) (p_query_req->
>> @@ -634,7 +635,8 @@
>>                sa_mad_data.attr_offset =
>>                    ib_get_attr_offset(sizeof(ib_path_rec_t));
>>                sa_mad_data.comp_mask =
>> -                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID);
>> +                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID |
>> IB_PR_COMPMASK_NUMBPATH);
>> +               path_rec.num_path = 0x7f;
>>                sa_mad_data.p_attr = &path_rec;
>>                memcpy(&path_rec.dgid,
>>                       &((osmv_gid_pair_t *)
>> (p_query_req->p_query_input))->
>> --- osm_vendor_mlx_sa.c.orig    2008-10-20 01:00:09.000000000 -0400
>> +++ osm_vendor_mlx_sa.c 2008-12-18 14:51:34.000000000 -0500
>> @@ -743,7 +743,8 @@
>>                sa_mad_data.attr_offset =
>>                    ib_get_attr_offset(sizeof(ib_path_rec_t));
>>                sa_mad_data.comp_mask =
>> -                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID);
>> +                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID |
>> IB_PR_COMPMASK_NUMBPATH);
>> +               path_rec.num_path = 0x7f;
>>                sa_mad_data.p_attr = &path_rec;
>>                ib_gid_set_default(&path_rec.dgid,
>>                                   ((osmv_guid_pair_t *) (p_query_req->
>> @@ -763,7 +764,8 @@
>>                sa_mad_data.attr_offset =
>>                    ib_get_attr_offset(sizeof(ib_path_rec_t));
>>                sa_mad_data.comp_mask =
>> -                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID);
>> +                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID |
>> IB_PR_COMPMASK_NUMBPATH);
>> +               path_rec.num_path = 0x7f;
>>                sa_mad_data.p_attr = &path_rec;
>>                memcpy(&path_rec.dgid,
>>                       &((osmv_gid_pair_t *)
>> (p_query_req->p_query_input))->
>>
>> --
>> Michael Heinz
>> Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania
>>
>> -----Original Message-----
>> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
>> Sent: Thursday, December 18, 2008 3:32 PM
>> To: Mike Heinz
>> Cc: general at lists.openfabrics.org
>> Subject: Re: [ofa-general] Patch for libvendor incompatibility with
>> QLogic SM
>>
>> On Thu, Dec 18, 2008 at 3:22 PM, Mike Heinz <michael.heinz at qlogic.com>
>> wrote:
>>>
>>>> Right and it wouldn't need num_paths either (as get assumes 1) so I
>>>> don't think the changes for OSMV_QUERY_PATH_REC_BY_LIDS in both these
>>>> patches are needed.
>>>
>>> Sorry if I was unclear, the last patch submission neither sets the
>>> num_path field nor the attribute mask for OSMV_QUERY_PATH_REC_BY_LIDS
>>> queries.
>>
>> Right; I didn't see the updated patch was for both sa files. In the new
>> patch, one case was missed in terms of the needed change though unless I
>> missed that too...
>>
>


From michael.heinz at qlogic.com  Mon May  4 06:37:51 2009
From: michael.heinz at qlogic.com (Mike Heinz)
Date: Mon, 4 May 2009 08:37:51 -0500
Subject: [ofa-general] Patch for libvendor incompatibility with QLogic SM
In-Reply-To: <f0e08f230905040636v47f3e2b2y344156f3bd6df550@mail.gmail.com>
References: <4C2744E8AD2982428C5BFE523DF8CDCB3E7462465F@MNEXMB1.qlogic.org>
	<f0e08f230812181133o4d4d89d6g69b83a9ee0a9c9ee@mail.gmail.com>
	<4C2744E8AD2982428C5BFE523DF8CDCB3E74624662@MNEXMB1.qlogic.org>
	<f0e08f230812181215t69d74222na072c8b52ab68270@mail.gmail.com>
	<4C2744E8AD2982428C5BFE523DF8CDCB3E74624663@MNEXMB1.qlogic.org>
	<f0e08f230812181231n397d9666s7342508fb41601b@mail.gmail.com>
	<4C2744E8AD2982428C5BFE523DF8CDCB3E74624665@MNEXMB1.qlogic.org>
	<f0e08f230812181302m266c20d7o38f0413ef7784f35@mail.gmail.com>
	<4C2744E8AD2982428C5BFE523DF8CDCB43E870D5B7@MNEXMB1.qlogic.org>
	<f0e08f230905040636v47f3e2b2y344156f3bd6df550@mail.gmail.com>
Message-ID: <4C2744E8AD2982428C5BFE523DF8CDCB43E870D5C2@MNEXMB1.qlogic.org>

Thanks for the quick response, Hal. Will that branch be folded into 1.5?

--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania

-----Original Message-----
From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] 
Sent: Monday, May 04, 2009 9:37 AM
To: Mike Heinz
Cc: general at lists.openfabrics.org; Bob Jaworski; Todd Rimmer
Subject: Re: [ofa-general] Patch for libvendor incompatibility with QLogic SM

On 5/4/09, Mike Heinz <michael.heinz at qlogic.com> wrote:
> Hey, all -
>
> I submitted this patch back in December; there's some question on my end
> about whether or not it was accepted for the next release of OFED.
>
> Can anyone set me straight?

It is commit fa905120f9971bf1601cc3fed4a7900fe9814892 on the master.

It depends on what you mean by next release of OFED as to whether it
will be there. If you mean OFED 1.4.1, then the answer appears to be
not currently. See opensm-3.2 branch.

-- Hal

> --
> Michael Heinz
> Principal Engineer, Qlogic Corporation
> King of Prussia, Pennsylvania
>
> -----Original Message-----
> From: Mike Heinz
> Sent: Thursday, December 18, 2008 4:05 PM
> To: 'Hal Rosenstock'
> Cc: general at lists.openfabrics.org
> Subject: RE: [ofa-general] Patch for libvendor incompatibility with QLogic
> SM
>
> No problem. I figured it had to be something like that.
>
>
> --
> Michael Heinz
> Principal Engineer, Qlogic Corporation
> King of Prussia, Pennsylvania
>
> -----Original Message-----
> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> Sent: Thursday, December 18, 2008 4:02 PM
> To: Mike Heinz
> Cc: general at lists.openfabrics.org
> Subject: Re: [ofa-general] Patch for libvendor incompatibility with QLogic
> SM
>
> Mike,
>
> On Thu, Dec 18, 2008 at 3:49 PM, Mike Heinz <michael.heinz at qlogic.com>
> wrote:
>> Hal,
>>
>> You've got me really confused now - there are only two cases that need
>> changing, OSMV_QUERY_PATH_REC_BY_GIDS and
>> OSMV_QUERY_PATH_REC_BY_PORT_GUIDS;  OSMV_QUERY_PATH_REC_BY_LIDS does *not*
>> need to be changed because it uses the GET method. Thus, this should be
>> the correct patch. (I'm re-including it for clarity).
>
> The below looks right to me. The previous one with osm_vendor_mlx_sa.c was
> truncated somehow in my gmail and appeared to only have 1 of the 2 cases and
> I didn't look at the attachment. Sorry for the confusion.
>
> -- Hal
>
>>
>> Signed-off-by: Michael Heinz <mheinz at qlogic.com>
>> --------------------------------
>> --- osm_vendor_ibumad_sa.c.orig 2008-10-20 01:00:09.000000000 -0400
>> +++ osm_vendor_ibumad_sa.c      2008-12-18 14:50:49.000000000 -0500
>> @@ -615,7 +615,8 @@
>>                sa_mad_data.attr_offset =
>>                    ib_get_attr_offset(sizeof(ib_path_rec_t));
>>                sa_mad_data.comp_mask =
>> -                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID);
>> +                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID |
>> IB_PR_COMPMASK_NUMBPATH);
>> +               path_rec.num_path = 0x7f;
>>                sa_mad_data.p_attr = &path_rec;
>>                ib_gid_set_default(&path_rec.dgid,
>>                                   ((osmv_guid_pair_t *) (p_query_req->
>> @@ -634,7 +635,8 @@
>>                sa_mad_data.attr_offset =
>>                    ib_get_attr_offset(sizeof(ib_path_rec_t));
>>                sa_mad_data.comp_mask =
>> -                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID);
>> +                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID |
>> IB_PR_COMPMASK_NUMBPATH);
>> +               path_rec.num_path = 0x7f;
>>                sa_mad_data.p_attr = &path_rec;
>>                memcpy(&path_rec.dgid,
>>                       &((osmv_gid_pair_t *)
>> (p_query_req->p_query_input))->
>> --- osm_vendor_mlx_sa.c.orig    2008-10-20 01:00:09.000000000 -0400
>> +++ osm_vendor_mlx_sa.c 2008-12-18 14:51:34.000000000 -0500
>> @@ -743,7 +743,8 @@
>>                sa_mad_data.attr_offset =
>>                    ib_get_attr_offset(sizeof(ib_path_rec_t));
>>                sa_mad_data.comp_mask =
>> -                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID);
>> +                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID |
>> IB_PR_COMPMASK_NUMBPATH);
>> +               path_rec.num_path = 0x7f;
>>                sa_mad_data.p_attr = &path_rec;
>>                ib_gid_set_default(&path_rec.dgid,
>>                                   ((osmv_guid_pair_t *) (p_query_req->
>> @@ -763,7 +764,8 @@
>>                sa_mad_data.attr_offset =
>>                    ib_get_attr_offset(sizeof(ib_path_rec_t));
>>                sa_mad_data.comp_mask =
>> -                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID);
>> +                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID |
>> IB_PR_COMPMASK_NUMBPATH);
>> +               path_rec.num_path = 0x7f;
>>                sa_mad_data.p_attr = &path_rec;
>>                memcpy(&path_rec.dgid,
>>                       &((osmv_gid_pair_t *)
>> (p_query_req->p_query_input))->
>>
>> --
>> Michael Heinz
>> Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania
>>
>> -----Original Message-----
>> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
>> Sent: Thursday, December 18, 2008 3:32 PM
>> To: Mike Heinz
>> Cc: general at lists.openfabrics.org
>> Subject: Re: [ofa-general] Patch for libvendor incompatibility with
>> QLogic SM
>>
>> On Thu, Dec 18, 2008 at 3:22 PM, Mike Heinz <michael.heinz at qlogic.com>
>> wrote:
>>>
>>>> Right and it wouldn't need num_paths either (as get assumes 1) so I
>>>> don't think the changes for OSMV_QUERY_PATH_REC_BY_LIDS in both these
>>>> patches are needed.
>>>
>>> Sorry if I was unclear, the last patch submission neither sets the
>>> num_path field nor the attribute mask for OSMV_QUERY_PATH_REC_BY_LIDS
>>> queries.
>>
>> Right; I didn't see the updated patch was for both sa files. In the new
>> patch, one case was missed in terms of the needed change though unless I
>> missed that too...
>>
>

From tziporet at mellanox.co.il  Mon May  4 06:39:12 2009
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Mon, 4 May 2009 16:39:12 +0300
Subject: [ofa-general] EWG/OFED meeting agenda for today (May 4)
Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD028C3B83@mtlexch01.mtl.com>


This is the agenda for today's EWG/OFED meeting

1. OFED 1.4.1 status
RC4 was done on Thursday, but we still have some open bugs. 
We must decide which bugs are really critical for this release and
decide when we are doing RC5 (should be final release)

ID 	Sev 	OS 	Assignee 		 	Summary	 
1607    	blo  	SLES  	Jeffrey.C.Becker at nasa.gov  	kernel
oops during login on sles10 sp2 with OFED-1.4.1-20...
1616 	cri 	RHEL 	jon at opengridcomputing.com 	iommu_alloc
error when running connectathon on ppc64 nfs ...
1620 	cri 	Othe 	jon at opengridcomputing.com 	backport
definition of struct hash_desc doesn't match the...
1571 	cri 	RHEL 	vu at mellanox.com 		nfsrdma server
crash @test5 connectathon basic test,
1287 	maj 	RHEL 	bugzilla at openib.org 		IPoIB datagram
mode initial packet loss - decided to hold now
1596 	maj 	Othe 	Jeffrey.C.Becker at nasa.gov 	openibd stop
failed when nfs is loaded
1621 	maj 	RHEL 	vu at mellanox.com 		RHEL 5.3 + OFED
1.4.1-rc4: loading ib_sprt kernel module ...  - not sure if this is a
showstopper


2. OFED 1.5
a. Schedule: Since OFED 1.4.1 is delayed by more then a month I think we
need to consider its influence on the 1.5 schedule.
BTW: If we delay the release we may want to change kernel base to 2.6.31
too

b. Status: We opened a git tree that is based on 2.6.30, and for now its
compiled on 2.6.30. Need to start the backports.
Mellanox will be able to work on the backports only in few weeks from
now.
Is there other company that can start earlier?

3. MPI new memory API
   If Jeff S. will join we can discuss the next steps

4. Open discussion


Tziporet


From hal.rosenstock at gmail.com  Mon May  4 06:42:36 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 4 May 2009 09:42:36 -0400
Subject: [ofa-general] Patch for libvendor incompatibility with QLogic SM
In-Reply-To: <4C2744E8AD2982428C5BFE523DF8CDCB43E870D5C2@MNEXMB1.qlogic.org>
References: <4C2744E8AD2982428C5BFE523DF8CDCB3E7462465F@MNEXMB1.qlogic.org>
	<4C2744E8AD2982428C5BFE523DF8CDCB3E74624662@MNEXMB1.qlogic.org>
	<f0e08f230812181215t69d74222na072c8b52ab68270@mail.gmail.com>
	<4C2744E8AD2982428C5BFE523DF8CDCB3E74624663@MNEXMB1.qlogic.org>
	<f0e08f230812181231n397d9666s7342508fb41601b@mail.gmail.com>
	<4C2744E8AD2982428C5BFE523DF8CDCB3E74624665@MNEXMB1.qlogic.org>
	<f0e08f230812181302m266c20d7o38f0413ef7784f35@mail.gmail.com>
	<4C2744E8AD2982428C5BFE523DF8CDCB43E870D5B7@MNEXMB1.qlogic.org>
	<f0e08f230905040636v47f3e2b2y344156f3bd6df550@mail.gmail.com>
	<4C2744E8AD2982428C5BFE523DF8CDCB43E870D5C2@MNEXMB1.qlogic.org>
Message-ID: <f0e08f230905040642h6da19161ycebedb7d03f11c72@mail.gmail.com>

On Mon, May 4, 2009 at 9:37 AM, Mike Heinz <michael.heinz at qlogic.com> wrote:
> Thanks for the quick response, Hal. Will that branch be folded into 1.5?

I was saying the patch is _not_ on that branch.

I would expect OFED 1.5 to be based off the current master but this is
up to Sasha. The master is currently the 3.3 series whereas OFED 1.4
is the 3.2 series.

-- Hal

> --
> Michael Heinz
> Principal Engineer, Qlogic Corporation
> King of Prussia, Pennsylvania
>
> -----Original Message-----
> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> Sent: Monday, May 04, 2009 9:37 AM
> To: Mike Heinz
> Cc: general at lists.openfabrics.org; Bob Jaworski; Todd Rimmer
> Subject: Re: [ofa-general] Patch for libvendor incompatibility with QLogic SM
>
> On 5/4/09, Mike Heinz <michael.heinz at qlogic.com> wrote:
>> Hey, all -
>>
>> I submitted this patch back in December; there's some question on my end
>> about whether or not it was accepted for the next release of OFED.
>>
>> Can anyone set me straight?
>
> It is commit fa905120f9971bf1601cc3fed4a7900fe9814892 on the master.
>
> It depends on what you mean by next release of OFED as to whether it
> will be there. If you mean OFED 1.4.1, then the answer appears to be
> not currently. See opensm-3.2 branch.
>
> -- Hal
>
>> --
>> Michael Heinz
>> Principal Engineer, Qlogic Corporation
>> King of Prussia, Pennsylvania
>>
>> -----Original Message-----
>> From: Mike Heinz
>> Sent: Thursday, December 18, 2008 4:05 PM
>> To: 'Hal Rosenstock'
>> Cc: general at lists.openfabrics.org
>> Subject: RE: [ofa-general] Patch for libvendor incompatibility with QLogic
>> SM
>>
>> No problem. I figured it had to be something like that.
>>
>>
>> --
>> Michael Heinz
>> Principal Engineer, Qlogic Corporation
>> King of Prussia, Pennsylvania
>>
>> -----Original Message-----
>> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
>> Sent: Thursday, December 18, 2008 4:02 PM
>> To: Mike Heinz
>> Cc: general at lists.openfabrics.org
>> Subject: Re: [ofa-general] Patch for libvendor incompatibility with QLogic
>> SM
>>
>> Mike,
>>
>> On Thu, Dec 18, 2008 at 3:49 PM, Mike Heinz <michael.heinz at qlogic.com>
>> wrote:
>>> Hal,
>>>
>>> You've got me really confused now - there are only two cases that need
>>> changing, OSMV_QUERY_PATH_REC_BY_GIDS and
>>> OSMV_QUERY_PATH_REC_BY_PORT_GUIDS;  OSMV_QUERY_PATH_REC_BY_LIDS does *not*
>>> need to be changed because it uses the GET method. Thus, this should be
>>> the correct patch. (I'm re-including it for clarity).
>>
>> The below looks right to me. The previous one with osm_vendor_mlx_sa.c was
>> truncated somehow in my gmail and appeared to only have 1 of the 2 cases and
>> I didn't look at the attachment. Sorry for the confusion.
>>
>> -- Hal
>>
>>>
>>> Signed-off-by: Michael Heinz <mheinz at qlogic.com>
>>> --------------------------------
>>> --- osm_vendor_ibumad_sa.c.orig 2008-10-20 01:00:09.000000000 -0400
>>> +++ osm_vendor_ibumad_sa.c      2008-12-18 14:50:49.000000000 -0500
>>> @@ -615,7 +615,8 @@
>>>                sa_mad_data.attr_offset =
>>>                    ib_get_attr_offset(sizeof(ib_path_rec_t));
>>>                sa_mad_data.comp_mask =
>>> -                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID);
>>> +                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID |
>>> IB_PR_COMPMASK_NUMBPATH);
>>> +               path_rec.num_path = 0x7f;
>>>                sa_mad_data.p_attr = &path_rec;
>>>                ib_gid_set_default(&path_rec.dgid,
>>>                                   ((osmv_guid_pair_t *) (p_query_req->
>>> @@ -634,7 +635,8 @@
>>>                sa_mad_data.attr_offset =
>>>                    ib_get_attr_offset(sizeof(ib_path_rec_t));
>>>                sa_mad_data.comp_mask =
>>> -                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID);
>>> +                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID |
>>> IB_PR_COMPMASK_NUMBPATH);
>>> +               path_rec.num_path = 0x7f;
>>>                sa_mad_data.p_attr = &path_rec;
>>>                memcpy(&path_rec.dgid,
>>>                       &((osmv_gid_pair_t *)
>>> (p_query_req->p_query_input))->
>>> --- osm_vendor_mlx_sa.c.orig    2008-10-20 01:00:09.000000000 -0400
>>> +++ osm_vendor_mlx_sa.c 2008-12-18 14:51:34.000000000 -0500
>>> @@ -743,7 +743,8 @@
>>>                sa_mad_data.attr_offset =
>>>                    ib_get_attr_offset(sizeof(ib_path_rec_t));
>>>                sa_mad_data.comp_mask =
>>> -                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID);
>>> +                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID |
>>> IB_PR_COMPMASK_NUMBPATH);
>>> +               path_rec.num_path = 0x7f;
>>>                sa_mad_data.p_attr = &path_rec;
>>>                ib_gid_set_default(&path_rec.dgid,
>>>                                   ((osmv_guid_pair_t *) (p_query_req->
>>> @@ -763,7 +764,8 @@
>>>                sa_mad_data.attr_offset =
>>>                    ib_get_attr_offset(sizeof(ib_path_rec_t));
>>>                sa_mad_data.comp_mask =
>>> -                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID);
>>> +                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID |
>>> IB_PR_COMPMASK_NUMBPATH);
>>> +               path_rec.num_path = 0x7f;
>>>                sa_mad_data.p_attr = &path_rec;
>>>                memcpy(&path_rec.dgid,
>>>                       &((osmv_gid_pair_t *)
>>> (p_query_req->p_query_input))->
>>>
>>> --
>>> Michael Heinz
>>> Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania
>>>
>>> -----Original Message-----
>>> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
>>> Sent: Thursday, December 18, 2008 3:32 PM
>>> To: Mike Heinz
>>> Cc: general at lists.openfabrics.org
>>> Subject: Re: [ofa-general] Patch for libvendor incompatibility with
>>> QLogic SM
>>>
>>> On Thu, Dec 18, 2008 at 3:22 PM, Mike Heinz <michael.heinz at qlogic.com>
>>> wrote:
>>>>
>>>>> Right and it wouldn't need num_paths either (as get assumes 1) so I
>>>>> don't think the changes for OSMV_QUERY_PATH_REC_BY_LIDS in both these
>>>>> patches are needed.
>>>>
>>>> Sorry if I was unclear, the last patch submission neither sets the
>>>> num_path field nor the attribute mask for OSMV_QUERY_PATH_REC_BY_LIDS
>>>> queries.
>>>
>>> Right; I didn't see the updated patch was for both sa files. In the new
>>> patch, one case was missed in terms of the needed change though unless I
>>> missed that too...
>>>
>>
>


From jon at opengridcomputing.com  Mon May  4 07:56:42 2009
From: jon at opengridcomputing.com (Jon Mason)
Date: Mon, 4 May 2009 09:56:42 -0500
Subject: [ofa-general] OFED, the backported <linux/scatterlist.h>
	header and sg_init_table()
In-Reply-To: <e2e108260905030836r108c8578rfb2ea5ffad6df50e@mail.gmail.com>
References: <e2e108260905020446j41996c85o35d41f09dc4ff64d@mail.gmail.com>
	<200905031604.05907.jackm@dev.mellanox.co.il>
	<e2e108260905030836r108c8578rfb2ea5ffad6df50e@mail.gmail.com>
Message-ID: <20090504145641.GA19565@opengridcomputing.com>

On Sun, May 03, 2009 at 05:36:53PM +0200, Bart Van Assche wrote:
> On Sun, May 3, 2009 at 3:04 PM, Jack Morgenstein
> <jackm at dev.mellanox.co.il> wrote:
> > On Saturday 02 May 2009 14:46, Bart Van Assche wrote:
> >> Hello,
> >>
> >> Yesterday I installed OFED-1.4.1-rc4 on a CentOS 5.3 system and started
> >> looking at the backported kernel headers. I found the following in the
> >> header file
> >> /usr/src/ofa_kernel-1.4.1/kernel_addons/backport/2.6.18-EL5.3/include/linux/scatterlist.h:
> >>
> >> #define sg_init_table(a, b)
> >>
> >> Or: sg_init_table() is defined to do nothing. I was expecting the following
> >> however:
> >>
> >> #define sg_init_table(sgl, nents) memset(sgl, 0, sizeof(*sgl) * nents);
> >>
> >> The sg_init_table() function is implemented in e.g. 2.6.29 as follows:
> >>
> >> void sg_init_table(struct scatterlist *sgl, unsigned int nents)
> >> {
> >>         memset(sgl, 0, sizeof(*sgl) * nents);
> >> #ifdef CONFIG_DEBUG_SG
> >>         {
> >>                 unsigned int i;
> >>                 for (i = 0; i < nents; i++)
> >>                         sgl[i].sg_magic = SG_MAGIC;
> >>         }
> >> #endif
> >>         sg_mark_end(&sgl[nents - 1]);
> >> }
> >>
> >> Does anyone know why sg_init_table() is defined such that it does nothing in
> >> the backported OFED headers ?
> >
> > I checked this more carefully.
> > Use of sg_init_table was introduced in 2.6.24 by Jens Axboe, in commit
> > 45711f1af6eff1a6d010703b4862e0d2b9afd056. (see chunks for core/umem.c)
> >
> > Before this, no initialization was done on the sg page_list, and we had no
> > problems.  When doing the backport, then, I simply made this a NOP.
> > I'm not convinced that sg_init_table needs to be implemented in kernels earlier
> > than 2.6.24, since this call is not replacing anything (e.g., a kzalloc), and
> > the page list was not previously zeroed out before usage.
> >
> > What do you think?
> 
> My opinion is that it is really dangerous and confusing to have one
> version of the sg_init_table() macro that performs initialization and
> another version that does not. As an example, the OFED source file
> net/sunrpc/xdr.c invokes sg_init_table(). When this code is compiled
> against e.g. a 2.6.27 kernel, invoking sg_init_table() will
> initialize the sg-list properly because in this case the
> sg_init_table() included with the 2.6.27 kernel is used. When this
> code is compiled against e.g. an RHEL 5.3 kernel, invoking the
> sg_init_table() macro will have no effect because the sg_init_table()
> macro from OFED's backported header files is used. Is this effect
> really desired ?

What's even worse is that sg_init_table is already defined in the
RHEL5.3 headers.  When coding up a header cleanup patch for RHEL5.3, I
noticed it was already defined in linux/ncrypto.h.  Also, it's there for
RHEL5.2 (and a few older kernels).

I should have the patch out today for review.

Thanks,
Jon


> 
> Bart.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From jon at opengridcomputing.com  Mon May  4 08:20:37 2009
From: jon at opengridcomputing.com (Jon Mason)
Date: Mon, 4 May 2009 10:20:37 -0500
Subject: [ofa-general] Build failures on current 1.4.1 dailies
In-Reply-To: <49FD7680.1060508@mellanox.co.il>
References: <ab66a8180904282237k19366778j21359a61112e4399@mail.gmail.com>
	<49FD7680.1060508@mellanox.co.il>
Message-ID: <20090504152037.GC19565@opengridcomputing.com>

On Sun, May 03, 2009 at 01:48:32PM +0300, Tziporet Koren wrote:
> Jon/Steve
> I see the issue is with nfs - please look at this

I do not think anyone has backported 2.6.27 (as I do not see a
kernel_addons/backport/2.6.27 backport dir).  The fix is a simple 1
liner in pagemap.h consisting of:
#define __grab_cache_page	grab_cache_page

Since there is not a backport dir for this kernel, do we really want to
add support for it this late in the OFED 1.4.1 release?  I have not done
any NFSRDMA testing for this kernel.  So this could end up to be
something that could delay the 1.41. release further.

Thanks,
Jon

>
> Thanks
> Tziporet
>
> Gennadiy Nerubayev wrote:
>> Hi all,
>>
>> Running on 2.6.27.21 x64. ofa_kernel build error as follows:
>>
>> -I/usr/src/redhat/BUILD/kernel-2.6.27.21/arch/x86_64/include \
>>  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing  
>> -fno-common -Werror-implicit-function-declaration -Os -m64  
>> -mtune=generic -mno-red-zone -mc
>> model=kernel -funit-at-a-time -maccumulate-outgoing-args  
>> -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -pipe  
>> -Wno-sign-compare -fno-asynchronous-unwind-tables
>> -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Iinclude/asm-x86/mach-default  
>> -fno-stack-protector -fomit-frame-pointer -g  
>> -Wdeclaration-after-statement -Wno-pointer-sign
>>  -fwrapv -DMODULE -D"KBUILD_STR(s)=#s"  
>> -D"KBUILD_BASENAME=KBUILD_STR(file)"   
>> -D"KBUILD_MODNAME=KBUILD_STR(nfs)" -c -o  
>> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/
>> fs/nfs/.tmp_file.o  
>> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c
>> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c: In function  
>> 'nfs_write_begin':
>> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c:354: error:  
>> implicit declaration of function '__grab_cache_page'
>> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c:354:  
>> warning: assignment makes pointer from integer without a cast
>> make[3]: ***  
>> [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.o] Error 1
>> make[2]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs] Error 2
>> make[1]: *** [_module_/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1] Error 2
>> make[1]: Leaving directory `/usr/src/redhat/BUILD/kernel-2.6.27.21'
>> make: *** [kernel] Error 2
>> error: Bad exit status from /var/tmp/rpm-tmp.2461 (%build)
>>
>> Assuming we turn off nfs stuff to go further, error number two is from  
>> infiniband-diags:
>> <snip>
>> checking whether to build shared libraries... yes
>> checking whether to build static libraries... yes
>> checking for sys_read_string in -libcommon... yes
>> checking for umad_init in -libumad... yes
>> checking for mad_dump_int in -libmad... no
>> configure: error: mad_dump_int() not found. diags require libibmad.
>> error: Bad exit status from /var/tmp/rpm-tmp.42050 (%build)
>>
>> I confirmed that pulling management git and compiling libs and diags  
>> from there does not have this issue, and that the libibmad.so.1 that  
>> gets compiled in the daily OFED does not have mad_dump_int().
>>
>>
>


From monis at Voltaire.COM  Mon May  4 08:32:57 2009
From: monis at Voltaire.COM (Moni Shoua)
Date: Mon, 04 May 2009 18:32:57 +0300
Subject: [ofa-general] EWG/OFED meeting agenda for today (May 4)
In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD028C3B83@mtlexch01.mtl.com>
References: <5D49E7A8952DC44FB38C38FA0D758EAD028C3B83@mtlexch01.mtl.com>
Message-ID: <49FF0AA9.6000006@Voltaire.COM>

Tziporet Koren wrote:
> This is the agenda for today's EWG/OFED meeting
> 
> 1. OFED 1.4.1 status
> RC4 was done on Thursday, but we still have some open bugs. 
> We must decide which bugs are really critical for this release and
> decide when we are doing RC5 (should be final release)
> 
> ID 	Sev 	OS 	Assignee 		 	Summary	 
> 1607    	blo  	SLES  	Jeffrey.C.Becker at nasa.gov  	kernel
> oops during login on sles10 sp2 with OFED-1.4.1-20...
> 1616 	cri 	RHEL 	jon at opengridcomputing.com 	iommu_alloc
> error when running connectathon on ppc64 nfs ...
> 1620 	cri 	Othe 	jon at opengridcomputing.com 	backport
> definition of struct hash_desc doesn't match the...
> 1571 	cri 	RHEL 	vu at mellanox.com 		nfsrdma server
> crash @test5 connectathon basic test,
> 1287 	maj 	RHEL 	bugzilla at openib.org 		IPoIB datagram
> mode initial packet loss - decided to hold now
> 1596 	maj 	Othe 	Jeffrey.C.Becker at nasa.gov 	openibd stop
> failed when nfs is loaded
> 1621 	maj 	RHEL 	vu at mellanox.com 		RHEL 5.3 + OFED
> 1.4.1-rc4: loading ib_sprt kernel module ...  - not sure if this is a
> showstopper
> 
> 
> 2. OFED 1.5
> a. Schedule: Since OFED 1.4.1 is delayed by more then a month I think we
> need to consider its influence on the 1.5 schedule.
> BTW: If we delay the release we may want to change kernel base to 2.6.31
> too
> 
> b. Status: We opened a git tree that is based on 2.6.30, and for now its
> compiled on 2.6.30. Need to start the backports.
> Mellanox will be able to work on the backports only in few weeks from
> now.
> Is there other company that can start earlier?
> 
> 3. MPI new memory API
>    If Jeff S. will join we can discuss the next steps
> 
> 4. Open discussion
> 
> 
> Tziporet
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
Please add 1623, it was opened by mistake on gen2


From Jeffrey.C.Becker at nasa.gov  Mon May  4 10:08:03 2009
From: Jeffrey.C.Becker at nasa.gov (Jeff Becker)
Date: Mon, 04 May 2009 10:08:03 -0700
Subject: [ofa-general] Build failures on current 1.4.1 dailies
In-Reply-To: <20090504152037.GC19565@opengridcomputing.com>
References: <ab66a8180904282237k19366778j21359a61112e4399@mail.gmail.com>	<49FD7680.1060508@mellanox.co.il>
	<20090504152037.GC19565@opengridcomputing.com>
Message-ID: <49FF20F3.1090808@nasa.gov>

Hi Jon

Jon Mason wrote:
> On Sun, May 03, 2009 at 01:48:32PM +0300, Tziporet Koren wrote:
>   
>> Jon/Steve
>> I see the issue is with nfs - please look at this
>>     
>
> I do not think anyone has backported 2.6.27 (as I do not see a
> kernel_addons/backport/2.6.27 backport dir).  The fix is a simple 1
> liner in pagemap.h consisting of:
> #define __grab_cache_page	grab_cache_page
>
> Since there is not a backport dir for this kernel, do we really want to
> add support for it this late in the OFED 1.4.1 release?  I have not done
> any NFSRDMA testing for this kernel.  So this could end up to be
> something that could delay the 1.41. release further.
>   
I originally verified that NFSRDMA built against 2.6.27 for OFED 1.4.
Since OFED 1.4 was based on 2.6.27 kernel, there was no reason to
have a backport. I believe this is still the case for 1.4.1, but it's
possible that
one of the upstream fixes caused this breakage.

-jeff


> Thanks,
> Jon
>
>   
>> Thanks
>> Tziporet
>>
>> Gennadiy Nerubayev wrote:
>>     
>>> Hi all,
>>>
>>> Running on 2.6.27.21 x64. ofa_kernel build error as follows:
>>>
>>> -I/usr/src/redhat/BUILD/kernel-2.6.27.21/arch/x86_64/include \
>>>  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing  
>>> -fno-common -Werror-implicit-function-declaration -Os -m64  
>>> -mtune=generic -mno-red-zone -mc
>>> model=kernel -funit-at-a-time -maccumulate-outgoing-args  
>>> -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -pipe  
>>> -Wno-sign-compare -fno-asynchronous-unwind-tables
>>> -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Iinclude/asm-x86/mach-default  
>>> -fno-stack-protector -fomit-frame-pointer -g  
>>> -Wdeclaration-after-statement -Wno-pointer-sign
>>>  -fwrapv -DMODULE -D"KBUILD_STR(s)=#s"  
>>> -D"KBUILD_BASENAME=KBUILD_STR(file)"   
>>> -D"KBUILD_MODNAME=KBUILD_STR(nfs)" -c -o  
>>> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/
>>> fs/nfs/.tmp_file.o  
>>> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c
>>> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c: In function  
>>> 'nfs_write_begin':
>>> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c:354: error:  
>>> implicit declaration of function '__grab_cache_page'
>>> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c:354:  
>>> warning: assignment makes pointer from integer without a cast
>>> make[3]: ***  
>>> [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.o] Error 1
>>> make[2]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs] Error 2
>>> make[1]: *** [_module_/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1] Error 2
>>> make[1]: Leaving directory `/usr/src/redhat/BUILD/kernel-2.6.27.21'
>>> make: *** [kernel] Error 2
>>> error: Bad exit status from /var/tmp/rpm-tmp.2461 (%build)
>>>
>>> Assuming we turn off nfs stuff to go further, error number two is from  
>>> infiniband-diags:
>>> <snip>
>>> checking whether to build shared libraries... yes
>>> checking whether to build static libraries... yes
>>> checking for sys_read_string in -libcommon... yes
>>> checking for umad_init in -libumad... yes
>>> checking for mad_dump_int in -libmad... no
>>> configure: error: mad_dump_int() not found. diags require libibmad.
>>> error: Bad exit status from /var/tmp/rpm-tmp.42050 (%build)
>>>
>>> I confirmed that pulling management git and compiling libs and diags  
>>> from there does not have this issue, and that the libibmad.so.1 that  
>>> gets compiled in the daily OFED does not have mad_dump_int().
>>>
>>>
>>>       
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From hnrose at comcast.net  Mon May  4 12:17:32 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Mon, 4 May 2009 15:17:32 -0400
Subject: [ofa-general] [PATCH] infiniband-diags/ibnetdiscover.c: Cosmetic
	formatting changes
Message-ID: <20090504191732.GA29650@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c
index 810b8db..1799618 100644
--- a/infiniband-diags/src/ibnetdiscover.c
+++ b/infiniband-diags/src/ibnetdiscover.c
@@ -144,7 +144,7 @@ list_node(ibnd_node_t *node, void *user_data)
 {
 	char *node_type;
 	char *nodename = remap_node_name(node_name_map, node->guid,
-					      node->nodedesc);
+					 node->nodedesc);
 
 	switch(node->type) {
 	case IB_NODE_SWITCH:
@@ -161,8 +161,7 @@ list_node(ibnd_node_t *node, void *user_data)
 		break;
 	}
 	fprintf(f, "%s\t : 0x%016" PRIx64 " ports %d devid 0x%x vendid 0x%x \"%s\"\n",
-		node_type,
-		node->guid, node->numports,
+		node_type, node->guid, node->numports,
 		mad_get_field(node->info, 0, IB_NODE_DEVID_F),
 		mad_get_field(node->info, 0, IB_NODE_VENDORID_F),
 		nodename);
@@ -173,15 +172,12 @@ list_node(ibnd_node_t *node, void *user_data)
 void
 list_nodes(ibnd_fabric_t *fabric, int list)
 {
-	if (list & LIST_CA_NODE) {
+	if (list & LIST_CA_NODE)
 		ibnd_iter_nodes_type(fabric, list_node, IB_NODE_CA, NULL);
-	}
-	if (list & LIST_SWITCH_NODE) {
+	if (list & LIST_SWITCH_NODE)
 		ibnd_iter_nodes_type(fabric, list_node, IB_NODE_SWITCH, NULL);
-	}
-	if (list & LIST_ROUTER_NODE) {
+	if (list & LIST_ROUTER_NODE)
 		ibnd_iter_nodes_type(fabric, list_node, IB_NODE_ROUTER, NULL);
-	}
 }
 
 void
@@ -194,14 +190,12 @@ out_ids(ibnd_node_t *node, int group, char *chname)
 		mad_get_field(node->info, 0, IB_NODE_DEVID_F));
 	if (sysimgguid)
 		fprintf(f, "sysimgguid=0x%" PRIx64, sysimgguid);
-	if (group
-	    && node->chassis && node->chassis->chassisnum) {
+	if (group && node->chassis && node->chassis->chassisnum) {
 		fprintf(f, "\t\t# Chassis %d", node->chassis->chassisnum);
 		if (chname)
 			fprintf(f, " (%s)", clean_nodedesc(chname));
-		if (ibnd_is_xsigo_tca(node->guid)
-				&& node->ports[1]
-				&& node->ports[1]->remoteport)
+		if (ibnd_is_xsigo_tca(node->guid) && node->ports[1] &&
+		    node->ports[1]->remoteport)
 			fprintf(f, " slot %d", node->ports[1]->remoteport->portnum);
 	}
 	fprintf(f, "\n");
@@ -242,8 +236,7 @@ out_switch(ibnd_node_t *node, int group, char *chname)
 	nodename = remap_node_name(node_name_map, node->guid, node->nodedesc);
 
 	fprintf(f, "\nSwitch\t%d %s\t\t# \"%s\" %s port 0 lid %d lmc %d\n",
-		node->numports, node_name(node),
-		nodename,
+		node->numports, node_name(node), nodename,
 		node->smaenhsp0 ? "enhanced" : "base",
 		node->smalid, node->smalmc);
 
@@ -314,13 +307,12 @@ out_switch_port(ibnd_port_t *port, int group)
 		fprintf(f, "%s", ext_port_str);
 
 	rem_nodename = remap_node_name(node_name_map,
-				port->remoteport->node->guid,
-				port->remoteport->node->nodedesc);
+				       port->remoteport->node->guid,
+				       port->remoteport->node->nodedesc);
 
 	ext_port_str = out_ext_port(port->remoteport, group);
 	fprintf(f, "\t%s[%d]%s",
-		node_name(port->remoteport->node),
-		port->remoteport->portnum,
+		node_name(port->remoteport->node), port->remoteport->portnum,
 		ext_port_str ? ext_port_str : "");
 	if (port->remoteport->node->type != IB_NODE_SWITCH)
 		fprintf(f, "(%" PRIx64 ") ", port->remoteport->guid);
@@ -355,8 +347,7 @@ out_ca_port(ibnd_port_t *port, int group)
 	if (port->node->type != IB_NODE_SWITCH)
 		fprintf(f, "(%" PRIx64 ") ", port->guid);
 	fprintf(f, "\t%s[%d]",
-		node_name(port->remoteport->node),
-		port->remoteport->portnum);
+		node_name(port->remoteport->node), port->remoteport->portnum);
 	str = out_ext_port(port->remoteport, group);
 	if (str)
 		fprintf(f, "%s", str);
@@ -364,8 +355,8 @@ out_ca_port(ibnd_port_t *port, int group)
 		fprintf(f, " (%" PRIx64 ") ", port->remoteport->guid);
 
 	rem_nodename = remap_node_name(node_name_map,
-				port->remoteport->node->guid,
-				port->remoteport->node->nodedesc);
+				       port->remoteport->node->guid,
+				       port->remoteport->node->nodedesc);
 
 	fprintf(f, "\t\t# lid %d lmc %d \"%s\" lid %d %s%s\n",
 		port->base_lid, port->lmc, rem_nodename,
@@ -513,7 +504,7 @@ dump_topology(int group, ibnd_fabric_t *fabric)
 
 			fprintf(f, "\n# Chassis Switches");
 			for (node = ch->nodes; node;
-					node = node->next_chassis_node) {
+			     node = node->next_chassis_node) {
 				if (node->type == IB_NODE_SWITCH) {
 					out_switch(node, group, chname);
 					for (p = 1; p <= node->numports; p++) {
@@ -527,7 +518,7 @@ dump_topology(int group, ibnd_fabric_t *fabric)
 
 			fprintf(f, "\n# Chassis CAs");
 			for (node = ch->nodes; node;
-					node = node->next_chassis_node) {
+			     node = node->next_chassis_node) {
 				if (node->type == IB_NODE_CA) {
 					out_ca(node, group, chname);
 					for (p = 1; p <= node->numports; p++) {
@@ -545,7 +536,7 @@ dump_topology(int group, ibnd_fabric_t *fabric)
 		iter_user_data.group = group;
 		iter_user_data.skip_chassis_nodes = 0;
 		ibnd_iter_nodes_type(fabric, switch_iter_func,
-				IB_NODE_SWITCH, &iter_user_data);
+				     IB_NODE_SWITCH, &iter_user_data);
 	}
 
 	chname = NULL;
@@ -556,18 +547,17 @@ dump_topology(int group, ibnd_fabric_t *fabric)
 		fprintf(f, "\nNon-Chassis Nodes\n");
 
 		ibnd_iter_nodes_type(fabric, switch_iter_func,
-				IB_NODE_SWITCH, &iter_user_data);
+				     IB_NODE_SWITCH, &iter_user_data);
 	}
 
 	iter_user_data.group = group;
 	iter_user_data.skip_chassis_nodes = 0;
 	/* Make pass on CAs */
-	ibnd_iter_nodes_type(fabric, ca_iter_func, IB_NODE_CA,
-			&iter_user_data);
+	ibnd_iter_nodes_type(fabric, ca_iter_func, IB_NODE_CA, &iter_user_data);
 
-	/* make pass on routers */
+	/* Make pass on routers */
 	ibnd_iter_nodes_type(fabric, router_iter_func, IB_NODE_ROUTER,
-			&iter_user_data);
+			     &iter_user_data);
 
 	return i;
 }
@@ -578,8 +568,7 @@ void dump_ports_report (ibnd_node_t *node, void *user_data)
 	ibnd_port_t *port = NULL;
 
 	/* for each port */
-	for (p = node->numports, port = node->ports[p];
-	     p > 0;
+	for (p = node->numports, port = node->ports[p]; p > 0;
 	     port = node->ports[--p]) {
 		uint32_t iwidth, ispeed;
 		if (port == NULL)
@@ -591,8 +580,7 @@ void dump_ports_report (ibnd_node_t *node, void *user_data)
 			ports_nt_str_compat(node),
 			node->type == IB_NODE_SWITCH ?
 				node->smalid : port->base_lid,
-			port->portnum,
-			port->guid,
+			port->portnum, port->guid,
 			dump_linkwidth_compat(iwidth),
 			dump_linkspeed_compat(ispeed));
 		if (port->remoteport)
@@ -604,12 +592,10 @@ void dump_ports_report (ibnd_node_t *node, void *user_data)
 					port->remoteport->node->smalid :
 					port->remoteport->base_lid,
 				port->remoteport->portnum,
-				port->remoteport->guid,
-				port->node->nodedesc,
+				port->remoteport->guid, port->node->nodedesc,
 				port->remoteport->node->nodedesc);
 		else
-			fprintf(stdout, "%36s'%s'\n", "",
-				port->node->nodedesc);
+			fprintf(stdout, "%36s'%s'\n", "", port->node->nodedesc);
 	}
 }
 

From hnrose at comcast.net  Mon May  4 13:00:18 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Mon, 4 May 2009 16:00:18 -0400
Subject: [ofa-general] [PATCH] opensm/PerfMgr DB: Remove leading underscores
	from internal names
Message-ID: <20090504200018.GA4590@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/include/opensm/osm_perfmgr_db.h b/opensm/include/opensm/osm_perfmgr_db.h
index 9598d02..d0eff73 100644
--- a/opensm/include/opensm/osm_perfmgr_db.h
+++ b/opensm/include/opensm/osm_perfmgr_db.h
@@ -120,32 +120,32 @@ typedef enum {
  * Port counter object.
  * Store all the port counters for a single port.
  */
-typedef struct _db_port {
+typedef struct db_port {
 	perfmgr_db_err_reading_t err_total;
 	perfmgr_db_err_reading_t err_previous;
 	perfmgr_db_data_cnt_reading_t dc_total;
 	perfmgr_db_data_cnt_reading_t dc_previous;
 	time_t last_reset;
-} _db_port_t;
+} db_port_t;
 
 /** =========================================================================
  * group port counters for ports into the nodes
  */
 #define NODE_NAME_SIZE (IB_NODE_DESCRIPTION_SIZE << 1)
-typedef struct _db_node {
+typedef struct db_node {
 	cl_map_item_t map_item;	/* must be first */
 	uint64_t node_guid;
 	boolean_t esp0;
-	_db_port_t *ports;
+	db_port_t *ports;
 	uint8_t num_ports;
 	char node_name[NODE_NAME_SIZE];
-} _db_node_t;
+} db_node_t;
 
 /** =========================================================================
- * all nodes in the system.
+ * all nodes in the subnet.
  */
-typedef struct _db {
-	cl_qmap_t pc_data;	/* stores type (_db_node_t *) */
+typedef struct perfmgr_db {
+	cl_qmap_t pc_data;	/* stores type (db_node_t *) */
 	cl_plock_t lock;
 	struct osm_perfmgr *perfmgr;
 } perfmgr_db_t;
diff --git a/opensm/opensm/osm_perfmgr_db.c b/opensm/opensm/osm_perfmgr_db.c
index 8be0b6f..b0bfd36 100644
--- a/opensm/opensm/osm_perfmgr_db.c
+++ b/opensm/opensm/osm_perfmgr_db.c
@@ -77,17 +77,17 @@ void perfmgr_db_destroy(perfmgr_db_t * db)
 /**********************************************************************
  * Internal call db->lock should be held when calling
  **********************************************************************/
-static inline _db_node_t *_get(perfmgr_db_t * db, uint64_t guid)
+static inline db_node_t *get(perfmgr_db_t * db, uint64_t guid)
 {
 	cl_map_item_t *rc = cl_qmap_get(&db->pc_data, guid);
 	const cl_map_item_t *end = cl_qmap_end(&db->pc_data);
 
 	if (rc == end)
 		return (NULL);
-	return ((_db_node_t *) rc);
+	return ((db_node_t *) rc);
 }
 
-static inline perfmgr_db_err_t bad_node_port(_db_node_t * node, uint8_t port)
+static inline perfmgr_db_err_t bad_node_port(db_node_t * node, uint8_t port)
 {
 	if (!node)
 		return (PERFMGR_EVENT_DB_GUIDNOTFOUND);
@@ -98,16 +98,16 @@ static inline perfmgr_db_err_t bad_node_port(_db_node_t * node, uint8_t port)
 
 /** =========================================================================
  */
-static _db_node_t *__malloc_node(uint64_t guid, boolean_t esp0,
-				 uint8_t num_ports, char *name)
+static db_node_t *malloc_node(uint64_t guid, boolean_t esp0,
+			      uint8_t num_ports, char *name)
 {
 	int i = 0;
 	time_t cur_time = 0;
-	_db_node_t *rc = malloc(sizeof(*rc));
+	db_node_t *rc = malloc(sizeof(*rc));
 	if (!rc)
 		return (NULL);
 
-	rc->ports = calloc(num_ports, sizeof(_db_port_t));
+	rc->ports = calloc(num_ports, sizeof(db_port_t));
 	if (!rc->ports)
 		goto free_rc;
 	rc->num_ports = num_ports;
@@ -131,7 +131,7 @@ free_rc:
 
 /** =========================================================================
  */
-static void __free_node(_db_node_t * node)
+static void free_node(db_node_t * node)
 {
 	if (!node)
 		return;
@@ -141,7 +141,7 @@ static void __free_node(_db_node_t * node)
 }
 
 /* insert nodes to the database */
-static perfmgr_db_err_t __insert(perfmgr_db_t * db, _db_node_t * node)
+static perfmgr_db_err_t insert(perfmgr_db_t * db, db_node_t * node)
 {
 	cl_map_item_t *rc = cl_qmap_insert(&db->pc_data, node->node_guid,
 					   (cl_map_item_t *) node);
@@ -160,15 +160,15 @@ perfmgr_db_create_entry(perfmgr_db_t * db, uint64_t guid, boolean_t esp0,
 	perfmgr_db_err_t rc = PERFMGR_EVENT_DB_SUCCESS;
 
 	cl_plock_excl_acquire(&db->lock);
-	if (!_get(db, guid)) {
-		_db_node_t *pc_node = __malloc_node(guid, esp0, num_ports,
-						    name);
+	if (!get(db, guid)) {
+		db_node_t *pc_node = malloc_node(guid, esp0, num_ports,
+						 name);
 		if (!pc_node) {
 			rc = PERFMGR_EVENT_DB_NOMEM;
 			goto Exit;
 		}
-		if (__insert(db, pc_node)) {
-			__free_node(pc_node);
+		if (insert(db, pc_node)) {
+			free_node(pc_node);
 			rc = PERFMGR_EVENT_DB_FAIL;
 			goto Exit;
 		}
@@ -183,7 +183,7 @@ Exit:
  **********************************************************************/
 static inline void
 debug_dump_err_reading(perfmgr_db_t * db, uint64_t guid, uint8_t port_num,
-		       _db_port_t * port, perfmgr_db_err_reading_t * cur)
+		       db_port_t * port, perfmgr_db_err_reading_t * cur)
 {
 	osm_log_t *log = db->perfmgr->log;
 
@@ -250,14 +250,14 @@ perfmgr_db_err_t
 perfmgr_db_add_err_reading(perfmgr_db_t * db, uint64_t guid,
 			   uint8_t port, perfmgr_db_err_reading_t * reading)
 {
-	_db_port_t *p_port = NULL;
-	_db_node_t *node = NULL;
+	db_port_t *p_port = NULL;
+	db_node_t *node = NULL;
 	perfmgr_db_err_reading_t *previous = NULL;
 	perfmgr_db_err_t rc = PERFMGR_EVENT_DB_SUCCESS;
 	osm_epi_pe_event_t epi_pe_data;
 
 	cl_plock_excl_acquire(&db->lock);
-	node = _get(db, guid);
+	node = get(db, guid);
 	if ((rc = bad_node_port(node, port)) != PERFMGR_EVENT_DB_SUCCESS)
 		goto Exit;
 
@@ -323,12 +323,12 @@ perfmgr_db_err_t perfmgr_db_get_prev_err(perfmgr_db_t * db, uint64_t guid,
 					 uint8_t port,
 					 perfmgr_db_err_reading_t * reading)
 {
-	_db_node_t *node = NULL;
+	db_node_t *node = NULL;
 	perfmgr_db_err_t rc = PERFMGR_EVENT_DB_SUCCESS;
 
 	cl_plock_acquire(&db->lock);
 
-	node = _get(db, guid);
+	node = get(db, guid);
 	if ((rc = bad_node_port(node, port)) != PERFMGR_EVENT_DB_SUCCESS)
 		goto Exit;
 
@@ -342,12 +342,12 @@ Exit:
 perfmgr_db_err_t
 perfmgr_db_clear_prev_err(perfmgr_db_t * db, uint64_t guid, uint8_t port)
 {
-	_db_node_t *node = NULL;
+	db_node_t *node = NULL;
 	perfmgr_db_err_reading_t *previous = NULL;
 	perfmgr_db_err_t rc = PERFMGR_EVENT_DB_SUCCESS;
 
 	cl_plock_excl_acquire(&db->lock);
-	node = _get(db, guid);
+	node = get(db, guid);
 	if ((rc = bad_node_port(node, port)) != PERFMGR_EVENT_DB_SUCCESS)
 		goto Exit;
 
@@ -363,7 +363,7 @@ Exit:
 
 static inline void
 debug_dump_dc_reading(perfmgr_db_t * db, uint64_t guid, uint8_t port_num,
-		      _db_port_t * port, perfmgr_db_data_cnt_reading_t * cur)
+		      db_port_t * port, perfmgr_db_data_cnt_reading_t * cur)
 {
 	osm_log_t *log = db->perfmgr->log;
 	if (!osm_log_is_active(log, OSM_LOG_DEBUG))
@@ -392,14 +392,14 @@ perfmgr_db_err_t
 perfmgr_db_add_dc_reading(perfmgr_db_t * db, uint64_t guid,
 			  uint8_t port, perfmgr_db_data_cnt_reading_t * reading)
 {
-	_db_port_t *p_port = NULL;
-	_db_node_t *node = NULL;
+	db_port_t *p_port = NULL;
+	db_node_t *node = NULL;
 	perfmgr_db_data_cnt_reading_t *previous = NULL;
 	perfmgr_db_err_t rc = PERFMGR_EVENT_DB_SUCCESS;
 	osm_epi_dc_event_t epi_dc_data;
 
 	cl_plock_excl_acquire(&db->lock);
-	node = _get(db, guid);
+	node = get(db, guid);
 	if ((rc = bad_node_port(node, port)) != PERFMGR_EVENT_DB_SUCCESS)
 		goto Exit;
 
@@ -448,12 +448,12 @@ perfmgr_db_err_t perfmgr_db_get_prev_dc(perfmgr_db_t * db, uint64_t guid,
 					uint8_t port,
 					perfmgr_db_data_cnt_reading_t * reading)
 {
-	_db_node_t *node = NULL;
+	db_node_t *node = NULL;
 	perfmgr_db_err_t rc = PERFMGR_EVENT_DB_SUCCESS;
 
 	cl_plock_acquire(&db->lock);
 
-	node = _get(db, guid);
+	node = get(db, guid);
 	if ((rc = bad_node_port(node, port)) != PERFMGR_EVENT_DB_SUCCESS)
 		goto Exit;
 
@@ -467,12 +467,12 @@ Exit:
 perfmgr_db_err_t
 perfmgr_db_clear_prev_dc(perfmgr_db_t * db, uint64_t guid, uint8_t port)
 {
-	_db_node_t *node = NULL;
+	db_node_t *node = NULL;
 	perfmgr_db_data_cnt_reading_t *previous = NULL;
 	perfmgr_db_err_t rc = PERFMGR_EVENT_DB_SUCCESS;
 
 	cl_plock_excl_acquire(&db->lock);
-	node = _get(db, guid);
+	node = get(db, guid);
 	if ((rc = bad_node_port(node, port)) != PERFMGR_EVENT_DB_SUCCESS)
 		goto Exit;
 
@@ -486,9 +486,9 @@ Exit:
 	return (rc);
 }
 
-static void __clear_counters(cl_map_item_t * const p_map_item, void *context)
+static void clear_counters(cl_map_item_t * const p_map_item, void *context)
 {
-	_db_node_t *node = (_db_node_t *) p_map_item;
+	db_node_t *node = (db_node_t *) p_map_item;
 	int i = 0;
 	time_t ts = time(NULL);
 
@@ -527,7 +527,7 @@ static void __clear_counters(cl_map_item_t * const p_map_item, void *context)
 void perfmgr_db_clear_counters(perfmgr_db_t * db)
 {
 	cl_plock_excl_acquire(&db->lock);
-	cl_qmap_apply_func(&db->pc_data, __clear_counters, (void *)db);
+	cl_qmap_apply_func(&db->pc_data, clear_counters, (void *)db);
 	cl_plock_release(&db->lock);
 #if 0
 	if (db->db_impl->clear_counters)
@@ -538,7 +538,7 @@ void perfmgr_db_clear_counters(perfmgr_db_t * db)
 /**********************************************************************
  * Output a tab delimited output of the port counters
  **********************************************************************/
-static void __dump_node_mr(_db_node_t * node, FILE * fp)
+static void dump_node_mr(db_node_t * node, FILE * fp)
 {
 	int i = 0;
 
@@ -605,7 +605,7 @@ static void __dump_node_mr(_db_node_t * node, FILE * fp)
 /**********************************************************************
  * Output a human readable output of the port counters
  **********************************************************************/
-static void __dump_node_hr(_db_node_t * node, FILE * fp)
+static void dump_node_hr(db_node_t * node, FILE * fp)
 {
 	int i = 0;
 
@@ -670,19 +670,19 @@ typedef struct {
 
 /**********************************************************************
  **********************************************************************/
-static void __db_dump(cl_map_item_t * const p_map_item, void *context)
+static void db_dump(cl_map_item_t * const p_map_item, void *context)
 {
-	_db_node_t *node = (_db_node_t *) p_map_item;
+	db_node_t *node = (db_node_t *) p_map_item;
 	dump_context_t *c = (dump_context_t *) context;
 	FILE *fp = c->fp;
 
 	switch (c->dump_type) {
 	case PERFMGR_EVENT_DB_DUMP_MR:
-		__dump_node_mr(node, fp);
+		dump_node_mr(node, fp);
 		break;
 	case PERFMGR_EVENT_DB_DUMP_HR:
 	default:
-		__dump_node_hr(node, fp);
+		dump_node_hr(node, fp);
 		break;
 	}
 }
@@ -694,16 +694,16 @@ void
 perfmgr_db_print_by_name(perfmgr_db_t * db, char *nodename, FILE *fp)
 {
 	cl_map_item_t *item = NULL;
-	_db_node_t *node = NULL;
+	db_node_t *node = NULL;
 
 	cl_plock_acquire(&db->lock);
 
 	/* find the node */
 	item = cl_qmap_head(&db->pc_data);
 	while (item != cl_qmap_end(&db->pc_data)) {
-		node = (_db_node_t *)item;
+		node = (db_node_t *)item;
 		if (strcmp(node->node_name, nodename) == 0) {
-			__dump_node_hr(node, fp);
+			dump_node_hr(node, fp);
 			goto done;
 		}
 		item = cl_qmap_next(item);
@@ -726,7 +726,7 @@ perfmgr_db_print_by_guid(perfmgr_db_t * db, uint64_t nodeguid, FILE *fp)
 
 	node = cl_qmap_get(&db->pc_data, nodeguid);
 	if (node != cl_qmap_end(&db->pc_data))
-		__dump_node_hr((_db_node_t *)node, fp);
+		dump_node_hr((db_node_t *)node, fp);
 	else
 		fprintf(fp, "Node 0x%" PRIx64 " not found...\n", nodeguid);
 
@@ -747,7 +747,7 @@ perfmgr_db_dump(perfmgr_db_t * db, char *file, perfmgr_db_dump_t dump_type)
 	context.dump_type = dump_type;
 
 	cl_plock_acquire(&db->lock);
-	cl_qmap_apply_func(&db->pc_data, __db_dump, (void *)&context);
+	cl_qmap_apply_func(&db->pc_data, db_dump, (void *)&context);
 	cl_plock_release(&db->lock);
 	fclose(context.fp);
 	return (PERFMGR_EVENT_DB_SUCCESS);


From gregkh at suse.de  Mon May  4 13:00:22 2009
From: gregkh at suse.de (Greg Kroah-Hartman)
Date: Mon, 4 May 2009 13:00:22 -0700
Subject: [ofa-general] [PATCH] infiniband: ehca: remove driver_data direct
	access of struct device
Message-ID: <20090504200022.GA22746@kroah.com>

From: Greg Kroah-Hartman <gregkh at suse.de>

In the near future, the driver core is going to not allow direct access
to the driver_data pointer in struct device.  Instead, the functions
dev_get_drvdata() and dev_set_drvdata() should be used.  These functions
have been around since the beginning, so are backwards compatible with
all older kernel versions.

Cc: Sean Hefty <sean.hefty at intel.com>
Cc: Roland Dreier <rolandd at cisco.com>
Cc: Hal Rosenstock <hal.rosenstock at gmail.com>
Cc: general at lists.openfabrics.org
Cc: Christoph Raisch <raisch at de.ibm.com>
Cc: Hoang-Nam Nguyen <hnguyen at de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh at suse.de>

---
 drivers/infiniband/hw/ehca/ehca_main.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/drivers/infiniband/hw/ehca/ehca_main.c
+++ b/drivers/infiniband/hw/ehca/ehca_main.c
@@ -636,7 +636,7 @@ static ssize_t  ehca_show_##name(struct 
 	struct hipz_query_hca *rblock;				           \
 	int data;                                                          \
 									   \
-	shca = dev->driver_data;					   \
+	shca = dev_get_drvdata(dev);					   \
 									   \
 	rblock = ehca_alloc_fw_ctrlblock(GFP_KERNEL);			   \
 	if (!rblock) {						           \
@@ -680,7 +680,7 @@ static ssize_t ehca_show_adapter_handle(
 					struct device_attribute *attr,
 					char *buf)
 {
-	struct ehca_shca *shca = dev->driver_data;
+	struct ehca_shca *shca = dev_get_drvdata(dev);
 
 	return sprintf(buf, "%llx\n", shca->ipz_hca_handle.handle);
 
@@ -749,7 +749,7 @@ static int __devinit ehca_probe(struct o
 
 	shca->ofdev = dev;
 	shca->ipz_hca_handle.handle = *handle;
-	dev->dev.driver_data = shca;
+	dev_set_drvdata(&dev->dev, shca);
 
 	ret = ehca_sense_attributes(shca);
 	if (ret < 0) {
@@ -878,7 +878,7 @@ probe1:
 
 static int __devexit ehca_remove(struct of_device *dev)
 {
-	struct ehca_shca *shca = dev->dev.driver_data;
+	struct ehca_shca *shca = dev_get_drvdata(&dev->dev);
 	unsigned long flags;
 	int ret;
 

From sean.hefty at intel.com  Mon May  4 15:49:49 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 4 May 2009 15:49:49 -0700
Subject: [ofa-general] [PATCH] ib-mgmt: fixup ibsendtrap for windows
Message-ID: <F9CF6702599B420CA6B34588BAB7323F@amr.corp.intel.com>

Fix some typecast issues.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---

 infiniband-diags/src/ibsendtrap.c |   12 ++++++------
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c
index 469bc39..7ad588e 100644
--- a/infiniband-diags/src/ibsendtrap.c
+++ b/infiniband-diags/src/ibsendtrap.c
@@ -66,10 +66,10 @@ static int get_node_type(ib_portid_t *port)
 static void build_trap144(ib_mad_notice_attr_t * n, ib_portid_t *port)
 {
 	n->generic_type = 0x80 | IB_NOTICE_TYPE_INFO;
-	n->g_or_v.generic.prod_type_lsb = cl_hton16(get_node_type(port));
+	n->g_or_v.generic.prod_type_lsb = cl_hton16((uint16_t) get_node_type(port));
 	n->g_or_v.generic.trap_num = cl_hton16(144);
-	n->issuer_lid = cl_hton16(port->lid);
-	n->data_details.ntc_144.lid = cl_hton16(port->lid);
+	n->issuer_lid = cl_hton16((uint16_t) port->lid);
+	n->data_details.ntc_144.lid = n->issuer_lid;
 	n->data_details.ntc_144.local_changes =
 	    TRAP_144_MASK_OTHER_LOCAL_CHANGES;
 	n->data_details.ntc_144.change_flgs =
@@ -79,10 +79,10 @@ static void build_trap144(ib_mad_notice_attr_t * n, ib_portid_t *port)
 static void build_trap129(ib_mad_notice_attr_t * n, ib_portid_t *port)
 {
 	n->generic_type = 0x80 | IB_NOTICE_TYPE_URGENT;
-	n->g_or_v.generic.prod_type_lsb = cl_hton16(get_node_type(port));
+	n->g_or_v.generic.prod_type_lsb = cl_hton16((uint16_t) get_node_type(port));
 	n->g_or_v.generic.trap_num = cl_hton16(129);
-	n->issuer_lid = cl_hton16(port->lid);
-	n->data_details.ntc_129_131.lid = cl_hton16(port->lid);
+	n->issuer_lid = cl_hton16((uint16_t) port->lid);
+	n->data_details.ntc_129_131.lid = n->issuer_lid;
 	n->data_details.ntc_129_131.pad = 0;
 	n->data_details.ntc_129_131.port_num = (uint8_t) error_port;
 }


From messenger at webex.com  Mon May  4 17:24:46 2009
From: messenger at webex.com (Jeff Squyres)
Date: Tue, 5 May 2009 00:24:46 GMT
Subject: [ofa-general] ***SPAM*** Meeting invitation: Verbs memory
	registration
Message-ID: <94632617.1241483086001.JavaMail.nobody@jsj6wl002.webex.com>

Hello ,

Jeff Squyres invites you to attend this online meeting.

Topic: Verbs memory registration
Date: Monday, May 11, 2009
Time: 12:00 pm, Eastern Daylight Time (GMT -04:00, New York)
Meeting Number: 203 642 533
Meeting Password: verbs

Please click the link below to see more information, or to join the meeting.

----------------------------------------------------------------
ALERT:Toll-Free Dial Restrictions for (408) and (919) Area Codes
----------------------------------------------------------------

As of April 9th, 2009, you can no longer dial toll free in the 408 or 919 area codes in the United States.  The affected toll free numbers are: (866) 432-9903 for the San Jose/Milpitas area and (866) 349-3520 for the RTP area.

Please dial the local access number for your area from the list below:
-  San Jose/Milpitas (408) area:  525-6800
-  RTP (919) area:  392-3330

-------------------------------------------------------
To join the online meeting
-------------------------------------------------------
1. Go to https://cisco.webex.com/cisco/j.php?ED=119193612&UID=1123387277&PW=5ef2c01d4e5c171043
2. Enter your name and email address.
3. Enter the meeting password: verbs
4. Click "Join Now".

------------------------------------------------------- 
To join the teleconference only 
------------------------------------------------------- 
1. Dial into Cisco WebEx (view all Global Access Numbers at 
http://cisco.com/en/US/about/doing_business/conferencing/index.html 
2. Press 3 to attend the meeting. 
3. Follow the prompts to enter the Meeting Number (listed above) or Access Code followed by the # sign. 

San Jose, CA: +1.408.525.6800  RTP: +1.919.392.3330 

US/Canada: +1.866.432.9903  United Kingdom: +44.20.8824.0117 

India: +91.80.4350.1111  Germany: +49.619.6773.9002 

Japan: +81.3.5763.9394  China: +86.10.8515.5666 


------------------------------------------------------- 
To join the meeting on iPhone
-------------------------------------------------------
Go to wbx://cisco.webex.com/ciscosales?MK=203642533&MPW=768e0fa81cb639e8ff44dd29522061f0bb154512eeedc42495588659cb1bf790

Don't have the iPhone WebEx application yet? 
Go to http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewSoftware?id=298844386 


-------------------------------------------------------
For assistance
-------------------------------------------------------
1. Go to https://cisco.webex.com/cisco/mc
2. On the left navigation bar, click "Support".

You can contact me at:
jsquyres at cisco.com
1-408-525 0971

To add this meeting to your calendar program (for example Microsoft Outlook), click this link:
https://cisco.webex.com/cisco/j.php?ED=119193612&UID=1123387277&ICS=MI&LD=1&RD=2&ST=1&SHA2=PQM0FncQOp/M461AFXiuPStSZyv8DeZiipMItYw7884=

The playback of UCF (Universal Communications Format) rich media files requires appropriate players. To view this type of rich media files in the meeting, please check whether you have the players installed on your computer by going to  https://cisco.webex.com/cisco/systemdiagnosis.php

Sign up for a free trial of WebEx
http://www.webex.com/go/mcemfreetrial

http://www.webex.com 
We've got to start meeting like this(TM)

IMPORTANT NOTICE: This WebEx service includes a feature that allows audio and any documents and other materials exchanged or viewed during the session to be recorded. By joining this session, you automatically consent to such recordings. If you do not consent to the recording, do not join the session.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090505/e91da4ed/attachment.html>

From jsquyres at cisco.com  Mon May  4 17:25:23 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Mon, 4 May 2009 20:25:23 -0400
Subject: [ofa-general] New proposal for memory management
In-Reply-To: <adaljpgyckk.fsf@cisco.com>
References: <382A478CAD40FA4FB46605CF81FE39F42B5C8631@orsmsx507.amr.corp.intel.com><C61F6F4D.4AB3%bwbarre@sandia.gov><382A478CAD40FA4FB46605CF81FE39F42B5C876E@orsmsx507.amr.corp.intel.com><8B573344-0870-4352-8BC2-AED66E1E4234@cisco.com>
	<adaljpgyckk.fsf@cisco.com>
Message-ID: <11AAF71E-0D36-471E-A9C6-5FC924AF9E7D@cisco.com>

I think that this thread has gotten to the point where people are no  
longer reading each post carefully and are therefore re-hashing points  
that have already been discussed.  It has therefore reached the end of  
its usefulness.

It was suggested today that a teleconference to discuss these issues  
might be much more useful (an hour-long teleconference can save a  
week's worth of emails!).  This will be a technical call to discuss  
memory registration issues; it will not be an EWG call.  I've setup a  
WebEx call for next Monday at the "normal" time: noon US Eastern, 9am  
US Pacific, 7pm Israel.  The invite will be coming to the ewg and  
general lists shortly.

*** PLEASE USE THE WEBEX URL TO JOIN THE TELECONFERENCE (vs. just  
dialing in)
     (when you logon, it'll prompt you for a phone number to call you  
back;
     yes, non-US phone numbers are supported)

I will make up a small number of slides that attempt to summarize all  
the arguments (on both sides) so far.  Hopefully, they can serve as a  
starting point for discussion.

Thanks; see you next Monday.


On May 1, 2009, at 1:09 PM, Roland Dreier (rdreier) wrote:

>  > You mentioned that doing this stuff is a choice; the choice that
>  > MPI's/ ULPs/applications therefore have is:
>  >
>  > - don't use registration caches/memory allocation hooking, have
>  > terrible performance
>  > - use registration caches/memory allocation hooking, have good
>  > performance
>
> I think it's a bit of a stretch to suggest that all or even most
> userspace RDMA applications have the same need for registration  
> caching
> as MPI.  In fact my feeling is that the fact that MPI must deal with
> RDMA to arbitrary memory allocated by an application out of MPI's
> control is the exception.  My most recent experience was with Cisco's
> RAB library, and in that case we simply designed the library so that  
> all
> RDMA was done to memory allocated by the library -- so no need for a
> registration cache, and in fact no need for registration in any fast
> path.  I suspect that the majority of code written to use RDMA  
> natively
> will be designed with similar properties.
>
> So this proposal is very much an MPI-specific interface.  Which  
> leads to
> my next point.  I have no doubt that the MPI community has a very good
> idea of a memory registration interface that would make MPI
> implementations simpler and more robust.  However I don't think  
> there's
> quite as much expertise about what the best way to implement such an
> interface is.
>
> My initial reaction is that I don't want to extend the kernel ABI with
> a set of new MPI-specific verbs if there's a way around it.  We've  
> been
> told over and over that the registration cache is complex and fragile
> code -- but moving complex and fragile code into the kernel doesn't
> magically make it any simpler or more robust, it just means that bugs
> now crash the whole system instead of just affecting one process.
>
> Now, of course MMU notifiers allow the kernel to know reliably when a
> process's page tables change, which means that all the complicated
> malloc hooking etc is not needed.  So that complexity is avoided in  
> the
> kernel.  But suppose I give userspace the same MMU notifier capability
> (eg I add a system call like "if any mappings in the virtual address
> range X ... Y change, then write a 1 to virtual address Z") -- then  
> what
> do I gain from having the rest of the registration caching in the
> kernel?  (And avoiding the duplication of caching code between  
> multiple
> MPI implementations is not an answer -- it's quite feasible to put the
> caching code into libibverbs if that's the best place for it)
>
>  - R.


-- 
Jeff Squyres
Cisco Systems


From HNGUYEN at de.ibm.com  Mon May  4 22:13:16 2009
From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen)
Date: Tue, 5 May 2009 07:13:16 +0200
Subject: [ofa-general] Re: [PATCH] infiniband: ehca: remove driver_data
 direct access of	struct device
In-Reply-To: <20090504200022.GA22746@kroah.com>
References: <20090504200022.GA22746@kroah.com>
Message-ID: <OFD6B3383E.4B7D18D2-ONC12575AD.001C9A7B-C12575AD.001CADBA@de.ibm.com>

Hi,
This patch looks fine to me. Thanks!
Nam

Greg Kroah-Hartman <gregkh at suse.de> wrote on 04.05.2009 22:00:22:

> [image removed]
>
> [PATCH] infiniband: ehca: remove driver_data direct access of struct
device
>
> Greg Kroah-Hartman
>
> to:
>
> Sean Hefty, Roland Dreier, Hal Rosenstock, Christoph Raisch, Hoang-Nam
Nguyen
>
> 04.05.2009 22:05
>
> Cc:
>
> general, Greg KH
>
> From: Greg Kroah-Hartman <gregkh at suse.de>
>
> In the near future, the driver core is going to not allow direct access
> to the driver_data pointer in struct device.  Instead, the functions
> dev_get_drvdata() and dev_set_drvdata() should be used.  These functions
> have been around since the beginning, so are backwards compatible with
> all older kernel versions.
>
> Cc: Sean Hefty <sean.hefty at intel.com>
> Cc: Roland Dreier <rolandd at cisco.com>
> Cc: Hal Rosenstock <hal.rosenstock at gmail.com>
> Cc: general at lists.openfabrics.org
> Cc: Christoph Raisch <raisch at de.ibm.com>
> Cc: Hoang-Nam Nguyen <hnguyen at de.ibm.com>
> Signed-off-by: Greg Kroah-Hartman <gregkh at suse.de>
>
> ---
>  drivers/infiniband/hw/ehca/ehca_main.c |    8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> --- a/drivers/infiniband/hw/ehca/ehca_main.c
> +++ b/drivers/infiniband/hw/ehca/ehca_main.c
> @@ -636,7 +636,7 @@ static ssize_t  ehca_show_##name(struct
>     struct hipz_query_hca *rblock;                       \
>     int data;                                                          \
>                                \
> -   shca = dev->driver_data;                  \
> +   shca = dev_get_drvdata(dev);                  \
>                                \
>     rblock = ehca_alloc_fw_ctrlblock(GFP_KERNEL);            \
>     if (!rblock) {                             \
> @@ -680,7 +680,7 @@ static ssize_t ehca_show_adapter_handle(
>                 struct device_attribute *attr,
>                 char *buf)
>  {
> -   struct ehca_shca *shca = dev->driver_data;
> +   struct ehca_shca *shca = dev_get_drvdata(dev);
>
>     return sprintf(buf, "%llx\n", shca->ipz_hca_handle.handle);
>
> @@ -749,7 +749,7 @@ static int __devinit ehca_probe(struct o
>
>     shca->ofdev = dev;
>     shca->ipz_hca_handle.handle = *handle;
> -   dev->dev.driver_data = shca;
> +   dev_set_drvdata(&dev->dev, shca);
>
>     ret = ehca_sense_attributes(shca);
>     if (ret < 0) {
> @@ -878,7 +878,7 @@ probe1:
>
>  static int __devexit ehca_remove(struct of_device *dev)
>  {
> -   struct ehca_shca *shca = dev->dev.driver_data;
> +   struct ehca_shca *shca = dev_get_drvdata(&dev->dev);
>     unsigned long flags;
>     int ret;
>


From jackm at dev.mellanox.co.il  Tue May  5 00:21:36 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Tue, 5 May 2009 10:21:36 +0300
Subject: [ofa-general] OFED,
	the backported <linux/scatterlist.h> header and sg_init_table()
In-Reply-To: <20090504145641.GA19565@opengridcomputing.com>
References: <e2e108260905020446j41996c85o35d41f09dc4ff64d@mail.gmail.com>
	<e2e108260905030836r108c8578rfb2ea5ffad6df50e@mail.gmail.com>
	<20090504145641.GA19565@opengridcomputing.com>
Message-ID: <200905051021.36725.jackm@dev.mellanox.co.il>

On Monday 04 May 2009 17:56, Jon Mason wrote:
> What's even worse is that sg_init_table is already defined in the
> RHEL5.3 headers.  When coding up a header cleanup patch for RHEL5.3, I
> noticed it was already defined in linux/ncrypto.h.  Also, it's there for
> RHEL5.2 (and a few older kernels).
> 
I do not see that as "worse". ncrypto is the cryptographic scatterlist API, which is not used anywhere in OFED.
Do we include this only because of its base scatterlist additions? 
ncrypto.h itself has a list of includes.

I guess, though, you could do the following for scatterlist.h in the RHEL5.3 backport:
==============================================================================
#ifndef __BACKPORT_LINUX_SCATTERLIST_H_TO_RHEL5_3__
#define __BACKPORT_LINUX_SCATTERLIST_H_TO_RHEL5_3__

/* crypto.h includes scatterlist.h */
#include<linux/ncrypto.h>

static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
{
        sg->page = page;
}

#define for_each_sg(sglist, sg, nr, __i)        \
        for (__i = 0, sg = (sglist); __i < (nr); __i++, sg++)

static inline struct scatterlist *sg_next(struct scatterlist *sg)
{
        if (!sg) {
                BUG();
                return NULL;
        }
        return sg + 1;
}

#endif
==============================================================================

linux/ncrypto.h, though, is not part of, say, kernel 2.6.23.  Need to check if the above is RedHat-only solution.

- Jack


From vlad at lists.openfabrics.org  Tue May  5 03:25:02 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Tue,  5 May 2009 03:25:02 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090505-0200 daily build status
Message-ID: <20090505102502.7E36AE61024@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From tziporet at mellanox.co.il  Tue May  5 04:01:51 2009
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Tue, 5 May 2009 14:01:51 +0300
Subject: [ofa-general] EWG/OFED meeting meeting minutes for May 4, 09
Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD0291246E@mtlexch01.mtl.com>

These are the OFED meeting minutes for May 4 09:

Summary:
========
1. OFED 1.4.1 is delayed: 
    RC5 is planed for next Monday May 11.
    GA for May 14
2. OFED 1.5 schedule: will be delayed to October since 1.4.1 is delayed
3. MPI memory registration API request: We decided to have a special
meeting on this subject next week on the same slot.

Reminder: 
OFED roadmap on the web:
http://www.openfabrics.org/txt/woody/roadmap.txt
EWG meeting minutes:
http://www.openfabrics.org/txt/documentation/linux/EWG_meeting_minutes/

Details:
======

1. OFED 1.4.1 bugs status

These are the open critical bugs:

ID 	Sev 	OS 	Assignee 		 	Summary	 
1616 	cri 	RHEL 	jon at opengridcomputing.com 	iommu_alloc
error when running connectathon on ppc64 nfs ... - we see similar
problem in SDP with bug 1612 in PPC. IBM will try to help in debug.
1571 	cri 	RHEL 	vu at mellanox.com 		nfsrdma server
crash @test5 connectathon basic test - related to mlx4 implementation,
fix is under test

Other bugs:
1620 	cri 	Other 	jon at opengridcomputing.com 	backport
definition of struct hash_desc doesn't match the... - There is a fix;
wait for ok from Brian
1287 	maj 	RHEL 	bugzilla at openib.org 		IPoIB datagram
mode initial packet loss - document in RN and say there is a workaround
1596 	maj 	Othe 	Jeffrey.C.Becker at nasa.gov 	openibd stop
failed when nfs is loaded - we will document it for 1.4.1 and do a
better fix for 1.5
1621 	maj 	RHEL 	vu at mellanox.com 		RHEL 5.3 + OFED
1.4.1-rc4: loading ib_sprt kernel module ...  - was usage issue only
1623    	maj  	Othe  	shirif at voltaire.com  	  	IB
Devices not found on SLES11, ia64 (HP Blade) - seems FW configuration
issue

We decided to wait for another week to fix the PPC issue.
New schedule is:
* RC5 next Monday - May 11
* GA Thu May 14


2. OFED 1.5 schedule
Release is delayed by a month. 
We do not wish to delay any more since we wish to have one major OFED
release each year and we want the new OFED release before SC09
Kernel base will stay 2.6.30

This is the new schedule is:
	Feature Freeze: Jun 7, 09
	Alpha Release:  Jun 12, 09
	Beta Release:	Jun 9, 09
	RC1:                 Jul 25, 09
	RC2-RCx: About every 2 weeks as needed
	We usually have ~6 RCs
	Release:            Oct 15, 09

Note: Jeff S. suggests that we drop MPI from the OFED package. We will
discuss this after 1.4.1 release.


3. MPI new memory API
We will have next week a special meeting on this subject. Jeff will
prepare the meeting.

Tziporet


From ossrosch at linux.vnet.ibm.com  Tue May  5 04:23:51 2009
From: ossrosch at linux.vnet.ibm.com (Stefan Roscher)
Date: Tue, 5 May 2009 13:23:51 +0200
Subject: [ofa-general] Queue pair state for multicast group attachment
Message-ID: <200905051323.52192.ossrosch@linux.vnet.ibm.com>

Hi,

during testing with ib_diag tools (ib_send_lat/bw) we noticed some problems by using multicast 
option. The tools attach queue pairs to multicast groups while the queue 
pairs are in RESET state. In the IB standard there is no specific definition 
when the attach operation should be processed. In our opinion a QP should not be 
attached until it's in  RTR to receive data from the multicast group.
What is the communitys opinion about that? Is it possible to change the diag tools?

regards Stefan


From dotanba at gmail.com  Tue May  5 04:36:03 2009
From: dotanba at gmail.com (Dotan Barak)
Date: Tue, 5 May 2009 14:36:03 +0300
Subject: [ofa-general] Queue pair state for multicast group attachment
In-Reply-To: <200905051323.52192.ossrosch@linux.vnet.ibm.com>
References: <200905051323.52192.ossrosch@linux.vnet.ibm.com>
Message-ID: <2f3bf9a60905050436m2569fafbj7a2a1d49c806bc3b@mail.gmail.com>

I believe that the right QP state to attach it to a multicast group is
in INIT state, since it this state you can post receive request too.
As soon as you will modify the QP state to RTR the multicast messages
will be received by this QP.

Dotan

On Tue, May 5, 2009 at 2:23 PM, Stefan Roscher
<ossrosch at linux.vnet.ibm.com> wrote:
> Hi,
>
> during testing with ib_diag tools (ib_send_lat/bw) we noticed some problems by using multicast
> option. The tools attach queue pairs to multicast groups while the queue
> pairs are in RESET state. In the IB standard there is no specific definition
> when the attach operation should be processed. In our opinion a QP should not be
> attached until it's in  RTR to receive data from the multicast group.
> What is the communitys opinion about that? Is it possible to change the diag tools?
>
> regards Stefan
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From sokar6012 at hotmail.com  Tue May  5 05:53:39 2009
From: sokar6012 at hotmail.com (anthony garnier)
Date: Tue, 5 May 2009 12:53:39 +0000
Subject: [ofa-general] SDP error
Message-ID: <BAY139-W844589AC3C296DB8DC131AE690@phx.gbl>


Hello,

i`m running a  debian 5.0 OS with ofed 1.4, RDMA work very well, but when I`m trying to use the SDP protocol with ssh, Netperf or a simple Client-Server programming in C, I got socket error like that :

NetPIPE: can't open stream socket! errno=97   (for Netpipe)

Address family not supported by protocol ssh      (for ssh)

Address family not supported by protocol   (for clent-server)

Someone knows those errors?

_________________________________________________________________
Téléphonez gratuitement à tous vos proches avec Windows Live Messenger  !  Téléchargez-le maintenant ! 
http://www.windowslive.fr/messenger/1.asp
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090505/60b797a9/attachment.html>

From dorons at voltaire.com  Tue May  5 06:00:30 2009
From: dorons at voltaire.com (Doron Shoham)
Date: Tue, 05 May 2009 16:00:30 +0300
Subject: [ofa-general] [PATCH] osm_port.c: do not force max_op_vls = 0 to 1
Message-ID: <4A00386E.2050300@voltaire.com>

when setting max_op_vls = 0
do not force it to 1.
0 is valid value which means "No change"

Signed-off-by: Doron Shoham <dorons at voltaire.com>
---
 opensm/opensm/osm_port.c   |    6 ------
 opensm/opensm/osm_subnet.c |    8 ++++++++
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c
index 2e6c642..db0c27e 100644
--- a/opensm/opensm/osm_port.c
+++ b/opensm/opensm/osm_port.c
@@ -380,12 +380,6 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log,
 	if (op_vls > p_subn->opt.max_op_vls)
 		op_vls = p_subn->opt.max_op_vls;
 
-	if (op_vls == 0) {
-		OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: "
-			"Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n");
-		op_vls = 1;
-	}
-
 	OSM_LOG_EXIT(p_log);
 	return op_vls;
 }
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index ec15f8a..71fc7a0 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -1288,6 +1288,14 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts)
 		"# switch port connected to a CA or router port\n"
 		"leaf_head_of_queue_lifetime 0x%02x\n\n"
 		"# Limit the maximal operational VLs\n"
+		"# Virtual Lanes operational on this port\n"
+		"# Values are (IB Spec 1.2.1, 14.2.5.6 Table 146 \"PortInfo\")\n"
+		"#    0: No change; valid only on Set()\n"
+		"#    1: VL0\n"
+		"#    2: VL0, VL1\n"
+		"#    3: VL0 - VL3\n"
+		"#    4: VL0 - VL7\n"
+		"#    5: VL0 - VL14\n"
 		"max_op_vls %u\n\n"
 		"# Force PortInfo:LinkSpeedEnabled on switch ports\n"
 		"# If 0, don't modify PortInfo:LinkSpeedEnabled on switch port\n"
-- 
1.5.4


From gmpc at sanger.ac.uk  Tue May  5 06:13:34 2009
From: gmpc at sanger.ac.uk (Guy Coates)
Date: Tue, 05 May 2009 14:13:34 +0100
Subject: [ofa-general] SDP error
In-Reply-To: <BAY139-W844589AC3C296DB8DC131AE690@phx.gbl>
References: <BAY139-W844589AC3C296DB8DC131AE690@phx.gbl>
Message-ID: <4A003B7E.3070604@sanger.ac.uk>

anthony garnier wrote:
> Hello,
> 
> i`m running a  debian 5.0 OS with ofed 1.4, RDMA work very well, but
> when I`m trying to use the SDP protocol with ssh, Netperf or a simple
> Client-Server programming in C, I got socket error like that :
> 
> NetPIPE: can't open stream socket! errno=97   (for Netpipe)
> 
> Address family not supported by protocol ssh      (for ssh)
> 
> Address family not supported by protocol   (for clent-server)
> 
> Someone knows those errors?

Is the ib_sdp module loaded?

Cheers,

Guy

-- 
Dr. Guy Coates,  Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
Tel: +44 (0)1223 834244 x 6925
Fax: +44 (0)1223 496802


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


From hal.rosenstock at gmail.com  Tue May  5 06:14:11 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 5 May 2009 09:14:11 -0400
Subject: [ofa-general] [PATCH] osm_port.c: do not force max_op_vls = 0 to 
	1
In-Reply-To: <4A00386E.2050300@voltaire.com>
References: <4A00386E.2050300@voltaire.com>
Message-ID: <f0e08f230905050614n149dfc53yd6d0f03e5f2dd4c@mail.gmail.com>

On Tue, May 5, 2009 at 9:00 AM, Doron Shoham <dorons at voltaire.com> wrote:
> when setting max_op_vls = 0
> do not force it to 1.
> 0 is valid value which means "No change"
>
> Signed-off-by: Doron Shoham <dorons at voltaire.com>
> ---
>  opensm/opensm/osm_port.c   |    6 ------
>  opensm/opensm/osm_subnet.c |    8 ++++++++
>  2 files changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c
> index 2e6c642..db0c27e 100644
> --- a/opensm/opensm/osm_port.c
> +++ b/opensm/opensm/osm_port.c
> @@ -380,12 +380,6 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log,
>        if (op_vls > p_subn->opt.max_op_vls)
>                op_vls = p_subn->opt.max_op_vls;
>
> -       if (op_vls == 0) {
> -               OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: "
> -                       "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n");
> -               op_vls = 1;
> -       }
> -

Should that only be done when max_op_vls is 0 ?

Something like:
           if (op_vls > p_subn->opt.max_op_vls)
                op_vls = p_subn->opt.max_op_vls;
           else if (op_vls == 0) {
               OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: "
                       "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n");
               op_vls = 1;
          }

-- Hal

>        OSM_LOG_EXIT(p_log);
>        return op_vls;
>  }
> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
> index ec15f8a..71fc7a0 100644
> --- a/opensm/opensm/osm_subnet.c
> +++ b/opensm/opensm/osm_subnet.c
> @@ -1288,6 +1288,14 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts)
>                "# switch port connected to a CA or router port\n"
>                "leaf_head_of_queue_lifetime 0x%02x\n\n"
>                "# Limit the maximal operational VLs\n"
> +               "# Virtual Lanes operational on this port\n"
> +               "# Values are (IB Spec 1.2.1, 14.2.5.6 Table 146 \"PortInfo\")\n"
> +               "#    0: No change; valid only on Set()\n"
> +               "#    1: VL0\n"
> +               "#    2: VL0, VL1\n"
> +               "#    3: VL0 - VL3\n"
> +               "#    4: VL0 - VL7\n"
> +               "#    5: VL0 - VL14\n"
>                "max_op_vls %u\n\n"
>                "# Force PortInfo:LinkSpeedEnabled on switch ports\n"
>                "# If 0, don't modify PortInfo:LinkSpeedEnabled on switch port\n"
> --
> 1.5.4
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From Line.Holen at Sun.COM  Tue May  5 06:25:19 2009
From: Line.Holen at Sun.COM (Line.Holen at Sun.COM)
Date: Tue, 05 May 2009 15:25:19 +0200
Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c Increase the size of
	the hop table
Message-ID: <4A003E3F.2010100@Sun.COM>

The hops table of ftree_sw_t is too small to hold the hop count
of max_lid. Changed sw_create() to allocate hops[max_lid+1]
not hops[max_lid].

Signed-off-by: Line Holen <Line.Holen at sun.com>

---

diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c
index 0c4741a..8ed2f74 100644
--- a/opensm/opensm/osm_ucast_ftree.c
+++ b/opensm/opensm/osm_ucast_ftree.c
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2007 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
@@ -554,10 +555,10 @@ static ftree_sw_t *sw_create(IN ftree_fabric_t * p_ftree,
 
 	/* initialize lft buffer */
 	memset(p_osm_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1);
-	p_sw->hops = malloc(p_osm_sw->max_lid_ho * sizeof(*(p_sw->hops)));
+	p_sw->hops = malloc((p_osm_sw->max_lid_ho + 1) * sizeof(*(p_sw->hops)));
 	if(p_sw->hops == NULL)
 		return NULL;
-	memset(p_sw->hops, OSM_NO_PATH, p_osm_sw->max_lid_ho);
+	memset(p_sw->hops, OSM_NO_PATH, p_osm_sw->max_lid_ho + 1);
 
 	return p_sw;
 }				/* sw_create() */


From dorfman.eli at gmail.com  Tue May  5 06:48:32 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Tue, 05 May 2009 16:48:32 +0300
Subject: [ofa-general] [PATCH] osm_port.c: do not force max_op_vls = 0 to 1
In-Reply-To: <f0e08f230905050614n149dfc53yd6d0f03e5f2dd4c@mail.gmail.com>
References: <4A00386E.2050300@voltaire.com>
	<f0e08f230905050614n149dfc53yd6d0f03e5f2dd4c@mail.gmail.com>
Message-ID: <4A0043B0.3030400@gmail.com>

Hal Rosenstock wrote:
> On Tue, May 5, 2009 at 9:00 AM, Doron Shoham <dorons at voltaire.com> wrote:
>> when setting max_op_vls = 0
>> do not force it to 1.
>> 0 is valid value which means "No change"
>>
>> Signed-off-by: Doron Shoham <dorons at voltaire.com>
>> ---
>>  opensm/opensm/osm_port.c   |    6 ------
>>  opensm/opensm/osm_subnet.c |    8 ++++++++
>>  2 files changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c
>> index 2e6c642..db0c27e 100644
>> --- a/opensm/opensm/osm_port.c
>> +++ b/opensm/opensm/osm_port.c
>> @@ -380,12 +380,6 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log,
>>        if (op_vls > p_subn->opt.max_op_vls)
>>                op_vls = p_subn->opt.max_op_vls;
>>
>> -       if (op_vls == 0) {
>> -               OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: "
>> -                       "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n");
>> -               op_vls = 1;
>> -       }
>> -
> 
> Should that only be done when max_op_vls is 0 ?
> 
> Something like:
>            if (op_vls > p_subn->opt.max_op_vls)
>                 op_vls = p_subn->opt.max_op_vls;
>            else if (op_vls == 0) {
>                OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: "
>                        "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n");
>                op_vls = 1;
>           }

why do you suggest a special case for op_vls=0 (and not for other portinfo fields)?
is there a firmware bug that reports op_vls=0?

Eli


From hal.rosenstock at gmail.com  Tue May  5 06:59:17 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 5 May 2009 09:59:17 -0400
Subject: [ofa-general] [PATCH] osm_port.c: do not force max_op_vls = 0 to 
	1
In-Reply-To: <4A0043B0.3030400@gmail.com>
References: <4A00386E.2050300@voltaire.com>
	<f0e08f230905050614n149dfc53yd6d0f03e5f2dd4c@mail.gmail.com>
	<4A0043B0.3030400@gmail.com>
Message-ID: <f0e08f230905050659ld9f24acvf03a0bc1e0cdb827@mail.gmail.com>

On Tue, May 5, 2009 at 9:48 AM, Eli Dorfman (Voltaire)
<dorfman.eli at gmail.com> wrote:
> Hal Rosenstock wrote:
>> On Tue, May 5, 2009 at 9:00 AM, Doron Shoham <dorons at voltaire.com> wrote:
>>> when setting max_op_vls = 0
>>> do not force it to 1.
>>> 0 is valid value which means "No change"
>>>
>>> Signed-off-by: Doron Shoham <dorons at voltaire.com>
>>> ---
>>>  opensm/opensm/osm_port.c   |    6 ------
>>>  opensm/opensm/osm_subnet.c |    8 ++++++++
>>>  2 files changed, 8 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c
>>> index 2e6c642..db0c27e 100644
>>> --- a/opensm/opensm/osm_port.c
>>> +++ b/opensm/opensm/osm_port.c
>>> @@ -380,12 +380,6 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log,
>>>        if (op_vls > p_subn->opt.max_op_vls)
>>>                op_vls = p_subn->opt.max_op_vls;
>>>
>>> -       if (op_vls == 0) {
>>> -               OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: "
>>> -                       "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n");
>>> -               op_vls = 1;
>>> -       }
>>> -
>>
>> Should that only be done when max_op_vls is 0 ?
>>
>> Something like:
>>            if (op_vls > p_subn->opt.max_op_vls)
>>                 op_vls = p_subn->opt.max_op_vls;
>>            else if (op_vls == 0) {
>>                OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: "
>>                        "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n");
>>                op_vls = 1;
>>           }
>
> why do you suggest a special case for op_vls=0 (and not for other portinfo fields)?

> is there a firmware bug that reports op_vls=0?

There were (still are ?) implementations which returned op_vls 0 which
is why the words "valid on Set()" were added to the IBA spec and why I
don't feel safe removing the code as originally proposed but think my
alternative is safe and accomplishes the stated goal. Is there a
problem with my alternative proposal ?

-- Hal

> Eli
>
>
>


From dorfman.eli at gmail.com  Tue May  5 07:45:09 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Tue, 05 May 2009 17:45:09 +0300
Subject: [ofa-general] [PATCH] osm_port.c: do not force max_op_vls = 0 to 1
In-Reply-To: <f0e08f230905050659ld9f24acvf03a0bc1e0cdb827@mail.gmail.com>
References: <4A00386E.2050300@voltaire.com>	
	<f0e08f230905050614n149dfc53yd6d0f03e5f2dd4c@mail.gmail.com>	
	<4A0043B0.3030400@gmail.com>
	<f0e08f230905050659ld9f24acvf03a0bc1e0cdb827@mail.gmail.com>
Message-ID: <4A0050F5.2010208@gmail.com>

Hal Rosenstock wrote:
> On Tue, May 5, 2009 at 9:48 AM, Eli Dorfman (Voltaire)
> <dorfman.eli at gmail.com> wrote:
>> Hal Rosenstock wrote:
>>> On Tue, May 5, 2009 at 9:00 AM, Doron Shoham <dorons at voltaire.com> wrote:
>>>> when setting max_op_vls = 0
>>>> do not force it to 1.
>>>> 0 is valid value which means "No change"
>>>>
>>>> Signed-off-by: Doron Shoham <dorons at voltaire.com>
>>>> ---
>>>>  opensm/opensm/osm_port.c   |    6 ------
>>>>  opensm/opensm/osm_subnet.c |    8 ++++++++
>>>>  2 files changed, 8 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c
>>>> index 2e6c642..db0c27e 100644
>>>> --- a/opensm/opensm/osm_port.c
>>>> +++ b/opensm/opensm/osm_port.c
>>>> @@ -380,12 +380,6 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log,
>>>>        if (op_vls > p_subn->opt.max_op_vls)
>>>>                op_vls = p_subn->opt.max_op_vls;
>>>>
>>>> -       if (op_vls == 0) {
>>>> -               OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: "
>>>> -                       "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n");
>>>> -               op_vls = 1;
>>>> -       }
>>>> -
>>> Should that only be done when max_op_vls is 0 ?
>>>
>>> Something like:
>>>            if (op_vls > p_subn->opt.max_op_vls)
>>>                 op_vls = p_subn->opt.max_op_vls;
>>>            else if (op_vls == 0) {
>>>                OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: "
>>>                        "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n");
>>>                op_vls = 1;
>>>           }
>> why do you suggest a special case for op_vls=0 (and not for other portinfo fields)?
> 
>> is there a firmware bug that reports op_vls=0?
> 
> There were (still are ?) implementations which returned op_vls 0 which
> is why the words "valid on Set()" were added to the IBA spec and why I
> don't feel safe removing the code as originally proposed but think my
> alternative is safe and accomplishes the stated goal. Is there a
> problem with my alternative proposal ?

no, but there are other fields in portinfo that are not validated.
for example link_speed_enabled (which allows 0 value only on Set as well).
also if a node returns op_vl=0 how do you know it supports op_vl=1?


Eli


From jon at opengridcomputing.com  Tue May  5 08:06:36 2009
From: jon at opengridcomputing.com (Jon Mason)
Date: Tue, 5 May 2009 10:06:36 -0500
Subject: [ofa-general] OFED, the backported <linux/scatterlist.h>
	header and sg_init_table()
In-Reply-To: <200905051021.36725.jackm@dev.mellanox.co.il>
References: <e2e108260905020446j41996c85o35d41f09dc4ff64d@mail.gmail.com>
	<e2e108260905030836r108c8578rfb2ea5ffad6df50e@mail.gmail.com>
	<20090504145641.GA19565@opengridcomputing.com>
	<200905051021.36725.jackm@dev.mellanox.co.il>
Message-ID: <20090505150635.GA30788@opengridcomputing.com>

On Tue, May 05, 2009 at 10:21:36AM +0300, Jack Morgenstein wrote:
> On Monday 04 May 2009 17:56, Jon Mason wrote:
> > What's even worse is that sg_init_table is already defined in the
> > RHEL5.3 headers.  When coding up a header cleanup patch for RHEL5.3, I
> > noticed it was already defined in linux/ncrypto.h.  Also, it's there for
> > RHEL5.2 (and a few older kernels).
> > 
> I do not see that as "worse". ncrypto is the cryptographic scatterlist API, which is not used anywhere in OFED.
> Do we include this only because of its base scatterlist additions? 

No, we currently duplicate all the scatterlist functionality.  Including
ncrypto.h would greatly simplify the backport headers, but it is a
RHEL5.2/5.3 only solution.  If this change is needed for all other
backports, then a better solution will be needed.

> ncrypto.h itself has a list of includes.
> 
> I guess, though, you could do the following for scatterlist.h in the RHEL5.3 backport:
> ==============================================================================
> #ifndef __BACKPORT_LINUX_SCATTERLIST_H_TO_RHEL5_3__
> #define __BACKPORT_LINUX_SCATTERLIST_H_TO_RHEL5_3__
> 
> /* crypto.h includes scatterlist.h */
> #include<linux/ncrypto.h>
> 
> static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
> {
>         sg->page = page;
> }
> 
> #define for_each_sg(sglist, sg, nr, __i)        \
>         for (__i = 0, sg = (sglist); __i < (nr); __i++, sg++)
> 
> static inline struct scatterlist *sg_next(struct scatterlist *sg)
> {
>         if (!sg) {
>                 BUG();
>                 return NULL;
>         }
>         return sg + 1;
> }
> 
> #endif
> ==============================================================================


It is more than just this.  By including ncrypto.h, crypto.h and
scatterlist.h in the RHEL backports are 99% smaller due to the removal
of duplicated functionality.  Obviously, this will need to be tested
heavily.

Thanks,
Jon

> 
> linux/ncrypto.h, though, is not part of, say, kernel 2.6.23.  Need to check if the above is RedHat-only solution.
> 
> - Jack


From hal.rosenstock at gmail.com  Tue May  5 11:30:45 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 5 May 2009 14:30:45 -0400
Subject: [ofa-general] [PATCH] osm_port.c: do not force max_op_vls = 0 to 
	1
In-Reply-To: <4A0050F5.2010208@gmail.com>
References: <4A00386E.2050300@voltaire.com>
	<f0e08f230905050614n149dfc53yd6d0f03e5f2dd4c@mail.gmail.com>
	<4A0043B0.3030400@gmail.com>
	<f0e08f230905050659ld9f24acvf03a0bc1e0cdb827@mail.gmail.com>
	<4A0050F5.2010208@gmail.com>
Message-ID: <f0e08f230905051130h2b953de1ldf4e9606ea360c3@mail.gmail.com>

On Tue, May 5, 2009 at 10:45 AM, Eli Dorfman (Voltaire)
<dorfman.eli at gmail.com> wrote:
> Hal Rosenstock wrote:
>> On Tue, May 5, 2009 at 9:48 AM, Eli Dorfman (Voltaire)
>> <dorfman.eli at gmail.com> wrote:
>>> Hal Rosenstock wrote:
>>>> On Tue, May 5, 2009 at 9:00 AM, Doron Shoham <dorons at voltaire.com> wrote:
>>>>> when setting max_op_vls = 0
>>>>> do not force it to 1.
>>>>> 0 is valid value which means "No change"
>>>>>
>>>>> Signed-off-by: Doron Shoham <dorons at voltaire.com>
>>>>> ---
>>>>>  opensm/opensm/osm_port.c   |    6 ------
>>>>>  opensm/opensm/osm_subnet.c |    8 ++++++++
>>>>>  2 files changed, 8 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c
>>>>> index 2e6c642..db0c27e 100644
>>>>> --- a/opensm/opensm/osm_port.c
>>>>> +++ b/opensm/opensm/osm_port.c
>>>>> @@ -380,12 +380,6 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log,
>>>>>        if (op_vls > p_subn->opt.max_op_vls)
>>>>>                op_vls = p_subn->opt.max_op_vls;
>>>>>
>>>>> -       if (op_vls == 0) {
>>>>> -               OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: "
>>>>> -                       "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n");
>>>>> -               op_vls = 1;
>>>>> -       }
>>>>> -
>>>> Should that only be done when max_op_vls is 0 ?
>>>>
>>>> Something like:
>>>>            if (op_vls > p_subn->opt.max_op_vls)
>>>>                 op_vls = p_subn->opt.max_op_vls;
>>>>            else if (op_vls == 0) {
>>>>                OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: "
>>>>                        "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n");
>>>>                op_vls = 1;
>>>>           }
>>> why do you suggest a special case for op_vls=0 (and not for other portinfo fields)?
>>
>>> is there a firmware bug that reports op_vls=0?
>>
>> There were (still are ?) implementations which returned op_vls 0 which
>> is why the words "valid on Set()" were added to the IBA spec and why I
>> don't feel safe removing the code as originally proposed but think my
>> alternative is safe and accomplishes the stated goal. Is there a
>> problem with my alternative proposal ?
>
> no, but there are other fields in portinfo that are not validated.

Yes, there's some inconsistency here but it's based on field experience.

> for example link_speed_enabled (which allows 0 value only on Set as well).

Yes, but this field had the specific issue I noted and 0 being
returned on get was never observed on any of the other fields where 0
is valid on set (added there as well).

> also if a node returns op_vl=0 how do you know it supports op_vl=1?

op_vls 1 is always safe as at least 1 data VL must be supported. It's
just that possibly more op vls could have been supported if things had
been compliant.

-- Hal

> Eli


From sashak at voltaire.com  Tue May  5 12:05:46 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 5 May 2009 22:05:46 +0300
Subject: [ofa-general] Re: Issues with combined routing in smpquery
In-Reply-To: <20090429160438.db62cde1.weiny2@llnl.gov>
References: <20090428202736.0ff049e5.weiny2@llnl.gov>
	<20090428205525.4ffdd778.weiny2@llnl.gov>
	<20090429145355.704fb2f5.weiny2@llnl.gov>
	<20090429160438.db62cde1.weiny2@llnl.gov>
Message-ID: <20090505190546.GA31846@sashak.voltaire.com>

Hi Ira,

On 16:04 Wed 29 Apr     , Ira Weiny wrote:
> 
> I know what changed but there appears to be a discrepancy between ib_mad_f 
> and the spec.
> 
> Commit 2dbb8b95d9dc27423a6fdb85d88ef385ecee0005
>    "libibmad: remove c99 definitions within the ib_mad_f structure"
> removed the designated initializers from ib_mad_f.  Appling the patch below
> aligns the MAD_FIELDS with ib_mad_f.

Thanks for looking into this.

> However, if you look at the offsets specified in ib_mad_f they are wrong.
> According to 14.2.1.2, DrSLID is at offset 32 bytes (256 bits).  ib_mad_f
> places the offset at 272.  I have verified the bytes using a debugger and byte
> 32 is the DrSLID.  I hesitate to say there is a bug in mad_set_field however
> there does appear to be something amiss.  :-/

I think everything is ok there. 14.2.1.2 says: at offset 32 bytes (256
bits) DrDLID - bits 0-15, DrSLID - bits 16-31.

Sasha


From sashak at voltaire.com  Tue May  5 12:08:10 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 5 May 2009 22:08:10 +0300
Subject: [ofa-general] Re: [PATCH 1/3] Fix reversal of DRSLID and DRDLID in
	MAD_FIELDS enum
In-Reply-To: <20090430142950.85ef6368.weiny2@llnl.gov>
References: <20090430142950.85ef6368.weiny2@llnl.gov>
Message-ID: <20090505190810.GB31846@sashak.voltaire.com>

On 14:29 Thu 30 Apr     , Ira Weiny wrote:
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Thu, 30 Apr 2009 11:19:26 -0700
> Subject: [PATCH] Fix reversal of DRSLID and DRDLID in MAD_FIELDS enum.
> 
> 
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>

Applied. Thanks.

Sasha


From devel-ofed at morey-chaisemartin.com  Tue May  5 12:33:38 2009
From: devel-ofed at morey-chaisemartin.com (Nicolas Morey-Chaisemartin)
Date: Tue, 05 May 2009 21:33:38 +0200
Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c Increase the size
	of	the hop table
In-Reply-To: <4A003E3F.2010100@Sun.COM>
References: <4A003E3F.2010100@Sun.COM>
Message-ID: <4A009492.2060408@morey-chaisemartin.com>

Le 05/05/2009 15:25, Line.Holen at Sun.COM a écrit :
> The hops table of ftree_sw_t is too small to hold the hop count
> of max_lid. Changed sw_create() to allocate hops[max_lid+1]
> not hops[max_lid].
> 
> Signed-off-by: Line Holen <Line.Holen at sun.com>


This patch seems right to me (at least agrees with other checks).
However, I've been using the ftree algorithm without this fix in thousands of tests and never had any seg fault problem and valgrind showed nothing either...
Would it be possible that the actual value is always < max_lid_ho ? 

Nicolas


From sashak at voltaire.com  Tue May  5 13:15:39 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 5 May 2009 23:15:39 +0300
Subject: [ofa-general] Re: [PATCH 2/3] Add combined routing support to
	libibnetdisc
In-Reply-To: <20090430142958.5811218f.weiny2@llnl.gov>
References: <20090430142958.5811218f.weiny2@llnl.gov>
Message-ID: <20090505201539.GC31846@sashak.voltaire.com>

On 14:29 Thu 30 Apr     , Ira Weiny wrote:
>  
>  static int
> -extend_dpath(struct ibnd_fabric *f, ib_dr_path_t *path, int nextport)
> +extend_dpath(struct ibnd_fabric *f, ib_portid_t *portid, int nextport)
>  {
> -	int rc = add_port_to_dpath(path, nextport);
> -	if ((rc != -1) && (path->cnt > f->fabric.maxhops_discovered))
> -		f->fabric.maxhops_discovered = path->cnt;
> +	int rc = 0;
> +
> +	if (portid->lid && !portid->drpath.drslid) {
> +		/* If we were LID routed
> +		 * AND have not done so already
> +		 * we need to set up the drslid
> +		 */
> +		ib_portid_t selfportid = { 0 };
> +		if (ib_resolve_self_via(&selfportid, NULL, NULL, f->fabric.ibmad_port) < 0)
> +			return -1;
> +		portid->drpath.drslid = selfportid.lid;
> +		portid->drpath.drdlid = 0xFFFF;

How does it work? Shouldn't be portid->drpath.drslid = portid->lid? What
am I missing?

Sasha

> +	}
> +
> +	rc = add_port_to_dpath(&portid->drpath, nextport);
> +
> +	if ((rc != -1) && (portid->drpath.cnt > f->fabric.maxhops_discovered))
> +		f->fabric.maxhops_discovered = portid->drpath.cnt;
>  	return (rc);
>  }
>  
> @@ -447,7 +462,7 @@ get_remote_node(struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd_
>  			!= IB_PORT_PHYS_STATE_LINKUP)
>  		return -1;
>  
> -	if (extend_dpath(fabric, &path->drpath, portnum) < 0)
> +	if (extend_dpath(fabric, path, portnum) < 0)
>  		return -1;
>  
>  	if (query_node(fabric, &node_buf, &port_buf, path)) {
> @@ -546,8 +561,7 @@ ibnd_discover_fabric(struct ibmad_port *ibmad_port, int timeout_ms,
>  	if (!port)
>  		IBPANIC("out of memory");
>  
> -	if (node->node.type != IB_NODE_SWITCH &&
> -	    get_remote_node(fabric, node, port, from,
> +	if(get_remote_node(fabric, node, port, from,
>  				mad_get_field(node->node.info, 0, IB_NODE_LOCAL_PORT_F),
>  				0) < 0)
>  		return ((ibnd_fabric_t *)fabric);
> -- 
> 1.5.4.5
> 


From sashak at voltaire.com  Tue May  5 13:16:59 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 5 May 2009 23:16:59 +0300
Subject: [ofa-general] Re: [PATCH] opensm/PerfMgr: Remove some underbars from
	internal names
In-Reply-To: <20090501214724.GA30974@comcast.net>
References: <20090501214724.GA30974@comcast.net>
Message-ID: <20090505201659.GD31846@sashak.voltaire.com>

On 17:47 Fri 01 May     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Tue May  5 13:18:26 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 5 May 2009 23:18:26 +0300
Subject: [ofa-general] Re: [PATCH] infiniband-diags: Added libibnetdiscover
	to .spec file
In-Reply-To: <49FECA41.7060200@ext.bull.net>
References: <49FECA41.7060200@ext.bull.net>
Message-ID: <20090505201826.GE31846@sashak.voltaire.com>

On 12:58 Mon 04 May     , Nicolas Morey-Chaisemartin wrote:
> 
> Signed-off-by: Nicolas Morey-Chaisemartin <nicolas.morey-chaisemartin at ext.bull.net>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Tue May  5 13:19:32 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 5 May 2009 23:19:32 +0300
Subject: [ofa-general] Patch for libvendor incompatibility with QLogic SM
In-Reply-To: <f0e08f230905040642h6da19161ycebedb7d03f11c72@mail.gmail.com>
References: <4C2744E8AD2982428C5BFE523DF8CDCB3E74624662@MNEXMB1.qlogic.org>
	<f0e08f230812181215t69d74222na072c8b52ab68270@mail.gmail.com>
	<4C2744E8AD2982428C5BFE523DF8CDCB3E74624663@MNEXMB1.qlogic.org>
	<f0e08f230812181231n397d9666s7342508fb41601b@mail.gmail.com>
	<4C2744E8AD2982428C5BFE523DF8CDCB3E74624665@MNEXMB1.qlogic.org>
	<f0e08f230812181302m266c20d7o38f0413ef7784f35@mail.gmail.com>
	<4C2744E8AD2982428C5BFE523DF8CDCB43E870D5B7@MNEXMB1.qlogic.org>
	<f0e08f230905040636v47f3e2b2y344156f3bd6df550@mail.gmail.com>
	<4C2744E8AD2982428C5BFE523DF8CDCB43E870D5C2@MNEXMB1.qlogic.org>
	<f0e08f230905040642h6da19161ycebedb7d03f11c72@mail.gmail.com>
Message-ID: <20090505201932.GF31846@sashak.voltaire.com>

On 09:42 Mon 04 May     , Hal Rosenstock wrote:
> 
> I would expect OFED 1.5 to be based off the current master

Yes, it will be based on the current master.

Sasha


From sashak at voltaire.com  Tue May  5 13:21:33 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 5 May 2009 23:21:33 +0300
Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibnetdiscover.c: Cosmetic
	formatting changes
In-Reply-To: <20090504191732.GA29650@comcast.net>
References: <20090504191732.GA29650@comcast.net>
Message-ID: <20090505202133.GG31846@sashak.voltaire.com>

On 15:17 Mon 04 May     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Tue May  5 13:24:21 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 5 May 2009 23:24:21 +0300
Subject: [ofa-general] Re: [PATCH] opensm/PerfMgr DB: Remove leading
	underscores from internal names
In-Reply-To: <20090504200018.GA4590@comcast.net>
References: <20090504200018.GA4590@comcast.net>
Message-ID: <20090505202421.GH31846@sashak.voltaire.com>

On 16:00 Mon 04 May     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From Line.Holen at Sun.COM  Tue May  5 13:26:36 2009
From: Line.Holen at Sun.COM (Line.Holen at Sun.COM)
Date: Tue, 05 May 2009 22:26:36 +0200
Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c Increase the size
	of	the hop table
In-Reply-To: <4A009492.2060408@morey-chaisemartin.com>
References: <4A003E3F.2010100@Sun.COM>
	<4A009492.2060408@morey-chaisemartin.com>
Message-ID: <4A00A0FC.6050606@Sun.COM>

On 05/ 5/09 09:33 PM, Nicolas Morey-Chaisemartin wrote:
> Le 05/05/2009 15:25, Line.Holen at Sun.COM a écrit :
>> The hops table of ftree_sw_t is too small to hold the hop count
>> of max_lid. Changed sw_create() to allocate hops[max_lid+1]
>> not hops[max_lid].
>>
>> Signed-off-by: Line Holen <Line.Holen at sun.com>
> 
> 
> This patch seems right to me (at least agrees with other checks).
> However, I've been using the ftree algorithm without this fix in thousands of tests and never had any seg fault problem and valgrind showed nothing either...
> Would it be possible that the actual value is always < max_lid_ho ? 
> 
> Nicolas
> 

I haven't experienced any seg fault either. But I have seen lack of connectivity
to the node having lid = max_lid. This was because hop[max_lid] contained a value 
of 0 rather than 0xff (for some of the switches) which made the routing stop too
early.

Line


From jsquyres at cisco.com  Tue May  5 13:57:09 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Tue, 5 May 2009 16:57:09 -0400
Subject: [ofa-general] Memory registration redux
Message-ID: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>

Roland and I chatted on the phone today; I think I now understand  
Roland's counter-proposal (I clearly didn't before).  Let me try to  
summarize:

1. Add a new verb for "set this userspace flag to 1 if mr X ever  
becomes invalid"
2. Add a new verb for "no longer tell me if mr X ever becomes  
invalid" (i.e., remove the effects of #1)
3. Add run-time query indicating whether #1 works
4. Add [optional] memory registration caching to libibverbs

Prior to talking to Roland, I had envisioned *one* flag in userspace  
that indicated whether any memory registrations had become invalid.   
Roland's idea is that there is one flag *per registration* -- you can  
instantly tell whether a specific registration is valid.

Given this, let's keep the discussion going here in email -- perhaps  
the teleconference next Monday may become moot.

---------------------------------------------

More detail...

Here's a sample scenario:

- userspace registers memory buffer A
- userspace adds this registration to its cache
   (note: the cache could be in libibverbs; more on this below)
- userspace calls a [new] verb that says "tell me if mr X ever becomes  
invalid" and passes a pointer to a flag *in this registration's entry  
in the cache*
- userspace leaves the memory buffer A registered/cached

Some scenarios after the above has run:

1. Userspace uses buffer A again
    - userspace looks up and finds A's cached registration
    - userspace sees that this registration's flag is still 0, and  
therefore can proceed with communication

2. Application frees buffer A and it is returned to the OS (e.g, munmap)
    - IOMMU fires
    - change userspace flag corresponding to this registration to 1
    - memory is unregistered
    - pages are returned

3. Userspace uses buffer A again (after #2)
    - userspace looks up and finds A's cached registration
    - userspace sees that this registration's flag is 1
    - userspace therefore registers this memory again, and re-calls  
the verb saying "tell me if mr X ever becomes invalid" (etc.)
    - userspace proceeds with communication

The kernel has to store a little extra state for each registration  
(the address of the userspace flag to tweak if the registration ever  
becomes invalid), but it's small and bounded by the number of active  
registrations.

 From MPI's perspective, this feature would be a great step forward --  
if we can query verbs at run-time to see if this feature is active, we  
can stop using the memory allocation hooks (yay!).  Obviously, MPI's  
will need to carry the old memory allocation hooks for backwards  
compatibility for a while, but if we can effectively deprecate them,  
that would be great.

**Specifically: it's the memory allocation hooks code in MPI  
implementations that is "fragile", "brittle", etc.  Avoiding the issue  
would be great; the code becomes much more robust because we're not  
subverting the memory allocator.

A secondary feature would be to add memory registration caching to  
libibverbs.  This wouldn't be *required* for MPIs since we all have  
registration caches already, but it might be nice to deprecate/ 
eventually remove that code in an MPI implementation, too.

The use case is similar to what was proposed earlier: add a flag to  
ibv_reg_mr() indicating whether you want the registration cached or  
not.  If the registration is to be cached, libibverbs would also  
invoke the "tell me if this mr every becomes invalid" functionality.   
The MPI/application then *always* calls ibv_reg_mr() to register  
memory -- if the cache in libibverbs finds a valid matching mr, it can  
just return without a syscall.  As also described previously, calls to  
ibv_dereg_mr() do not necessarily need to actually unregister -- they  
can just mark a registration cache as "able to be evicted if necessary."

The other new verbs discussed in my prior mail would also still be  
useful (ibv_is_reg(), ibv_reg_mr_limits(), ibv_reg_mr_clean()).

**Note: the registration caches in MPI's today are not necessarily  
that complicated.  They're essentially balanced trees (e.g., in OMPI,  
it's a red-black tree).  This is not the "fragile", "brittle" code --  
it's just data structures and accounting.

=================================

I refrained from a specific new API proposal; let's argue over these  
ideas first and see if we can come to consensus.  If so, specific API  
proposals can follow.

-- 
Jeff Squyres
Cisco Systems


From weiny2 at llnl.gov  Tue May  5 14:19:40 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 5 May 2009 14:19:40 -0700
Subject: [ofa-general] Re: [PATCH 2/3] Add combined routing support to
	libibnetdisc
In-Reply-To: <20090505201539.GC31846@sashak.voltaire.com>
References: <20090430142958.5811218f.weiny2@llnl.gov>
	<20090505201539.GC31846@sashak.voltaire.com>
Message-ID: <20090505141940.2f2d57e3.weiny2@llnl.gov>

On Tue, 5 May 2009 23:15:39 +0300
Sasha Khapyorsky <sashak at voltaire.com> wrote:

> On 14:29 Thu 30 Apr     , Ira Weiny wrote:
> >  
> >  static int
> > -extend_dpath(struct ibnd_fabric *f, ib_dr_path_t *path, int nextport)
> > +extend_dpath(struct ibnd_fabric *f, ib_portid_t *portid, int nextport)
> >  {
> > -	int rc = add_port_to_dpath(path, nextport);
> > -	if ((rc != -1) && (path->cnt > f->fabric.maxhops_discovered))
> > -		f->fabric.maxhops_discovered = path->cnt;
> > +	int rc = 0;
> > +
> > +	if (portid->lid && !portid->drpath.drslid) {
> > +		/* If we were LID routed
> > +		 * AND have not done so already
> > +		 * we need to set up the drslid
> > +		 */
> > +		ib_portid_t selfportid = { 0 };
> > +		if (ib_resolve_self_via(&selfportid, NULL, NULL, f->fabric.ibmad_port) < 0)
> > +			return -1;
> > +		portid->drpath.drslid = selfportid.lid;
> > +		portid->drpath.drdlid = 0xFFFF;
> 
> How does it work? Shouldn't be portid->drpath.drslid = portid->lid? What
> am I missing?

Using a combined route where we are starting at some remote node.  We have to
use a directed route which does not start at "our" requester node.  From the
spec. C14-6 "bullet 6" states:

   "... If the directed route does not start from the requester node, then
   DrSLID shall be set to the LID of the requester node, which must have been
   assigned."

The requester node is "self" in this case.  If the DRSLID was set to the
portid->lid then the response would not come back to us because portid->lid is
the LID of the remote node we are starting the DR Path at.

Ira

> 
> Sasha
> 
> > +	}
> > +
> > +	rc = add_port_to_dpath(&portid->drpath, nextport);
> > +
> > +	if ((rc != -1) && (portid->drpath.cnt > f->fabric.maxhops_discovered))
> > +		f->fabric.maxhops_discovered = portid->drpath.cnt;
> >  	return (rc);
> >  }
> >  
> > @@ -447,7 +462,7 @@ get_remote_node(struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd_
> >  			!= IB_PORT_PHYS_STATE_LINKUP)
> >  		return -1;
> >  
> > -	if (extend_dpath(fabric, &path->drpath, portnum) < 0)
> > +	if (extend_dpath(fabric, path, portnum) < 0)
> >  		return -1;
> >  
> >  	if (query_node(fabric, &node_buf, &port_buf, path)) {
> > @@ -546,8 +561,7 @@ ibnd_discover_fabric(struct ibmad_port *ibmad_port, int timeout_ms,
> >  	if (!port)
> >  		IBPANIC("out of memory");
> >  
> > -	if (node->node.type != IB_NODE_SWITCH &&
> > -	    get_remote_node(fabric, node, port, from,
> > +	if(get_remote_node(fabric, node, port, from,
> >  				mad_get_field(node->node.info, 0, IB_NODE_LOCAL_PORT_F),
> >  				0) < 0)
> >  		return ((ibnd_fabric_t *)fabric);
> > -- 
> > 1.5.4.5
> > 


-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
weiny2 at llnl.gov


From weiny2 at llnl.gov  Tue May  5 14:21:06 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 5 May 2009 14:21:06 -0700
Subject: [ofa-general] Re: Issues with combined routing in smpquery
In-Reply-To: <20090505190546.GA31846@sashak.voltaire.com>
References: <20090428202736.0ff049e5.weiny2@llnl.gov>
	<20090428205525.4ffdd778.weiny2@llnl.gov>
	<20090429145355.704fb2f5.weiny2@llnl.gov>
	<20090429160438.db62cde1.weiny2@llnl.gov>
	<20090505190546.GA31846@sashak.voltaire.com>
Message-ID: <20090505142106.99d96c01.weiny2@llnl.gov>

On Tue, 5 May 2009 22:05:46 +0300
Sasha Khapyorsky <sashak at voltaire.com> wrote:

> Hi Ira,
> 
> On 16:04 Wed 29 Apr     , Ira Weiny wrote:
> > 
> > I know what changed but there appears to be a discrepancy between ib_mad_f 
> > and the spec.
> > 
> > Commit 2dbb8b95d9dc27423a6fdb85d88ef385ecee0005
> >    "libibmad: remove c99 definitions within the ib_mad_f structure"
> > removed the designated initializers from ib_mad_f.  Appling the patch below
> > aligns the MAD_FIELDS with ib_mad_f.
> 
> Thanks for looking into this.
> 
> > However, if you look at the offsets specified in ib_mad_f they are wrong.
> > According to 14.2.1.2, DrSLID is at offset 32 bytes (256 bits).  ib_mad_f
> > places the offset at 272.  I have verified the bytes using a debugger and byte
> > 32 is the DrSLID.  I hesitate to say there is a bug in mad_set_field however
> > there does appear to be something amiss.  :-/
> 
> I think everything is ok there. 14.2.1.2 says: at offset 32 bytes (256
> bits) DrDLID - bits 0-15, DrSLID - bits 16-31.

Ah, ok, I see now.  I mixed up my bits...  ;-)

Ira

> 
> Sasha


-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
weiny2 at llnl.gov


From jackm at dev.mellanox.co.il  Tue May  5 22:30:55 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Wed, 6 May 2009 08:30:55 +0300
Subject: [ofa-general] OFED,
	the backported <linux/scatterlist.h> header and sg_init_table()
In-Reply-To: <20090505150635.GA30788@opengridcomputing.com>
References: <e2e108260905020446j41996c85o35d41f09dc4ff64d@mail.gmail.com>
	<200905051021.36725.jackm@dev.mellanox.co.il>
	<20090505150635.GA30788@opengridcomputing.com>
Message-ID: <200905060830.56062.jackm@dev.mellanox.co.il>

On Tuesday 05 May 2009 18:06, Jon Mason wrote:
> No, we currently duplicate all the scatterlist functionality.  Including
> ncrypto.h would greatly simplify the backport headers, but it is a
> RHEL5.2/5.3 only solution.  If this change is needed for all other
> backports, then a better solution will be needed.
> 
Each backport has its OWN directory.  The backports are not identical
for all kernels.  There is absolutely no problem with handling backports
per kernel/per distribution. Therefore, the RHEL 5.2/5.3 solution can be
used for those backports alone, without affecting any of the others.
Other backports will have a different change.

For RHEL5.2/5.3, my concern is that if someone will actually write an ncrypto kernel
application, and include ncrypto.h along with the infiniband headers, there will be
compilation problems because the scatterlist functionality fixes will appear twice.

Specifically, OFED 1.4.1 has the following INDIVIDUAL/independent backports, and
each one is handled differently:
2.6.16
2.6.16_sles10
2.6.16_sles10_sp1
2.6.16_sles10_sp2
2.6.17
2.6.18
2.6.18-EL5.1
2.6.18-EL5.2
2.6.18-EL5.3
2.6.18_FC6 (also for EL5.0)
2.6.18_suse10_2
2.6.19
2.6.20
2.6.21
2.6.22
2.6.22_suse10_3
2.6.23
2.6.24
2.6.25
2.6.26
2.6.9_U4
2.6.9_U5
2.6.9_U6
2.6.9_U7

- Jack


From sashak at voltaire.com  Wed May  6 03:07:44 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 6 May 2009 13:07:44 +0300
Subject: [ofa-general] Re: [PATCH 2/3] Add combined routing support to
	libibnetdisc
In-Reply-To: <20090430142958.5811218f.weiny2@llnl.gov>
References: <20090430142958.5811218f.weiny2@llnl.gov>
Message-ID: <20090506100744.GB10145@sk>

On 14:29 Thu 30 Apr     , Ira Weiny wrote:
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Wed, 29 Apr 2009 10:15:55 -0700
> Subject: [PATCH] Add combined routing support to libibnetdisc
> 
>    Also allow a scan to start at a switch.
> 
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
> ---
>  infiniband-diags/libibnetdisc/src/ibnetdisc.c |   28 ++++++++++++++++++------
>  1 files changed, 21 insertions(+), 7 deletions(-)
> 
> diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
> index 0ff5134..fc19633 100644
> --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
> +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
> @@ -177,11 +177,26 @@ add_port_to_dpath(ib_dr_path_t *path, int nextport)
>  }
>  
>  static int
> -extend_dpath(struct ibnd_fabric *f, ib_dr_path_t *path, int nextport)
> +extend_dpath(struct ibnd_fabric *f, ib_portid_t *portid, int nextport)
>  {
> -	int rc = add_port_to_dpath(path, nextport);
> -	if ((rc != -1) && (path->cnt > f->fabric.maxhops_discovered))
> -		f->fabric.maxhops_discovered = path->cnt;
> +	int rc = 0;
> +
> +	if (portid->lid && !portid->drpath.drslid) {
> +		/* If we were LID routed
> +		 * AND have not done so already
> +		 * we need to set up the drslid
> +		 */
> +		ib_portid_t selfportid = { 0 };
> +		if (ib_resolve_self_via(&selfportid, NULL, NULL, f->fabric.ibmad_port) < 0)
> +			return -1;

And wouldn't it be better instead of resolving selfport on each
extend_path() call to keep it already resolved somewhere in fabric
structure?

Sasha

> +		portid->drpath.drslid = selfportid.lid;
> +		portid->drpath.drdlid = 0xFFFF;
> +	}
> +
> +	rc = add_port_to_dpath(&portid->drpath, nextport);
> +
> +	if ((rc != -1) && (portid->drpath.cnt > f->fabric.maxhops_discovered))
> +		f->fabric.maxhops_discovered = portid->drpath.cnt;
>  	return (rc);
>  }
>  
> @@ -447,7 +462,7 @@ get_remote_node(struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd_
>  			!= IB_PORT_PHYS_STATE_LINKUP)
>  		return -1;
>  
> -	if (extend_dpath(fabric, &path->drpath, portnum) < 0)
> +	if (extend_dpath(fabric, path, portnum) < 0)
>  		return -1;
>  
>  	if (query_node(fabric, &node_buf, &port_buf, path)) {
> @@ -546,8 +561,7 @@ ibnd_discover_fabric(struct ibmad_port *ibmad_port, int timeout_ms,
>  	if (!port)
>  		IBPANIC("out of memory");
>  
> -	if (node->node.type != IB_NODE_SWITCH &&
> -	    get_remote_node(fabric, node, port, from,
> +	if(get_remote_node(fabric, node, port, from,
>  				mad_get_field(node->node.info, 0, IB_NODE_LOCAL_PORT_F),
>  				0) < 0)
>  		return ((ibnd_fabric_t *)fabric);
> -- 
> 1.5.4.5
> 


From sashak at voltaire.com  Wed May  6 03:08:44 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 6 May 2009 13:08:44 +0300
Subject: [ofa-general] Re: [PATCH 2/3] Add combined routing support to
	libibnetdisc
In-Reply-To: <20090505141940.2f2d57e3.weiny2@llnl.gov>
References: <20090430142958.5811218f.weiny2@llnl.gov>
	<20090505201539.GC31846@sashak.voltaire.com>
	<20090505141940.2f2d57e3.weiny2@llnl.gov>
Message-ID: <20090506100844.GC10145@sk>

On 14:19 Tue 05 May     , Ira Weiny wrote:
> > 
> > How does it work? Shouldn't be portid->drpath.drslid = portid->lid? What
> > am I missing?
> 
> Using a combined route where we are starting at some remote node.  We have to
> use a directed route which does not start at "our" requester node.  From the
> spec. C14-6 "bullet 6" states:
> 
>    "... If the directed route does not start from the requester node, then
>    DrSLID shall be set to the LID of the requester node, which must have been
>    assigned."
> 
> The requester node is "self" in this case.

Makes sense.

Sasha


From vlad at lists.openfabrics.org  Wed May  6 03:21:58 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Wed,  6 May 2009 03:21:58 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090506-0200 daily build status
Message-ID: <20090506102158.E5ADBE6140F@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From slavas at voltaire.com  Wed May  6 03:24:55 2009
From: slavas at voltaire.com (Slava Strebkov)
Date: Wed, 6 May 2009 13:24:55 +0300
Subject: [ofa-general] Re: [RFC] OpenSM and IPv6 Scalability Proposal
Message-ID: <39C75744D164D948A170E9792AF8E7CA01F6F886@exil.voltaire.com>


In addition to the original proposal we suggest allocating special MLID
for the following MGIDs:
 1. FF12401bxxxx000000000000FFFFFFFF - All Nodes
 2. FF12401bxxxx00000000000000000001 - All hosts
 3. FF12401bffff0000000000000000004d  - all Gateways
 4. FF12401bxxxx00000000000000000002 - all routers
 5. FF12601bABCD000000000001ffxxxxxx - IPv6 SNM

For all other cases we suggest that same MLID will be assigned to
different MGIDs if:
 1. They share the same P Key
 2. Same signature - for IPoIB only
 3. Same LSB bits - bitmask configurable by user (default  10 bits)
	for example, the following are the same: 
	MGID1: 	FF12401bABCD000000000000xxxxx755
	MGID2: 	FF12401bABCD000000000000yyyyyB55
 Implementation.
 Since there will be many mgroups shared same mlid, mlid-array entry
will contain
 fleximap holding mgroups.
 Searching of mgroup will be performed by mlid (index in the array) and
mgid -
 key in the fleximap.


 Slava Strebkov


From sashak at voltaire.com  Wed May  6 03:24:38 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 6 May 2009 13:24:38 +0300
Subject: [ofa-general] Re: [PATCH] ib-mgmt: fixup ibsendtrap for windows
In-Reply-To: <F9CF6702599B420CA6B34588BAB7323F@amr.corp.intel.com>
References: <F9CF6702599B420CA6B34588BAB7323F@amr.corp.intel.com>
Message-ID: <20090506102438.GE10145@sk>

On 15:49 Mon 04 May     , Sean Hefty wrote:
> Fix some typecast issues.
> 
> Signed-off-by: Sean Hefty <sean.hefty at intel.com>

Applied with change noted below. Thanks.

> ---
> 
>  infiniband-diags/src/ibsendtrap.c |   12 ++++++------
>  1 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c
> index 469bc39..7ad588e 100644
> --- a/infiniband-diags/src/ibsendtrap.c
> +++ b/infiniband-diags/src/ibsendtrap.c
> @@ -66,10 +66,10 @@ static int get_node_type(ib_portid_t *port)
>  static void build_trap144(ib_mad_notice_attr_t * n, ib_portid_t *port)
>  {
>  	n->generic_type = 0x80 | IB_NOTICE_TYPE_INFO;
> -	n->g_or_v.generic.prod_type_lsb = cl_hton16(get_node_type(port));
> +	n->g_or_v.generic.prod_type_lsb = cl_hton16((uint16_t) get_node_type(port));

Instead of this casting I converted get_node_type() to return uint16_t.

Sasha


From sashak at voltaire.com  Wed May  6 04:07:19 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 6 May 2009 14:07:19 +0300
Subject: [ofa-general] Re: [PATCH] osm_port.c: do not force max_op_vls = 0 to
	1
In-Reply-To: <4A00386E.2050300@voltaire.com>
References: <4A00386E.2050300@voltaire.com>
Message-ID: <20090506110719.GF10145@sk>

Hi Doron,

On 16:00 Tue 05 May     , Doron Shoham wrote:
> when setting max_op_vls = 0
> do not force it to 1.
> 0 is valid value which means "No change"
> 
> Signed-off-by: Doron Shoham <dorons at voltaire.com>
> ---
>  opensm/opensm/osm_port.c   |    6 ------
>  opensm/opensm/osm_subnet.c |    8 ++++++++
>  2 files changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c
> index 2e6c642..db0c27e 100644
> --- a/opensm/opensm/osm_port.c
> +++ b/opensm/opensm/osm_port.c
> @@ -380,12 +380,6 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log,
>  	if (op_vls > p_subn->opt.max_op_vls)
>  		op_vls = p_subn->opt.max_op_vls;
>  
> -	if (op_vls == 0) {
> -		OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: "
> -			"Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n");
> -		op_vls = 1;
> -	}
> -

I think that originally it was done as workaround for some old and buggy
device. Personally I don't remember such cases in practice, but maybe
Mellanox guys could say more. Yevgeny?

Basically if this is not needed anymore I'm fine to remove it (but
somehow it was not a direct purpose of the patch).

>  	OSM_LOG_EXIT(p_log);
>  	return op_vls;
>  }
> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
> index ec15f8a..71fc7a0 100644
> --- a/opensm/opensm/osm_subnet.c
> +++ b/opensm/opensm/osm_subnet.c
> @@ -1288,6 +1288,14 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts)
>  		"# switch port connected to a CA or router port\n"
>  		"leaf_head_of_queue_lifetime 0x%02x\n\n"
>  		"# Limit the maximal operational VLs\n"
> +		"# Virtual Lanes operational on this port\n"
> +		"# Values are (IB Spec 1.2.1, 14.2.5.6 Table 146 \"PortInfo\")\n"
> +		"#    0: No change; valid only on Set()\n"
> +		"#    1: VL0\n"
> +		"#    2: VL0, VL1\n"
> +		"#    3: VL0 - VL3\n"
> +		"#    4: VL0 - VL7\n"
> +		"#    5: VL0 - VL14\n"
>  		"max_op_vls %u\n\n"

Using 'max_op_vls = 0' will enforce PortInfo update (see how
osm_physp_calc_link_op_vls() is used in osm_link_mgr.c and
osm_lid_mgr.c) with "No change" request, which is obviously not desired.
So max_op_vls = 0 case should be handled properly or not permitted.

Sasha

>  		"# Force PortInfo:LinkSpeedEnabled on switch ports\n"
>  		"# If 0, don't modify PortInfo:LinkSpeedEnabled on switch port\n"
> -- 
> 1.5.4
> 


From sashak at voltaire.com  Wed May  6 04:21:35 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 6 May 2009 14:21:35 +0300
Subject: [ofa-general] [PATCH] osm_port.c: do not force max_op_vls = 0 to 1
In-Reply-To: <f0e08f230905050659ld9f24acvf03a0bc1e0cdb827@mail.gmail.com>
References: <4A00386E.2050300@voltaire.com>
	<f0e08f230905050614n149dfc53yd6d0f03e5f2dd4c@mail.gmail.com>
	<4A0043B0.3030400@gmail.com>
	<f0e08f230905050659ld9f24acvf03a0bc1e0cdb827@mail.gmail.com>
Message-ID: <20090506112135.GG10145@sk>

On 09:59 Tue 05 May     , Hal Rosenstock wrote:
> >>
> >> Should that only be done when max_op_vls is 0 ?
> >>
> >> Something like:
> >> ?? ?? ?? ?? ?? ??if (op_vls > p_subn->opt.max_op_vls)
> >> ?? ?? ?? ?? ?? ?? ?? ?? op_vls = p_subn->opt.max_op_vls;
> >> ?? ?? ?? ?? ?? ??else if (op_vls == 0) {
> >> ?? ?? ?? ?? ?? ?? ?? ??OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: "
> >> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??"Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n");
> >> ?? ?? ?? ?? ?? ?? ?? ??op_vls = 1;
> >> ?? ?? ?? ?? ?? }
> >
> > why do you suggest a special case for op_vls=0 (and not for other portinfo fields)?
> 
> > is there a firmware bug that reports op_vls=0?
> 
> There were (still are ?) implementations which returned op_vls 0 which
> is why the words "valid on Set()" were added to the IBA spec and why I
> don't feel safe removing the code as originally proposed but think my
> alternative is safe and accomplishes the stated goal. Is there a
> problem with my alternative proposal ?

Assuming that all this was done as workaround for buggy OperVLs report
its relevance shouldn't be a function of max_op_vls configuration value.

I see two independent issues here: (1) removing (or keeping) zero
OperVLs report workaround and (2) support and proper handling
max_op_vls = 0 configuration value.

Sasha


From hal.rosenstock at gmail.com  Wed May  6 04:29:37 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 6 May 2009 07:29:37 -0400
Subject: [ofa-general] [PATCH] osm_port.c: do not force max_op_vls = 0 to 
	1
In-Reply-To: <20090506112135.GG10145@sk>
References: <4A00386E.2050300@voltaire.com>
	<f0e08f230905050614n149dfc53yd6d0f03e5f2dd4c@mail.gmail.com>
	<4A0043B0.3030400@gmail.com>
	<f0e08f230905050659ld9f24acvf03a0bc1e0cdb827@mail.gmail.com>
	<20090506112135.GG10145@sk>
Message-ID: <f0e08f230905060429m162dd8e5h5a9744c35d81033@mail.gmail.com>

On Wed, May 6, 2009 at 7:21 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 09:59 Tue 05 May     , Hal Rosenstock wrote:
>> >>
>> >> Should that only be done when max_op_vls is 0 ?
>> >>
>> >> Something like:
>> >> ?? ?? ?? ?? ?? ??if (op_vls > p_subn->opt.max_op_vls)
>> >> ?? ?? ?? ?? ?? ?? ?? ?? op_vls = p_subn->opt.max_op_vls;
>> >> ?? ?? ?? ?? ?? ??else if (op_vls == 0) {
>> >> ?? ?? ?? ?? ?? ?? ?? ??OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: "
>> >> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??"Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n");
>> >> ?? ?? ?? ?? ?? ?? ?? ??op_vls = 1;
>> >> ?? ?? ?? ?? ?? }
>> >
>> > why do you suggest a special case for op_vls=0 (and not for other portinfo fields)?
>>
>> > is there a firmware bug that reports op_vls=0?
>>
>> There were (still are ?) implementations which returned op_vls 0 which
>> is why the words "valid on Set()" were added to the IBA spec and why I
>> don't feel safe removing the code as originally proposed but think my
>> alternative is safe and accomplishes the stated goal. Is there a
>> problem with my alternative proposal ?
>
> Assuming that all this was done as workaround for buggy OperVLs report
> its relevance shouldn't be a function of max_op_vls configuration value.
>
> I see two independent issues here: (1) removing (or keeping) zero
> OperVLs report workaround and (2) support and proper handling
> max_op_vls = 0 configuration value.

Agreed and IMO (as I stated in previous emails) the workaround should
be kept as I don't think there is a way of knowing for sure that those
non compliant implementations are not in the field anymore. If the
push is to remove this, then maybe another option for this workaround
should be added with the default being to have the workaround off.

-- Hal

>
> Sasha
>


From hnrose at comcast.net  Wed May  6 04:35:52 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Wed, 6 May 2009 07:35:52 -0400
Subject: [ofa-general] [PATCH] opensm/osm_perfmgr_db.c: Remove unneeded
	initialization in perfmgr_db_print_by_name
Message-ID: <20090506113552.GA32102@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/opensm/osm_perfmgr_db.c b/opensm/opensm/osm_perfmgr_db.c
index b0bfd36..3034894 100644
--- a/opensm/opensm/osm_perfmgr_db.c
+++ b/opensm/opensm/osm_perfmgr_db.c
@@ -693,8 +693,8 @@ static void db_dump(cl_map_item_t * const p_map_item, void *context)
 void
 perfmgr_db_print_by_name(perfmgr_db_t * db, char *nodename, FILE *fp)
 {
-	cl_map_item_t *item = NULL;
-	db_node_t *node = NULL;
+	cl_map_item_t *item;
+	db_node_t *node;
 
 	cl_plock_acquire(&db->lock);
 

From sashak at voltaire.com  Wed May  6 04:46:43 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 6 May 2009 14:46:43 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_ftree.c Increase the size
	of the hop table
In-Reply-To: <4A003E3F.2010100@Sun.COM>
References: <4A003E3F.2010100@Sun.COM>
Message-ID: <20090506114643.GI10145@sk>

On 15:25 Tue 05 May     , Line.Holen at Sun.COM wrote:
> The hops table of ftree_sw_t is too small to hold the hop count
> of max_lid. Changed sw_create() to allocate hops[max_lid+1]
> not hops[max_lid].
> 
> Signed-off-by: Line Holen <Line.Holen at sun.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Wed May  6 04:51:06 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 6 May 2009 14:51:06 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_perfmgr_db.c: Remove unneeded
	initialization in perfmgr_db_print_by_name
In-Reply-To: <20090506113552.GA32102@comcast.net>
References: <20090506113552.GA32102@comcast.net>
Message-ID: <20090506115106.GJ10145@sk>

On 07:35 Wed 06 May     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Wed May  6 05:29:52 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 6 May 2009 15:29:52 +0300
Subject: [ofa-general] Re: [PATCH 0/5] Follow on patch series to libibnetdisc
	including converting ibqueryerrors.pl
In-Reply-To: <20090501165334.59bf72a9.weiny2@llnl.gov>
References: <20090422185441.6f8601dc.weiny2@llnl.gov>
	<20090425175710.GI28604@sk>
	<20090427150409.9c10e479.weiny2@llnl.gov>
	<20090501173806.GF14714@sk.iol.unh.edu>
	<20090501165334.59bf72a9.weiny2@llnl.gov>
Message-ID: <20090506122952.GA28975@sk>

On 16:53 Fri 01 May     , Ira Weiny wrote:
> 
> I did not attempt to preserve any switch or HCA order printing.  I don't know
> of any utils which require this.  Am I wrong?

I don't think that some utils could need it (unless there are bugs), I
just diff-ed old and new outputs.

Sasha


From tziporet at dev.mellanox.co.il  Wed May  6 07:09:00 2009
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Wed, 06 May 2009 17:09:00 +0300
Subject: [ofa-general] Memory registration redux
In-Reply-To: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
Message-ID: <4A0199FC.4000006@mellanox.co.il>

Jeff Squyres wrote:
> Roland and I chatted on the phone today; I think I now understand 
> Roland's counter-proposal (I clearly didn't before).  Let me try to 
> summarize:
>
> 1. Add a new verb for "set this userspace flag to 1 if mr X ever 
> becomes invalid"
> 2. Add a new verb for "no longer tell me if mr X ever becomes invalid" 
> (i.e., remove the effects of #1)
> 3. Add run-time query indicating whether #1 works
> 4. Add [optional] memory registration caching to libibverbs
>
> Prior to talking to Roland, I had envisioned *one* flag in userspace 
> that indicated whether any memory registrations had become invalid.  
> Roland's idea is that there is one flag *per registration* -- you can 
> instantly tell whether a specific registration is valid.
>
> Given this, let's keep the discussion going here in email -- perhaps 
> the teleconference next Monday may become moot.
I think the new proposal is good (but I am not MPI expert)
If we implement it soon we will be able to enable it in OFED 1.5 too
I think the cache in libibverbs can be delayed since it can be added 
after the API will the kernel is avilable

Tziporet


From hnrose at comcast.net  Wed May  6 07:13:26 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Wed, 6 May 2009 10:13:26 -0400
Subject: [ofa-general] [PATCH] opensm/PerfMgr: Cosmetic changes
Message-ID: <20090506141326.GA29542@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/include/opensm/osm_perfmgr.h b/opensm/include/opensm/osm_perfmgr.h
index e6a1cfe..855a2ff 100644
--- a/opensm/include/opensm/osm_perfmgr.h
+++ b/opensm/include/opensm/osm_perfmgr.h
@@ -96,7 +96,7 @@ typedef struct redir {
 	ib_net32_t redir_qp;
 } redir_t;
 
-/* Node to store information about which nodes we are monitoring */
+/* Node to store information about nodes being monitored */
 typedef struct monitored_node {
 	cl_map_item_t map_item;
 	struct monitored_node *next;
@@ -108,6 +108,7 @@ typedef struct monitored_node {
 } monitored_node_t;
 
 struct osm_opensm;
+
 /****s* OpenSM: PerfMgr/osm_perfmgr_t
 *  This object should be treated as opaque and should
 *  be manipulated only through the provided functions.
@@ -130,9 +131,9 @@ typedef struct osm_perfmgr {
 	uint16_t sweep_time_s;
 	perfmgr_db_t *db;
 	atomic32_t outstanding_queries;	/* this along with sig_query */
-	cl_event_t sig_query;	/* will throttle our querys */
+	cl_event_t sig_query;	/* will throttle our queries */
 	uint32_t max_outstanding_queries;
-	cl_qmap_t monitored_map;	/* map the nodes we are tracking */
+	cl_qmap_t monitored_map;	/* map the nodes being tracked */
 	monitored_node_t *remove_list;
 } osm_perfmgr_t;
 /*
diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c
index 93644a0..ecfdbda 100644
--- a/opensm/opensm/osm_perfmgr.c
+++ b/opensm/opensm/osm_perfmgr.c
@@ -47,7 +47,6 @@
 #endif				/* HAVE_CONFIG_H */
 
 #ifdef ENABLE_OSM_PERF_MGR
-
 #include <stdlib.h>
 #include <stdint.h>
 #include <string.h>
@@ -66,7 +65,7 @@
 #include <opensm/osm_node.h>
 #include <opensm/osm_opensm.h>
 
-#define OSM_PERFMGR_INITIAL_TID_VALUE 0xcafe
+#define PERFMGR_INITIAL_TID_VALUE 0xcafe
 
 #if ENABLE_OSM_PERF_MGR_PROFILE
 struct {
@@ -114,8 +113,6 @@ static inline void diff_time(struct timeval *before, struct timeval *after,
 }
 #endif
 
-extern int wait_for_pending_transactions(osm_stats_t * stats);
-
 /**********************************************************************
  * Internal helper functions.
  **********************************************************************/
@@ -200,8 +197,9 @@ static void perfmgr_mad_send_err_callback(void *bind_context,
 
 	OSM_LOG_ENTER(pm->log);
 
-	/* go ahead and get the monitored node struct to have the printable
-	 * name if needed in messages
+	/*
+	 * get the monitored node struct to have the printable name
+	 * for log messages
 	 */
 	if ((p_node = cl_qmap_get(&pm->monitored_map, node_guid)) ==
 	    cl_qmap_end(&pm->monitored_map)) {
@@ -290,7 +288,7 @@ Exit:
 /**********************************************************************
  * Unbind the PerfMgr from the vendor layer for MAD sends/receives
  **********************************************************************/
-static void osm_perfmgr_mad_unbind(osm_perfmgr_t * pm)
+static void perfmgr_mad_unbind(osm_perfmgr_t * pm)
 {
 	OSM_LOG_ENTER(pm->log);
 	if (pm->bind_handle == OSM_BIND_INVALID_HANDLE) {
@@ -307,7 +305,7 @@ Exit:
  **********************************************************************/
 static ib_net32_t get_qp(monitored_node_t * mon_node, uint8_t port)
 {
-	ib_net32_t qp = cl_ntoh32(1);
+	ib_net32_t qp = IB_QP1;
 
 	if (mon_node && mon_node->num_ports && port < mon_node->num_ports &&
 	    mon_node->redir_port[port].redir_lid &&
@@ -396,7 +394,7 @@ static ib_api_status_t perfmgr_send_pc_mad(osm_perfmgr_t * perfmgr,
 	status = osm_vendor_send(perfmgr->bind_handle, p_madw, TRUE);
 
 	if (status == IB_SUCCESS) {
-		/* pause this thread if we have too many outstanding requests */
+		/* pause thread if there are too many outstanding requests */
 		cl_atomic_inc(&(perfmgr->outstanding_queries));
 		if (perfmgr->outstanding_queries >
 		    perfmgr->max_outstanding_queries) {
@@ -426,7 +424,7 @@ static void collect_guids(cl_map_item_t * p_map_item, void *context)
 
 	if (cl_qmap_get(&pm->monitored_map, node_guid)
 	    == cl_qmap_end(&pm->monitored_map)) {
-		/* if not already in our map add it */
+		/* if not already in map add it */
 		num_ports = osm_node_get_num_physp(node);
 		mon_node = malloc(sizeof(*mon_node) +
 				  sizeof(redir_t) * num_ports);
@@ -484,7 +482,7 @@ static void perfmgr_query_counters(cl_map_item_t * p_map_item, void *context)
 	num_ports = osm_node_get_num_physp(node);
 	node_guid = cl_ntoh64(node->node_info.node_guid);
 
-	/* make sure we have a database object ready to store this information */
+	/* make sure there is a database object ready to store this info */
 	if (perfmgr_db_create_entry(pm->db, node_guid, mon_node->esp0,
 				    num_ports, node->print_desc) !=
 	    PERFMGR_EVENT_DB_SUCCESS) {
@@ -538,8 +536,9 @@ Exit:
 
 /**********************************************************************
  * Discovery stuff.
- * Basically this code should not be here, but merged with main OpenSM
+ * This code should not be here, but merged with main OpenSM
  **********************************************************************/
+extern int wait_for_pending_transactions(osm_stats_t * stats);
 extern void osm_drop_mgr_process(IN osm_sm_t * sm);
 
 static int sweep_hop_1(osm_sm_t * sm)
@@ -680,7 +679,7 @@ static int sweep_hop_0(osm_sm_t * sm)
 
 	h_bind = osm_sm_mad_ctrl_get_bind_handle(&sm->mad_ctrl);
 	if (h_bind == OSM_BIND_INVALID_HANDLE) {
-		OSM_LOG(sm->p_log, OSM_LOG_DEBUG, "No bound ports.\n");
+		OSM_LOG(sm->p_log, OSM_LOG_DEBUG, "No bound ports\n");
 		return -1;
 	}
 
@@ -773,7 +772,7 @@ void osm_perfmgr_process(osm_perfmgr_t * pm)
 	gettimeofday(&before, NULL);
 #endif
 	pm->sweep_state = PERFMGR_SWEEP_ACTIVE;
-	/* With the global lock held collect the node guids */
+	/* With the global lock held, collect the node guids */
 	/* FIXME we should be able to track SA notices
 	 * and not have to sweep the node_guid_tbl each pass
 	 */
@@ -785,9 +784,7 @@ void osm_perfmgr_process(osm_perfmgr_t * pm)
 	/* then for each node query their counters */
 	cl_qmap_apply_func(&pm->monitored_map, perfmgr_query_counters, pm);
 
-	/* Clean out any nodes found to be removed during the
-	 * sweep
-	 */
+	/* clean out any nodes found to be removed during the sweep */
 	remove_marked_nodes(pm);
 
 #if ENABLE_OSM_PERF_MGR_PROFILE
@@ -812,7 +809,7 @@ void osm_perfmgr_process(osm_perfmgr_t * pm)
 
 /**********************************************************************
  * PerfMgr timer - loop continuously and signal SM to run PerfMgr
- * processor.
+ * processor if enabled.
  **********************************************************************/
 static void perfmgr_sweep(void *arg)
 {
@@ -830,7 +827,7 @@ void osm_perfmgr_shutdown(osm_perfmgr_t * pm)
 	OSM_LOG_ENTER(pm->log);
 	cl_timer_stop(&pm->sweep_timer);
 	cl_disp_unregister(pm->pc_disp_h);
-	osm_perfmgr_mad_unbind(pm);
+	perfmgr_mad_unbind(pm);
 	OSM_LOG_EXIT(pm->log);
 }
 
@@ -846,12 +843,12 @@ void osm_perfmgr_destroy(osm_perfmgr_t * pm)
 
 /**********************************************************************
  * Detect if someone else on the network could have cleared the counters
- * without us knowing.  This is easy to detect because the counters never wrap
- * but are "sticky"
+ * without us knowing.  This is easy to detect because the counters never
+ * wrap but are "sticky"
  *
- * The one time this will not work is if the port is getting errors fast enough
- * to have the reading overtake the previous reading.  In this case counters
- * will be missed.
+ * The one time this will not work is if the port is getting errors fast
+ * enough to have the reading overtake the previous reading.  In this case,
+ * counters will be missed.
  **********************************************************************/
 static void perfmgr_check_oob_clear(osm_perfmgr_t * pm,
 				    monitored_node_t * mon_node, uint8_t port,
@@ -1051,9 +1048,9 @@ static void perfmgr_log_events(osm_perfmgr_t * pm,
 
 /**********************************************************************
  * The dispatcher uses a thread pool which will call this function when
- * we have a thread available to process our mad received from the wire.
+ * there is a thread available to process the mad received on the wire.
  **********************************************************************/
-static void pc_rcv_process(void *context, void *data)
+static void pc_recv_process(void *context, void *data)
 {
 	osm_perfmgr_t *pm = context;
 	osm_madw_t *p_madw = data;
@@ -1070,8 +1067,9 @@ static void pc_rcv_process(void *context, void *data)
 
 	OSM_LOG_ENTER(pm->log);
 
-	/* go ahead and get the monitored node struct to have the printable
-	 * name if needed in messages
+	/*
+	 * get the monitored node struct to have the printable name
+	 * for log messages
 	 */
 	if ((p_node = cl_qmap_get(&pm->monitored_map, node_guid)) ==
 	    cl_qmap_end(&pm->monitored_map)) {
@@ -1207,7 +1205,7 @@ ib_api_status_t osm_perfmgr_init(osm_perfmgr_t * pm, osm_opensm_t * osm,
 	pm->log = &osm->log;
 	pm->mad_pool = &osm->mad_pool;
 	pm->vendor = osm->p_vendor;
-	pm->trans_id = OSM_PERFMGR_INITIAL_TID_VALUE;
+	pm->trans_id = PERFMGR_INITIAL_TID_VALUE;
 	pm->lock = &osm->lock;
 	pm->state =
 	    p_opt->perfmgr ? PERFMGR_STATE_ENABLED : PERFMGR_STATE_DISABLE;
@@ -1227,7 +1225,7 @@ ib_api_status_t osm_perfmgr_init(osm_perfmgr_t * pm, osm_opensm_t * osm,
 	}
 
 	pm->pc_disp_h = cl_disp_register(&osm->disp, OSM_MSG_MAD_PORT_COUNTERS,
-					 pc_rcv_process, pm);
+					 pc_recv_process, pm);
 	if (pm->pc_disp_h == CL_DISP_INVALID_HANDLE) {
 		perfmgr_db_destroy(pm->db);
 		goto Exit;
@@ -1256,7 +1254,7 @@ void osm_perfmgr_clear_counters(osm_perfmgr_t * pm)
 }
 
 /*******************************************************************
- * Have the DB dump its information to the file specified
+ * Dump the DB information to the file specified
  *******************************************************************/
 void osm_perfmgr_dump_counters(osm_perfmgr_t * pm, perfmgr_db_dump_t dump_type)
 {
@@ -1276,7 +1274,7 @@ void osm_perfmgr_dump_counters(osm_perfmgr_t * pm, perfmgr_db_dump_t dump_type)
 }
 
 /*******************************************************************
- * Have the DB print its information to the fp specified
+ * Print the DB information to the fp specified
  *******************************************************************/
 void osm_perfmgr_print_counters(osm_perfmgr_t * pm, char *nodename, FILE * fp)
 {
diff --git a/opensm/opensm/osm_perfmgr_db.c b/opensm/opensm/osm_perfmgr_db.c
index 3034894..e5dfc19 100644
--- a/opensm/opensm/osm_perfmgr_db.c
+++ b/opensm/opensm/osm_perfmgr_db.c
@@ -247,8 +247,8 @@ debug_dump_err_reading(perfmgr_db_t * db, uint64_t guid, uint8_t port_num,
  * perfmgr_db_err_reading_t functions
  **********************************************************************/
 perfmgr_db_err_t
-perfmgr_db_add_err_reading(perfmgr_db_t * db, uint64_t guid,
-			   uint8_t port, perfmgr_db_err_reading_t * reading)
+perfmgr_db_add_err_reading(perfmgr_db_t * db, uint64_t guid, uint8_t port,
+			   perfmgr_db_err_reading_t * reading)
 {
 	db_port_t *p_port = NULL;
 	db_node_t *node = NULL;
@@ -389,8 +389,8 @@ debug_dump_dc_reading(perfmgr_db_t * db, uint64_t guid, uint8_t port_num,
  * perfmgr_db_data_cnt_reading_t functions
  **********************************************************************/
 perfmgr_db_err_t
-perfmgr_db_add_dc_reading(perfmgr_db_t * db, uint64_t guid,
-			  uint8_t port, perfmgr_db_data_cnt_reading_t * reading)
+perfmgr_db_add_dc_reading(perfmgr_db_t * db, uint64_t guid, uint8_t port,
+			  perfmgr_db_data_cnt_reading_t * reading)
 {
 	db_port_t *p_port = NULL;
 	db_node_t *node = NULL;


From sashak at voltaire.com  Wed May  6 08:00:31 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 6 May 2009 18:00:31 +0300
Subject: [ofa-general] Re: [PATCH] opensm/PerfMgr: Cosmetic changes
In-Reply-To: <20090506141326.GA29542@comcast.net>
References: <20090506141326.GA29542@comcast.net>
Message-ID: <20090506150031.GA29470@sk>

On 10:13 Wed 06 May     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From weiny2 at llnl.gov  Wed May  6 08:45:33 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Wed, 6 May 2009 08:45:33 -0700
Subject: [ofa-general] Re: [PATCH 0/5] Follow on patch series to libibnetdisc
 including converting ibqueryerrors.pl
In-Reply-To: <20090506122952.GA28975@sk>
References: <20090422185441.6f8601dc.weiny2@llnl.gov>
	<20090425175710.GI28604@sk>
	<20090427150409.9c10e479.weiny2@llnl.gov>
	<20090501173806.GF14714@sk.iol.unh.edu>
	<20090501165334.59bf72a9.weiny2@llnl.gov>
	<20090506122952.GA28975@sk>
Message-ID: <20090506084533.2d182a1d.weiny2@llnl.gov>

On Wed, 6 May 2009 15:29:52 +0300
Sasha Khapyorsky <sashak at voltaire.com> wrote:

> On 16:53 Fri 01 May     , Ira Weiny wrote:
> > 
> > I did not attempt to preserve any switch or HCA order printing.  I don't know
> > of any utils which require this.  Am I wrong?
> 
> I don't think that some utils could need it (unless there are bugs), I
> just diff-ed old and new outputs.

Ok, we are in agreement then.  I have done extensive testing by diffing the output and I agree it is a pain...  ;-)  Sorry about that.

Ira


-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
weiny2 at llnl.gov


From jon at opengridcomputing.com  Wed May  6 08:56:42 2009
From: jon at opengridcomputing.com (Jon Mason)
Date: Wed, 6 May 2009 10:56:42 -0500
Subject: [ofa-general] OFED, the backported <linux/scatterlist.h>
	header and sg_init_table()
In-Reply-To: <200905060830.56062.jackm@dev.mellanox.co.il>
References: <e2e108260905020446j41996c85o35d41f09dc4ff64d@mail.gmail.com>
	<200905051021.36725.jackm@dev.mellanox.co.il>
	<20090505150635.GA30788@opengridcomputing.com>
	<200905060830.56062.jackm@dev.mellanox.co.il>
Message-ID: <20090506155641.GA4935@opengridcomputing.com>

On Wed, May 06, 2009 at 08:30:55AM +0300, Jack Morgenstein wrote:
> On Tuesday 05 May 2009 18:06, Jon Mason wrote:
> > No, we currently duplicate all the scatterlist functionality.  Including
> > ncrypto.h would greatly simplify the backport headers, but it is a
> > RHEL5.2/5.3 only solution.  If this change is needed for all other
> > backports, then a better solution will be needed.
> > 
> Each backport has its OWN directory.  The backports are not identical
> for all kernels.  There is absolutely no problem with handling backports
> per kernel/per distribution. Therefore, the RHEL 5.2/5.3 solution can be
> used for those backports alone, without affecting any of the others.
> Other backports will have a different change.

Yes, the point I was trying to make is that the fix I have will only
apply to RHEL5.  If a more sweeping change is needed, then something
else will need to be done.  I believe the root issue with the reported
was on RHEL5.3, so this will probably solve their problem unless they
need it for all OFED supported releases.
 
> For RHEL5.2/5.3, my concern is that if someone will actually write an ncrypto kernel
> application, and include ncrypto.h along with the infiniband headers, there will be
> compilation problems because the scatterlist functionality fixes will appear twice.

Excellent point.  The patch I just sent out should prevent this from
happening as well.

> Specifically, OFED 1.4.1 has the following INDIVIDUAL/independent backports, and
> each one is handled differently:
> 2.6.16
> 2.6.16_sles10
> 2.6.16_sles10_sp1
> 2.6.16_sles10_sp2
> 2.6.17
> 2.6.18
> 2.6.18-EL5.1
> 2.6.18-EL5.2
> 2.6.18-EL5.3
> 2.6.18_FC6 (also for EL5.0)
> 2.6.18_suse10_2
> 2.6.19
> 2.6.20
> 2.6.21
> 2.6.22
> 2.6.22_suse10_3
> 2.6.23
> 2.6.24
> 2.6.25
> 2.6.26
> 2.6.9_U4
> 2.6.9_U5
> 2.6.9_U6
> 2.6.9_U7
> 
> - Jack


From monis at Voltaire.COM  Wed May  6 09:06:52 2009
From: monis at Voltaire.COM (Moni Shoua)
Date: Wed, 06 May 2009 19:06:52 +0300
Subject: [ofa-general] Re: [PATCH v2] rdma_cm: Add debugfs entries to
	monitor rdma_cm connections
In-Reply-To: <20090501213652.GO32114@obsidianresearch.com>
References: <49F05AAE.4020606@Voltaire.COM>	<90A07B5E8FAD49C187EAE583A976A3CD@amr.corp.intel.com>	<49F42D40.5000200@Voltaire.COM>
	<49F5A2EC.3050807@Voltaire.com>	<49F5AED6.4070208@Voltaire.COM>
	<49F5AFEA.5090003@voltaire.com>	<20090427162349.GI4431@obsidianresearch.com>	<49F9A729.3090904@voltaire.com>
	<20090501213652.GO32114@obsidianresearch.com>
Message-ID: <4A01B59C.5080301@Voltaire.COM>

Jason Gunthorpe wrote:
> On Thu, Apr 30, 2009 at 04:27:05PM +0300, Or Gerlitz wrote:
>> Jason Gunthorpe wrote:
>>> including a PID is not best, you should include enough information to 
>>> figure out the pid(s) from proc/xx/fd, and vice versa.
> 
>> maybe its not the best solution but it seems to me good enough
> 
> Well, we have to live with these interfaces literally forever,
> shortcuts ultimately just cause more problems down the road..
> 
> Reall the thinking should be 'I want to make lsof work usefully' not
> 'I want some random and different hack to let me see something'. And
> yes, that is harder. But the IB stack is now at the point where these
> small hard things are the sort of work that is needed to get parity
> with the other stuff in linux..
> 
> Jason
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
debugfs is a common way to export data from kernel to users (IPoIB uses it) and it has it's advantages. On the other hand, netlink has it's disadvantages so, I don't think that debugfs is the wrong way. It's just another way.

Besides, remember that rdmacm is only aware of part of the opened QPs on the host which may lead to a confusion for one who reads the output of lsof ("I know that there is an open QP but I don't see it in the list") 


From weiny2 at llnl.gov  Wed May  6 09:33:47 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Wed, 6 May 2009 09:33:47 -0700
Subject: [ofa-general] Re: [PATCH 2/3] Add combined routing support to
	libibnetdisc
In-Reply-To: <20090506100744.GB10145@sk>
References: <20090430142958.5811218f.weiny2@llnl.gov>
	<20090506100744.GB10145@sk>
Message-ID: <20090506093347.bb1b56be.weiny2@llnl.gov>

On Wed, 6 May 2009 13:07:44 +0300
Sasha Khapyorsky <sashak at voltaire.com> wrote:

> On 14:29 Thu 30 Apr     , Ira Weiny wrote:
> > From: Ira Weiny <weiny2 at llnl.gov>
> > Date: Wed, 29 Apr 2009 10:15:55 -0700
> > Subject: [PATCH] Add combined routing support to libibnetdisc
> > 
> >    Also allow a scan to start at a switch.
> > 
> > Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
> > ---
> >  infiniband-diags/libibnetdisc/src/ibnetdisc.c |   28 ++++++++++++++++++------
> >  1 files changed, 21 insertions(+), 7 deletions(-)
> > 
> > diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
> > index 0ff5134..fc19633 100644
> > --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
> > +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
> > @@ -177,11 +177,26 @@ add_port_to_dpath(ib_dr_path_t *path, int nextport)
> >  }
> >  
> >  static int
> > -extend_dpath(struct ibnd_fabric *f, ib_dr_path_t *path, int nextport)
> > +extend_dpath(struct ibnd_fabric *f, ib_portid_t *portid, int nextport)
> >  {
> > -	int rc = add_port_to_dpath(path, nextport);
> > -	if ((rc != -1) && (path->cnt > f->fabric.maxhops_discovered))
> > -		f->fabric.maxhops_discovered = path->cnt;
> > +	int rc = 0;
> > +
> > +	if (portid->lid && !portid->drpath.drslid) {
> > +		/* If we were LID routed
> > +		 * AND have not done so already
> > +		 * we need to set up the drslid
> > +		 */
> > +		ib_portid_t selfportid = { 0 };
> > +		if (ib_resolve_self_via(&selfportid, NULL, NULL, f->fabric.ibmad_port) < 0)
> > +			return -1;
> 
> And wouldn't it be better instead of resolving selfport on each
> extend_path() call to keep it already resolved somewhere in fabric
> structure?

This will only happen 1 time for each fabric being scan'ed because the path is
reused...

Oh wait a minute, I just reviewed the code...  For the current use case the
path is reused since I am only scanning 1 node.  However, in the general case
this is not true.  Sorry about that.  A new patch is below.

Ira


From: Ira Weiny <weiny2 at llnl.gov>
Date: Wed, 29 Apr 2009 10:15:55 -0700
Subject: [PATCH] Fix ibnd_discover when the specified ib_portid_t starts LID routed.

Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/libibnetdisc/src/ibnetdisc.c |   27 ++++++++++++++++++------
 infiniband-diags/libibnetdisc/src/internal.h  |    1 +
 2 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
index 0ff5134..1e93ff8 100644
--- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
+++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
@@ -177,11 +177,25 @@ add_port_to_dpath(ib_dr_path_t *path, int nextport)
 }
 
 static int
-extend_dpath(struct ibnd_fabric *f, ib_dr_path_t *path, int nextport)
+extend_dpath(struct ibnd_fabric *f, ib_portid_t *portid, int nextport)
 {
-	int rc = add_port_to_dpath(path, nextport);
-	if ((rc != -1) && (path->cnt > f->fabric.maxhops_discovered))
-		f->fabric.maxhops_discovered = path->cnt;
+	int rc = 0;
+
+	if (portid->lid) {
+		/* If we were LID routed we need to set up the drslid */
+		if (!f->selfportid.lid)
+			if (ib_resolve_self_via(&f->selfportid, NULL, NULL,
+					f->fabric.ibmad_port) < 0)
+				return -1;
+
+		portid->drpath.drslid = f->selfportid.lid;
+		portid->drpath.drdlid = 0xFFFF;
+	}
+
+	rc = add_port_to_dpath(&portid->drpath, nextport);
+
+	if ((rc != -1) && (portid->drpath.cnt > f->fabric.maxhops_discovered))
+		f->fabric.maxhops_discovered = portid->drpath.cnt;
 	return (rc);
 }
 
@@ -447,7 +461,7 @@ get_remote_node(struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd_
 			!= IB_PORT_PHYS_STATE_LINKUP)
 		return -1;
 
-	if (extend_dpath(fabric, &path->drpath, portnum) < 0)
+	if (extend_dpath(fabric, path, portnum) < 0)
 		return -1;
 
 	if (query_node(fabric, &node_buf, &port_buf, path)) {
@@ -546,8 +560,7 @@ ibnd_discover_fabric(struct ibmad_port *ibmad_port, int timeout_ms,
 	if (!port)
 		IBPANIC("out of memory");
 
-	if (node->node.type != IB_NODE_SWITCH &&
-	    get_remote_node(fabric, node, port, from,
+	if(get_remote_node(fabric, node, port, from,
 				mad_get_field(node->node.info, 0, IB_NODE_LOCAL_PORT_F),
 				0) < 0)
 		return ((ibnd_fabric_t *)fabric);
diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h
index 4e6bb18..5785e33 100644
--- a/infiniband-diags/libibnetdisc/src/internal.h
+++ b/infiniband-diags/libibnetdisc/src/internal.h
@@ -88,6 +88,7 @@ struct ibnd_fabric {
 	struct ibnd_node *switches;
 	struct ibnd_node *ch_adapters;
 	struct ibnd_node *routers;
+	ib_portid_t selfportid;
 };
 #define CONV_FABRIC_INTERNAL(fabric) ((struct ibnd_fabric *)fabric)
 
-- 
1.5.4.5


From weiny2 at llnl.gov  Wed May  6 09:51:14 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Wed, 6 May 2009 09:51:14 -0700
Subject: [ofa-general] [PATCH] Fix 2 formatting diff's from old
	ibqueryerrors.
Message-ID: <20090506095114.3893f4aa.weiny2@llnl.gov>

2 changes I noted in the output from ibqueryerrors.

"Link Info:" was not being printed when "-r" was used.

The "header": Errors for 0x<guid> "<node name>"

Should only be printed when errors are found.

The following patch cleans those up.

Ira


From: Ira Weiny <weiny2 at llnl.gov>
Date: Tue, 28 Apr 2009 14:39:11 -0700
Subject: [PATCH] Fix 2 formatting diff's from old ibqueryerrors.

Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/src/ibqueryerrors.c |   29 ++++++++++++++++-------------
 1 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/infiniband-diags/src/ibqueryerrors.c b/infiniband-diags/src/ibqueryerrors.c
index 09861be..70c3d48 100644
--- a/infiniband-diags/src/ibqueryerrors.c
+++ b/infiniband-diags/src/ibqueryerrors.c
@@ -123,7 +123,6 @@ print_port_config(ibnd_node_t *node, int portnum)
 	char speed_msg[256];
 	char ext_port_str[256];
 	int iwidth, ispeed, istate, iphystate;
-	int n = 0;
 
 	ibnd_port_t *port = node->ports[portnum];
 
@@ -140,7 +139,7 @@ print_port_config(ibnd_node_t *node, int portnum)
 	width_msg[0] = '\0';
 	speed_msg[0] = '\0';
 
-	n = snprintf(link_str, 256, "(%3s %s %6s/%8s)",
+	snprintf(link_str, 256, "(%3s %s %6s/%8s)",
 		mad_dump_val(IB_PORT_LINK_WIDTH_ACTIVE_F, width, 64, &iwidth),
 		mad_dump_val(IB_PORT_LINK_SPEED_ACTIVE_F, speed, 64, &ispeed),
 		mad_dump_val(IB_PORT_STATE_F, state, 64, &istate),
@@ -177,9 +176,9 @@ print_port_config(ibnd_node_t *node, int portnum)
 		ext_port_str[0] = '\0';
 
 	if (node->type == IB_NODE_SWITCH)
-		printf("       %6d", node->smalid);
+		printf("       Link info: %6d", node->smalid);
 	else
-		printf("       %6d", port->base_lid);
+		printf("       Link info: %6d", port->base_lid);
 
 	printf("%4d[%2s] ==%s==>  %s",
 		port->portnum, ext_port_str, link_str, remote_str);
@@ -211,7 +210,7 @@ report_suppressed(void)
 }
 
 static void
-print_results(ibnd_node_t *node, uint8_t *pc, int portnum)
+print_results(ibnd_node_t *node, uint8_t *pc, int portnum, int *header_printed)
 {
 	char buf[1024];
 	char *str = buf;
@@ -237,7 +236,6 @@ print_results(ibnd_node_t *node, uint8_t *pc, int portnum)
 
 	/* if we found errors. */
 	if (n != 0) {
-		char *nodename = remap_node_name(node_name_map, node->guid, node->nodedesc);
 		if (data_counters)
 			for (i = IB_PC_XMT_BYTES_F; i <= IB_PC_RCV_PKTS_F; i++) {
 				uint64_t val64 = 0;
@@ -247,17 +245,21 @@ print_results(ibnd_node_t *node, uint8_t *pc, int portnum)
 						mad_field_name(i), val64);
 			}
 
-		printf("Errors for 0x%" PRIx64 " \"%s\"\n", node->guid, nodename);
-		printf("   GUID 0x%" PRIx64 " port %d:%s\n",
-			node->guid, portnum, str);
+		if (!*header_printed) {
+			char *nodename = remap_node_name(node_name_map, node->guid, node->nodedesc);
+			printf("Errors for 0x%" PRIx64 " \"%s\"\n", node->guid, nodename);
+			*header_printed = 1;
+			free(nodename);
+		}
+
+		printf("   GUID 0x%" PRIx64 " port %d:%s\n", node->guid, portnum, str);
 		if (port_config)
 			print_port_config(node, portnum);
-		free(nodename);
 	}
 }
 
 static void
-print_port(ibnd_node_t *node, int portnum)
+print_port(ibnd_node_t *node, int portnum, int *header_printed)
 {
 	uint8_t pc[1024];
 	uint16_t cap_mask;
@@ -291,7 +293,7 @@ print_port(ibnd_node_t *node, int portnum)
 		uint32_t foo = 0;
 		mad_encode_field(pc, IB_PC_XMT_WAIT_F, &foo);
 	}
-	print_results(node, pc, portnum);
+	print_results(node, pc, portnum, header_printed);
 
 cleanup:
 	free(nodename);
@@ -300,6 +302,7 @@ cleanup:
 void
 print_node(ibnd_node_t *node, void *user_data)
 {
+	int header_printed = 0;
 	int p = 0;
 	int startport = 1;
 
@@ -311,7 +314,7 @@ print_node(ibnd_node_t *node, void *user_data)
 
 	for (p = startport; p <= node->numports; p++) {
 		if (node->ports[p]) {
-			print_port(node, p);
+			print_port(node, p, &header_printed);
 		}
 	}
 }
-- 
1.5.4.5


From weiny2 at llnl.gov  Wed May  6 09:53:03 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Wed, 6 May 2009 09:53:03 -0700
Subject: [ofa-general] [PATCH] Clean up printing of switch heading when
 printing "down links" only.
Message-ID: <20090506095303.f11659f1.weiny2@llnl.gov>

Another corner case:  If there are no down links on a switch and "-d" is selected then the header for that switch should not be printed.

Ira


From: Ira Weiny <weiny2 at llnl.gov>
Date: Thu, 30 Apr 2009 13:41:38 -0700
Subject: [PATCH] Clean up printing of switch heading when printing "down links" only.

Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/src/iblinkinfo.c |   14 +++++++-------
 1 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/infiniband-diags/src/iblinkinfo.c b/infiniband-diags/src/iblinkinfo.c
index 2454bf2..cf38ecb 100644
--- a/infiniband-diags/src/iblinkinfo.c
+++ b/infiniband-diags/src/iblinkinfo.c
@@ -205,13 +205,8 @@ void
 print_switch(ibnd_node_t *node, void *user_data)
 {
 	int i = 0;
-
-	if (!line_mode) {
-		char *remap = remap_node_name(node_name_map, node->guid,
-					node->nodedesc);
-		printf("Switch 0x%016"PRIx64" %s:\n", node->guid, remap);
-		free(remap);
-	}
+	int head_print = 0;
+	char *remap = remap_node_name(node_name_map, node->guid, node->nodedesc);
 
 	for (i = 1; i <= node->numports; i++) {
 		ibnd_port_t *port = node->ports[i];
@@ -219,9 +214,14 @@ print_switch(ibnd_node_t *node, void *user_data)
 			continue;
 		if (!down_links_only ||
 				mad_get_field(port->info, 0, IB_PORT_STATE_F) == IB_LINK_DOWN) {
+			if (!head_print && !line_mode) {
+				printf("Switch 0x%016"PRIx64" %s:\n", node->guid, remap);
+				head_print = 1;
+			}
 			print_port(node, port);
 		}
 	}
+	free(remap);
 }
 
 void
-- 
1.5.4.5


From jgunthorpe at obsidianresearch.com  Wed May  6 10:38:50 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Wed, 6 May 2009 11:38:50 -0600
Subject: [ofa-general] Re: [PATCH v2] rdma_cm: Add debugfs entries to
	monitor rdma_cm connections
In-Reply-To: <4A01B59C.5080301@Voltaire.COM>
References: <49F05AAE.4020606@Voltaire.COM>
	<90A07B5E8FAD49C187EAE583A976A3CD@amr.corp.intel.com>
	<49F42D40.5000200@Voltaire.COM> <49F5A2EC.3050807@Voltaire.com>
	<49F5AED6.4070208@Voltaire.COM> <49F5AFEA.5090003@voltaire.com>
	<20090427162349.GI4431@obsidianresearch.com>
	<49F9A729.3090904@voltaire.com>
	<20090501213652.GO32114@obsidianresearch.com>
	<4A01B59C.5080301@Voltaire.COM>
Message-ID: <20090506173850.GJ2590@obsidianresearch.com>

On Wed, May 06, 2009 at 07:06:52PM +0300, Moni Shoua wrote:
> > Reall the thinking should be 'I want to make lsof work usefully' not
> > 'I want some random and different hack to let me see something'. And
> > yes, that is harder. But the IB stack is now at the point where these
> > small hard things are the sort of work that is needed to get parity
> > with the other stuff in linux..

> debugfs is a common way to export data from kernel to users (IPoIB
> uses it) and it has it's advantages. On the other hand, netlink has
> it's disadvantages so, I don't think that debugfs is the wrong
> way. It's just another way.

Gah, no! Debugfs is NOT ment to be used for users, it is for kernel
debugging. It is specifically not a stable API and commonly used user
space apps should not rely on it. This is why the distros don't mount
it by default.

Viewing the active QPs and RDMA CM connections is not kernel
debugging, it is necessary data for end user app debugging.

> Besides, remember that rdmacm is only aware of part of the opened
> QPs on the host which may lead to a confusion for one who reads the
> output of lsof ("I know that there is an open QP but I don't see it
> in the list")

Sure, but clearly the end user desire is to make lsof work properly
with all the new objects the IB and verbs APIs introduce to the
kernel. I don't think your patch advances that goal at all.

Jason


From jsquyres at cisco.com  Wed May  6 10:42:37 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 6 May 2009 13:42:37 -0400
Subject: [ofa-general] Memory registration redux
In-Reply-To: <4A0199FC.4000006@mellanox.co.il>
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
	<4A0199FC.4000006@mellanox.co.il>
Message-ID: <F81BCAD0-EA30-4D48-AB36-B2A76E47EDC0@cisco.com>

On May 6, 2009, at 10:09 AM, Tziporet Koren wrote:

> I think the new proposal is good (but I am not MPI expert)
> If we implement it soon we will be able to enable it in OFED 1.5 too
>

That sounds good, as long as we don't diverge from upstream (like what  
happened with XRC).

> I think the cache in libibverbs can be delayed since it can be added
> after the API will the kernel is avilable
>


Fair enough.

-- 
Jeff Squyres
Cisco Systems


From rdreier at cisco.com  Wed May  6 13:08:31 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 06 May 2009 13:08:31 -0700
Subject: [ofa-general] Memory registration redux
In-Reply-To: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com> (Jeff Squyres's
	message of "Tue, 5 May 2009 16:57:09 -0400")
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
Message-ID: <adafxfit2o0.fsf@cisco.com>

 > Roland and I chatted on the phone today; I think I now understand
 > Roland's counter-proposal (I clearly didn't before).  Let me try to
 > summarize:
 > 
 > 1. Add a new verb for "set this userspace flag to 1 if mr X ever
 > becomes invalid"
 > 2. Add a new verb for "no longer tell me if mr X ever becomes invalid"
 > (i.e., remove the effects of #1)
 > 3. Add run-time query indicating whether #1 works
 > 4. Add [optional] memory registration caching to libibverbs

Looking closer at how to actually implement this, I see that the MMU
notifiers (cf <linux/mmu_notifier.h>) may be called with locks held, so
the kernel can't do a put_user() or the equivalent from the notifier.
Therefore I think the interface we would expose to userspace would be
something more like mmap() on some special file to get some kernel
memory mapped into userspace, and then ioctl() to register/unregister a
"set this flag if address range X...Y is affected."

To be honest I don't really love this idea -- the kernel still needs a
fairly complicated data structure to efficiently track the address
ranges being tracked, the size of the mmap() limits the number of ranges
being tracked based on a static limit set at initialization time (or
handling multiple maps gets still more complex), and there is some
careful thinking required to make sure there are no memory ordering or
cache aliasing issues.

So then I thought some about how to implement the full MR cache in the
kernel.  And that fairly quickly gets into some complex stuff as well --
for example, since we can't take sleeping locks from MMU notifiers, but
we can't hold non-sleeping locks across MR register operations, we need
to drop our MR cache lock while registering things, which forces us to
deal with rolling back registrations if we miss the cache initially but
then find that another thread has already added a registration to the
cache while we were trying to register the same memory.  Keeping the
actual MR caching in userspace does seem to make things simpler because
the locking is much easier without having to worry about sleeping
vs. non-sleeping locks.

Also doing the cache in userspace with my flag idea above has the nice
property that the fast path of hitting the cache on memory registration
has no system call and in fact testing the flag may even be a CPU cache
hit if memory registration is a hot enough path.  Doing it in the kernel
means even the best case has a system call -- which is very cheap with
current CPUs but still a non-zero cost.

So I'm really not sure what the right way to go is yet.  Further
opinions would be helpful.

 - R.


From rdreier at cisco.com  Wed May  6 13:10:47 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 06 May 2009 13:10:47 -0700
Subject: [ofa-general] Memory registration redux
In-Reply-To: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com> (Jeff Squyres's
	message of "Tue, 5 May 2009 16:57:09 -0400")
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
Message-ID: <adabpq6t2k8.fsf@cisco.com>

By the way, what's the desired behavior of the cache if a process
registers, say, address range 0x1000 ... 0x3fff, and then the same
process registers address range 0x2000 ... 0x2fff (with all the same
permissions, etc)?

The initial registration creates an MR that is still valid for the
smaller virtual address range, so the second registration is much
cheaper if we used the cached registration; but if we use the cache for
the second registration, and then deregister the first one, we're stuck
with a too-big range pinned in the cache because of the second
registration.

 - R.


From jgunthorpe at obsidianresearch.com  Wed May  6 14:46:28 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Wed, 6 May 2009 15:46:28 -0600
Subject: [ofa-general] Memory registration redux
In-Reply-To: <adabpq6t2k8.fsf@cisco.com>
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
	<adabpq6t2k8.fsf@cisco.com>
Message-ID: <20090506214628.GM2590@obsidianresearch.com>

On Wed, May 06, 2009 at 01:10:47PM -0700, Roland Dreier wrote:
> By the way, what's the desired behavior of the cache if a process
> registers, say, address range 0x1000 ... 0x3fff, and then the same
> process registers address range 0x2000 ... 0x2fff (with all the same
> permissions, etc)?
> 
> The initial registration creates an MR that is still valid for the
> smaller virtual address range, so the second registration is much
> cheaper if we used the cached registration; but if we use the cache for
> the second registration, and then deregister the first one, we're stuck
> with a too-big range pinned in the cache because of the second
> registration.

Yuk, doesn't this problem pretty much doom this method entirely? You
can't tear down the entire registration of 0x1000 ... 0x3fff if the app
does something to change 0x2000 .. 0x2fff because it may have active
RDMAs going on in 0x1000 ... 0x1fff.

The above could happen through strange use of brk.

What about a slightly different twist.. Instead of trying to make
everything synchronous in the mmu_notifier, just have a counter mapped
to user space. Increment the counter whenever the mms change from the
notifier. Pin the user page that contains the single counter upon
starting the process so access is lockless.

In user space, check the counter before every cache lookup and if it
has changed call back into the kernel to resynchronize the MR tables in
the HCA to the current VM.

Avoids the locking and racing problems, keeps the fast path in the
user space and avoids the above question about how to deal with
arbitrary actions?

Jason


From rdreier at cisco.com  Wed May  6 14:56:25 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 06 May 2009 14:56:25 -0700
Subject: [ofa-general] Memory registration redux
In-Reply-To: <20090506214628.GM2590@obsidianresearch.com> (Jason Gunthorpe's
	message of "Wed, 6 May 2009 15:46:28 -0600")
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
	<adabpq6t2k8.fsf@cisco.com>
	<20090506214628.GM2590@obsidianresearch.com>
Message-ID: <adatz3xsxo6.fsf@cisco.com>

 > Yuk, doesn't this problem pretty much doom this method entirely? You
 > can't tear down the entire registration of 0x1000 ... 0x3fff if the app
 > does something to change 0x2000 .. 0x2fff because it may have active
 > RDMAs going on in 0x1000 ... 0x1fff.

Yes, I guess if we try to reuse registrations like this then we run into
trouble.  I think your example points to a problem if an app registers
0x1000...0x3fff and then we reuse that registration for 0x2000...0x2fff
and also for 0x1000...0x1fff, and then the app unregisters 0x1000...0x3fff.

But we can get around this just by not ever reusing registrations that
way -- only treat something as a cache hit if it matches the start and
length exactly.

 > What about a slightly different twist.. Instead of trying to make
 > everything synchronous in the mmu_notifier, just have a counter mapped
 > to user space. Increment the counter whenever the mms change from the
 > notifier. Pin the user page that contains the single counter upon
 > starting the process so access is lockless.
 > 
 > In user space, check the counter before every cache lookup and if it
 > has changed call back into the kernel to resynchronize the MR tables in
 > the HCA to the current VM.
 > 
 > Avoids the locking and racing problems, keeps the fast path in the
 > user space and avoids the above question about how to deal with
 > arbitrary actions?

I like the simplicity of the fast path.  But it seems the slow path
would be hard -- how exactly did you envision resynchronizing the MR
tables?  (Considering that RDMAs might be in flight for MRs that weren't
changed by the MM operations)

 - R.


From landman at scalableinformatics.com  Wed May  6 15:26:40 2009
From: landman at scalableinformatics.com (Joe Landman)
Date: Wed, 06 May 2009 18:26:40 -0400
Subject: [Fwd: Re: [ofa-general] [NFS/RDMA] Can't
	mount	NFS/RDMA	partition]]
In-Reply-To: <49F1FCEF.3030305@mellanox.com>
References: <49F02280.7010005@ext.bull.net>	<49F07710.3070002@opengridcomputing.com>	<49F19ECE.9080007@ext.bull.net>	<49F1CF6C.3090703@opengridcomputing.com>
	<49F1FCEF.3030305@mellanox.com>
Message-ID: <4A020EA0.5030605@scalableinformatics.com>

Vu Pham wrote:
> Hi Celine,
> 
> What HCA do you have on your system? Is it ConnectX? If yes, what is its 
> firmware version?

I am seeing this also on a server with ConnectX and a client with mthca.

My mount hangs:

  /sbin/mount.nfs 10.1.1.2:/data /data -o rdma,intr,port=2050
^C

Leaving this in the logs:

May  6 18:14:03 dv3 kernel: [ 9997.015209] rpcrdma: connection to 
10.1.1.2:2050 on mthca0, memreg 6 slots 32 ird 4
May  6 18:14:03 dv3 kernel: [ 9997.015582] rpcrdma: connection to 
10.1.1.2:2050 closed (-103)

rdma seems to work

root at dv3:~# ib_rdma_bw -b -i 2
6222: | port=18515 | ib_port=2 | size=65536 | tx_depth=100 | iters=1000 
| duplex=1 | cma=0 |
6222: Local address:  ...
6222: Remote address: ...

6222: Bandwidth peak (#0 to #245): 1765.83 MB/sec
6222: Bandwidth average: 1724.45 MB/sec
6222: Service Demand peak (#0 to #245): 884 cycles/KB
6222: Service Demand Avg  : 906 cycles/KB

root at dv3:~# showmount -e 10.1.1.2
Export list for 10.1.1.2:
/data *

On the server side, I see

May  6 14:07:53 jr4 mountd[5673]: authenticated mount request from 
10.1.1.1:940 for /data (/data)

On server for rping
[
root at jr4 ~]# rping -s
cq completion failed status 4
wait for RDMA_READ_COMPLETE state 10

on the client side for rping

root at dv3:~# rping -S 100 -d -v -c -a 10.1.1.2
verbose
client
created cm_id 0x606690
cma_event type RDMA_CM_EVENT_ADDR_RESOLVED cma_id 0x606690 (parent)
cma_event type RDMA_CM_EVENT_ROUTE_RESOLVED cma_id 0x606690 (parent)
rdma_resolve_addr - rdma_resolve_route successful
created pd 0x608be0
created channel 0x6068c0
created cq 0x608c30
created qp 0x608d50
rping_setup_buffers called on cb 0x605010
allocated & registered buffers...
cq_thread started.
cma_event type RDMA_CM_EVENT_ESTABLISHED cma_id 0x606690 (parent)
ESTABLISHED
rmda_connect successful
RDMA addr 60a8d0 rkey 116003d len 100
send completion
cma_event type RDMA_CM_EVENT_DISCONNECTED cma_id 0x606690 (parent)
client DISCONNECT EVENT...
wait for RDMA_WRITE_ADV state 6
cq completion failed status 5
rping_free_buffers called on cb 0x605010
destroy cm_id 0x606690

Any hints on the 103 error?  I have 2.6.000 firmware on the ConnectX.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615


From jgunthorpe at obsidianresearch.com  Wed May  6 15:26:38 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Wed, 6 May 2009 16:26:38 -0600
Subject: [ofa-general] Memory registration redux
In-Reply-To: <adatz3xsxo6.fsf@cisco.com>
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
	<adabpq6t2k8.fsf@cisco.com>
	<20090506214628.GM2590@obsidianresearch.com>
	<adatz3xsxo6.fsf@cisco.com>
Message-ID: <20090506222638.GA16280@obsidianresearch.com>

On Wed, May 06, 2009 at 02:56:25PM -0700, Roland Dreier wrote:
>  > Yuk, doesn't this problem pretty much doom this method entirely? You
>  > can't tear down the entire registration of 0x1000 ... 0x3fff if the app
>  > does something to change 0x2000 .. 0x2fff because it may have active
>  > RDMAs going on in 0x1000 ... 0x1fff.
> 
> Yes, I guess if we try to reuse registrations like this then we run into
> trouble.  I think your example points to a problem if an app registers
> 0x1000...0x3fff and then we reuse that registration for 0x2000...0x2fff
> and also for 0x1000...0x1fff, and then the app unregisters 0x1000...0x3fff.
> 
> But we can get around this just by not ever reusing registrations that
> way -- only treat something as a cache hit if it matches the start and
> length exactly.

I can't comment on that, but it feels to me like a reasonable MPI use
model would be to do small IOs randomly from the same allocation, and
pre-hint to the library it wants that whole area cached in one shot.

>  > What about a slightly different twist.. Instead of trying to make
>  > everything synchronous in the mmu_notifier, just have a counter mapped
>  > to user space. Increment the counter whenever the mms change from the
>  > notifier. Pin the user page that contains the single counter upon
>  > starting the process so access is lockless.
>  > 
>  > In user space, check the counter before every cache lookup and if it
>  > has changed call back into the kernel to resynchronize the MR tables in
>  > the HCA to the current VM.
>  > 
>  > Avoids the locking and racing problems, keeps the fast path in the
>  > user space and avoids the above question about how to deal with
>  > arbitrary actions?
> 
> I like the simplicity of the fast path.  But it seems the slow path
> would be hard -- how exactly did you envision resynchronizing the MR
> tables?  (Considering that RDMAs might be in flight for MRs that weren't
> changed by the MM operations)

Well, this conceptually doesn't seem hard. Go through all the pages in
the MR, if any have changed then pin the new page and replace the
pages physical address in the HCA's page table. Once done, synchronize
with the hardware, then run through again and un-pin and release all
the replaced pages.

Every HCA must have the necessary primitives for this to support
register and unregister...

An RDMA that is in progress to any page that is replaced is a
'use after free' type programming error. (And this means certain wacky
uses, like using MAP_FIXED on memory that is under active RDMA,
would be unsupported without an additional call)

Doing this on a page by page basis rather than on a registration by
registration basis is granular enough to avoid the problem you
noticed.

The mmu notifiers can optionally make note of the affected pages
during the callback to reduce the workload of the syscall.

Only part I don't immediately see is how to trap creation of new VM
(ie mmap), mmu notifiers seem focused on invalidating, ie munmap()..

Jason


From rdreier at cisco.com  Wed May  6 15:39:54 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 06 May 2009 15:39:54 -0700
Subject: [ofa-general] Memory registration redux
In-Reply-To: <20090506222638.GA16280@obsidianresearch.com> (Jason Gunthorpe's
	message of "Wed, 6 May 2009 16:26:38 -0600")
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
	<adabpq6t2k8.fsf@cisco.com>
	<20090506214628.GM2590@obsidianresearch.com>
	<adatz3xsxo6.fsf@cisco.com>
	<20090506222638.GA16280@obsidianresearch.com>
Message-ID: <adaprelsvnp.fsf@cisco.com>

 > Well, this conceptually doesn't seem hard. Go through all the pages in
 > the MR, if any have changed then pin the new page and replace the
 > pages physical address in the HCA's page table. Once done, synchronize
 > with the hardware, then run through again and un-pin and release all
 > the replaced pages.
 > 
 > Every HCA must have the necessary primitives for this to support
 > register and unregister...

No... every HCA just needs to support register and unregister.  It
doesn't have to support changing the mapping without full unregister and
reregister.

Also this requires potentially walking the page tables of the entire
process, checking to see if any mappings have changed.  We really want
to keep the information that the MMU notifiers give us, namely which
virtual address range is changing.

 > The mmu notifiers can optionally make note of the affected pages
 > during the callback to reduce the workload of the syscall.

This requires an unbounded amount of events to be queued up in the
kernel, naively.  (If we lose some events then we have to go back to the
full page table scan, which I don't think is feasible)

 > Only part I don't immediately see is how to trap creation of new VM
 > (ie mmap), mmu notifiers seem focused on invalidating, ie munmap()..

Why do we care?  The initial faulting in of mappings occurs when an MR
is created.

 - R.


From jgunthorpe at obsidianresearch.com  Wed May  6 17:02:31 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Wed, 6 May 2009 18:02:31 -0600
Subject: [ofa-general] Memory registration redux
In-Reply-To: <adaprelsvnp.fsf@cisco.com>
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
	<adabpq6t2k8.fsf@cisco.com>
	<20090506214628.GM2590@obsidianresearch.com>
	<adatz3xsxo6.fsf@cisco.com>
	<20090506222638.GA16280@obsidianresearch.com>
	<adaprelsvnp.fsf@cisco.com>
Message-ID: <20090507000231.GB16280@obsidianresearch.com>

On Wed, May 06, 2009 at 03:39:54PM -0700, Roland Dreier wrote:
>  > Well, this conceptually doesn't seem hard. Go through all the pages in
>  > the MR, if any have changed then pin the new page and replace the
>  > pages physical address in the HCA's page table. Once done, synchronize
>  > with the hardware, then run through again and un-pin and release all
>  > the replaced pages.
>  > 
>  > Every HCA must have the necessary primitives for this to support
>  > register and unregister...
> 
> No... every HCA just needs to support register and unregister.  It
> doesn't have to support changing the mapping without full unregister and
> reregister.

Well, I would imagine this entire process to be a HCA specific
operation, so HW that supports a better method can use it, otherwise
it has to register/unregister. Is this a concern today with existing
HCAs?

Using register/unregister exposes a race for the original case you
brought up - but that race is completely unfixable without hardware
support. At least it now becomes a hw specific race that can be
printk'd and someday fixed in new HW rather than an unsolvable API
problem..

> Also this requires potentially walking the page tables of the entire
> process, checking to see if any mappings have changed.  We really want
> to keep the information that the MMU notifiers give us, namely which
> virtual address range is changing.

Walking the page tables of every registration in the process, not the
entire process.

>  > The mmu notifiers can optionally make note of the affected pages
>  > during the callback to reduce the workload of the syscall.
 
> This requires an unbounded amount of events to be queued up in the
> kernel, naively.  (If we lose some events then we have to go back to the
> full page table scan, which I don't think is feasible)

I was thinking more along the lines of having the mmu notifiers put
affected registrations on a per-process (or PD?) dirty linked list,
with the link pointers as part of the registration structure. Set a
dirty flag in the registration too. An extra pointer per registration
and a minor incremental cost to the existing work the mmu notifier
would have to do.

>  > Only part I don't immediately see is how to trap creation of new VM
>  > (ie mmap), mmu notifiers seem focused on invalidating, ie munmap()..
> 
> Why do we care?  The initial faulting in of mappings occurs when an MR
> is created.

Well, exactly, that's the problem. If you can't trap mmap you cannot
do the initial faulting and mapping for a new object that is being
mapped into an existing MR.

Consider:

  void *a = mmap(0,PAGE_SIZE..);
  ibv_register();
  // [..]
  mmunmap(a);
  ibv_synchronize();

  // At this point we want the HCA mapping to point to oblivion

  mmap(a,PAGE_SIZE,MAP_FIXED);
  ibv_synchronize();

  // And now we want it to point to the new allocation

I use MAP_FIXED to illustrate the point, but Jeff has said the same
address re-use happens randomly in real apps.

This is the main deviation from your original idea, instead of having
a granular notification to userspace to unregister a region, the
kernel just goes and fixes it up so the existing registration still
works.

This method avoids the problem you noticed, but there is extra work to
fixup a registration that may never be used again. I strongly suspect
that in the majority of cases this extra work should be about on the
same order as userspace calling unregister on the MR.

Or, ignore the overlapping problem, and use your original technique,
slightly modified:
 - Userspace registers a counter with the kernel. Kernel pins the
   page, sets up mmu notifiers and increments the counter when
   invalidates intersect with registrations
 - Kernel maintains a linked list of registrations that have been
   invalidated via mmu notifiers using the registration structure
   and a dirty bit
 - Userspace checks the counter at every cache hit, if different it
   calls into the kernel:
       MR_Cookie *mrs[100];
       int rc = ibv_get_invalid_mrs(mrs,100);
       invalidate_cache(mrs,rc);
       // Repeat until drained

   get_invalid_mrs traverses the linked list and returns an
   identifying value to userspace, which looks it up in the cache,
   calls unregister and removes it from the cache.

Jason


From weiny2 at llnl.gov  Wed May  6 18:01:40 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Wed, 6 May 2009 18:01:40 -0700
Subject: [ofa-general] [RFC][PATCH] ibnetdiscover: remove report of max hops
	discovered.
In-Reply-To: <f0e08f230905051125k4ca6ab45q58ec46e9385df9ba@mail.gmail.com>
References: <20090504151005.9a565bc5.weiny2@llnl.gov>
	<f0e08f230905050338m4d11c0e9j205c514468e856ef@mail.gmail.com>
	<1241543312.18144.18.camel@auk31.llnl.gov>
	<f0e08f230905051125k4ca6ab45q58ec46e9385df9ba@mail.gmail.com>
Message-ID: <20090506180140.6213971e.weiny2@llnl.gov>

The number reported as "max hops" from ibnetdiscover can change depending on
the algorithm used to discover the fabric.  As Hal says in the message below
using this number is therefore dangerous.

If no one is currently using this number I propose the patch below which
removes the "max hops discovered" from the output.

Ira

On Tue, 5 May 2009 14:25:32 -0400
Hal Rosenstock <hal.rosenstock at gmail.com> wrote:

> Hi Al,
> 
> On Tue, May 5, 2009 at 1:08 PM, Al Chu <chu11 at llnl.gov> wrote:

[snip]

> >
> > Ira says that the output of the hops is actually "max hops used to get
> > from my port to another port during my search of the network".  So the
> > number could change if (hypotehtical example) depth-first-search were
> > used instead of BFS.
> 
> Sure; it can depend on how the search is done but isn't it the max
> from the initiated node (which could be different depending on the
> algo used) ? Using that number seems dangerous for that very reason. I
> always thought that number was "nice" to have but nothing more. It
> predated my work on ibnetdiscover.
> 
> -- Hal
> 


From: Ira Weiny <weiny2 at llnl.gov>
Date: Wed, 6 May 2009 17:56:23 -0700
Subject: [PATCH] ibnetdiscover: remove report of max hops discovered.


Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/src/ibnetdiscover.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c
index 1799618..89e4f0f 100644
--- a/infiniband-diags/src/ibnetdiscover.c
+++ b/infiniband-diags/src/ibnetdiscover.c
@@ -448,7 +448,6 @@ dump_topology(int group, ibnd_fabric_t *fabric)
 	struct iter_user_data iter_user_data;
 
 	fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t));
-	fprintf(f, "# Max of %d hops discovered\n", fabric->maxhops_discovered);
 	fprintf(f, "# Initiated from node %016" PRIx64 " port %016" PRIx64 "\n",
 		fabric->from_node->guid,
 		mad_get_field64(fabric->from_node->info, 0, IB_NODE_PORT_GUID_F));
-- 
1.5.4.5


From IMCEAEX-_O=QLOGIC_OU=SPG_CN=RECIPIENTS_CN=KSHARMA at qlogic.com  Wed May  6 22:03:51 2009
From: IMCEAEX-_O=QLOGIC_OU=SPG_CN=RECIPIENTS_CN=KSHARMA at qlogic.com (Karun Sharma (Contractor - GS Labs))
Date: Thu, 7 May 2009 00:03:51 -0500
Subject: [ofa-general] SDP error
In-Reply-To: <BAY139-W844589AC3C296DB8DC131AE690@phx.gbl>
References: <BAY139-W844589AC3C296DB8DC131AE690@phx.gbl>
Message-ID: <4C2744E8AD2982428C5BFE523DF8CDCB43E8736567@MNEXMB1.qlogic.org>

1. Make sure ib_sdp is loaded.
2. Do "export LD_PRELOAD=<path>/libsdp.so". path is /usr/lib64 for 64-bit systems.

Thanks
Karun
________________________________
From: general-bounces at lists.openfabrics.org [general-bounces at lists.openfabrics.org] On Behalf Of anthony garnier [sokar6012 at hotmail.com]
Sent: Tuesday, May 05, 2009 6:23 PM
To: general at lists.openfabrics.org
Subject: [ofa-general] SDP error

Hello,

i`m running a  debian 5.0 OS with ofed 1.4, RDMA work very well, but when I`m trying to use the SDP protocol with ssh, Netperf or a simple Client-Server programming in C, I got socket error like that :

NetPIPE: can't open stream socket! errno=97   (for Netpipe)

Address family not supported by protocol ssh      (for ssh)

Address family not supported by protocol   (for clent-server)

Someone knows those errors?

________________________________
Discutez sur Messenger où que vous soyez ! Mettez Messenger sur votre mobile !<http://www.messengersurvotremobile.com/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090507/dbae51e3/attachment.html>

From dorons at voltaire.com  Thu May  7 00:39:36 2009
From: dorons at voltaire.com (Doron Shoham)
Date: Thu, 07 May 2009 10:39:36 +0300
Subject: [ofa-general] [PATCH] osm_port.c: do not force max_op_vls = 0	to  1
In-Reply-To: <f0e08f230905060429m162dd8e5h5a9744c35d81033@mail.gmail.com>
References: <4A00386E.2050300@voltaire.com>	<f0e08f230905050614n149dfc53yd6d0f03e5f2dd4c@mail.gmail.com>	<4A0043B0.3030400@gmail.com>	<f0e08f230905050659ld9f24acvf03a0bc1e0cdb827@mail.gmail.com>	<20090506112135.GG10145@sk>
	<f0e08f230905060429m162dd8e5h5a9744c35d81033@mail.gmail.com>
Message-ID: <4A029038.2040603@voltaire.com>

when setting max_op_vls = 0
do not force it to 1.
0 is valid value which means "No change"

Signed-off-by: Doron Shoham <dorons at voltaire.com>
---
 opensm/opensm/osm_port.c   |    4 ++--
 opensm/opensm/osm_subnet.c |    8 ++++++++
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c
index 2e6c642..3679f29 100644
--- a/opensm/opensm/osm_port.c
+++ b/opensm/opensm/osm_port.c
@@ -379,8 +379,8 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log,
 	/* support user limitation of max_op_vls */
 	if (op_vls > p_subn->opt.max_op_vls)
 		op_vls = p_subn->opt.max_op_vls;
-
-	if (op_vls == 0) {
+	else if (op_vls == 0) {
+		/* for non compliant implementations */
 		OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: "
 			"Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n");
 		op_vls = 1;
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index ec15f8a..71fc7a0 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -1288,6 +1288,14 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts)
 		"# switch port connected to a CA or router port\n"
 		"leaf_head_of_queue_lifetime 0x%02x\n\n"
 		"# Limit the maximal operational VLs\n"
+		"# Virtual Lanes operational on this port\n"
+		"# Values are (IB Spec 1.2.1, 14.2.5.6 Table 146 \"PortInfo\")\n"
+		"#    0: No change; valid only on Set()\n"
+		"#    1: VL0\n"
+		"#    2: VL0, VL1\n"
+		"#    3: VL0 - VL3\n"
+		"#    4: VL0 - VL7\n"
+		"#    5: VL0 - VL14\n"
 		"max_op_vls %u\n\n"
 		"# Force PortInfo:LinkSpeedEnabled on switch ports\n"
 		"# If 0, don't modify PortInfo:LinkSpeedEnabled on switch port\n"
-- 
1.5.4


From ogerlitz at voltaire.com  Thu May  7 01:06:56 2009
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 07 May 2009 11:06:56 +0300
Subject: [ofa-general] Re: [PATCH v2] rdma_cm: Add debugfs entries to
	monitor rdma_cm connections
In-Reply-To: <20090501213652.GO32114@obsidianresearch.com>
References: <49F05AAE.4020606@Voltaire.COM>	<90A07B5E8FAD49C187EAE583A976A3CD@amr.corp.intel.com>	<49F42D40.5000200@Voltaire.COM>
	<49F5A2EC.3050807@Voltaire.com>	<49F5AED6.4070208@Voltaire.COM>
	<49F5AFEA.5090003@voltaire.com>	<20090427162349.GI4431@obsidianresearch.com>
	<49F9A729.3090904@voltaire.com>
	<20090501213652.GO32114@obsidianresearch.com>
Message-ID: <4A0296A0.3090308@voltaire.com>

Jason Gunthorpe wrote:
> Well, we have to live with these interfaces literally forever, shortcuts ultimately just cause more problems down the road.. Reall the thinking should be 'I want to make lsof work usefully' not
> 'I want some random and different hack to let me see something'. And yes, that is harder. But the IB stack is now at the point where these small hard things are the sort of work that is needed to get parity with the other stuff in linux...
Jason,

As Moni stated, this isn't a shortcut, its one solution to the problem 
of the user being unable to see their rdma-cm based sessions. For the 
time I believe that debugfs can do the job of raising the visibility of 
rdma connections from zero to something one can work with which isn't 
random and isn't hack.  A more sophisticated, netlink based solution is 
possible, we'd love to see patches from other people doing that.

Or.


From vlad at lists.openfabrics.org  Thu May  7 03:23:53 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Thu,  7 May 2009 03:23:53 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090507-0200 daily build status
Message-ID: <20090507102353.258F0E614EB@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From sashak at voltaire.com  Thu May  7 04:33:45 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 7 May 2009 14:33:45 +0300
Subject: [ofa-general] Re: [PATCH 2/3] Add combined routing support to
	libibnetdisc
In-Reply-To: <20090506093347.bb1b56be.weiny2@llnl.gov>
References: <20090430142958.5811218f.weiny2@llnl.gov>
	<20090506100744.GB10145@sk>
	<20090506093347.bb1b56be.weiny2@llnl.gov>
Message-ID: <20090507113345.GA19236@sk>

On 09:33 Wed 06 May     , Ira Weiny wrote:
> 
> This will only happen 1 time for each fabric being scan'ed because the path is
> reused...
> 
> Oh wait a minute, I just reviewed the code...  For the current use case the
> path is reused since I am only scanning 1 node.  However, in the general case
> this is not true.  Sorry about that.  A new patch is below.
> 
> Ira
> 
> 
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Wed, 29 Apr 2009 10:15:55 -0700
> Subject: [PATCH] Fix ibnd_discover when the specified ib_portid_t starts LID routed.
> 
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>

Applied. Thanks.

(Please send v2 patch as separate email - I will not need to edit/merge
commit messages, potentially doing wrong interpretations :)).

Sasha


From jackm at dev.mellanox.co.il  Thu May  7 05:01:16 2009
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Thu, 7 May 2009 15:01:16 +0300
Subject: [ofa-general] [PATCH] mlx4: fix fast registration implementation
Message-ID: <200905071501.17670.jackm@dev.mellanox.co.il>

The low-level driver modified the page-list addresses for FRWR post send
to big-endian, and set a "present" bit. This caused problems later when the
ULP attempted to unmap the pages in the page-list (using the list addresses which
were assumed to be still in CPU-endian order).

The cause of the crash was found by Vu Pham of Mellanox.
The fix is along the lines suggested by Steve Wise in comment #21 in Bugzilla 1571.

This patch fixes Bugzilla 1571.

Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>

---
Roland, please take this for kernel 2.6.30.

diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 9974e88..a8c0bc4 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -86,6 +86,7 @@ struct mlx4_ib_mr {
 
 struct mlx4_ib_fast_reg_page_list {
 	struct ib_fast_reg_page_list	ibfrpl;
+	u64				*mapped_page_list;
 	dma_addr_t			map;
 };
 
diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
index 8e4d26d..fddf583 100644
--- a/drivers/infiniband/hw/mlx4/mr.c
+++ b/drivers/infiniband/hw/mlx4/mr.c
@@ -231,16 +231,22 @@ struct ib_fast_reg_page_list *mlx4_ib_alloc_fast_reg_page_list(struct ib_device
 	if (!mfrpl)
 		return ERR_PTR(-ENOMEM);
 
-	mfrpl->ibfrpl.page_list = dma_alloc_coherent(&dev->dev->pdev->dev,
+	mfrpl->ibfrpl.page_list = kmalloc(size, GFP_KERNEL);
+	if (!mfrpl->ibfrpl.page_list)
+		goto err_free;
+
+	mfrpl->mapped_page_list = dma_alloc_coherent(&dev->dev->pdev->dev,
 						     size, &mfrpl->map,
 						     GFP_KERNEL);
 	if (!mfrpl->ibfrpl.page_list)
-		goto err_free;
+		goto err_free_mfrpl;
 
 	WARN_ON(mfrpl->map & 0x3f);
 
 	return &mfrpl->ibfrpl;
 
+err_free_mfrpl:
+	kfree(mfrpl->ibfrpl.page_list);
 err_free:
 	kfree(mfrpl);
 	return ERR_PTR(-ENOMEM);
@@ -252,8 +258,9 @@ void mlx4_ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list)
 	struct mlx4_ib_fast_reg_page_list *mfrpl = to_mfrpl(page_list);
 	int size = page_list->max_page_list_len * sizeof (u64);
 
-	dma_free_coherent(&dev->dev->pdev->dev, size, page_list->page_list,
+	dma_free_coherent(&dev->dev->pdev->dev, size, mfrpl->mapped_page_list,
 			  mfrpl->map);
+	kfree(mfrpl->ibfrpl.page_list);
 	kfree(mfrpl);
 }
 
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index f385a24..20724ae 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -1365,7 +1365,7 @@ static void set_fmr_seg(struct mlx4_wqe_fmr_seg *fseg, struct ib_send_wr *wr)
 	int i;
 
 	for (i = 0; i < wr->wr.fast_reg.page_list_len; ++i)
-		wr->wr.fast_reg.page_list->page_list[i] =
+		mfrpl->mapped_page_list[i] =
 			cpu_to_be64(wr->wr.fast_reg.page_list->page_list[i] |
 				    MLX4_MTT_FLAG_PRESENT);
 

From hnrose at comcast.net  Thu May  7 06:00:53 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Thu, 7 May 2009 09:00:53 -0400
Subject: [ofa-general] [PATCH 2/2] opensm/osm_console.c: Add dump and clear
	redir perfmgr command support
Message-ID: <20090507130053.GB1093@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

---
Changes since v1:
Changes based on changes to PerfMgr redir support in v3 patch

diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c
index d351261..30ddd53 100644
--- a/opensm/opensm/osm_console.c
+++ b/opensm/opensm/osm_console.c
@@ -211,7 +211,7 @@ static void help_dump_conf(FILE *out, int detail)
 static void help_perfmgr(FILE * out, int detail)
 {
 	fprintf(out,
-		"perfmgr [enable|disable|clear_counters|dump_counters|print_counters|sweep_time[seconds]]\n");
+		"perfmgr [enable|disable|clear_counters|dump_counters|print_counters|dump_redir|clear_redir|sweep_time[seconds]]\n");
 	if (detail) {
 		fprintf(out,
 			"perfmgr -- print the performance manager state\n");
@@ -225,6 +225,10 @@ static void help_perfmgr(FILE * out, int detail)
 			"   [dump_counters [mach]] -- dump the counters (optionally in [mach]ine readable format)\n");
 		fprintf(out,
 			"   [print_counters <nodename|nodeguid>] -- print the counters for the specified node\n");
+		fprintf(out,
+			"   [dump_redir [<nodename|nodeguid>]] -- dump the redirection table\n");
+		fprintf(out,
+			"   [clear_redir [<nodename|nodeguid>]] -- clear the redirection table\n");
 	}
 }
 #endif				/* ENABLE_OSM_PERF_MGR */
@@ -1135,6 +1139,152 @@ static void dump_conf_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
 }
 
 #ifdef ENABLE_OSM_PERF_MGR
+static monitored_node_t *find_node_by_name(osm_opensm_t * p_osm,
+					   char *nodename)
+{
+	cl_map_item_t *item;
+	monitored_node_t *node;
+
+	item = cl_qmap_head(&p_osm->perfmgr.monitored_map);
+        while (item != cl_qmap_end(&p_osm->perfmgr.monitored_map)) {
+                node = (monitored_node_t *)item;
+                if (strcmp(node->name, nodename) == 0)
+			return node;
+                item = cl_qmap_next(item);
+        }
+
+	return NULL;
+}
+
+static monitored_node_t *find_node_by_guid(osm_opensm_t * p_osm,
+					   uint64_t guid)
+{
+	cl_map_item_t *node;
+
+	node = cl_qmap_get(&p_osm->perfmgr.monitored_map, guid);
+	if (node != cl_qmap_end(&p_osm->perfmgr.monitored_map))
+		return (monitored_node_t *)node;
+
+	return NULL;
+}
+
+static void dump_redir_entry(monitored_node_t *p_mon_node, FILE * out)
+{
+	int port, redir;
+
+	/* only display monitored nodes with redirection info */
+	redir = 0;
+	for (port = (p_mon_node->esp0) ? 0 : 1;
+	     port < p_mon_node->num_ports; port++) {
+		if (p_mon_node->port[port].redirection) {
+			if (!redir) {
+				fprintf(out, "   Node GUID       ESP0   Name\n");
+				fprintf(out, "   ---------       ----   ----\n");
+				fprintf(out, "   0x%" PRIx64 " %d      %s\n",
+					p_mon_node->guid, p_mon_node->esp0,
+					p_mon_node->name);
+				fprintf(out, "\n   Port Valid  LIDs     PKey  QP    PKey Index\n");
+				fprintf(out, "   ---- -----  ----     ----  --    ----------\n");
+				redir = 1;
+			}
+			fprintf(out, "   %d    %d      %u->%u  0x%x 0x%x   %d\n",
+				port, p_mon_node->port[port].valid,
+				cl_ntoh16(p_mon_node->port[port].orig_lid),
+				cl_ntoh16(p_mon_node->port[port].lid),
+				cl_ntoh16(p_mon_node->port[port].pkey),
+				cl_ntoh32(p_mon_node->port[port].qp),
+				p_mon_node->port[port].pkey_ix);
+		}
+	}
+	if (redir)
+		fprintf(out, "\n");
+}
+
+static void dump_redir(osm_opensm_t * p_osm, char *nodename, FILE * out)
+{
+	monitored_node_t *p_mon_node;
+	uint64_t guid;
+
+	if (!p_osm->subn.opt.perfmgr_redir)
+		fprintf(out, "Perfmgr redirection not enabled\n");
+
+	fprintf(out, "\nRedirection Table\n");
+	fprintf(out, "-----------------\n");
+	cl_plock_acquire(p_osm->perfmgr.lock);
+	if (nodename) {
+		guid = strtoull(nodename, NULL, 0);
+		if (guid == 0 && errno)
+			p_mon_node = find_node_by_name(p_osm, nodename);
+		else
+			p_mon_node = find_node_by_guid(p_osm, guid);
+		if (p_mon_node)
+			dump_redir_entry(p_mon_node, out);
+		else {
+			if (guid == 0 && errno)
+				fprintf(out, "Node %s not found...\n", nodename);
+			else
+				fprintf(out, "Node 0x%" PRIx64 " not found...\n", guid);
+		}
+	} else {
+		p_mon_node = (monitored_node_t *) cl_qmap_head(&p_osm->perfmgr.monitored_map);
+		while (p_mon_node != (monitored_node_t *) cl_qmap_end(&p_osm->perfmgr.monitored_map)) {
+			dump_redir_entry(p_mon_node, out);
+			p_mon_node = (monitored_node_t *) cl_qmap_next((const cl_map_item_t *)p_mon_node);
+		}
+	}
+	cl_plock_release(p_osm->perfmgr.lock);
+}
+
+static void clear_redir_entry(monitored_node_t *p_mon_node)
+{
+	int port;
+	ib_net16_t orig_lid;
+
+	for (port = (p_mon_node->esp0) ? 0 : 1;
+	     port < p_mon_node->num_ports; port++) {
+		if (p_mon_node->port[port].redirection) {
+			orig_lid = p_mon_node->port[port].orig_lid;
+			memset(&p_mon_node->port[port], 0,
+			       sizeof(monitored_port_t));
+			p_mon_node->port[port].valid = TRUE;
+			p_mon_node->port[port].orig_lid = orig_lid;
+		}
+	}
+}
+
+static void clear_redir(osm_opensm_t * p_osm, char *nodename, FILE * out)
+{
+	monitored_node_t *p_mon_node;
+	uint64_t guid;
+
+	if (!p_osm->subn.opt.perfmgr_redir)
+		fprintf(out, "Perfmgr redirection not enabled\n");
+
+	cl_plock_acquire(p_osm->perfmgr.lock);
+	if (nodename) {
+		guid = strtoull(nodename, NULL, 0);
+		if (guid == 0 && errno)
+			p_mon_node = find_node_by_name(p_osm, nodename);
+		else
+			p_mon_node = find_node_by_guid(p_osm, guid);
+		if (p_mon_node)
+			clear_redir_entry(p_mon_node);
+		else {
+			if (guid == 0 && errno)
+				fprintf(out, "Node %s not found...\n", nodename);
+			else
+				fprintf(out, "Node 0x%" PRIx64 " not found...\n", guid);
+		}
+	} else {
+		p_mon_node = (monitored_node_t *) cl_qmap_head(&p_osm->perfmgr.monitored_map);
+		while (p_mon_node != (monitored_node_t *) cl_qmap_end(&p_osm->perfmgr.monitored_map)) {
+			clear_redir_entry(p_mon_node);
+			p_mon_node = (monitored_node_t *) cl_qmap_next((const cl_map_item_t *)p_mon_node);
+		}
+	}
+	cl_plock_release(p_osm->perfmgr.lock);
+}
+
 static void perfmgr_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
 {
 	char *p_cmd;
@@ -1167,6 +1317,12 @@ static void perfmgr_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
 				fprintf(out,
 					"print_counters requires a node name or node GUID to be specified\n");
 			}
+		} else if (strcmp(p_cmd, "dump_redir") == 0) {
+			p_cmd = name_token(p_last);
+			dump_redir(p_osm, p_cmd, out);
+		} else if (strcmp(p_cmd, "clear_redir") == 0) {
+			p_cmd = name_token(p_last);
+			clear_redir(p_osm, p_cmd, out);
 		} else if (strcmp(p_cmd, "sweep_time") == 0) {
 			p_cmd = next_token(p_last);
 			if (p_cmd) {


From hnrose at comcast.net  Thu May  7 05:59:18 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Thu, 7 May 2009 08:59:18 -0400
Subject: [ofa-general] [PATCH] opensm/PerfMgr: Better redirection support
Message-ID: <20090507125918.GA1093@comcast.net>


Handle PKey and QPN redirection information
GID redirection handling remains

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

---
Changes since v2:
Use OpenSM DB rather than vendor layer for local port number and PKeys
Change most log levels from ERROR to VERBOSE
Redirection info validity now determined by single flag
validate_redir_pkey returns pkey index or -1 rather than boolean
Removed redir_ prefixes

Changes since v1:
Added include of osm_helper.h to osm_perfmgr.c

diff --git a/opensm/include/opensm/osm_perfmgr.h b/opensm/include/opensm/osm_perfmgr.h
index 855a2ff..70d68f0 100644
--- a/opensm/include/opensm/osm_perfmgr.h
+++ b/opensm/include/opensm/osm_perfmgr.h
@@ -90,11 +90,17 @@ typedef enum {
 	PERFMGR_SWEEP_SUSPENDED
 } osm_perfmgr_sweep_state_t;
 
-/* Redirection information */
-typedef struct redir {
-	ib_net16_t redir_lid;
-	ib_net32_t redir_qp;
-} redir_t;
+typedef struct monitored_port {
+	uint16_t pkey_ix;
+	ib_net16_t orig_lid;
+	boolean_t redirection;
+	boolean_t valid;
+	/* Redirection fields from ClassPortInfo */
+	ib_gid_t gid;
+	ib_net16_t lid;
+	ib_net16_t pkey;
+	ib_net32_t qp;
+} monitored_port_t;
 
 /* Node to store information about nodes being monitored */
 typedef struct monitored_node {
@@ -104,7 +110,7 @@ typedef struct monitored_node {
 	boolean_t esp0;
 	char *name;
 	uint32_t num_ports;
-	redir_t redir_port[1];	/* redirection on a per port basis */
+	monitored_port_t port[1];
 } monitored_node_t;
 
 struct osm_opensm;
@@ -135,6 +141,8 @@ typedef struct osm_perfmgr {
 	uint32_t max_outstanding_queries;
 	cl_qmap_t monitored_map;	/* map the nodes being tracked */
 	monitored_node_t *remove_list;
+	ib_net64_t port_guid;
+	int16_t local_port;
 } osm_perfmgr_t;
 /*
 * FIELDS
diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c
index ecfdbda..9c47a8f 100644
--- a/opensm/opensm/osm_perfmgr.c
+++ b/opensm/opensm/osm_perfmgr.c
@@ -64,6 +64,7 @@
 #include <opensm/osm_log.h>
 #include <opensm/osm_node.h>
 #include <opensm/osm_opensm.h>
+#include <opensm/osm_helper.h>
 
 #define PERFMGR_INITIAL_TID_VALUE 0xcafe
 
@@ -194,6 +195,7 @@ static void perfmgr_mad_send_err_callback(void *bind_context,
 	uint8_t port = context->perfmgr_context.port;
 	cl_map_item_t *p_node;
 	monitored_node_t *p_mon_node;
+	ib_net16_t orig_lid;
 
 	OSM_LOG_ENTER(pm->log);
 
@@ -225,9 +227,11 @@ static void perfmgr_mad_send_err_callback(void *bind_context,
 				p_mon_node->num_ports);
 			goto Exit;
 		}
-		/* Clear redirection info */
-		p_mon_node->redir_port[port].redir_lid = 0;
-		p_mon_node->redir_port[port].redir_qp = 0;
+		/* Clear redirection info for this port except orig_lid */
+		orig_lid = p_mon_node->port[port].orig_lid;
+		memset(&p_mon_node->port[port], 0, sizeof(monitored_port_t));
+		p_mon_node->port[port].orig_lid = orig_lid;
+		p_mon_node->port[port].valid = TRUE;
 		cl_plock_release(pm->lock);
 	}
 
@@ -256,7 +260,7 @@ ib_api_status_t osm_perfmgr_bind(osm_perfmgr_t * pm, const ib_net64_t port_guid)
 		goto Exit;
 	}
 
-	bind_info.port_guid = port_guid;
+	bind_info.port_guid = pm->port_guid = port_guid;
 	bind_info.mad_class = IB_MCLASS_PERF;
 	bind_info.class_version = 1;
 	bind_info.is_responder = FALSE;
@@ -277,7 +281,6 @@ ib_api_status_t osm_perfmgr_bind(osm_perfmgr_t * pm, const ib_net64_t port_guid)
 		OSM_LOG(pm->log, OSM_LOG_ERROR,
 			"ERR 4C04: Vendor specific bind failed (%s)\n",
 			ib_get_err_str(status));
-		goto Exit;
 	}
 
 Exit:
@@ -308,24 +311,14 @@ static ib_net32_t get_qp(monitored_node_t * mon_node, uint8_t port)
 	ib_net32_t qp = IB_QP1;
 
 	if (mon_node && mon_node->num_ports && port < mon_node->num_ports &&
-	    mon_node->redir_port[port].redir_lid &&
-	    mon_node->redir_port[port].redir_qp)
-		qp = mon_node->redir_port[port].redir_qp;
+	    mon_node->port[port].redirection && mon_node->port[port].qp)
+		qp = mon_node->port[port].qp;
 
 	return qp;
 }
 
-/**********************************************************************
- * Given a node, a port, and an optional monitored node,
- * return the appropriate lid to query that port
- **********************************************************************/
-static ib_net16_t get_lid(osm_node_t * p_node, uint8_t port,
-			  monitored_node_t * mon_node)
+static ib_net16_t get_base_lid(osm_node_t * p_node, uint8_t port)
 {
-	if (mon_node && mon_node->num_ports && port < mon_node->num_ports &&
-	    mon_node->redir_port[port].redir_lid)
-		return mon_node->redir_port[port].redir_lid;
-
 	switch (p_node->node_info.node_type) {
 	case IB_NODE_TYPE_CA:
 	case IB_NODE_TYPE_ROUTER:
@@ -338,12 +331,26 @@ static ib_net16_t get_lid(osm_node_t * p_node, uint8_t port,
 }
 
 /**********************************************************************
+ * Given a node, a port, and an optional monitored node,
+ * return the lid appropriate to query that port
+ **********************************************************************/
+static ib_net16_t get_lid(osm_node_t * p_node, uint8_t port,
+			  monitored_node_t * mon_node)
+{
+	if (mon_node && mon_node->num_ports && port < mon_node->num_ports &&
+	    mon_node->port[port].lid)
+		return mon_node->port[port].lid;
+
+	return get_base_lid(p_node, port);
+}
+
+/**********************************************************************
  * Form and send the Port Counters MAD for a single port.
  **********************************************************************/
 static ib_api_status_t perfmgr_send_pc_mad(osm_perfmgr_t * perfmgr,
 					   ib_net16_t dest_lid,
-					   ib_net32_t dest_qp, uint8_t port,
-					   uint8_t mad_method,
+					   ib_net32_t dest_qp, uint16_t pkey_ix,
+					   uint8_t port, uint8_t mad_method,
 					   osm_madw_context_t * p_context)
 {
 	ib_api_status_t status = IB_SUCCESS;
@@ -382,8 +389,7 @@ static ib_api_status_t perfmgr_send_pc_mad(osm_perfmgr_t * perfmgr,
 	p_madw->mad_addr.addr_type.gsi.remote_qp = dest_qp;
 	p_madw->mad_addr.addr_type.gsi.remote_qkey =
 	    cl_hton32(IB_QP1_WELL_KNOWN_Q_KEY);
-	/* FIXME what about other partitions */
-	p_madw->mad_addr.addr_type.gsi.pkey_ix = 0;
+	p_madw->mad_addr.addr_type.gsi.pkey_ix = pkey_ix;
 	p_madw->mad_addr.addr_type.gsi.service_level = 0;
 	p_madw->mad_addr.addr_type.gsi.global_route = FALSE;
 	p_madw->resp_expected = TRUE;
@@ -419,6 +425,7 @@ static void collect_guids(cl_map_item_t * p_map_item, void *context)
 	osm_perfmgr_t *pm = (osm_perfmgr_t *) context;
 	monitored_node_t *mon_node = NULL;
 	uint32_t num_ports;
+	int port;
 
 	OSM_LOG_ENTER(pm->log);
 
@@ -427,7 +434,7 @@ static void collect_guids(cl_map_item_t * p_map_item, void *context)
 		/* if not already in map add it */
 		num_ports = osm_node_get_num_physp(node);
 		mon_node = malloc(sizeof(*mon_node) +
-				  sizeof(redir_t) * num_ports);
+				  sizeof(monitored_port_t) * num_ports);
 		if (!mon_node) {
 			OSM_LOG(pm->log, OSM_LOG_ERROR, "PerfMgr: ERR 4C06: "
 				"malloc failed: not handling node %s"
@@ -436,7 +443,7 @@ static void collect_guids(cl_map_item_t * p_map_item, void *context)
 			goto Exit;
 		}
 		memset(mon_node, 0,
-		       sizeof(*mon_node) + sizeof(redir_t) * num_ports);
+		       sizeof(*mon_node) + sizeof(monitored_port_t) * num_ports);
 		mon_node->guid = node_guid;
 		mon_node->name = strdup(node->print_desc);
 		mon_node->num_ports = num_ports;
@@ -444,6 +451,11 @@ static void collect_guids(cl_map_item_t * p_map_item, void *context)
 		mon_node->esp0 = (node->sw &&
 				  ib_switch_info_is_enhanced_port0(&node->sw->
 								   switch_info));
+		for (port = mon_node->esp0 ? 0 : 1; port < num_ports; port++) {
+			mon_node->port[port].orig_lid = get_base_lid(node, port);
+			mon_node->port[port].valid = TRUE;
+		}
+
 		cl_qmap_insert(&pm->monitored_map, node_guid,
 			       (cl_map_item_t *) mon_node);
 	}
@@ -500,6 +512,9 @@ static void perfmgr_query_counters(cl_map_item_t * p_map_item, void *context)
 		if (!osm_node_get_physp_ptr(node, port))
 			continue;
 
+		if (!mon_node->port[port].valid)
+			continue;
+
 		lid = get_lid(node, port, mon_node);
 		if (lid == 0) {
 			OSM_LOG(pm->log, OSM_LOG_DEBUG, "WARN: node 0x%" PRIx64
@@ -520,8 +535,10 @@ static void perfmgr_query_counters(cl_map_item_t * p_map_item, void *context)
 		OSM_LOG(pm->log, OSM_LOG_VERBOSE, "Getting stats for node 0x%"
 			PRIx64 " port %d (lid %u) (%s)\n", node_guid, port,
 			cl_ntoh16(lid), node->print_desc);
-		status = perfmgr_send_pc_mad(pm, lid, remote_qp, port,
-					     IB_MAD_METHOD_GET, &mad_context);
+		status = perfmgr_send_pc_mad(pm, lid, remote_qp,
+					     mon_node->port[port].pkey_ix,
+					     port, IB_MAD_METHOD_GET,
+					     &mad_context);
 		if (status != IB_SUCCESS)
 			OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C09: "
 				"Failed to issue port counter query for node 0x%"
@@ -768,6 +785,24 @@ void osm_perfmgr_process(osm_perfmgr_t * pm)
 	    pm->subn->sm_state == IB_SMINFO_STATE_NOTACTIVE)
 		perfmgr_discovery(pm->subn->p_osm);
 
+	/* if redirection enabled, determine local port */
+	if (pm->subn->opt.perfmgr_redir && pm->local_port == -1) {
+		osm_node_t *p_node;
+		osm_port_t *p_port;
+
+		CL_PLOCK_ACQUIRE(pm->sm->p_lock);
+		p_port = osm_get_port_by_guid(pm->subn, pm->port_guid);
+		if (p_port) {
+			p_node = p_port->p_node;
+			CL_ASSERT(p_node);
+			pm->local_port =
+			    ib_node_info_get_local_port_num(&p_node->node_info);
+		} else
+			OSM_LOG(pm->log, OSM_LOG_ERROR,
+				"ERR 4C87: No PerfMgr port object\n");
+		CL_PLOCK_RELEASE(pm->sm->p_lock);
+	}
+
 #if ENABLE_OSM_PERF_MGR_PROFILE
 	gettimeofday(&before, NULL);
 #endif
@@ -935,8 +970,8 @@ static int counter_overflow_32(ib_net32_t val)
  * MAD to the port.
  **********************************************************************/
 static void perfmgr_check_overflow(osm_perfmgr_t * pm,
-				   monitored_node_t * mon_node, uint8_t port,
-				   ib_port_counters_t * pc)
+				   monitored_node_t * mon_node, int16_t pkey_ix,
+				   uint8_t port, ib_port_counters_t * pc)
 {
 	osm_madw_context_t mad_context;
 	ib_api_status_t status;
@@ -963,6 +998,9 @@ static void perfmgr_check_overflow(osm_perfmgr_t * pm,
 		osm_node_t *p_node = NULL;
 		ib_net16_t lid = 0;
 
+		if (!mon_node->port[port].valid)
+			goto Exit;
+
 		osm_log(pm->log, OSM_LOG_VERBOSE,
 			"PerfMgr: Counter overflow: %s (0x%" PRIx64
 			") port %d; clearing counters\n",
@@ -987,8 +1025,9 @@ static void perfmgr_check_overflow(osm_perfmgr_t * pm,
 		mad_context.perfmgr_context.port = port;
 		mad_context.perfmgr_context.mad_method = IB_MAD_METHOD_SET;
 		/* clear port counters */
-		status = perfmgr_send_pc_mad(pm, lid, remote_qp, port,
-					     IB_MAD_METHOD_SET, &mad_context);
+		status = perfmgr_send_pc_mad(pm, lid, remote_qp, pkey_ix,
+					     port, IB_MAD_METHOD_SET,
+					     &mad_context);
 		if (status != IB_SUCCESS)
 			OSM_LOG(pm->log, OSM_LOG_ERROR, "PerfMgr: ERR 4C11: "
 				"Failed to send clear counters MAD for %s (0x%"
@@ -1046,6 +1085,64 @@ static void perfmgr_log_events(osm_perfmgr_t * pm,
 			time_diff, mon_node->name, mon_node->guid, port);
 }
 
+static int16_t validate_redir_pkey(osm_perfmgr_t *pm, ib_net16_t pkey)
+{
+	int16_t pkey_ix = -1;
+	osm_port_t *p_port;
+	osm_pkey_tbl_t *p_pkey_tbl;
+	ib_net16_t *p_orig_pkey;
+	uint16_t block;
+	uint8_t index;
+
+	OSM_LOG_ENTER(pm->log);
+
+	CL_PLOCK_ACQUIRE(pm->sm->p_lock);
+	p_port = osm_get_port_by_guid(pm->subn, pm->port_guid);
+	if (!p_port) {
+		CL_PLOCK_RELEASE(pm->sm->p_lock);		
+		OSM_LOG(pm->log, OSM_LOG_ERROR,
+			"ERR 4C1E: No PerfMgr port object\n");
+		goto Exit;
+	}
+	if (p_port->p_physp && osm_physp_is_valid(p_port->p_physp)) {
+		p_pkey_tbl = &p_port->p_physp->pkeys;
+		if (!p_pkey_tbl) {
+			CL_PLOCK_RELEASE(pm->sm->p_lock);
+			OSM_LOG(pm->log, OSM_LOG_VERBOSE,
+				"No PKey table found for PerfMgr port\n");
+			goto Exit;
+		}
+		p_orig_pkey = cl_map_get(&p_pkey_tbl->keys,
+					 ib_pkey_get_base(pkey));
+		if (!p_orig_pkey) {
+			CL_PLOCK_RELEASE(pm->sm->p_lock);
+			OSM_LOG(pm->log, OSM_LOG_VERBOSE,
+				"PKey 0x%x not found for PerfMgr port\n",
+				cl_ntoh16(pkey));
+			goto Exit;
+		}
+		if (osm_pkey_tbl_get_block_and_idx(p_pkey_tbl, p_orig_pkey,
+						   &block, &index) == IB_SUCCESS) {
+			CL_PLOCK_RELEASE(pm->sm->p_lock);
+			pkey_ix = block * IB_NUM_PKEY_ELEMENTS_IN_BLOCK + index;
+		} else {
+			CL_PLOCK_RELEASE(pm->sm->p_lock);
+			OSM_LOG(pm->log, OSM_LOG_ERROR, 
+				"ERR 0x4C1F: Failed to obtain P_Key 0x%04x "
+				"block and index for PerfMgr port\n",
+				cl_ntoh16(pkey));
+		}
+	} else {
+		CL_PLOCK_RELEASE(pm->sm->p_lock);
+		OSM_LOG(pm->log, OSM_LOG_ERROR,
+			"ERR 4C20: Local PerfMgt port physp invalid\n");
+	}
+
+Exit:
+	OSM_LOG_EXIT(pm->log);
+	return pkey_ix;
+}
+
 /**********************************************************************
  * The dispatcher uses a thread pool which will call this function when
  * there is a thread available to process the mad received on the wire.
@@ -1064,6 +1161,8 @@ static void pc_recv_process(void *context, void *data)
 	perfmgr_db_data_cnt_reading_t data_reading;
 	cl_map_item_t *p_node;
 	monitored_node_t *p_mon_node;
+	int16_t pkey_ix = 0;
+	boolean_t valid = TRUE;
 
 	OSM_LOG_ENTER(pm->log);
 
@@ -1087,7 +1186,8 @@ static void pc_recv_process(void *context, void *data)
 		  p_mad->attr_id == IB_MAD_ATTR_CLASS_PORT_INFO);
 
 	/* Response could also be redirection (IBM eHCA PMA does this) */
-	if (p_mad->attr_id == IB_MAD_ATTR_CLASS_PORT_INFO) {
+	if (p_mad->status & IB_MAD_STATUS_REDIRECT &&
+	    p_mad->attr_id == IB_MAD_ATTR_CLASS_PORT_INFO) {
 		char gid_str[INET6_ADDRSTRLEN];
 		ib_class_port_info_t *cpi =
 		    (ib_class_port_info_t *) &
@@ -1100,17 +1200,46 @@ static void pc_recv_process(void *context, void *data)
 			inet_ntop(AF_INET6, cpi->redir_gid.raw, gid_str,
 				  sizeof gid_str), cl_ntoh32(cpi->redir_qp));
 
-		/* LID or GID redirection ? */
-		/* For GID redirection, need to get PathRecord from SA */
+		if (!pm->subn->opt.perfmgr_redir) {
+			OSM_LOG(pm->log, OSM_LOG_VERBOSE,
+				"Redirection requested but disabled\n");
+			valid = FALSE;
+		}
+
+		/* valid redirection ? */
 		if (cpi->redir_lid == 0) {
+			if (!ib_gid_is_notzero(&cpi->redir_gid)) {
+				OSM_LOG(pm->log, OSM_LOG_VERBOSE,
+					"Invalid redirection "
+					"(both redirect LID and GID are zero)\n");
+				valid = FALSE;
+			}
+		}
+		if (cpi->redir_qp == 0) {
+			OSM_LOG(pm->log, OSM_LOG_VERBOSE, "Invalid RedirectQP\n");
+			valid = FALSE;
+		}
+		if (cpi->redir_pkey == 0) {
+			OSM_LOG(pm->log, OSM_LOG_VERBOSE, "Invalid RedirectP_Key\n");
+			valid = FALSE;
+		}
+		if (cpi->redir_qkey != IB_QP1_WELL_KNOWN_Q_KEY) {
+			OSM_LOG(pm->log, OSM_LOG_VERBOSE, "Invalid RedirectQ_Key\n");
+			valid = FALSE;
+		}
+
+		pkey_ix = validate_redir_pkey(pm, cpi->redir_pkey);
+		if (pkey_ix == -1) {
 			OSM_LOG(pm->log, OSM_LOG_VERBOSE,
-				"GID redirection not currently implemented!\n");
-			goto Exit;
+				"Index for Pkey 0x%x not found\n",
+				cl_ntoh16(cpi->redir_pkey));
+			valid = FALSE;
 		}
 
-		if (!pm->subn->opt.perfmgr_redir) {
-			OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C16: "
-				"redirection requested but disabled\n");
+		if (cpi->redir_lid == 0) {
+			/* GID redirection: get PathRecord information */
+			OSM_LOG(pm->log, OSM_LOG_VERBOSE,
+				"GID redirection not currently supported\n");
 			goto Exit;
 		}
 
@@ -1125,13 +1254,24 @@ static void pc_recv_process(void *context, void *data)
 				p_mon_node->num_ports);
 			goto Exit;
 		}
-		p_mon_node->redir_port[port].redir_lid = cpi->redir_lid;
-		p_mon_node->redir_port[port].redir_qp = cpi->redir_qp;
+		p_mon_node->port[port].redirection = TRUE;
+		p_mon_node->port[port].valid = valid;
+		memcpy(&p_mon_node->port[port].gid, &cpi->redir_gid,
+		       sizeof(ib_gid_t));
+		p_mon_node->port[port].lid = cpi->redir_lid;
+		p_mon_node->port[port].qp = cpi->redir_qp;
+		p_mon_node->port[port].pkey = cpi->redir_pkey;
+		if (pkey_ix != -1)
+			p_mon_node->port[port].pkey_ix = pkey_ix;
 		cl_plock_release(pm->lock);
 
+		if (!valid)
+			goto Exit;
+
 		/* Finally, reissue the query to the redirected location */
 		status = perfmgr_send_pc_mad(pm, cpi->redir_lid, cpi->redir_qp,
-					     port, mad_context->perfmgr_context.
+					     pkey_ix, port,
+					     mad_context->perfmgr_context.
 					     mad_method, mad_context);
 		if (status != IB_SUCCESS)
 			OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C14: "
@@ -1166,7 +1306,7 @@ static void pc_recv_process(void *context, void *data)
 		perfmgr_db_clear_prev_dc(pm->db, node_guid, port);
 	}
 
-	perfmgr_check_overflow(pm, p_mon_node, port, wire_read);
+	perfmgr_check_overflow(pm, p_mon_node, pkey_ix, port, wire_read);
 
 #if ENABLE_OSM_PERF_MGR_PROFILE
 	do {
@@ -1212,6 +1352,7 @@ ib_api_status_t osm_perfmgr_init(osm_perfmgr_t * pm, osm_opensm_t * osm,
 	pm->sweep_time_s = p_opt->perfmgr_sweep_time_s;
 	pm->max_outstanding_queries = p_opt->perfmgr_max_outstanding_queries;
 	pm->osm = osm;
+	pm->local_port = -1;
 
 	status = cl_timer_init(&pm->sweep_timer, perfmgr_sweep, pm);
 	if (status != IB_SUCCESS)


From hal.rosenstock at gmail.com  Thu May  7 06:53:05 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 7 May 2009 09:53:05 -0400
Subject: [ofa-general] Re: [RFC][PATCH] ibnetdiscover: remove report of max
	hops discovered.
In-Reply-To: <20090506180140.6213971e.weiny2@llnl.gov>
References: <20090504151005.9a565bc5.weiny2@llnl.gov>
	<f0e08f230905050338m4d11c0e9j205c514468e856ef@mail.gmail.com>
	<1241543312.18144.18.camel@auk31.llnl.gov>
	<f0e08f230905051125k4ca6ab45q58ec46e9385df9ba@mail.gmail.com>
	<20090506180140.6213971e.weiny2@llnl.gov>
Message-ID: <f0e08f230905070653t3060a09dxc58630d683109ea6@mail.gmail.com>

On Wed, May 6, 2009 at 9:01 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
> The number reported as "max hops" from ibnetdiscover can change depending on
> the algorithm used to discover the fabric.  As Hal says in the message below
> using this number is therefore dangerous.
>
> If no one is currently using this number I propose the patch below which
> removes the "max hops discovered" from the output.

If it's removed from the topology output, should there be an option
which displays this number ? It does provide some idea of the levels
in the hierarchy which can be useful when someone provides a topology
file for their subnet.

-- Hal

> Ira
>
> On Tue, 5 May 2009 14:25:32 -0400
> Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
>
>> Hi Al,
>>
>> On Tue, May 5, 2009 at 1:08 PM, Al Chu <chu11 at llnl.gov> wrote:
>
> [snip]
>
>> >
>> > Ira says that the output of the hops is actually "max hops used to get
>> > from my port to another port during my search of the network".  So the
>> > number could change if (hypotehtical example) depth-first-search were
>> > used instead of BFS.
>>
>> Sure; it can depend on how the search is done but isn't it the max
>> from the initiated node (which could be different depending on the
>> algo used) ? Using that number seems dangerous for that very reason. I
>> always thought that number was "nice" to have but nothing more. It
>> predated my work on ibnetdiscover.
>>
>> -- Hal
>>
>
>
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Wed, 6 May 2009 17:56:23 -0700
> Subject: [PATCH] ibnetdiscover: remove report of max hops discovered.
>
>
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
> ---
>  infiniband-diags/src/ibnetdiscover.c |    1 -
>  1 files changed, 0 insertions(+), 1 deletions(-)
>
> diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c
> index 1799618..89e4f0f 100644
> --- a/infiniband-diags/src/ibnetdiscover.c
> +++ b/infiniband-diags/src/ibnetdiscover.c
> @@ -448,7 +448,6 @@ dump_topology(int group, ibnd_fabric_t *fabric)
>        struct iter_user_data iter_user_data;
>
>        fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t));
> -       fprintf(f, "# Max of %d hops discovered\n", fabric->maxhops_discovered);
>        fprintf(f, "# Initiated from node %016" PRIx64 " port %016" PRIx64 "\n",
>                fabric->from_node->guid,
>                mad_get_field64(fabric->from_node->info, 0, IB_NODE_PORT_GUID_F));
> --
> 1.5.4.5
>
>


From jsquyres at cisco.com  Thu May  7 06:54:26 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Thu, 7 May 2009 09:54:26 -0400
Subject: [ofa-general] Memory registration redux
In-Reply-To: <adabpq6t2k8.fsf@cisco.com>
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
	<adabpq6t2k8.fsf@cisco.com>
Message-ID: <D6421CE6-78D4-4C91-8803-9482E1C60566@cisco.com>

On May 6, 2009, at 4:10 PM, Roland Dreier (rdreier) wrote:

> By the way, what's the desired behavior of the cache if a process
> registers, say, address range 0x1000 ... 0x3fff, and then the same
> process registers address range 0x2000 ... 0x2fff (with all the same
> permissions, etc)?
>
> The initial registration creates an MR that is still valid for the
> smaller virtual address range, so the second registration is much
> cheaper if we used the cached registration; but if we use the cache  
> for
> the second registration, and then deregister the first one, we're  
> stuck
> with a too-big range pinned in the cache because of the second
> registration.
>


I don't know what the other MPI's do in this scenario, but here's what  
OMPI will do:

1. lookup 0x1000-0x3fff in the cache; not find any of it it, and  
therefore register
    - add each page to our cache with a refcount of 1
2. lookup 0x2000-0x2fff in the cache, find that all the pages are  
already registered
    - refcount++ on each page in the cache
3. when we go to dereg 0x1000-0x3fff
    - refcount-- on each page in the cache
    - since some pages in the range still have refcount>0, don't do  
anything further

Specifically: the actual dereg of 0x1000-0x3fff is blocked on also  
releasing 0x2000-0x2fff.

Note that OMPI will only register a max of X bytes at a time (where X  
defaults to 2MB).  So even if a user calls MPI_SEND(...) with an  
enormous buffer, we'll register it X/page_size pages at a time, not  
the entire buffer at once.  Hence, the "buffer A is blocked from  
dereg'ing by buffer B" scenario is *somewhat* mitigated -- it's less  
wasteful than if we can registered/cached the entire huge buffer at  
once.

Finally, note that if 0x2000-0x2fff had not been registered, the  
0x1000-0x3fff pages are not actually deregistered when all the pages'  
refcounts go to 0 -- they are just moved to the "able to be dereg'ed  
list".  We don't actually dereg it until we later try to reg new  
memory and fail due to lack of resources.  Then we take entries off  
the "able to be dereg'ed list" and dereg them, then try reg'ing the  
new memory again.

MVAPICH: do you guys do similar things?

(I don't know if HP/Scali/Intel will comment on their registration  
cache schemes)

-- 
Jeff Squyres
Cisco Systems


From hal.rosenstock at gmail.com  Thu May  7 06:56:38 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 7 May 2009 09:56:38 -0400
Subject: [ofa-general] Re: [PATCH 2/3] Add combined routing support to 
	libibnetdisc
In-Reply-To: <20090506093347.bb1b56be.weiny2@llnl.gov>
References: <20090430142958.5811218f.weiny2@llnl.gov>
	<20090506100744.GB10145@sk> <20090506093347.bb1b56be.weiny2@llnl.gov>
Message-ID: <f0e08f230905070656w384fde21ydbd20a8092b21834@mail.gmail.com>

Ira,

On Wed, May 6, 2009 at 12:33 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
> On Wed, 6 May 2009 13:07:44 +0300
> Sasha Khapyorsky <sashak at voltaire.com> wrote:
>
>> On 14:29 Thu 30 Apr     , Ira Weiny wrote:
>> > From: Ira Weiny <weiny2 at llnl.gov>
>> > Date: Wed, 29 Apr 2009 10:15:55 -0700
>> > Subject: [PATCH] Add combined routing support to libibnetdisc
>> >
>> >    Also allow a scan to start at a switch.
>> >
>> > Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
>> > ---
>> >  infiniband-diags/libibnetdisc/src/ibnetdisc.c |   28 ++++++++++++++++++------
>> >  1 files changed, 21 insertions(+), 7 deletions(-)
>> >
>> > diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
>> > index 0ff5134..fc19633 100644
>> > --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
>> > +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
>> > @@ -177,11 +177,26 @@ add_port_to_dpath(ib_dr_path_t *path, int nextport)
>> >  }
>> >
>> >  static int
>> > -extend_dpath(struct ibnd_fabric *f, ib_dr_path_t *path, int nextport)
>> > +extend_dpath(struct ibnd_fabric *f, ib_portid_t *portid, int nextport)
>> >  {
>> > -   int rc = add_port_to_dpath(path, nextport);
>> > -   if ((rc != -1) && (path->cnt > f->fabric.maxhops_discovered))
>> > -           f->fabric.maxhops_discovered = path->cnt;
>> > +   int rc = 0;
>> > +
>> > +   if (portid->lid && !portid->drpath.drslid) {
>> > +           /* If we were LID routed
>> > +            * AND have not done so already
>> > +            * we need to set up the drslid
>> > +            */
>> > +           ib_portid_t selfportid = { 0 };
>> > +           if (ib_resolve_self_via(&selfportid, NULL, NULL, f->fabric.ibmad_port) < 0)
>> > +                   return -1;
>>
>> And wouldn't it be better instead of resolving selfport on each
>> extend_path() call to keep it already resolved somewhere in fabric
>> structure?
>
> This will only happen 1 time for each fabric being scan'ed because the path is
> reused...
>
> Oh wait a minute, I just reviewed the code...  For the current use case the
> path is reused since I am only scanning 1 node.  However, in the general case
> this is not true.  Sorry about that.  A new patch is below.

Does combined routing always fall back on failure to using directed routing ?

Also, would you summarize the use cases for combined routing in ibnetdiscover ?

-- Hal

> Ira
>
>
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Wed, 29 Apr 2009 10:15:55 -0700
> Subject: [PATCH] Fix ibnd_discover when the specified ib_portid_t starts LID routed.
>
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
> ---
>  infiniband-diags/libibnetdisc/src/ibnetdisc.c |   27 ++++++++++++++++++------
>  infiniband-diags/libibnetdisc/src/internal.h  |    1 +
>  2 files changed, 21 insertions(+), 7 deletions(-)
>
> diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
> index 0ff5134..1e93ff8 100644
> --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
> +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
> @@ -177,11 +177,25 @@ add_port_to_dpath(ib_dr_path_t *path, int nextport)
>  }
>
>  static int
> -extend_dpath(struct ibnd_fabric *f, ib_dr_path_t *path, int nextport)
> +extend_dpath(struct ibnd_fabric *f, ib_portid_t *portid, int nextport)
>  {
> -       int rc = add_port_to_dpath(path, nextport);
> -       if ((rc != -1) && (path->cnt > f->fabric.maxhops_discovered))
> -               f->fabric.maxhops_discovered = path->cnt;
> +       int rc = 0;
> +
> +       if (portid->lid) {
> +               /* If we were LID routed we need to set up the drslid */
> +               if (!f->selfportid.lid)
> +                       if (ib_resolve_self_via(&f->selfportid, NULL, NULL,
> +                                       f->fabric.ibmad_port) < 0)
> +                               return -1;
> +
> +               portid->drpath.drslid = f->selfportid.lid;
> +               portid->drpath.drdlid = 0xFFFF;
> +       }
> +
> +       rc = add_port_to_dpath(&portid->drpath, nextport);
> +
> +       if ((rc != -1) && (portid->drpath.cnt > f->fabric.maxhops_discovered))
> +               f->fabric.maxhops_discovered = portid->drpath.cnt;
>        return (rc);
>  }
>
> @@ -447,7 +461,7 @@ get_remote_node(struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd_
>                        != IB_PORT_PHYS_STATE_LINKUP)
>                return -1;
>
> -       if (extend_dpath(fabric, &path->drpath, portnum) < 0)
> +       if (extend_dpath(fabric, path, portnum) < 0)
>                return -1;
>
>        if (query_node(fabric, &node_buf, &port_buf, path)) {
> @@ -546,8 +560,7 @@ ibnd_discover_fabric(struct ibmad_port *ibmad_port, int timeout_ms,
>        if (!port)
>                IBPANIC("out of memory");
>
> -       if (node->node.type != IB_NODE_SWITCH &&
> -           get_remote_node(fabric, node, port, from,
> +       if(get_remote_node(fabric, node, port, from,
>                                mad_get_field(node->node.info, 0, IB_NODE_LOCAL_PORT_F),
>                                0) < 0)
>                return ((ibnd_fabric_t *)fabric);
> diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h
> index 4e6bb18..5785e33 100644
> --- a/infiniband-diags/libibnetdisc/src/internal.h
> +++ b/infiniband-diags/libibnetdisc/src/internal.h
> @@ -88,6 +88,7 @@ struct ibnd_fabric {
>        struct ibnd_node *switches;
>        struct ibnd_node *ch_adapters;
>        struct ibnd_node *routers;
> +       ib_portid_t selfportid;
>  };
>  #define CONV_FABRIC_INTERNAL(fabric) ((struct ibnd_fabric *)fabric)
>
> --
> 1.5.4.5
>
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From chocapiiic.tiery at gmail.com  Thu May  7 06:58:54 2009
From: chocapiiic.tiery at gmail.com (Thierry)
Date: Thu, 7 May 2009 15:58:54 +0200
Subject: [ofa-general] struct ib_sge, how to dump buffer from kernel
Message-ID: <8d9c773c0905070658q4844714fh213f383bacfa99b3@mail.gmail.com>

Hi,

I am trying to make a kernel module in order to monitor communication
threw infiniband device: my goal is to monitor as many things as I can
from kernel space.
I have implemented a simple module which can send data from kernel
space to user space using netlink socket.
I think the best palce to extract information is inside the
ib_post_send function in driver/infiniband/hw/mlx4/qp.c were there is
a ib_send_wr structure.

But I have some trouble to read data from kernel :
I have already made a trap in libibverbs (cmd_post_send function), and
I have been able to read struct ibv_send_wr and also
ibv_send_wr->ibv_sge->addr
But in kernel space, I can't read data in ib_send_wr->ib_sge->addr and
I don't understand why :
I made a memcpy of addr, using the length in ib_sge->length, and then
print it with printk %s.

Does ibv_send_wr structure is a copy of ibv_send_wr but in kernel_space?*
How does a memory adress looks like?
Do you have any references I can read in order to understand memorry adressing ?

Regards,

Thierry

---
char foo[wr->sg_list->length];

memcpy(foo, wr->sg_list->addr, wr->sg_list->length);

printk(KERN_INFO "buffer: %s\n", foo);


From sashak at voltaire.com  Thu May  7 04:34:04 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 7 May 2009 14:34:04 +0300
Subject: [ofa-general] Re: [PATCH 3/3] Modify '-S' option of iblinkinfo and
	ibqueryerrors
	to do a limited scan of the fabric first and then fall back to a
	full scan which searches for the GUID.
In-Reply-To: <20090430143002.89262384.weiny2@llnl.gov>
References: <20090430143002.89262384.weiny2@llnl.gov>
Message-ID: <20090507113404.GB19236@sk>

On 14:30 Thu 30 Apr     , Ira Weiny wrote:
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Tue, 28 Apr 2009 16:38:38 -0700
> Subject: [PATCH] Modify '-S' option of iblinkinfo and ibqueryerrors to do a limited scan of the
>  fabric first and then fall back to a full scan which searches for the GUID.
> 
> 
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>

Applied. Thanks.

Sasha


From dorfman.eli at gmail.com  Thu May  7 07:09:39 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Thu, 07 May 2009 17:09:39 +0300
Subject: [ofa-general] Bug in opensm LID assignement
Message-ID: <4A02EBA3.70205@gmail.com>

opensm assigns conflicting LIDs to node after lmc change (e.g. 0 to 1)
when node guid is in the guid2lid cache.

In the following example CA port 1 lid 24 lmc 1
and switch lid is 25 which overlaps with CA port's lid.
This happens because switch port guid is in the guid2lid cache (0x0008f104003f2aa2 0x0019 0x0019)
 
vendid=0x2c9
devid=0x634a
sysimgguid=0x2c902002576a3
caguid=0x2c902002576a0
Ca      2 "H-0002c902002576a0"          # "FIG3 HCA-1"
[2](2c902002576a2)      "S-0008f104003f29d2"[23]                # lid 26 lmc 1 "ISR2012/ISR2004 Voltaire sLB-2024" lid 34 4xDDR
[1](2c902002576a1)      "S-0008f104003f2aa2"[23]                # lid 24 lmc 1 "ISR2012/ISR2004 Voltaire sLB-2024" lid 25 4xDDR


From sashak at voltaire.com  Thu May  7 04:52:12 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 7 May 2009 14:52:12 +0300
Subject: [ofa-general] Re: [PATCH] osm_port.c: do not force max_op_vls = 0 to
	1
In-Reply-To: <4A029038.2040603@voltaire.com>
References: <4A00386E.2050300@voltaire.com>
	<f0e08f230905050614n149dfc53yd6d0f03e5f2dd4c@mail.gmail.com>
	<4A0043B0.3030400@gmail.com>
	<f0e08f230905050659ld9f24acvf03a0bc1e0cdb827@mail.gmail.com>
	<20090506112135.GG10145@sk>
	<f0e08f230905060429m162dd8e5h5a9744c35d81033@mail.gmail.com>
	<4A029038.2040603@voltaire.com>
Message-ID: <20090507115212.GC19236@sk>

Hi Doron,

On 10:39 Thu 07 May     , Doron Shoham wrote:
> when setting max_op_vls = 0
> do not force it to 1.
> 0 is valid value which means "No change"
> 
> Signed-off-by: Doron Shoham <dorons at voltaire.com>
> ---
>  opensm/opensm/osm_port.c   |    4 ++--
>  opensm/opensm/osm_subnet.c |    8 ++++++++
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c
> index 2e6c642..3679f29 100644
> --- a/opensm/opensm/osm_port.c
> +++ b/opensm/opensm/osm_port.c
> @@ -379,8 +379,8 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log,
>  	/* support user limitation of max_op_vls */
>  	if (op_vls > p_subn->opt.max_op_vls)
>  		op_vls = p_subn->opt.max_op_vls;
> -
> -	if (op_vls == 0) {
> +	else if (op_vls == 0) {
> +		/* for non compliant implementations */
>  		OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: "
>  			"Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n");
>  		op_vls = 1;

I would suggest to not mix zero OperVLs workaround and max_op_vls=0
processing. Just move 'op_vls == 0' check to be above max_op_vls
comparison.

Also (need to repeat my original comment) using max_op_vls = 0 will
enforce PortInfo update attempt, which actually may not be needed if the
only "changed" field is OperVLs changed to 0 ("No change") - see the
code in osm_link_mgr.c.

Sasha


From hnrose at comcast.net  Thu May  7 07:33:46 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Thu, 7 May 2009 10:33:46 -0400
Subject: [ofa-general] [PATCH] opensm/osm_port.c: Remove error number from
	debug level log message
Message-ID: <20090507143346.GA1713@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c
index 2e6c642..17bac73 100644
--- a/opensm/opensm/osm_port.c
+++ b/opensm/opensm/osm_port.c
@@ -381,7 +381,7 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log,
 		op_vls = p_subn->opt.max_op_vls;
 
 	if (op_vls == 0) {
-		OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: "
+		OSM_LOG(p_log, OSM_LOG_DEBUG,
 			"Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n");
 		op_vls = 1;
 	}


From dorons at voltaire.com  Thu May  7 08:51:19 2009
From: dorons at voltaire.com (Doron Shoham)
Date: Thu, 07 May 2009 18:51:19 +0300
Subject: [ofa-general] [PATCH 0/2] osm_port.c: fix op_vls processing
In-Reply-To: <20090507115212.GC19236@sk>
References: <4A00386E.2050300@voltaire.com>	<f0e08f230905050614n149dfc53yd6d0f03e5f2dd4c@mail.gmail.com>	<4A0043B0.3030400@gmail.com>	<f0e08f230905050659ld9f24acvf03a0bc1e0cdb827@mail.gmail.com>	<20090506112135.GG10145@sk>	<f0e08f230905060429m162dd8e5h5a9744c35d81033@mail.gmail.com>	<4A029038.2040603@voltaire.com>
	<20090507115212.GC19236@sk>
Message-ID: <4A030377.6050202@voltaire.com>


From dorons at voltaire.com  Thu May  7 08:54:36 2009
From: dorons at voltaire.com (Doron Shoham)
Date: Thu, 07 May 2009 18:54:36 +0300
Subject: [ofa-general] [PATCH 1/2] osm_port.c: check if op_vls = 0 before
	max_op_vls comparison
In-Reply-To: <4A030377.6050202@voltaire.com>
References: <4A00386E.2050300@voltaire.com>	<f0e08f230905050614n149dfc53yd6d0f03e5f2dd4c@mail.gmail.com>	<4A0043B0.3030400@gmail.com>	<f0e08f230905050659ld9f24acvf03a0bc1e0cdb827@mail.gmail.com>	<20090506112135.GG10145@sk>	<f0e08f230905060429m162dd8e5h5a9744c35d81033@mail.gmail.com>	<4A029038.2040603@voltaire.com>
	<20090507115212.GC19236@sk> <4A030377.6050202@voltaire.com>
Message-ID: <4A03043C.4010709@voltaire.com>

check if op_vls = 0 before max_op_vls comparison

Signed-off-by: Doron Shoham <dorons at voltaire.com>
---
 opensm/opensm/osm_port.c |    9 +++++----
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c
index 2e6c642..4d1bbf2 100644
--- a/opensm/opensm/osm_port.c
+++ b/opensm/opensm/osm_port.c
@@ -376,15 +376,16 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log,
 	} else
 		op_vls = ib_port_info_get_op_vls(&p_physp->port_info);
 
-	/* support user limitation of max_op_vls */
-	if (op_vls > p_subn->opt.max_op_vls)
-		op_vls = p_subn->opt.max_op_vls;
-
 	if (op_vls == 0) {
+		/* for non compliant implementations */	
 		OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: "
 			"Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n");
 		op_vls = 1;
 	}
+	/* support user limitation of max_op_vls */
+	if (op_vls > p_subn->opt.max_op_vls)
+		op_vls = p_subn->opt.max_op_vls;
+
 
 	OSM_LOG_EXIT(p_log);
 	return op_vls;
-- 
1.5.4


From dorons at voltaire.com  Thu May  7 08:55:17 2009
From: dorons at voltaire.com (Doron Shoham)
Date: Thu, 07 May 2009 18:55:17 +0300
Subject: [ofa-general] [PATCH 0/2] osm_port.c: do not enforce PortInfo update
 if max_op_vls = 0
In-Reply-To: <4A030377.6050202@voltaire.com>
References: <4A00386E.2050300@voltaire.com>	<f0e08f230905050614n149dfc53yd6d0f03e5f2dd4c@mail.gmail.com>	<4A0043B0.3030400@gmail.com>	<f0e08f230905050659ld9f24acvf03a0bc1e0cdb827@mail.gmail.com>	<20090506112135.GG10145@sk>	<f0e08f230905060429m162dd8e5h5a9744c35d81033@mail.gmail.com>	<4A029038.2040603@voltaire.com>
	<20090507115212.GC19236@sk> <4A030377.6050202@voltaire.com>
Message-ID: <4A030465.90009@voltaire.com>

do not enforce PortInfo update if max_op_vls = 0

Signed-off-by: Doron Shoham <dorons at voltaire.com>
---
 opensm/opensm/osm_port.c   |    2 +-
 opensm/opensm/osm_subnet.c |    8 ++++++++
 2 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c
index 4d1bbf2..8bf1767 100644
--- a/opensm/opensm/osm_port.c
+++ b/opensm/opensm/osm_port.c
@@ -383,7 +383,7 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log,
 		op_vls = 1;
 	}
 	/* support user limitation of max_op_vls */
-	if (op_vls > p_subn->opt.max_op_vls)
+	if (p_subn->opt.max_op_vls && op_vls > p_subn->opt.max_op_vls)
 		op_vls = p_subn->opt.max_op_vls;
 
 
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index ec15f8a..71fc7a0 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -1288,6 +1288,14 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts)
 		"# switch port connected to a CA or router port\n"
 		"leaf_head_of_queue_lifetime 0x%02x\n\n"
 		"# Limit the maximal operational VLs\n"
+		"# Virtual Lanes operational on this port\n"
+		"# Values are (IB Spec 1.2.1, 14.2.5.6 Table 146 \"PortInfo\")\n"
+		"#    0: No change; valid only on Set()\n"
+		"#    1: VL0\n"
+		"#    2: VL0, VL1\n"
+		"#    3: VL0 - VL3\n"
+		"#    4: VL0 - VL7\n"
+		"#    5: VL0 - VL14\n"
 		"max_op_vls %u\n\n"
 		"# Force PortInfo:LinkSpeedEnabled on switch ports\n"
 		"# If 0, don't modify PortInfo:LinkSpeedEnabled on switch port\n"
-- 
1.5.4


From dorons at voltaire.com  Thu May  7 08:58:29 2009
From: dorons at voltaire.com (Doron Shoham)
Date: Thu, 07 May 2009 18:58:29 +0300
Subject: [ofa-general] [PATCH] saquery: fix -c arguement
Message-ID: <4A030525.7090209@voltaire.com>

set SAQUERY_CMD_CLASS_PORT_INFO instead of CLASS_PORT_INFO

Signed-off-by: Doron Shoham <dorons at voltaire.com>
---
 infiniband-diags/src/saquery.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c
index 4dcd712..2ec32cf 100644
--- a/infiniband-diags/src/saquery.c
+++ b/infiniband-diags/src/saquery.c
@@ -1470,7 +1470,7 @@ static int process_opt(void *context, int ch, char *optarg)
 		node_print_desc = ALL_DESC;
 		break;
 	case 'c':
-		command = CLASS_PORT_INFO;
+		command = SAQUERY_CMD_CLASS_PORT_INFO
 		break;
 	case 'S':
 		query_type = IB_SA_ATTR_SERVICERECORD;
-- 
1.5.4


From weiny2 at llnl.gov  Thu May  7 08:58:07 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 7 May 2009 08:58:07 -0700
Subject: [ofa-general] Re: [PATCH 2/3] Add combined routing support to 
	libibnetdisc
In-Reply-To: <f0e08f230905070656w384fde21ydbd20a8092b21834@mail.gmail.com>
References: <20090430142958.5811218f.weiny2@llnl.gov>
	<20090506100744.GB10145@sk>
	<20090506093347.bb1b56be.weiny2@llnl.gov>
	<f0e08f230905070656w384fde21ydbd20a8092b21834@mail.gmail.com>
Message-ID: <20090507085807.f1e743bb.weiny2@llnl.gov>

On Thu, 7 May 2009 09:56:38 -0400
Hal Rosenstock <hal.rosenstock at gmail.com> wrote:

> Ira,
> 
> On Wed, May 6, 2009 at 12:33 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
> > On Wed, 6 May 2009 13:07:44 +0300
> > Sasha Khapyorsky <sashak at voltaire.com> wrote:
> >
[snip]

> >>
> >> And wouldn't it be better instead of resolving selfport on each
> >> extend_path() call to keep it already resolved somewhere in fabric
> >> structure?
> >
> > This will only happen 1 time for each fabric being scan'ed because the path is
> > reused...
> >
> > Oh wait a minute, I just reviewed the code...  For the current use case the
> > path is reused since I am only scanning 1 node.  However, in the general case
> > this is not true.  Sorry about that.  A new patch is below.
> 
> Does combined routing always fall back on failure to using directed routing ?

No, not automatically in the library.

> 
> Also, would you summarize the use cases for combined routing in ibnetdiscover ?
> 

ibnetdiscover does not use this feature.  It does a "full scan" which results
in only DR routing.

iblinkinfo and ibqueryerrors have the ability to request output for a single
node.  The library was written to be able to scan from a given portid and a
number of hops around that node.  However, at first this only supported a DR
path in the portid.  If the user specified something like GUID iblinkinfo
would scan the entire fabric and search the data which came back for that
node.  Of course the problem with is that on a large fabric it could take 8
seconds to come back with a single node of data.  If the SM/SA is up and
running I decided it would be better to query for the LID of that node and
start the scan from there.  That is what this patch adds.  iblinkinfo and
ibqueryerrors will call ibnd_discover_fabric with the "from" == to the portid
resolved from the SA and "hops" == 1.  If resolving the GUID or the limited
scan fails ibqueryerrors and iblinkinfo then call the library again for a full
fabric scan ("from" == NULL) and then search for the node in the fabric data
returned.

So that is the use case for doing this in the library.  But once again
ibnetdiscover does not use this.  The other use case I could think of is doing
a more extensive scan of multiple hops around a single node.  I have not
implemented this yet but in my early testing it worked just fine starting with
a DR path.  I believe this will still work with combined routing.

Make sense?
Ira


From changquing.tang at hp.com  Thu May  7 09:07:05 2009
From: changquing.tang at hp.com (Tang, Changqing)
Date: Thu, 7 May 2009 16:07:05 +0000
Subject: [ofa-general] Memory registration redux
In-Reply-To: <D6421CE6-78D4-4C91-8803-9482E1C60566@cisco.com>
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
	<adabpq6t2k8.fsf@cisco.com>
	<D6421CE6-78D4-4C91-8803-9482E1C60566@cisco.com>
Message-ID: <58C6777539C300489D145B0F8E29C3281679DC115F@GVW0673EXC.americas.hpqcorp.net>


HP-MPI is pretty much doing the similar thing.  --CQ
 

> -----Original Message-----
> From: general-bounces at lists.openfabrics.org 
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of 
> Jeff Squyres
> Sent: Thursday, May 07, 2009 8:54 AM
> To: Roland Dreier
> Cc: Pavel Shamis; Hans Westgaard Ry; Terry Dontje; Lenny 
> Verkhovsky; Håkon Bugge; Donald Kerr; OpenFabrics General; 
> Alexander Supalov
> Subject: Re: [ofa-general] Memory registration redux
> 
> On May 6, 2009, at 4:10 PM, Roland Dreier (rdreier) wrote:
> 
> > By the way, what's the desired behavior of the cache if a process 
> > registers, say, address range 0x1000 ... 0x3fff, and then the same 
> > process registers address range 0x2000 ... 0x2fff (with all 
> the same 
> > permissions, etc)?
> >
> > The initial registration creates an MR that is still valid for the 
> > smaller virtual address range, so the second registration is much 
> > cheaper if we used the cached registration; but if we use the cache 
> > for the second registration, and then deregister the first 
> one, we're 
> > stuck with a too-big range pinned in the cache because of 
> the second 
> > registration.
> >
> 
> 
> I don't know what the other MPI's do in this scenario, but 
> here's what OMPI will do:
> 
> 1. lookup 0x1000-0x3fff in the cache; not find any of it it, 
> and therefore register
>     - add each page to our cache with a refcount of 1 2. 
> lookup 0x2000-0x2fff in the cache, find that all the pages 
> are already registered
>     - refcount++ on each page in the cache 3. when we go to 
> dereg 0x1000-0x3fff
>     - refcount-- on each page in the cache
>     - since some pages in the range still have refcount>0, 
> don't do anything further
> 
> Specifically: the actual dereg of 0x1000-0x3fff is blocked on 
> also releasing 0x2000-0x2fff.
> 
> Note that OMPI will only register a max of X bytes at a time 
> (where X defaults to 2MB).  So even if a user calls 
> MPI_SEND(...) with an enormous buffer, we'll register it 
> X/page_size pages at a time, not the entire buffer at once.  
> Hence, the "buffer A is blocked from dereg'ing by buffer B" 
> scenario is *somewhat* mitigated -- it's less wasteful than 
> if we can registered/cached the entire huge buffer at once.
> 
> Finally, note that if 0x2000-0x2fff had not been registered, 
> the 0x1000-0x3fff pages are not actually deregistered when 
> all the pages'  
> refcounts go to 0 -- they are just moved to the "able to be 
> dereg'ed list".  We don't actually dereg it until we later 
> try to reg new memory and fail due to lack of resources.  
> Then we take entries off the "able to be dereg'ed list" and 
> dereg them, then try reg'ing the new memory again.
> 
> MVAPICH: do you guys do similar things?
> 
> (I don't know if HP/Scali/Intel will comment on their 
> registration cache schemes)
> 
> --
> Jeff Squyres
> Cisco Systems
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 

From koop at cse.ohio-state.edu  Thu May  7 09:55:13 2009
From: koop at cse.ohio-state.edu (Matthew Koop)
Date: Thu, 7 May 2009 12:55:13 -0400 (EDT)
Subject: [ofa-general] Memory registration redux
In-Reply-To: <58C6777539C300489D145B0F8E29C3281679DC115F@GVW0673EXC.americas.hpqcorp.net>
Message-ID: <Pine.GSO.4.40.0905071254550.15104-100000@omicron.cse.ohio-state.edu>


MVAPICH is also doing pretty much the same thing as well.

Matt

On Thu, 7 May 2009, Tang, Changqing wrote:

>
> HP-MPI is pretty much doing the similar thing.  --CQ
>
>
> > -----Original Message-----
> > From: general-bounces at lists.openfabrics.org
> > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of
> > Jeff Squyres
> > Sent: Thursday, May 07, 2009 8:54 AM
> > To: Roland Dreier
> > Cc: Pavel Shamis; Hans Westgaard Ry; Terry Dontje; Lenny
> > Verkhovsky; H�kon Bugge; Donald Kerr; OpenFabrics General;
> > Alexander Supalov
> > Subject: Re: [ofa-general] Memory registration redux
> >
> > On May 6, 2009, at 4:10 PM, Roland Dreier (rdreier) wrote:
> >
> > > By the way, what's the desired behavior of the cache if a process
> > > registers, say, address range 0x1000 ... 0x3fff, and then the same
> > > process registers address range 0x2000 ... 0x2fff (with all
> > the same
> > > permissions, etc)?
> > >
> > > The initial registration creates an MR that is still valid for the
> > > smaller virtual address range, so the second registration is much
> > > cheaper if we used the cached registration; but if we use the cache
> > > for the second registration, and then deregister the first
> > one, we're
> > > stuck with a too-big range pinned in the cache because of
> > the second
> > > registration.
> > >
> >
> >
> > I don't know what the other MPI's do in this scenario, but
> > here's what OMPI will do:
> >
> > 1. lookup 0x1000-0x3fff in the cache; not find any of it it,
> > and therefore register
> >     - add each page to our cache with a refcount of 1 2.
> > lookup 0x2000-0x2fff in the cache, find that all the pages
> > are already registered
> >     - refcount++ on each page in the cache 3. when we go to
> > dereg 0x1000-0x3fff
> >     - refcount-- on each page in the cache
> >     - since some pages in the range still have refcount>0,
> > don't do anything further
> >
> > Specifically: the actual dereg of 0x1000-0x3fff is blocked on
> > also releasing 0x2000-0x2fff.
> >
> > Note that OMPI will only register a max of X bytes at a time
> > (where X defaults to 2MB).  So even if a user calls
> > MPI_SEND(...) with an enormous buffer, we'll register it
> > X/page_size pages at a time, not the entire buffer at once.
> > Hence, the "buffer A is blocked from dereg'ing by buffer B"
> > scenario is *somewhat* mitigated -- it's less wasteful than
> > if we can registered/cached the entire huge buffer at once.
> >
> > Finally, note that if 0x2000-0x2fff had not been registered,
> > the 0x1000-0x3fff pages are not actually deregistered when
> > all the pages'
> > refcounts go to 0 -- they are just moved to the "able to be
> > dereg'ed list".  We don't actually dereg it until we later
> > try to reg new memory and fail due to lack of resources.
> > Then we take entries off the "able to be dereg'ed list" and
> > dereg them, then try reg'ing the new memory again.
> >
> > MVAPICH: do you guys do similar things?
> >
> > (I don't know if HP/Scali/Intel will comment on their
> > registration cache schemes)
> >
> > --
> > Jeff Squyres
> > Cisco Systems
> >
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> > _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From rdreier at cisco.com  Thu May  7 14:46:55 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 07 May 2009 14:46:55 -0700
Subject: [ofa-general] Memory registration redux
In-Reply-To: <20090507000231.GB16280@obsidianresearch.com> (Jason Gunthorpe's
	message of "Wed, 6 May 2009 18:02:31 -0600")
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
	<adabpq6t2k8.fsf@cisco.com>
	<20090506214628.GM2590@obsidianresearch.com>
	<adatz3xsxo6.fsf@cisco.com>
	<20090506222638.GA16280@obsidianresearch.com>
	<adaprelsvnp.fsf@cisco.com>
	<20090507000231.GB16280@obsidianresearch.com>
Message-ID: <adak54ssi0g.fsf@cisco.com>

 > > No... every HCA just needs to support register and unregister.  It
 > > doesn't have to support changing the mapping without full unregister and
 > > reregister.
 > 
 > Well, I would imagine this entire process to be a HCA specific
 > operation, so HW that supports a better method can use it, otherwise
 > it has to register/unregister. Is this a concern today with existing
 > HCAs?
 > 
 > Using register/unregister exposes a race for the original case you
 > brought up - but that race is completely unfixable without hardware
 > support. At least it now becomes a hw specific race that can be
 > printk'd and someday fixed in new HW rather than an unsolvable API
 > problem..

We definitely don't want to duplicate all this logic in every hardware
device driver, so most of it needs to be generic.  If we're adding new
low-level driver methods to handle this, that definitely raises the cost
of implementing all this.  But I guess if we start with a generic
register/unregister fallback that drivers can override for better
performance, then I think we're in good shape.

 > > Also this requires potentially walking the page tables of the entire
 > > process, checking to see if any mappings have changed.  We really want
 > > to keep the information that the MMU notifiers give us, namely which
 > > virtual address range is changing.
 > 
 > Walking the page tables of every registration in the process, not the
 > entire process.

Yes... but there are bugs in the bugzilla about mthca being limited to
only 8 GB of registration by default or something like that, and having
that break Intel MPI in some cases.  So some MPI jobs want to have 10s
of GBs of registered memory -- walking millions of page table entries
for every resync operation seems like a big problem to me.

Which means that the MMU notifier has to walk the list of memory
registrations and mark any affected ones as dirty (possibly with a hint
about which pages were invalidated) as you suggest below.  Falling back
to the "check every registration" ultra-slow-path I think should never
ever happen.

 > I was thinking more along the lines of having the mmu notifiers put
 > affected registrations on a per-process (or PD?) dirty linked list,
 > with the link pointers as part of the registration structure. Set a
 > dirty flag in the registration too. An extra pointer per registration
 > and a minor incremental cost to the existing work the mmu notifier
 > would have to do.

Yes, makes sense.

 > >  > Only part I don't immediately see is how to trap creation of new VM
 > >  > (ie mmap), mmu notifiers seem focused on invalidating, ie munmap()..
 > > 
 > > Why do we care?  The initial faulting in of mappings occurs when an MR
 > > is created.
 > 
 > Well, exactly, that's the problem. If you can't trap mmap you cannot
 > do the initial faulting and mapping for a new object that is being
 > mapped into an existing MR.
 > 
 > Consider:
 > 
 >   void *a = mmap(0,PAGE_SIZE..);
 >   ibv_register();
 >   // [..]
 >   mmunmap(a);
 >   ibv_synchronize();
 > 
 >   // At this point we want the HCA mapping to point to oblivion
 > 
 >   mmap(a,PAGE_SIZE,MAP_FIXED);
 >   ibv_synchronize();
 > 
 >   // And now we want it to point to the new allocation
 > 
 > I use MAP_FIXED to illustrate the point, but Jeff has said the same
 > address re-use happens randomly in real apps.

This can be handled I think, although at some cost.  Just have the
kernel keep track of which MMU sequence number actually invalidated each
MR, and return (via ibv_synchronize()) the MMU change sequence number
that userspace is in sync with.  So in the example above, the first
synchronize after munmap() will fail to fix up the first registration,
since it is pointing to an unmapped virtual address, and hence it will
leave that MR on the dirty list, and return that sequence number as not
being synced up yet.  And then the second synchronize will see that MR
still on the dirty list, and try again to find the pages.

Passing the sequence number back to userspace makes it possible for
userspace to know that it still has to call ibv_synchronize() again.

There is the possibility that a 1GB MR will have its last page unmapped,
and end up having 100s of thousands of pages walked again and again in
every synchronize operation.

 > This method avoids the problem you noticed, but there is extra work to
 > fixup a registration that may never be used again. I strongly suspect
 > that in the majority of cases this extra work should be about on the
 > same order as userspace calling unregister on the MR.

Yes, also it doesn't match the current MPI way of lazily unregistering
things, and only garbage collecting the refcnt 0 cache entries when a
registration fails.  With this method, if userspace unregisters
something, it really is gone, and if it doesn't unregister it, then it
really uses up space until userspace explicitly unregisters it.  Not
sure how MPI implementers feel about that.

 > Or, ignore the overlapping problem, and use your original technique,
 > slightly modified:
 >  - Userspace registers a counter with the kernel. Kernel pins the
 >    page, sets up mmu notifiers and increments the counter when
 >    invalidates intersect with registrations
 >  - Kernel maintains a linked list of registrations that have been
 >    invalidated via mmu notifiers using the registration structure
 >    and a dirty bit
 >  - Userspace checks the counter at every cache hit, if different it
 >    calls into the kernel:
 >        MR_Cookie *mrs[100];
 >        int rc = ibv_get_invalid_mrs(mrs,100);
 >        invalidate_cache(mrs,rc);
 >        // Repeat until drained
 > 
 >    get_invalid_mrs traverses the linked list and returns an
 >    identifying value to userspace, which looks it up in the cache,
 >    calls unregister and removes it from the cache.

What's the advantage of this?  I have to do the get_invalid_mrs() call a
bunch of times, rather than just reading which ones are invalid from the
cache directly?

 - R.


From rdreier at cisco.com  Thu May  7 14:58:50 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 07 May 2009 14:58:50 -0700
Subject: [ofa-general] Memory registration redux
In-Reply-To: <D6421CE6-78D4-4C91-8803-9482E1C60566@cisco.com> (Jeff Squyres's
	message of "Thu, 7 May 2009 09:54:26 -0400")
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
	<adabpq6t2k8.fsf@cisco.com>
	<D6421CE6-78D4-4C91-8803-9482E1C60566@cisco.com>
Message-ID: <adafxfgshgl.fsf@cisco.com>

 > I don't know what the other MPI's do in this scenario, but here's what
 > OMPI will do:
 > 
 > 1. lookup 0x1000-0x3fff in the cache; not find any of it it, and
 > therefore register
 >    - add each page to our cache with a refcount of 1
 > 2. lookup 0x2000-0x2fff in the cache, find that all the pages are
 > already registered
 >    - refcount++ on each page in the cache
 > 3. when we go to dereg 0x1000-0x3fff
 >    - refcount-- on each page in the cache
 >    - since some pages in the range still have refcount>0, don't do
 > anything further
 > 
 > Specifically: the actual dereg of 0x1000-0x3fff is blocked on also
 > releasing 0x2000-0x2fff.

If everyone is doing this, how do you handle the case that Jason pointed
out, namely:

 * you register 0x1000 ... 0x3fff
 * you want to register 0x2000 ... 0x2fff and have a cache hit
 * you finish up with 0x1000 ... 0x3fff
 * app does something (which is valid since you finished up with the
   bigger range) that invalidates mapping 0x3000 ... 0x3fff (eg free()
   that leads to munmap() or whatever), and your hooks tell you so.
 * app reallocates a mapping in 0x3000 ... 0x3fff
 * you want to re-register 0x1000 ... 0x3fff -- but it has to be marked
   both invalid and in-use in the cache at this point !?

 - R.


From rdreier at cisco.com  Thu May  7 15:08:54 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 07 May 2009 15:08:54 -0700
Subject: [ofa-general] Re: [PATCH] mlx4: fix fast registration implementation
In-Reply-To: <200905071501.17670.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Thu, 7 May 2009 15:01:16 +0300")
References: <200905071501.17670.jackm@dev.mellanox.co.il>
Message-ID: <adabpq4sgzt.fsf@cisco.com>

OK, I guess we want to make work request read-only by the low-level
driver.  (I wasn't sure whether the fix should be in mlx4 or the
NFS/RDMA code, but OK, this approach seems better overall)

Applied.

 > +	u64				*mapped_page_list;

fixed this to __be64 to avoid sparse endianness checking problems, and
say what the code means a little better.

 - R.


From jgunthorpe at obsidianresearch.com  Thu May  7 15:48:06 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Thu, 7 May 2009 16:48:06 -0600
Subject: [ofa-general] Memory registration redux
In-Reply-To: <adak54ssi0g.fsf@cisco.com>
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
	<adabpq6t2k8.fsf@cisco.com>
	<20090506214628.GM2590@obsidianresearch.com>
	<adatz3xsxo6.fsf@cisco.com>
	<20090506222638.GA16280@obsidianresearch.com>
	<adaprelsvnp.fsf@cisco.com>
	<20090507000231.GB16280@obsidianresearch.com>
	<adak54ssi0g.fsf@cisco.com>
Message-ID: <20090507224806.GF16280@obsidianresearch.com>

On Thu, May 07, 2009 at 02:46:55PM -0700, Roland Dreier wrote:

>  > Using register/unregister exposes a race for the original case you
>  > brought up - but that race is completely unfixable without hardware
>  > support. At least it now becomes a hw specific race that can be
>  > printk'd and someday fixed in new HW rather than an unsolvable API
>  > problem..
> 
> We definitely don't want to duplicate all this logic in every hardware
> device driver, so most of it needs to be generic.  If we're adding new
> low-level driver methods to handle this, that definitely raises the cost
> of implementing all this.  But I guess if we start with a generic
> register/unregister fallback that drivers can override for better
> performance, then I think we're in good shape.

Right, I was only thinking of a new driver call that was along the
lines of update_mr_pages() that just updates the HCA's mapping with
new page table entires atomically. It really would be device
specific. If there is no call available then unregister/register +
printk log is a fair generic implementation.

To be clear, what I'm thinking is that this would only be invoked if
the VM is being *replaced*. Simply unmaping VM should do nothing.

> Which means that the MMU notifier has to walk the list of memory
> registrations and mark any affected ones as dirty (possibly with a hint
> about which pages were invalidated) as you suggest below.  Falling back
> to the "check every registration" ultra-slow-path I think should never
> ever happen.

Yikes, yes, that makes sense. And hearing that at least openmpi caps
the registration size makes me think per-page granularity is probably
unnecessary to track.

>  > Well, exactly, that's the problem. If you can't trap mmap you cannot
>  > do the initial faulting and mapping for a new object that is being
>  > mapped into an existing MR.
>  > 
>  > Consider:
>  > 
>  >   void *a = mmap(0,PAGE_SIZE..);
>  >   ibv_register();
>  >   // [..]
>  >   mmunmap(a);
>  >   ibv_synchronize();
>  > 
>  >   // At this point we want the HCA mapping to point to oblivion
>  > 
>  >   mmap(a,PAGE_SIZE,MAP_FIXED);
>  >   ibv_synchronize();
>  > 
>  >   // And now we want it to point to the new allocation
>  > 
>  > I use MAP_FIXED to illustrate the point, but Jeff has said the same
>  > address re-use happens randomly in real apps.
> 
> This can be handled I think, although at some cost.  Just have the
> kernel keep track of which MMU sequence number actually invalidated each
> MR, and return (via ibv_synchronize()) the MMU change sequence number
> that userspace is in sync with.  So in the example above, the first
> synchronize after munmap() will fail to fix up the first registration,
> since it is pointing to an unmapped virtual address, and hence it will
> leave that MR on the dirty list, and return that sequence number as not
> being synced up yet.  And then the second synchronize will see that MR
> still on the dirty list, and try again to find the pages.

I agree some kind of kernel/userspace exchange of the sequence number
is necessary to make all the locking and race conditions work out.

But the problem I'm seeing is how does the sequence number get
incremented by the kernel after the mmap() call in the above sequence?
Which mmu_notifier/etc call back do you hook for that?

The *very best* hook would be one that is called when a mm has new
virtual address space allocated and the verbs layer would then take
the allocated address range and intersect it with the registration
list. Any registrations that have pages in the allocated region are
marked invalid.

Imagine every call to ibv_synchronize was prefixed with a check that
the sequence number is changed.

>  > This method avoids the problem you noticed, but there is extra work to
>  > fixup a registration that may never be used again. I strongly suspect
>  > that in the majority of cases this extra work should be about on the
>  > same order as userspace calling unregister on the MR.
> 
> Yes, also it doesn't match the current MPI way of lazily unregistering
> things, and only garbage collecting the refcnt 0 cache entries when a
> registration fails.  With this method, if userspace unregisters
> something, it really is gone, and if it doesn't unregister it, then it
> really uses up space until userspace explicitly unregisters it.  Not
> sure how MPI implementers feel about that.

Well, mixing the lazy unregister in is not a significant change, just
don't increment the sequence number on munmap and have the kernel do
nothing until pages are mapped into an existing registration. With a
flag both behaviors are possible.

All of this work is mainly to close the hole where mapping new memory
over already registered VM results in RDMA to the wrong pages. Fixing
this hole removes the need to trap memory management syscalls and
solves that data corruption problem.

>From there various optimizations can be done, like lazy garbage
collecting registrations that no longer point to mapped memory.

>  > Or, ignore the overlapping problem, and use your original technique,
>  > slightly modified:
>  >  - Userspace registers a counter with the kernel. Kernel pins the
>  >    page, sets up mmu notifiers and increments the counter when
>  >    invalidates intersect with registrations
>  >  - Kernel maintains a linked list of registrations that have been
>  >    invalidated via mmu notifiers using the registration structure
>  >    and a dirty bit
>  >  - Userspace checks the counter at every cache hit, if different it
>  >    calls into the kernel:
>  >        MR_Cookie *mrs[100];
>  >        int rc = ibv_get_invalid_mrs(mrs,100);
>  >        invalidate_cache(mrs,rc);
>  >        // Repeat until drained
>  > 
>  >    get_invalid_mrs traverses the linked list and returns an
>  >    identifying value to userspace, which looks it up in the cache,
>  >    calls unregister and removes it from the cache.
> 
> What's the advantage of this?  I have to do the get_invalid_mrs() call a
> bunch of times, rather than just reading which ones are invalid from the
> cache directly?

This is a trade off, the above is a more normal kernel API and lets
the app get an list of changes it can scan. Having the kernel update
flags means if the app wants a list of changes it has to scan all
registrations.

Knowing the registration is no good lets you remove it from the search
list and save time on the hot path.

I imagined a call that would return as much in one go as memory is
available (ie 100 entries above) so I doubt more then one call per
event would ever be needed.

Jason


From sfr at canb.auug.org.au  Thu May  7 18:53:56 2009
From: sfr at canb.auug.org.au (Stephen Rothwell)
Date: Fri, 8 May 2009 11:53:56 +1000
Subject: [ofa-general] linux-next: infiniband tree build failure
Message-ID: <20090508115356.b97b8981.sfr@canb.auug.org.au>

Hi Roland,

Today's linux-next build (x86_64 allmodconfig) failed like this:

drivers/infiniband/hw/mlx4/mr.c: In function 'mlx4_ib_alloc_fast_reg_page_list':
drivers/infiniband/hw/mlx4/mr.c:242: error: label 'err_free_mfrpl' used but not defined

Caused by commit 88029ff3c862812b81745ae3d6557ede96e2d051 ("IB/mlx4:
Don't overwrite fast registration page list when posting work request").
Clearly not build tested :-(

I have used the version of the tree from next-20090507.
-- 
Cheers,
Stephen Rothwell                    sfr at canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090508/5e9f0423/attachment.sig>

From rdreier at cisco.com  Thu May  7 21:36:55 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 07 May 2009 21:36:55 -0700
Subject: [ofa-general] linux-next: infiniband tree build failure
In-Reply-To: <20090508115356.b97b8981.sfr@canb.auug.org.au> (Stephen
	Rothwell's message of "Fri, 8 May 2009 11:53:56 +1000")
References: <20090508115356.b97b8981.sfr@canb.auug.org.au>
Message-ID: <adatz3wqkgo.fsf@cisco.com>

 > Today's linux-next build (x86_64 allmodconfig) failed like this:
 > 
 > drivers/infiniband/hw/mlx4/mr.c: In function 'mlx4_ib_alloc_fast_reg_page_list':
 > drivers/infiniband/hw/mlx4/mr.c:242: error: label 'err_free_mfrpl' used but not defined
 > 
 > Caused by commit 88029ff3c862812b81745ae3d6557ede96e2d051 ("IB/mlx4:
 > Don't overwrite fast registration page list when posting work request").
 > Clearly not build tested :-(

My fault for editing after applying the patch.  Fixed now.

Thanks,
  Roland


From vlad at lists.openfabrics.org  Fri May  8 03:25:01 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Fri,  8 May 2009 03:25:01 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090508-0200 daily build status
Message-ID: <20090508102501.10072E61327@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From hal.rosenstock at gmail.com  Fri May  8 06:35:38 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 8 May 2009 09:35:38 -0400
Subject: [ofa-general] Re: [PATCH 2/3] Add combined routing support to 
	libibnetdisc
In-Reply-To: <20090507085807.f1e743bb.weiny2@llnl.gov>
References: <20090430142958.5811218f.weiny2@llnl.gov>
	<20090506100744.GB10145@sk> <20090506093347.bb1b56be.weiny2@llnl.gov>
	<f0e08f230905070656w384fde21ydbd20a8092b21834@mail.gmail.com>
	<20090507085807.f1e743bb.weiny2@llnl.gov>
Message-ID: <f0e08f230905080635g6fd854c0od9a1b6a6f38bdc0d@mail.gmail.com>

On Thu, May 7, 2009 at 11:58 AM, Ira Weiny <weiny2 at llnl.gov> wrote:
> On Thu, 7 May 2009 09:56:38 -0400
> Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
>
>> Ira,
>>
>> On Wed, May 6, 2009 at 12:33 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
>> > On Wed, 6 May 2009 13:07:44 +0300
>> > Sasha Khapyorsky <sashak at voltaire.com> wrote:
>> >
> [snip]
>
>> >>
>> >> And wouldn't it be better instead of resolving selfport on each
>> >> extend_path() call to keep it already resolved somewhere in fabric
>> >> structure?
>> >
>> > This will only happen 1 time for each fabric being scan'ed because the path is
>> > reused...
>> >
>> > Oh wait a minute, I just reviewed the code...  For the current use case the
>> > path is reused since I am only scanning 1 node.  However, in the general case
>> > this is not true.  Sorry about that.  A new patch is below.
>>
>> Does combined routing always fall back on failure to using directed routing ?
>
> No, not automatically in the library.
>
>>
>> Also, would you summarize the use cases for combined routing in ibnetdiscover ?
>>
>
> ibnetdiscover does not use this feature.  It does a "full scan" which results
> in only DR routing.
>
> iblinkinfo and ibqueryerrors have the ability to request output for a single
> node.  The library was written to be able to scan from a given portid and a
> number of hops around that node.  However, at first this only supported a DR
> path in the portid.  If the user specified something like GUID iblinkinfo
> would scan the entire fabric and search the data which came back for that
> node.  Of course the problem with is that on a large fabric it could take 8
> seconds to come back with a single node of data.  If the SM/SA is up and
> running I decided it would be better to query for the LID of that node and
> start the scan from there.  That is what this patch adds.  iblinkinfo and
> ibqueryerrors will call ibnd_discover_fabric with the "from" == to the portid
> resolved from the SA and "hops" == 1.  If resolving the GUID or the limited
> scan fails ibqueryerrors and iblinkinfo then call the library again for a full
> fabric scan ("from" == NULL) and then search for the node in the fabric data
> returned.
>
> So that is the use case for doing this in the library.  But once again
> ibnetdiscover does not use this.  The other use case I could think of is doing
> a more extensive scan of multiple hops around a single node.  I have not
> implemented this yet but in my early testing it worked just fine starting with
> a DR path.  I believe this will still work with combined routing.
>
> Make sense?

Yes, this makes sense. Thanks for clarifying.

-- Hal

> Ira
>
>


From viral.mehta at einfochips.com  Fri May  8 06:45:06 2009
From: viral.mehta at einfochips.com (Viral Mehta)
Date: Fri, 08 May 2009 19:15:06 +0530
Subject: [ofa-general] ib_rdma_bw - bandwidth calculation
Message-ID: <4A043762.4090003@einfochips.com>

Hi,

While running below ib_rdma_bw on 32bit platform, I am getting unexpected low throughput.
Server: ib_rdma_bw -p 5019 -s 1048576 -t 500 -n 5000 -b -c  
Client: ib_rdma_bw -p 5019 -s 1048576 -t 500 -n 5000 -b -c 100.168.54.49

(If iterations are changed to 500, I am getting expected throughput)

Looking at the code I found,
ib_rdma_bw.c in perftest package has following code
>{
>        double cycles_to_units;
>        unsigned long tsize;    /* Transferred size, in megabytes */
>        ....
>        ....
>        cycles_to_units = get_cpu_mhz(0) * 1000000;
>
>        printf("%d: Bandwidth average: %g MB/sec\n", pid,
>                         tsize * iters * cycles_to_units /
>                         (tcompleted[iters - 1] - tposted[0]) 
>/ 0x100000);
>}
>

Here, tsize is "unsigned long" and which is of 4Bytes on 32bit 
platforms and 8Bytes on 64bit platforms.
I run test for 1M datasize and 5000 iterations as 
above, the calculation (tsize * iters) 
overflows "unsigned long" limit and thus gives unexpected 
result as low throughput.

Correct fix should be applied in ib_rdma_bw application. Either change 
calculation from (tsize * iters * cycles_to_units) to ( 
cycles_to_units * tsize * iters ) Or to change tsize to double. 

Should I go ahead and submit a patch ?

Viral Mehta, Embedded Software Engineer, www.einfochips.com

P.S. - 
However, I do understand that we can overflow double boundary as well if we run test for higher datasize and higher iterations. 
Better way to calculate bandwidth would be after every fix number of iterations (say 100).


From hal.rosenstock at gmail.com  Fri May  8 06:57:58 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 8 May 2009 09:57:58 -0400
Subject: [ofa-general] Re: [RFC] OpenSM and IPv6 Scalability Proposal
In-Reply-To: <39C75744D164D948A170E9792AF8E7CA01F6F886@exil.voltaire.com>
References: <AcnOHro1Z+E/64vgQaKNH19I6q2LoQAAQzxgAAUYQIA=>
	<39C75744D164D948A170E9792AF8E7CA01F6F886@exil.voltaire.com>
Message-ID: <f0e08f230905080657j36733c0fj2a85f3c3073de806@mail.gmail.com>

On Wed, May 6, 2009 at 6:24 AM, Slava Strebkov <slavas at voltaire.com> wrote:
>
> In addition to the original proposal we suggest allocating special MLID
> for the following MGIDs:
>  1. FF12401bxxxx000000000000FFFFFFFF - All Nodes
>  2. FF12401bxxxx00000000000000000001 - All hosts
>  3. FF12401bffff0000000000000000004d  - all Gateways
>  4. FF12401bxxxx00000000000000000002 - all routers
>  5. FF12601bABCD000000000001ffxxxxxx - IPv6 SNM

It turns out that collapsing multicast groups across PKeys on a single
MLID may not be such a good idea unless partition enforcement
enforcement by switches is disabled. There should be different modes
of collapsing based on this based on whether this is enabled or not.

> For all other cases we suggest that same MLID will be assigned to
> different MGIDs if:
>  1. They share the same P Key
>  2. Same signature - for IPoIB only
>  3. Same LSB bits - bitmask configurable by user (default  10 bits)
>        for example, the following are the same:
>        MGID1:  FF12401bABCD000000000000xxxxx755
>        MGID2:  FF12401bABCD000000000000yyyyyB55

Jason's approach to this was in a thread entitled "IPv6 and IPoIB
scalability issue":
http://lists.openfabrics.org/pipermail/general/2006-November/029621.html
in which he proposed an MGID range (MGID/prefix syntax) for collapsing
IPv6 SNM groups. Additionally, there was the potential to distribute
the matched groups across some number of MLIDs. See also thread "[RFC]
OpenSM and IPv6 Scalability Proposal":
http://lists.openfabrics.org/pipermail/general/2008-June/051226.html

>  Implementation.
>  Since there will be many mgroups shared same mlid, mlid-array entry
> will contain
>  fleximap holding mgroups.
>  Searching of mgroup will be performed by mlid (index in the array) and
> mgid -
>  key in the fleximap.

Sasha proposed using an array rather than fleximap for this:
http://lists.openfabrics.org/pipermail/general/2008-June/051525.html

-- Hal

>
>
>  Slava Strebkov
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From vlad at lists.openfabrics.org  Sat May  9 03:21:53 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sat,  9 May 2009 03:21:53 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090509-0200 daily build status
Message-ID: <20090509102153.70076E615A7@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From dorfman.eli at gmail.com  Sat May  9 03:32:06 2009
From: dorfman.eli at gmail.com (Eli Dorfman)
Date: Sat, 9 May 2009 13:32:06 +0300
Subject: [ofa-general] Re: [RFC] OpenSM and IPv6 Scalability Proposal
In-Reply-To: <f0e08f230905080657j36733c0fj2a85f3c3073de806@mail.gmail.com>
References: <39C75744D164D948A170E9792AF8E7CA01F6F886@exil.voltaire.com>
	<f0e08f230905080657j36733c0fj2a85f3c3073de806@mail.gmail.com>
Message-ID: <694d48600905090332u29630a75lb2b1bbfb537c287e@mail.gmail.com>

On Fri, May 8, 2009 at 4:57 PM, Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
> On Wed, May 6, 2009 at 6:24 AM, Slava Strebkov <slavas at voltaire.com> wrote:
>>
>> In addition to the original proposal we suggest allocating special MLID
>> for the following MGIDs:
>>  1. FF12401bxxxx000000000000FFFFFFFF - All Nodes
>>  2. FF12401bxxxx00000000000000000001 - All hosts
>>  3. FF12401bffff0000000000000000004d  - all Gateways
>>  4. FF12401bxxxx00000000000000000002 - all routers
>>  5. FF12601bABCD000000000001ffxxxxxx - IPv6 SNM
>
> It turns out that collapsing multicast groups across PKeys on a single
> MLID may not be such a good idea unless partition enforcement
> enforcement by switches is disabled. There should be different modes
> of collapsing based on this based on whether this is enabled or not.

The idea is to allocate a different MLID per each of the above special MGIDs.

>> For all other cases we suggest that same MLID will be assigned to
>> different MGIDs if:
>>  1. They share the same P Key
>>  2. Same signature - for IPoIB only
>>  3. Same LSB bits - bitmask configurable by user (default  10 bits)
>>        for example, the following are the same:
>>        MGID1:  FF12401bABCD000000000000xxxxx755
>>        MGID2:  FF12401bABCD000000000000yyyyyB55
>
> Jason's approach to this was in a thread entitled "IPv6 and IPoIB
> scalability issue":
> http://lists.openfabrics.org/pipermail/general/2006-November/029621.html
> in which he proposed an MGID range (MGID/prefix syntax) for collapsing
> IPv6 SNM groups. Additionally, there was the potential to distribute
> the matched groups across some number of MLIDs. See also thread "[RFC]
> OpenSM and IPv6 Scalability Proposal":
> http://lists.openfabrics.org/pipermail/general/2008-June/051226.html
>
>>  Implementation.
>>  Since there will be many mgroups shared same mlid, mlid-array entry
>> will contain
>>  fleximap holding mgroups.
>>  Searching of mgroup will be performed by mlid (index in the array) and
>> mgid -
>>  key in the fleximap.
>
> Sasha proposed using an array rather than fleximap for this:
> http://lists.openfabrics.org/pipermail/general/2008-June/051525.html
>
> -- Hal
>
>>
>>
>>  Slava Strebkov
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From hal.rosenstock at gmail.com  Sat May  9 03:41:27 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sat, 9 May 2009 06:41:27 -0400
Subject: [ofa-general] Re: [RFC] OpenSM and IPv6 Scalability Proposal
In-Reply-To: <694d48600905090332u29630a75lb2b1bbfb537c287e@mail.gmail.com>
References: <39C75744D164D948A170E9792AF8E7CA01F6F886@exil.voltaire.com>
	<f0e08f230905080657j36733c0fj2a85f3c3073de806@mail.gmail.com>
	<694d48600905090332u29630a75lb2b1bbfb537c287e@mail.gmail.com>
Message-ID: <f0e08f230905090341m37697ad9n7b176238e705bc16@mail.gmail.com>

On Sat, May 9, 2009 at 6:32 AM, Eli Dorfman <dorfman.eli at gmail.com> wrote:
> On Fri, May 8, 2009 at 4:57 PM, Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
>> On Wed, May 6, 2009 at 6:24 AM, Slava Strebkov <slavas at voltaire.com> wrote:
>>>
>>> In addition to the original proposal we suggest allocating special MLID
>>> for the following MGIDs:
>>>  1. FF12401bxxxx000000000000FFFFFFFF - All Nodes
>>>  2. FF12401bxxxx00000000000000000001 - All hosts
>>>  3. FF12401bffff0000000000000000004d  - all Gateways
>>>  4. FF12401bxxxx00000000000000000002 - all routers
>>>  5. FF12601bABCD000000000001ffxxxxxx - IPv6 SNM
>>
>> It turns out that collapsing multicast groups across PKeys on a single
>> MLID may not be such a good idea unless partition enforcement
>> enforcement by switches is disabled. There should be different modes
>> of collapsing based on this based on whether this is enabled or not.
>
> The idea is to allocate a different MLID per each of the above special MGIDs.

So one MLID per PKey in the MGID ?

What's the difference between xxxx's and ABCD in the syntax above ?
IPv6 is being collapsed per PKey too, right ?

>>> For all other cases we suggest that same MLID will be assigned to
>>> different MGIDs if:
>>>  1. They share the same P Key
>>>  2. Same signature - for IPoIB only
>>>  3. Same LSB bits - bitmask configurable by user (default  10 bits)
>>>        for example, the following are the same:
>>>        MGID1:  FF12401bABCD000000000000xxxxx755
>>>        MGID2:  FF12401bABCD000000000000yyyyyB55
>>
>> Jason's approach to this was in a thread entitled "IPv6 and IPoIB
>> scalability issue":
>> http://lists.openfabrics.org/pipermail/general/2006-November/029621.html
>> in which he proposed an MGID range (MGID/prefix syntax) for collapsing
>> IPv6 SNM groups. Additionally, there was the potential to distribute
>> the matched groups across some number of MLIDs. See also thread "[RFC]
>> OpenSM and IPv6 Scalability Proposal":
>> http://lists.openfabrics.org/pipermail/general/2008-June/051226.html
>>
>>>  Implementation.
>>>  Since there will be many mgroups shared same mlid, mlid-array entry
>>> will contain
>>>  fleximap holding mgroups.
>>>  Searching of mgroup will be performed by mlid (index in the array) and
>>> mgid -
>>>  key in the fleximap.
>>
>> Sasha proposed using an array rather than fleximap for this:
>> http://lists.openfabrics.org/pipermail/general/2008-June/051525.html
>>
>> -- Hal
>>
>>>
>>>
>>>  Slava Strebkov
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>
>


From dorfman.eli at gmail.com  Sat May  9 04:29:23 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Sat, 09 May 2009 14:29:23 +0300
Subject: [ofa-general] [PATCH] opensm/osm_lid_mgr.c bug in opensm LID
	assignment
Message-ID: <4A056913.7010700@gmail.com>

 lid persistent range wrong check
 used lids were not properly chekced which
 caused duplicate lid assignment in some cases.

Signed-off-by: Eli Dorfman <elid at voltaire.com>
---
 opensm/opensm/osm_lid_mgr.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c
index 14601e1..e1d5106 100644
--- a/opensm/opensm/osm_lid_mgr.c
+++ b/opensm/opensm/osm_lid_mgr.c
@@ -595,7 +595,7 @@ static boolean_t lid_mgr_is_range_not_persistent(IN osm_lid_mgr_t * p_mgr,
 		return FALSE;
 
 	for (i = lid; i < lid + num_lids; i++)
-		if (p_mgr->used_lids[lid])
+		if (p_mgr->used_lids[i])
 			return FALSE;
 
 	return TRUE;
-- 
1.5.3.6


From dorfman.eli at gmail.com  Sat May  9 04:31:30 2009
From: dorfman.eli at gmail.com (Eli Dorfman)
Date: Sat, 9 May 2009 14:31:30 +0300
Subject: [ofa-general] Re: [RFC] OpenSM and IPv6 Scalability Proposal
In-Reply-To: <f0e08f230905090341m37697ad9n7b176238e705bc16@mail.gmail.com>
References: <39C75744D164D948A170E9792AF8E7CA01F6F886@exil.voltaire.com>
	<f0e08f230905080657j36733c0fj2a85f3c3073de806@mail.gmail.com>
	<694d48600905090332u29630a75lb2b1bbfb537c287e@mail.gmail.com>
	<f0e08f230905090341m37697ad9n7b176238e705bc16@mail.gmail.com>
Message-ID: <694d48600905090431ocd05510y3218575a8a93d75@mail.gmail.com>

On Sat, May 9, 2009 at 1:41 PM, Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
> On Sat, May 9, 2009 at 6:32 AM, Eli Dorfman <dorfman.eli at gmail.com> wrote:
>> On Fri, May 8, 2009 at 4:57 PM, Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
>>> On Wed, May 6, 2009 at 6:24 AM, Slava Strebkov <slavas at voltaire.com> wrote:
>>>>
>>>> In addition to the original proposal we suggest allocating special MLID
>>>> for the following MGIDs:
>>>>  1. FF12401bxxxx000000000000FFFFFFFF - All Nodes
>>>>  2. FF12401bxxxx00000000000000000001 - All hosts
>>>>  3. FF12401bffff0000000000000000004d  - all Gateways
>>>>  4. FF12401bxxxx00000000000000000002 - all routers
>>>>  5. FF12601bABCD000000000001ffxxxxxx - IPv6 SNM
>>>
>>> It turns out that collapsing multicast groups across PKeys on a single
>>> MLID may not be such a good idea unless partition enforcement
>>> enforcement by switches is disabled. There should be different modes
>>> of collapsing based on this based on whether this is enabled or not.
>>
>> The idea is to allocate a different MLID per each of the above special MGIDs.
>
> So one MLID per PKey in the MGID ?
yes

> What's the difference between xxxx's and ABCD in the syntax above ?
none. should be the same.

> IPv6 is being collapsed per PKey too, right ?
yes

>>>> For all other cases we suggest that same MLID will be assigned to
>>>> different MGIDs if:
>>>>  1. They share the same P Key
>>>>  2. Same signature - for IPoIB only
>>>>  3. Same LSB bits - bitmask configurable by user (default  10 bits)
>>>>        for example, the following are the same:
>>>>        MGID1:  FF12401bABCD000000000000xxxxx755
>>>>        MGID2:  FF12401bABCD000000000000yyyyyB55
>>>
>>> Jason's approach to this was in a thread entitled "IPv6 and IPoIB
>>> scalability issue":
>>> http://lists.openfabrics.org/pipermail/general/2006-November/029621.html
>>> in which he proposed an MGID range (MGID/prefix syntax) for collapsing
>>> IPv6 SNM groups. Additionally, there was the potential to distribute
>>> the matched groups across some number of MLIDs. See also thread "[RFC]
>>> OpenSM and IPv6 Scalability Proposal":
>>> http://lists.openfabrics.org/pipermail/general/2008-June/051226.html
>>>
>>>>  Implementation.
>>>>  Since there will be many mgroups shared same mlid, mlid-array entry
>>>> will contain
>>>>  fleximap holding mgroups.
>>>>  Searching of mgroup will be performed by mlid (index in the array) and
>>>> mgid -
>>>>  key in the fleximap.
>>>
>>> Sasha proposed using an array rather than fleximap for this:
>>> http://lists.openfabrics.org/pipermail/general/2008-June/051525.html
>>>
>>> -- Hal
>>>
>>>>
>>>>
>>>>  Slava Strebkov
>>>> _______________________________________________
>>>> general mailing list
>>>> general at lists.openfabrics.org
>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>
>>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>>
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>
>>
>


From hal.rosenstock at gmail.com  Sat May  9 05:26:05 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sat, 9 May 2009 08:26:05 -0400
Subject: [ofa-general] Re: [RFC] OpenSM and IPv6 Scalability Proposal
In-Reply-To: <694d48600905090431ocd05510y3218575a8a93d75@mail.gmail.com>
References: <39C75744D164D948A170E9792AF8E7CA01F6F886@exil.voltaire.com>
	<f0e08f230905080657j36733c0fj2a85f3c3073de806@mail.gmail.com>
	<694d48600905090332u29630a75lb2b1bbfb537c287e@mail.gmail.com>
	<f0e08f230905090341m37697ad9n7b176238e705bc16@mail.gmail.com>
	<694d48600905090431ocd05510y3218575a8a93d75@mail.gmail.com>
Message-ID: <f0e08f230905090526m1ae28d21s64922ea53f04564d@mail.gmail.com>

On Sat, May 9, 2009 at 7:31 AM, Eli Dorfman <dorfman.eli at gmail.com> wrote:
> On Sat, May 9, 2009 at 1:41 PM, Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
>> On Sat, May 9, 2009 at 6:32 AM, Eli Dorfman <dorfman.eli at gmail.com> wrote:
>>> On Fri, May 8, 2009 at 4:57 PM, Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
>>>> On Wed, May 6, 2009 at 6:24 AM, Slava Strebkov <slavas at voltaire.com> wrote:
>>>>>
>>>>> In addition to the original proposal we suggest allocating special MLID
>>>>> for the following MGIDs:
>>>>>  1. FF12401bxxxx000000000000FFFFFFFF - All Nodes
>>>>>  2. FF12401bxxxx00000000000000000001 - All hosts
>>>>>  3. FF12401bffff0000000000000000004d  - all Gateways
>>>>>  4. FF12401bxxxx00000000000000000002 - all routers
>>>>>  5. FF12601bABCD000000000001ffxxxxxx - IPv6 SNM
>>>>
>>>> It turns out that collapsing multicast groups across PKeys on a single
>>>> MLID may not be such a good idea unless partition enforcement
>>>> enforcement by switches is disabled. There should be different modes
>>>> of collapsing based on this based on whether this is enabled or not.
>>>
>>> The idea is to allocate a different MLID per each of the above special MGIDs.
>>
>> So one MLID per PKey in the MGID ?
> yes
>
>> What's the difference between xxxx's and ABCD in the syntax above ?
> none. should be the same.

Doesn't the xxxxxx for IPv6 mean mask these nibbles though ?

>
>> IPv6 is being collapsed per PKey too, right ?
> yes
>
>>>>> For all other cases we suggest that same MLID will be assigned to
>>>>> different MGIDs if:
>>>>>  1. They share the same P Key
>>>>>  2. Same signature - for IPoIB only
>>>>>  3. Same LSB bits - bitmask configurable by user (default  10 bits)
>>>>>        for example, the following are the same:
>>>>>        MGID1:  FF12401bABCD000000000000xxxxx755
>>>>>        MGID2:  FF12401bABCD000000000000yyyyyB55
>>>>
>>>> Jason's approach to this was in a thread entitled "IPv6 and IPoIB
>>>> scalability issue":
>>>> http://lists.openfabrics.org/pipermail/general/2006-November/029621.html
>>>> in which he proposed an MGID range (MGID/prefix syntax) for collapsing
>>>> IPv6 SNM groups. Additionally, there was the potential to distribute
>>>> the matched groups across some number of MLIDs. See also thread "[RFC]
>>>> OpenSM and IPv6 Scalability Proposal":
>>>> http://lists.openfabrics.org/pipermail/general/2008-June/051226.html
>>>>
>>>>>  Implementation.
>>>>>  Since there will be many mgroups shared same mlid, mlid-array entry
>>>>> will contain
>>>>>  fleximap holding mgroups.
>>>>>  Searching of mgroup will be performed by mlid (index in the array) and
>>>>> mgid -
>>>>>  key in the fleximap.
>>>>
>>>> Sasha proposed using an array rather than fleximap for this:
>>>> http://lists.openfabrics.org/pipermail/general/2008-June/051525.html
>>>>
>>>> -- Hal
>>>>
>>>>>
>>>>>
>>>>>  Slava Strebkov
>>>>> _______________________________________________
>>>>> general mailing list
>>>>> general at lists.openfabrics.org
>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>>
>>>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>>>
>>>> _______________________________________________
>>>> general mailing list
>>>> general at lists.openfabrics.org
>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>
>>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>>
>>>
>>
>


From dorfman.eli at gmail.com  Sat May  9 22:51:33 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Sun, 10 May 2009 08:51:33 +0300
Subject: [ofa-general] Re: [RFC] OpenSM and IPv6 Scalability Proposal
In-Reply-To: <f0e08f230905090526m1ae28d21s64922ea53f04564d@mail.gmail.com>
References: <39C75744D164D948A170E9792AF8E7CA01F6F886@exil.voltaire.com>	
	<f0e08f230905080657j36733c0fj2a85f3c3073de806@mail.gmail.com>	
	<694d48600905090332u29630a75lb2b1bbfb537c287e@mail.gmail.com>	
	<f0e08f230905090341m37697ad9n7b176238e705bc16@mail.gmail.com>	
	<694d48600905090431ocd05510y3218575a8a93d75@mail.gmail.com>
	<f0e08f230905090526m1ae28d21s64922ea53f04564d@mail.gmail.com>
Message-ID: <4A066B65.8030704@gmail.com>

Hal Rosenstock wrote:
> On Sat, May 9, 2009 at 7:31 AM, Eli Dorfman <dorfman.eli at gmail.com> wrote:
>> On Sat, May 9, 2009 at 1:41 PM, Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
>>> On Sat, May 9, 2009 at 6:32 AM, Eli Dorfman <dorfman.eli at gmail.com> wrote:
>>>> On Fri, May 8, 2009 at 4:57 PM, Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
>>>>> On Wed, May 6, 2009 at 6:24 AM, Slava Strebkov <slavas at voltaire.com> wrote:
>>>>>> In addition to the original proposal we suggest allocating special MLID
>>>>>> for the following MGIDs:
>>>>>>  1. FF12401bxxxx000000000000FFFFFFFF - All Nodes
>>>>>>  2. FF12401bxxxx00000000000000000001 - All hosts
>>>>>>  3. FF12401bffff0000000000000000004d  - all Gateways
>>>>>>  4. FF12401bxxxx00000000000000000002 - all routers
>>>>>>  5. FF12601bABCD000000000001ffxxxxxx - IPv6 SNM
>>>>> It turns out that collapsing multicast groups across PKeys on a single
>>>>> MLID may not be such a good idea unless partition enforcement
>>>>> enforcement by switches is disabled. There should be different modes
>>>>> of collapsing based on this based on whether this is enabled or not.
>>>> The idea is to allocate a different MLID per each of the above special MGIDs.
>>> So one MLID per PKey in the MGID ?
>> yes
>>
>>> What's the difference between xxxx's and ABCD in the syntax above ?
>> none. should be the same.
> 
> Doesn't the xxxxxx for IPv6 mean mask these nibbles though ?

For IPv6 the ABCD is the pkey and xxxxxx is the mask
To make it the same as IPv4 groups we can use the following notation (mmmmmm=mask and xxxx=pkey)
FF12601bxxxx000000000001ffmmmmmm


From slavas at voltaire.com  Sat May  9 22:54:55 2009
From: slavas at voltaire.com (Slava Strebkov)
Date: Sun, 10 May 2009 08:54:55 +0300
Subject: [ofa-general] Re: [RFC] OpenSM and IPv6 Scalability Proposal
In-Reply-To: <f0e08f230905080657j36733c0fj2a85f3c3073de806@mail.gmail.com>
References: <39C75744D164D948A170E9792AF8E7CA01F6F886@exil.voltaire.com>
	<f0e08f230905080657j36733c0fj2a85f3c3073de806@mail.gmail.com>
Message-ID: <39C75744D164D948A170E9792AF8E7CA01F6F88C@exil.voltaire.com>


-----Original Message-----
From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] 
Sent: Friday, May 08, 2009 4:58 PM
To: Slava Strebkov
Cc: general at lists.openfabrics.org
Subject: Re: [ofa-general] Re: [RFC] OpenSM and IPv6 Scalability Proposal

On Wed, May 6, 2009 at 6:24 AM, Slava Strebkov <slavas at voltaire.com> wrote:
>
> In addition to the original proposal we suggest allocating special MLID
> for the following MGIDs:
>  1. FF12401bxxxx000000000000FFFFFFFF - All Nodes
>  2. FF12401bxxxx00000000000000000001 - All hosts
>  3. FF12401bffff0000000000000000004d  - all Gateways
>  4. FF12401bxxxx00000000000000000002 - all routers
>  5. FF12601bABCD000000000001ffxxxxxx - IPv6 SNM

It turns out that collapsing multicast groups across PKeys on a single
MLID may not be such a good idea unless partition enforcement
enforcement by switches is disabled. There should be different modes
of collapsing based on this based on whether this is enabled or not.

> For all other cases we suggest that same MLID will be assigned to
> different MGIDs if:
>  1. They share the same P Key
>  2. Same signature - for IPoIB only
>  3. Same LSB bits - bitmask configurable by user (default  10 bits)
>        for example, the following are the same:
>        MGID1:  FF12401bABCD000000000000xxxxx755
>        MGID2:  FF12401bABCD000000000000yyyyyB55

Jason's approach to this was in a thread entitled "IPv6 and IPoIB
scalability issue":
http://lists.openfabrics.org/pipermail/general/2006-November/029621.html
in which he proposed an MGID range (MGID/prefix syntax) for collapsing
IPv6 SNM groups. Additionally, there was the potential to distribute
the matched groups across some number of MLIDs. See also thread "[RFC]
OpenSM and IPv6 Scalability Proposal":
http://lists.openfabrics.org/pipermail/general/2008-June/051226.html

>  Implementation.
>  Since there will be many mgroups shared same mlid, mlid-array entry
> will contain
>  fleximap holding mgroups.
>  Searching of mgroup will be performed by mlid (index in the array) and
> mgid -
>  key in the fleximap.

Sasha proposed using an array rather than fleximap for this:
http://lists.openfabrics.org/pipermail/general/2008-June/051525.html

We propose MLID -indexed array, but instead of list of pointers to multicast groups, there will be fleximap sorted by MGID. This is faster than simple list.
Slava

-- Hal

>
>
>  Slava Strebkov
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From dorfman.eli at gmail.com  Sat May  9 23:42:44 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Sun, 10 May 2009 09:42:44 +0300
Subject: [ofa-general] [PATCH] opensm/osm_port.c: Remove error number
	from debug level log message
In-Reply-To: <20090507143346.GA1713@comcast.net>
References: <20090507143346.GA1713@comcast.net>
Message-ID: <4A067764.3040306@gmail.com>

Hal Rosenstock wrote:
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> ---
> diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c
> index 2e6c642..17bac73 100644
> --- a/opensm/opensm/osm_port.c
> +++ b/opensm/opensm/osm_port.c
> @@ -381,7 +381,7 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log,
>  		op_vls = p_subn->opt.max_op_vls;
>  
>  	if (op_vls == 0) {
> -		OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: "
> +		OSM_LOG(p_log, OSM_LOG_DEBUG,
>  			"Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n");

In this case I think that level should be changed to ERROR since this is not the normal behavior.

>  		op_vls = 1;
>  	}


From dorfman.eli at gmail.com  Sat May  9 23:49:41 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Sun, 10 May 2009 09:49:41 +0300
Subject: [ofa-general] [PATCH 1/2] osm_port.c: check if op_vls = 0 before
	max_op_vls comparison
In-Reply-To: <4A03043C.4010709@voltaire.com>
References: <4A00386E.2050300@voltaire.com>	<f0e08f230905050614n149dfc53yd6d0f03e5f2dd4c@mail.gmail.com>	<4A0043B0.3030400@gmail.com>	<f0e08f230905050659ld9f24acvf03a0bc1e0cdb827@mail.gmail.com>	<20090506112135.GG10145@sk>	<f0e08f230905060429m162dd8e5h5a9744c35d81033@mail.gmail.com>	<4A029038.2040603@voltaire.com>	<20090507115212.GC19236@sk>
	<4A030377.6050202@voltaire.com> <4A03043C.4010709@voltaire.com>
Message-ID: <4A067905.5060401@gmail.com>

Doron Shoham wrote:
> check if op_vls = 0 before max_op_vls comparison
> 
> Signed-off-by: Doron Shoham <dorons at voltaire.com>
> ---
>  opensm/opensm/osm_port.c |    9 +++++----
>  1 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c
> index 2e6c642..4d1bbf2 100644
> --- a/opensm/opensm/osm_port.c
> +++ b/opensm/opensm/osm_port.c
> @@ -376,15 +376,16 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log,
>  	} else
>  		op_vls = ib_port_info_get_op_vls(&p_physp->port_info);
>  
> -	/* support user limitation of max_op_vls */
> -	if (op_vls > p_subn->opt.max_op_vls)
> -		op_vls = p_subn->opt.max_op_vls;
> -
>  	if (op_vls == 0) {
> +		/* for non compliant implementations */	
>  		OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: "

I think that level should be OSM_LOG_ERROR.

>  			"Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n");
>  		op_vls = 1;
>  	}
> +	/* support user limitation of max_op_vls */
> +	if (op_vls > p_subn->opt.max_op_vls)
> +		op_vls = p_subn->opt.max_op_vls;
> +
>  
>  	OSM_LOG_EXIT(p_log);
>  	return op_vls;


From dorons at voltaire.com  Sun May 10 01:17:11 2009
From: dorons at voltaire.com (Doron Shoham)
Date: Sun, 10 May 2009 11:17:11 +0300
Subject: [ofa-general] [PATCH 1/2] osm_port.c: check if op_vls = 0 before
	max_op_vls comparison
In-Reply-To: <4A067905.5060401@gmail.com>
References: <4A00386E.2050300@voltaire.com>	<f0e08f230905050614n149dfc53yd6d0f03e5f2dd4c@mail.gmail.com>	<4A0043B0.3030400@gmail.com>	<f0e08f230905050659ld9f24acvf03a0bc1e0cdb827@mail.gmail.com>	<20090506112135.GG10145@sk>	<f0e08f230905060429m162dd8e5h5a9744c35d81033@mail.gmail.com>	<4A029038.2040603@voltaire.com>	<20090507115212.GC19236@sk>
	<4A030377.6050202@voltaire.com> <4A03043C.4010709@voltaire.com>
	<4A067905.5060401@gmail.com>
Message-ID: <4A068D87.6040801@voltaire.com>

check if op_vls = 0 before max_op_vls comparison

Signed-off-by: Doron Shoham <dorons at voltaire.com>
---
 opensm/opensm/osm_port.c |   11 ++++++-----
 1 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c
index 2e6c642..41b67ad 100644
--- a/opensm/opensm/osm_port.c
+++ b/opensm/opensm/osm_port.c
@@ -376,15 +376,16 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log,
 	} else
 		op_vls = ib_port_info_get_op_vls(&p_physp->port_info);
 
-	/* support user limitation of max_op_vls */
-	if (op_vls > p_subn->opt.max_op_vls)
-		op_vls = p_subn->opt.max_op_vls;
-
 	if (op_vls == 0) {
-		OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: "
+		/* for non compliant implementations */	
+		OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4102: "
 			"Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n");
 		op_vls = 1;
 	}
+	/* support user limitation of max_op_vls */
+	if (op_vls > p_subn->opt.max_op_vls)
+		op_vls = p_subn->opt.max_op_vls;
+
 
 	OSM_LOG_EXIT(p_log);
 	return op_vls;
-- 
1.5.4


From vlad at lists.openfabrics.org  Sun May 10 03:24:50 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sun, 10 May 2009 03:24:50 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090510-0200 daily build status
Message-ID: <20090510102450.B8AB8E61434@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From hal.rosenstock at gmail.com  Sun May 10 03:46:10 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sun, 10 May 2009 06:46:10 -0400
Subject: [ofa-general] [PATCH] opensm/osm_port.c: Remove error number from
	debug level log message
In-Reply-To: <4A067764.3040306@gmail.com>
References: <20090507143346.GA1713@comcast.net> <4A067764.3040306@gmail.com>
Message-ID: <f0e08f230905100346y14352b1eq5033493aadc43f0d@mail.gmail.com>

On Sun, May 10, 2009 at 2:42 AM, Eli Dorfman (Voltaire)
<dorfman.eli at gmail.com> wrote:
> Hal Rosenstock wrote:
>> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
>> ---
>> diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c
>> index 2e6c642..17bac73 100644
>> --- a/opensm/opensm/osm_port.c
>> +++ b/opensm/opensm/osm_port.c
>> @@ -381,7 +381,7 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log,
>>               op_vls = p_subn->opt.max_op_vls;
>>
>>       if (op_vls == 0) {
>> -             OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: "
>> +             OSM_LOG(p_log, OSM_LOG_DEBUG,
>>                       "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n");
>
> In this case I think that level should be changed to ERROR since this is not the normal behavior.

Sasha has been adamant that any device supplied data errors use
something other than ERROR log level.

-- Hal

>
>>               op_vls = 1;
>>       }
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From hal.rosenstock at gmail.com  Sun May 10 03:47:29 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sun, 10 May 2009 06:47:29 -0400
Subject: [ofa-general] [PATCH 1/2] osm_port.c: check if op_vls = 0 before 
	max_op_vls comparison
In-Reply-To: <4A068D87.6040801@voltaire.com>
References: <4A00386E.2050300@voltaire.com>
	<f0e08f230905050659ld9f24acvf03a0bc1e0cdb827@mail.gmail.com>
	<20090506112135.GG10145@sk>
	<f0e08f230905060429m162dd8e5h5a9744c35d81033@mail.gmail.com>
	<4A029038.2040603@voltaire.com> <20090507115212.GC19236@sk>
	<4A030377.6050202@voltaire.com> <4A03043C.4010709@voltaire.com>
	<4A067905.5060401@gmail.com> <4A068D87.6040801@voltaire.com>
Message-ID: <f0e08f230905100347t57c381bex6f8dd4cc3d8bff12@mail.gmail.com>

On Sun, May 10, 2009 at 4:17 AM, Doron Shoham <dorons at voltaire.com> wrote:
> check if op_vls = 0 before max_op_vls comparison
>
> Signed-off-by: Doron Shoham <dorons at voltaire.com>
> ---
>  opensm/opensm/osm_port.c |   11 ++++++-----
>  1 files changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c
> index 2e6c642..41b67ad 100644
> --- a/opensm/opensm/osm_port.c
> +++ b/opensm/opensm/osm_port.c
> @@ -376,15 +376,16 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log,
>        } else
>                op_vls = ib_port_info_get_op_vls(&p_physp->port_info);
>
> -       /* support user limitation of max_op_vls */
> -       if (op_vls > p_subn->opt.max_op_vls)
> -               op_vls = p_subn->opt.max_op_vls;
> -
>        if (op_vls == 0) {
> -               OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: "
> +               /* for non compliant implementations */
> +               OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4102: "


Sasha has been adamant that any device supplied data errors use
something other than ERROR log level.

-- Hal

>                        "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n");
>                op_vls = 1;
>        }
> +       /* support user limitation of max_op_vls */
> +       if (op_vls > p_subn->opt.max_op_vls)
> +               op_vls = p_subn->opt.max_op_vls;
> +
>
>        OSM_LOG_EXIT(p_log);
>        return op_vls;
> --
> 1.5.4
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From dorons at voltaire.com  Sun May 10 04:30:13 2009
From: dorons at voltaire.com (Doron Shoham)
Date: Sun, 10 May 2009 14:30:13 +0300
Subject: [ofa-general] [PATCH] saquery: fix -c arguement
In-Reply-To: <4A030525.7090209@voltaire.com>
References: <4A030525.7090209@voltaire.com>
Message-ID: <4A06BAC5.40405@voltaire.com>

set SAQUERY_CMD_CLASS_PORT_INFO instead of CLASS_PORT_INFO

Signed-off-by: Doron Shoham <dorons at voltaire.com>
---
 infiniband-diags/src/saquery.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c
index 4dcd712..5920eda 100644
--- a/infiniband-diags/src/saquery.c
+++ b/infiniband-diags/src/saquery.c
@@ -1470,7 +1470,7 @@ static int process_opt(void *context, int ch, char *optarg)
 		node_print_desc = ALL_DESC;
 		break;
 	case 'c':
-		command = CLASS_PORT_INFO;
+		command = SAQUERY_CMD_CLASS_PORT_INFO;
 		break;
 	case 'S':
 		query_type = IB_SA_ATTR_SERVICERECORD;
-- 
1.5.4


Sorry,
forgot ';'

Thanks,
Doron


From amirv at mellanox.co.il  Sun May 10 23:52:42 2009
From: amirv at mellanox.co.il (Amir Vadai)
Date: Mon, 11 May 2009 09:52:42 +0300
Subject: [ofa-general] SDP error
In-Reply-To: <BAY139-W844589AC3C296DB8DC131AE690@phx.gbl>
References: <BAY139-W844589AC3C296DB8DC131AE690@phx.gbl>
Message-ID: <4A07CB3A.2030507@mellanox.co.il>

Hi,

Which kernel is it?
What is the command line that you execute? for example for ssh (both in 
the client and in the server)
Please make sure that the module ib_sdp.so is loaded when you run the 
programs. If it wasn't started automatically - please let me know.

- Amir


On 05/05/2009 03:53 PM, anthony garnier wrote:
> Hello,
>
> i`m running a debian 5.0 OS with ofed 1.4, RDMA work very well, but when
> I`m trying to use the SDP protocol with ssh, Netperf or a simple
> Client-Server programming in C, I got socket error like that :
>
> NetPIPE: can't open stream socket! errno=97 (for Netpipe)
>
> Address family not supported by protocol ssh (for ssh)
>
> Address family not supported by protocol (for clent-server)
>
> Someone knows those errors?
>
> ------------------------------------------------------------------------
> Discutez sur Messenger où que vous soyez ! Mettez Messenger sur votre
> mobile ! <http://www.messengersurvotremobile.com/>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

-- 
Amir Vadai
Software Eng.
Mellanox Technologies
mailto: amirv at mellanox.co.il
Tel +972-3-6259539


From vlad at lists.openfabrics.org  Mon May 11 03:22:07 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Mon, 11 May 2009 03:22:07 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090511-0200 daily build status
Message-ID: <20090511102207.BC028E61364@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From jsquyres at cisco.com  Mon May 11 05:11:50 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Mon, 11 May 2009 08:11:50 -0400
Subject: [ofa-general] New proposal for memory management
In-Reply-To: <11AAF71E-0D36-471E-A9C6-5FC924AF9E7D@cisco.com>
References: <382A478CAD40FA4FB46605CF81FE39F42B5C8631@orsmsx507.amr.corp.intel.com><C61F6F4D.4AB3%bwbarre@sandia.gov><382A478CAD40FA4FB46605CF81FE39F42B5C876E@orsmsx507.amr.corp.intel.com><8B573344-0870-4352-8BC2-AED66E1E4234@cisco.com><adaljpgyckk.fsf@cisco.com>
	<11AAF71E-0D36-471E-A9C6-5FC924AF9E7D@cisco.com>
Message-ID: <871F48B9-4B7A-4EE5-9D1E-0D9A9A69035E@cisco.com>

On May 4, 2009, at 8:25 PM, Jeff Squyres (jsquyres) wrote:

> It was suggested today that a teleconference to discuss these issues
> might be much more useful (an hour-long teleconference can save a
> week's worth of emails!).  This will be a technical call to discuss
> memory registration issues; it will not be an EWG call.  I've setup a
> WebEx call for next Monday at the "normal" time: noon US Eastern, 9am
> US Pacific, 7pm Israel.  The invite will be coming to the ewg and
> general lists shortly.
>


Productive discussion about this issue is still occurring on the list  
-- I don't think we need this teleconf today.

-- 
Jeff Squyres
Cisco Systems


From perkinjo at cse.ohio-state.edu  Mon May 11 05:13:18 2009
From: perkinjo at cse.ohio-state.edu (Jonathan Perkins)
Date: Mon, 11 May 2009 08:13:18 -0400
Subject: [ofa-general] Memory registration redux
In-Reply-To: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
Message-ID: <20090511121318.GD3045@cse.ohio-state.edu>

On Tue, May 05, 2009 at 04:57:09PM -0400, Jeff Squyres wrote:
> Roland and I chatted on the phone today; I think I now understand  
> Roland's counter-proposal (I clearly didn't before).  Let me try to  
> summarize:
>
> 1. Add a new verb for "set this userspace flag to 1 if mr X ever becomes 
> invalid"
> 2. Add a new verb for "no longer tell me if mr X ever becomes invalid" 
> (i.e., remove the effects of #1)
> 3. Add run-time query indicating whether #1 works
> 4. Add [optional] memory registration caching to libibverbs
>
> Prior to talking to Roland, I had envisioned *one* flag in userspace  
> that indicated whether any memory registrations had become invalid.   
> Roland's idea is that there is one flag *per registration* -- you can  
> instantly tell whether a specific registration is valid.
>
> Given this, let's keep the discussion going here in email -- perhaps the 
> teleconference next Monday may become moot.

It looks like there has been more discussion on how to implement this
idea.  Are we still planning on having this teleconference today?

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090511/22e1c8ce/attachment.sig>

From perkinjo at cse.ohio-state.edu  Mon May 11 05:14:05 2009
From: perkinjo at cse.ohio-state.edu (Jonathan Perkins)
Date: Mon, 11 May 2009 08:14:05 -0400
Subject: [ofa-general] New proposal for memory management
In-Reply-To: <871F48B9-4B7A-4EE5-9D1E-0D9A9A69035E@cisco.com>
References: <11AAF71E-0D36-471E-A9C6-5FC924AF9E7D@cisco.com>
	<871F48B9-4B7A-4EE5-9D1E-0D9A9A69035E@cisco.com>
Message-ID: <20090511121405.GE3045@cse.ohio-state.edu>

On Mon, May 11, 2009 at 08:11:50AM -0400, Jeff Squyres wrote:
> On May 4, 2009, at 8:25 PM, Jeff Squyres (jsquyres) wrote:
>
>> It was suggested today that a teleconference to discuss these issues
>> might be much more useful (an hour-long teleconference can save a
>> week's worth of emails!).  This will be a technical call to discuss
>> memory registration issues; it will not be an EWG call.  I've setup a
>> WebEx call for next Monday at the "normal" time: noon US Eastern, 9am
>> US Pacific, 7pm Israel.  The invite will be coming to the ewg and
>> general lists shortly.
>>
>
>
> Productive discussion about this issue is still occurring on the list -- 
> I don't think we need this teleconf today.

OK thanks.  You can ignore my just sent email.

>
> -- 
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090511/96a14b6e/attachment.sig>

From sebastien.dugue at bull.net  Mon May 11 05:38:21 2009
From: sebastien.dugue at bull.net (sebastien dugue)
Date: Mon, 11 May 2009 14:38:21 +0200
Subject: [ofa-general] [PATCH] mstflint - Fix redirection to /dev/null in
 hca_self_test.ofed
Message-ID: <20090511143821.4b0746ef@frecb007965>


  Redirect 'rpm -qa' stderr to /dev/null instead of null.

Signed-Off-By: Sebastien Dugue <sebastien.dugue at bull.net>
---
 hca_self_test.ofed |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/hca_self_test.ofed b/hca_self_test.ofed
index c7c5492..4f29080 100755
--- a/hca_self_test.ofed
+++ b/hca_self_test.ofed
@@ -168,7 +168,7 @@ else
     fi
 
     if [ $RPM_KER_VER -ne 0 ]; then
-        RPM_CUR_BOOTED_KER=`rpm -qa 2> null| grep kernel-ib | grep $(echo $BOOTED_KER | sed s/-/_/) | wc -l`
+        RPM_CUR_BOOTED_KER=`rpm -qa 2> /dev/null| grep kernel-ib | grep $(echo $BOOTED_KER | sed s/-/_/) | wc -l`
         if [ $RPM_CUR_BOOTED_KER -eq 0 ]; then
             echo -e "Host Driver RPM Check .................. ${red}FAIL"
             tput sgr0
-- 
1.6.3.rc3.12.gb7937


From sokar6012 at hotmail.com  Mon May 11 07:06:29 2009
From: sokar6012 at hotmail.com (anthony garnier)
Date: Mon, 11 May 2009 14:06:29 +0000
Subject: [ofa-general] Install ofed 1.4 on XEN
Message-ID: <BAY139-W3637F7328E8D02BD6EC753AE630@phx.gbl>


Hi,
I installed ofed 1.4  (from http://alioth.debian.org/projects/pkg-ofed/ ) on Xen , all the package are well installed and my HCa are recognized but when I`m trying to build the kernel module with 

module-assistant prepare
module-assistant build ofa-kernel

I got an error in the log file wich is : 
Failed executing /usr/bin/quiltmake[1]: *** [kdist_config] Error 1
make[1]: Leaving directory `/usr/src/modules/ofa-kernel`
make: *** [kdist_build] error 2

Is where someone who knows this error?

Regards
Anthony

_________________________________________________________________
Téléphonez gratuitement à tous vos proches avec Windows Live Messenger  !  Téléchargez-le maintenant ! 
http://www.windowslive.fr/messenger/1.asp
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090511/de96965d/attachment.html>

From gmpc at sanger.ac.uk  Mon May 11 07:16:05 2009
From: gmpc at sanger.ac.uk (Guy Coates)
Date: Mon, 11 May 2009 15:16:05 +0100
Subject: [ofa-general] Install ofed 1.4 on XEN
In-Reply-To: <BAY139-W3637F7328E8D02BD6EC753AE630@phx.gbl>
References: <BAY139-W3637F7328E8D02BD6EC753AE630@phx.gbl>
Message-ID: <4A083325.3090008@sanger.ac.uk>

anthony garnier wrote:
> Hi,
> I installed ofed 1.4  (from http://alioth.debian.org/projects/pkg-ofed/
> ) on Xen , all the package are well installed and my HCa are recognized
> but when I`m trying to build the kernel module with
> 
> module-assistant prepare
> module-assistant build ofa-kernel
> 
> I got an error in the log file wich is :
> Failed executing /usr/bin/quiltmake[1]: *** [kdist_config] Error 1
> make[1]: Leaving directory `/usr/src/modules/ofa-kernel`
> make: *** [kdist_build] error 2
> 
> Is where someone who knows this error?

Do you have the quilt package installed? It is required for the build.

Cheers,

Guy


-- 
Dr. Guy Coates,  Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
Tel: +44 (0)1223 834244 x 6925
Fax: +44 (0)1223 496802


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


From tziporet at mellanox.co.il  Mon May 11 08:32:05 2009
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Mon, 11 May 2009 18:32:05 +0300
Subject: [ofa-general] OFED 1.4.1-rc5  is available
In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD020E76C4@mtlexch01.mtl.com>
References: <5D49E7A8952DC44FB38C38FA0D758EAD020E76C4@mtlexch01.mtl.com>
Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD02A2B828@mtlexch01.mtl.com>

 
Hi,

OFED-1.4.1-rc5  release is available on

http://www.openfabrics.org/downloads/OFED/ofed-1.4.1/OFED-1.4.1-rc5.tgz

To get BUILD_ID run ofed_info

Please report any issues in bugzilla https://bugs.openfabrics.org/  for
OFED 1.4.1

Vladimir & Tziporet

========================================================================


Release information:
------------------------------
Linux Operating Systems:
      - RedHat EL4 up4:  2.6.9-42.ELsmp      *
      - RedHat EL4 up5:  2.6.9-55.ELsmp
      - RedHat EL4 up6:  2.6.9-67.ELsmp
      - RedHat EL4 up7:  2.6.9-78.ELsmp
      - RedHat EL5:        2.6.18-8.el5
      - RedHat EL5 up1:  2.6.18-53.el5
      - RedHat EL5 up2:  2.6.18-92.el5
      - RedHat EL5 up3:  2.6.18-128.el5
      - OEL 4.5:              2.6.9-55.ELsmp
      - OEL 5.2:              2.6.18-92.el5
      - CentOS 5.2:         2.6.18-92.el5
      - Fedora C9:           2.6.25-14.fc9          *
      - SLES10:              2.6.16.21-0.8-smp
      - SLES10 SP1:       2.6.16.46-0.12-smp
      - SLES10 SP1 up1: 2.6.16.53-0.16-smp
      - SLES10 SP2:       2.6.16.60-0.21-smp
      - SLES11 GA:         2.6.27.13-1-default
      - OpenSuSE 10.3:   2.6.22.5-31             *
      - kernel.org:             2.6.26 and 2.6.27

    * Minimal QA for these versions

Systems:
      * x86_64
      * x86
      * ia64
      * ppc64


Main Changes from OFED-1.4.1-rc4
==========================
- mlx4_en: Updated driver to version 1.4.1 that was released by Mellanox
- Added an error in case of mlx4 library mismatch with kernel (due to
XRC support)
- 3 bug fixed (see attachment)
- Updated bonding package: ib-bonding-0.9.0-40
- Attached kernel git tree changes for details
- Updated documentation


Tasks that should be completed for GA (May 14):
====================================
1. High priority bug fixes - see list bellow
2. Complete documentation update

Open bugs:
========
bug_id	bug_severity	op_sys		assigned_to
1596    	cri  		Othe  	Jeffrey.C.Becker at nasa.gov
openibd stop failed when nfs is loaded
1616 	cri 		RHEL 	jon at opengridcomputing.com
iommu_alloc error when running connectathon on ppc64 nfs ...
1571 	cri 	 	RHEL 	vu at mellanox.com		nfsrdma server
crash @test5 connectathon basic test,

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ofed-1.4.1-rc4_rc5.log
Type: application/octet-stream
Size: 8057 bytes
Desc: ofed-1.4.1-rc4_rc5.log
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090511/9d198441/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ofed-1.4.1-rc5-fixed-bugs.csv
Type: application/octet-stream
Size: 469 bytes
Desc: ofed-1.4.1-rc5-fixed-bugs.csv
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090511/9d198441/attachment-0001.obj>

From swise at opengridcomputing.com  Mon May 11 08:35:09 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 11 May 2009 10:35:09 -0500
Subject: [ofa-general] [PATCH ofed-1.4.1 relnotes] Update cxgb3 release notes
	for 1.4.1
Message-ID: <20090511153509.17504.51102.stgit@build.ogc.int>

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 cxgb3_release_notes.txt |   38 ++++++++------------------------------
 1 files changed, 8 insertions(+), 30 deletions(-)

diff --git a/cxgb3_release_notes.txt b/cxgb3_release_notes.txt
index 5f2edaa..4df6779 100644
--- a/cxgb3_release_notes.txt
+++ b/cxgb3_release_notes.txt
@@ -1,42 +1,20 @@
             Open Fabrics Enterprise Distribution (OFED)
                 CHELSIO T3 RNIC RELEASE NOTES
-			December 2008
+			May 2009
 
 
 The iw_cxgb3 and cxgb3 modules provide RDMA and NIC support for the
-Chelsio S310/320 and R310/320 series adapters.  Make sure you choose the
-'cxgb3' and 'libcxgb3' options when generating your ofed-1.4 rpms.
+Chelsio S series adapters.  Make sure you choose the 'cxgb3' and
+'libcxgb3' options when generating your ofed-1.4.1 rpms.
 
 ============================================
-New for ofed-1.4
+New for ofed-1.4.1
 ============================================
 
-- 7.0 Firmware support.  See below for more information on updating
-your RNIC to the latest firmware.
-
-- Memory Managment Extensions including:
-	- Fast register memory regions
-	- Invalidate local memory region work request
-	- Zero stag support via the local DMA lkey field
-	- Read with invalidate local stag work request
-
-- RDS bcopy mode enabled for iWARP devices
-
-============================================
-Recent Enhancements
-============================================
+- NFSRDMA support.
 
-- Various MPI libraries are enabled via a new iw_cxgb3 module option
-called peer2peer.  When loading iw_cxgb3, set peer2peer=1 to enable Intel
-MPI version 3.1.038, HP MPI version 2.02.05.01, OpenMPI (will be released
-with OpenMPI-1.3), and Scali MPI (will be available in version 3.13.7).
-This option must be set on all systems in your cluster.  See more info
-below on running these MPIs.  NOTE: None of these MPIs are included in
-the ofed-1.4 release.  Contact the specific vendors for obtaining the
-MPI code.  Open MPI can be pulled from www.open-mpi.org.
-
-- Large memory registration.  User applications can now register > 30MB 
-memory regions.
+- 7.4 Firmware support.  See below for more information on updating
+your RNIC to the latest firmware.
 
 ============================================
 Enabling Various MPIs
@@ -64,7 +42,7 @@ chelsio u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" ""
 Intel MPI:
 =============
 
-The following env vars enable Intel MPI version 3.1.038.  Place these
+The following env vars enable Intel MPI version >= 3.1.038.  Place these
 in your user env after installing and setting up Intel MPI:
 
 export RSH=ssh


From swise at opengridcomputing.com  Mon May 11 08:36:19 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 11 May 2009 10:36:19 -0500
Subject: [ofa-general] [PATCH ofed-1.4.1 relnotes] Update cxgb3 release notes
	for 1.4.1
Message-ID: <20090511153619.17559.19237.stgit@build.ogc.int>

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 cxgb3_release_notes.txt |   36 +++++++-----------------------------
 1 files changed, 7 insertions(+), 29 deletions(-)

diff --git a/cxgb3_release_notes.txt b/cxgb3_release_notes.txt
index 5f2edaa..d1fdafc 100644
--- a/cxgb3_release_notes.txt
+++ b/cxgb3_release_notes.txt
@@ -1,42 +1,20 @@
             Open Fabrics Enterprise Distribution (OFED)
                 CHELSIO T3 RNIC RELEASE NOTES
-			December 2008
+			May 2009
 
 
 The iw_cxgb3 and cxgb3 modules provide RDMA and NIC support for the
-Chelsio S310/320 and R310/320 series adapters.  Make sure you choose the
-'cxgb3' and 'libcxgb3' options when generating your ofed-1.4 rpms.
+Chelsio S series adapters.  Make sure you choose the 'cxgb3' and
+'libcxgb3' options when generating your ofed-1.4.1 rpms.
 
 ============================================
-New for ofed-1.4
+New for ofed-1.4.1
 ============================================
 
-- 7.0 Firmware support.  See below for more information on updating
-your RNIC to the latest firmware.
-
-- Memory Managment Extensions including:
-	- Fast register memory regions
-	- Invalidate local memory region work request
-	- Zero stag support via the local DMA lkey field
-	- Read with invalidate local stag work request
+- NFSRDMA support.
 
-- RDS bcopy mode enabled for iWARP devices
-
-============================================
-Recent Enhancements
-============================================
-
-- Various MPI libraries are enabled via a new iw_cxgb3 module option
-called peer2peer.  When loading iw_cxgb3, set peer2peer=1 to enable Intel
-MPI version 3.1.038, HP MPI version 2.02.05.01, OpenMPI (will be released
-with OpenMPI-1.3), and Scali MPI (will be available in version 3.13.7).
-This option must be set on all systems in your cluster.  See more info
-below on running these MPIs.  NOTE: None of these MPIs are included in
-the ofed-1.4 release.  Contact the specific vendors for obtaining the
-MPI code.  Open MPI can be pulled from www.open-mpi.org.
-
-- Large memory registration.  User applications can now register > 30MB 
-memory regions.
+- 7.4 Firmware support.  See below for more information on updating
+your RNIC to the latest firmware.
 
 ============================================
 Enabling Various MPIs


From swise at opengridcomputing.com  Mon May 11 08:37:21 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 11 May 2009 10:37:21 -0500
Subject: [ofa-general] [PATCH ofed-1.4.1 cxgb3 relnotes] Update cxgb3 release
	notes for 1.4.1
Message-ID: <20090511153721.17587.46386.stgit@build.ogc.int>

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 cxgb3_release_notes.txt |   36 +++++++-----------------------------
 1 files changed, 7 insertions(+), 29 deletions(-)

diff --git a/cxgb3_release_notes.txt b/cxgb3_release_notes.txt
index 5f2edaa..d1fdafc 100644
--- a/cxgb3_release_notes.txt
+++ b/cxgb3_release_notes.txt
@@ -1,42 +1,20 @@
             Open Fabrics Enterprise Distribution (OFED)
                 CHELSIO T3 RNIC RELEASE NOTES
-			December 2008
+			May 2009
 
 
 The iw_cxgb3 and cxgb3 modules provide RDMA and NIC support for the
-Chelsio S310/320 and R310/320 series adapters.  Make sure you choose the
-'cxgb3' and 'libcxgb3' options when generating your ofed-1.4 rpms.
+Chelsio S series adapters.  Make sure you choose the 'cxgb3' and
+'libcxgb3' options when generating your ofed-1.4.1 rpms.
 
 ============================================
-New for ofed-1.4
+New for ofed-1.4.1
 ============================================
 
-- 7.0 Firmware support.  See below for more information on updating
-your RNIC to the latest firmware.
-
-- Memory Managment Extensions including:
-	- Fast register memory regions
-	- Invalidate local memory region work request
-	- Zero stag support via the local DMA lkey field
-	- Read with invalidate local stag work request
+- NFSRDMA support.
 
-- RDS bcopy mode enabled for iWARP devices
-
-============================================
-Recent Enhancements
-============================================
-
-- Various MPI libraries are enabled via a new iw_cxgb3 module option
-called peer2peer.  When loading iw_cxgb3, set peer2peer=1 to enable Intel
-MPI version 3.1.038, HP MPI version 2.02.05.01, OpenMPI (will be released
-with OpenMPI-1.3), and Scali MPI (will be available in version 3.13.7).
-This option must be set on all systems in your cluster.  See more info
-below on running these MPIs.  NOTE: None of these MPIs are included in
-the ofed-1.4 release.  Contact the specific vendors for obtaining the
-MPI code.  Open MPI can be pulled from www.open-mpi.org.
-
-- Large memory registration.  User applications can now register > 30MB 
-memory regions.
+- 7.4 Firmware support.  See below for more information on updating
+your RNIC to the latest firmware.
 
 ============================================
 Enabling Various MPIs


From bmr at opengridcomputing.com  Mon May 11 11:03:16 2009
From: bmr at opengridcomputing.com (Brian M. Rzycki)
Date: Mon, 11 May 2009 13:03:16 -0500
Subject: [ofa-general] OFED 1.4.1-rc5 symbol disagreements on SLES 11 SP0
Message-ID: <8D6B365D-9766-4908-873A-D4DC9BE3A9C9@opengridcomputing.com>

Greetings,

I have the following SLES 11 SP0 machine:

# cat /proc/cpuinfo | grep '^model name'
model name      : AMD Opteron(tm) Processor 244
model name      : AMD Opteron(tm) Processor 244

# cat /etc/SuSE-release
SUSE Linux Enterprise Server 11 (x86_64)
VERSION = 11
PATCHLEVEL = 0

# uname -a
Linux demo1 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100  
x86_64 x86_64 x86_64 GNU/Linux

# free -m
              total       used       free     shared    buffers      
cached
Mem:          1877        333       1544          0          7         
148
-/+ buffers/cache:        177       1700
Swap:         2055          0       2055

I downloaded and installed OFED-1.4.1-rc5.tgz on the machine.  I  
configured one of the Mellanox infiniband interfaces and then reboot  
the system.  When installing I chose 2,3:

2) Install OFED Software
3) All packages (all of Basic, HPC)

Upon reboot I see the following messages in the dmesg log:

ib_iser: disagrees about version of symbol ib_fmr_pool_unmap
ib_iser: Unknown symbol ib_fmr_pool_unmap
ib_iser: disagrees about version of symbol ib_create_cq
ib_iser: Unknown symbol ib_create_cq
ib_iser: disagrees about version of symbol rdma_resolve_addr
ib_iser: Unknown symbol rdma_resolve_addr
ib_iser: disagrees about version of symbol ib_create_fmr_pool
ib_iser: Unknown symbol ib_create_fmr_pool
ib_iser: disagrees about version of symbol ib_dereg_mr
ib_iser: Unknown symbol ib_dereg_mr
ib_iser: disagrees about version of symbol rdma_disconnect
ib_iser: Unknown symbol rdma_disconnect
ib_iser: disagrees about version of symbol rdma_resolve_route
ib_iser: Unknown symbol rdma_resolve_route
ib_iser: disagrees about version of symbol rdma_create_qp
ib_iser: Unknown symbol rdma_create_qp
ib_iser: disagrees about version of symbol ib_destroy_cq
ib_iser: Unknown symbol ib_destroy_cq
ib_iser: disagrees about version of symbol rdma_create_id
ib_iser: Unknown symbol rdma_create_id
ib_iser: disagrees about version of symbol rdma_destroy_qp
ib_iser: Unknown symbol rdma_destroy_qp
ib_iser: disagrees about version of symbol ib_get_dma_mr
ib_iser: Unknown symbol ib_get_dma_mr
ib_iser: disagrees about version of symbol ib_alloc_pd
ib_iser: Unknown symbol ib_alloc_pd
ib_iser: disagrees about version of symbol rdma_connect
ib_iser: Unknown symbol rdma_connect
ib_iser: disagrees about version of symbol rdma_destroy_id
ib_iser: Unknown symbol rdma_destroy_id
ib_iser: disagrees about version of symbol ib_dealloc_pd
ib_iser: Unknown symbol ib_dealloc_pd
ib_iser: disagrees about version of symbol ib_fmr_pool_map_phys
ib_iser: Unknown symbol ib_fmr_pool_map_phys

I don't see ib_iser.ko is in the kernel updates directory:
# find  /lib/modules/$(uname -r)/updates -name '*.ko' | grep iser
#

It looks like the OFED installer isn't building ib_iser.ko even when I  
choose 2,3.

Thanks,
-Brian Rzycki


From roel.kluin at gmail.com  Mon May 11 13:25:07 2009
From: roel.kluin at gmail.com (Roel Kluin)
Date: Mon, 11 May 2009 22:25:07 +0200
Subject: [ofa-general] [PATCH] ehca: remove driver_data direct access of
	struct device
Message-ID: <4A0889A3.8020803@gmail.com>

To avoid direct access to the driver_data pointer in struct device, the
functions dev_get_drvdata() and dev_set_drvdata() should be used.

Signed-off-by: Roel Kluin <roel.kluin at gmail.com>
---
diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c
index 368311c..5acfb4c 100644
--- a/drivers/infiniband/hw/ehca/ehca_main.c
+++ b/drivers/infiniband/hw/ehca/ehca_main.c
@@ -749,7 +749,7 @@ static int __devinit ehca_probe(struct of_device *dev,
 
 	shca->ofdev = dev;
 	shca->ipz_hca_handle.handle = *handle;
-	dev->dev.driver_data = shca;
+	dev_set_drvdata(&dev->dev, shca);
 
 	ret = ehca_sense_attributes(shca);
 	if (ret < 0) {
@@ -878,7 +878,7 @@ probe1:
 
 static int __devexit ehca_remove(struct of_device *dev)
 {
-	struct ehca_shca *shca = dev->dev.driver_data;
+	struct ehca_shca *shca = dev_get_drvdata(&dev->dev);
 	unsigned long flags;
 	int ret;
 

From jgunthorpe at obsidianresearch.com  Mon May 11 14:14:46 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Mon, 11 May 2009 15:14:46 -0600
Subject: [ofa-general] Re: [RFC] OpenSM and IPv6 Scalability Proposal
In-Reply-To: <694d48600905090332u29630a75lb2b1bbfb537c287e@mail.gmail.com>
References: <39C75744D164D948A170E9792AF8E7CA01F6F886@exil.voltaire.com>
	<f0e08f230905080657j36733c0fj2a85f3c3073de806@mail.gmail.com>
	<694d48600905090332u29630a75lb2b1bbfb537c287e@mail.gmail.com>
Message-ID: <20090511211446.GC16395@obsidianresearch.com>

On Sat, May 09, 2009 at 01:32:06PM +0300, Eli Dorfman wrote:
> On Fri, May 8, 2009 at 4:57 PM, Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
> > On Wed, May 6, 2009 at 6:24 AM, Slava Strebkov <slavas at voltaire.com> wrote:
> >>
> >> In addition to the original proposal we suggest allocating special MLID
> >> for the following MGIDs:
> >> ??1. FF12401bxxxx000000000000FFFFFFFF - All Nodes
> >> ??2. FF12401bxxxx00000000000000000001 - All hosts
> >> ??3. FF12401bffff0000000000000000004d ??- all Gateways
> >> ??4. FF12401bxxxx00000000000000000002 - all routers
> >> ??5. FF12601bABCD000000000001ffxxxxxx - IPv6 SNM
> >
> > It turns out that collapsing multicast groups across PKeys on a single
> > MLID may not be such a good idea unless partition enforcement
> > enforcement by switches is disabled. There should be different modes
> > of collapsing based on this based on whether this is enabled or not.
> 
> The idea is to allocate a different MLID per each of the above special MGIDs.

In practice I think you'd be better to combine the All Nodes, All
hosts, All Gatesways, All Routers and IPv4 broadcast group onto a
single MLID and then distribute the SNM groups over some number of
additional MLIDs in an intelligent manner. The specialty groups are
not really used very much, while the purpose of the SNM group is for
ND scalability.

If your network is large enough to care about this then it is probably
also large enough to benefit from multiple SNM groups..

Otherwise, you may as well lump them all together into the broadcast
MLID.

Jason


From caitlin.bestler at gmail.com  Mon May 11 14:23:58 2009
From: caitlin.bestler at gmail.com (Caitlin Bestler)
Date: Mon, 11 May 2009 14:23:58 -0700
Subject: [ofa-general] Memory registration redux
In-Reply-To: <20090507224806.GF16280@obsidianresearch.com>
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
	<adabpq6t2k8.fsf@cisco.com>
	<20090506214628.GM2590@obsidianresearch.com>
	<adatz3xsxo6.fsf@cisco.com>
	<20090506222638.GA16280@obsidianresearch.com>
	<adaprelsvnp.fsf@cisco.com>
	<20090507000231.GB16280@obsidianresearch.com>
	<adak54ssi0g.fsf@cisco.com>
	<20090507224806.GF16280@obsidianresearch.com>
Message-ID: <469958e00905111423p5fd15c58s2dfa57cbc4f64c26@mail.gmail.com>

On Thu, May 7, 2009 at 3:48 PM, Jason Gunthorpe
<jgunthorpe at obsidianresearch.com> wrote:
>
> Right, I was only thinking of a new driver call that was along the
> lines of update_mr_pages() that just updates the HCA's mapping with
> new page table entires atomically. It really would be device
> specific. If there is no call available then unregister/register +
> printk log is a fair generic implementation.
>
> To be clear, what I'm thinking is that this would only be invoked if

Both the IBTA and RDMAC verbs were defined so that the meaning of
L-Key/R-Key/STag + Address could not
instantly change from "X" to "Y", only from "X" to NULL and then NULL to "Y".

There are a lot of good reasons for this, especially for R-Keys or
remotely accessible STags. It ensures that
all operations that started when the translation was "X" are completed
before any that will use the "Y" translation
can commence. That is not something we want to accidentally undermine.

There really isn't a reason why this rule needed to apply to entire
Memory Regions. So I don't see a problem
with allowing an update_mr_pages() verb that changes a portion of an
MR map, perhaps by optimal machine
specific hooks when available, without requiring the entire MR be
specified. But it must preserve the guarantee
that all operations initiated with translation "X" are completed
before any operations for translation "Y" can be initiated.

Preserving this guarantee should not be a problem for the free() then
reallocate scenarios that have been discussed.


From jgunthorpe at obsidianresearch.com  Mon May 11 14:40:54 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Mon, 11 May 2009 15:40:54 -0600
Subject: [ofa-general] Memory registration redux
In-Reply-To: <469958e00905111423p5fd15c58s2dfa57cbc4f64c26@mail.gmail.com>
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
	<adabpq6t2k8.fsf@cisco.com>
	<20090506214628.GM2590@obsidianresearch.com>
	<adatz3xsxo6.fsf@cisco.com>
	<20090506222638.GA16280@obsidianresearch.com>
	<adaprelsvnp.fsf@cisco.com>
	<20090507000231.GB16280@obsidianresearch.com>
	<adak54ssi0g.fsf@cisco.com>
	<20090507224806.GF16280@obsidianresearch.com>
	<469958e00905111423p5fd15c58s2dfa57cbc4f64c26@mail.gmail.com>
Message-ID: <20090511214054.GE16395@obsidianresearch.com>

On Mon, May 11, 2009 at 02:23:58PM -0700, Caitlin Bestler wrote:
> On Thu, May 7, 2009 at 3:48 PM, Jason Gunthorpe
> <jgunthorpe at obsidianresearch.com> wrote:
> >
> > Right, I was only thinking of a new driver call that was along the
> > lines of update_mr_pages() that just updates the HCA's mapping with
> > new page table entires atomically. It really would be device
> > specific. If there is no call available then unregister/register +
> > printk log is a fair generic implementation.
> >
> > To be clear, what I'm thinking is that this would only be invoked if
> 
> Both the IBTA and RDMAC verbs were defined so that the meaning of
> L-Key/R-Key/STag + Address could not instantly change from "X" to
> "Y", only from "X" to NULL and then NULL to "Y".

Well, this is sort of a grey area, in one sense the meaning isn't
changing, just the underlying phyiscal memory is being moved around by
the OS.

The notion that the verbs refer to some sort of invisible underlying
VM object is nice for an implementation but pretty useless for
MPI..

> There are a lot of good reasons for this, especially for R-Keys or
> remotely accessible STags. It ensures that all operations that
> started when the translation was "X" are completed before any that
> will use the "Y" translation can commence. That is not something we
> want to accidentally undermine.

I'm not sure I see how this helps, synchronizing all this is the
responsibility of the application, if it wants to change the mapping
then it should be able to, and if it does so with poor timing then it
will have races and loose data <shrug>. As it stands today there are
already races where apps can loose data transfered after an unmap() or
transfer the wrong data after a mmap() so the current model is already
broken from that perspective.

Of course an update verb has to operate with similar ordering
guarantees to regsiter/unregister relative to the local work request
queue - that is to say if the verb is done out-of-line with the WR
queue then it must wait for the queue to flush before issuing the
update to the HCA - just like unregister - and then wait for the verb
to complete before returning to the app - just like register.

And we all wish for userspace FRMRs...

Jason


From greg at kroah.com  Mon May 11 14:05:05 2009
From: greg at kroah.com (Greg KH)
Date: Mon, 11 May 2009 14:05:05 -0700
Subject: [ofa-general] Re: [PATCH] infiniband: ehca: remove driver_data
	direct access of struct device
In-Reply-To: <OFD6B3383E.4B7D18D2-ONC12575AD.001C9A7B-C12575AD.001CADBA@de.ibm.com>
References: <20090504200022.GA22746@kroah.com>
	<OFD6B3383E.4B7D18D2-ONC12575AD.001C9A7B-C12575AD.001CADBA@de.ibm.com>
Message-ID: <20090511210505.GA31999@kroah.com>

On Tue, May 05, 2009 at 07:13:16AM +0200, Hoang-Nam Nguyen wrote:
> Hi,
> This patch looks fine to me. Thanks!

Thanks for reviewing it, I've added your "Acked-by" to the patch in my
tree.

greg k-h


From zhouyonghao at ict.ac.cn  Mon May 11 19:17:37 2009
From: zhouyonghao at ict.ac.cn (zhouyonghao at ict.ac.cn)
Date: Tue, 12 May 2009 10:17:37 +0800 (CST)
Subject: [ofa-general] How to establish IB communcation more 
 =?gb2312?b?ZWZmZWN0aXZl?= =?gb2312?b?bHmjvw==?=
Message-ID: <a53064263780dafbe3b9744a706153d0.squirrel@webmail.ict.ac.cn>

Hi all,
    I'm using libibverbs to build a cluster memory pool, and using TCP/IP
handshake to exchange memory information and establish the connection
before the IB communication. While I found this process costed a lot
of time, 100ms in 1GEth LAN, so I want to use the rdma_cm or ib_ucm to
handle the establishment. But I dont't find sample code or API
document, is there anything I missed?
    BTW, how to establish communication in current OFED? Any comparision
or suggestion is appreciated, that will help me a lot.


From swise at opengridcomputing.com  Mon May 11 20:06:26 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 11 May 2009 22:06:26 -0500
Subject: [ofa-general] Re: [PATCH 2.6.30] xprtrdma: The frmr iova_start
 values are truncated by	the nfs rdma client.
In-Reply-To: <1242092150.16618.15.camel@heimdal.trondhjem.org>
References: <20090424190510.3134.90405.stgit@build.ogc.int>	
	<49F31A16.2080806@opengridcomputing.com>	
	<49F4AE86.4090908@opengridcomputing.com>	
	<49f515a5.1d1e640a.1c82.6677@mx.google.com>	
	<49F5ED55.1010607@opengridcomputing.com>	
	<1240855510.8818.9.camel@heimdal.trondhjem.org>	
	<1240856613.8818.16.camel@heimdal.trondhjem.org>	
	<49F60845.4010007@opengridcomputing.com>	
	<1240865214.8818.73.camel@heimdal.trondhjem.org>	
	<4A08A5C6.7040003@opengridcomputing.com>	
	<1242082203.1743.11.camel@heimdal.trondhjem.org>	
	<4A08BF1C.2050204@opengridcomputing.com>	
	<1242089066.1743.19.camel@heimdal.trondhjem.org>	
	<4a08cd7b.48c3f10a.6bb1.fffff6d3@mx.google.com>
	<1242092150.16618.15.camel@heimdal.trondhjem.org>
Message-ID: <4A08E7B2.1010907@opengridcomputing.com>

Trond Myklebust wrote:
> On Mon, 2009-05-11 at 21:14 -0400, Tom Talpey wrote:
>   
>> At 08:44 PM 5/11/2009, Trond Myklebust wrote:
>>     
>>> On Mon, 2009-05-11 at 19:13 -0500, Steve Wise wrote:
>>>       
>>>> Trond Myklebust wrote:
>>>>         
>>>>> On Mon, 2009-05-11 at 17:25 -0500, Steve Wise wrote:
>>>>>   
>>>>>           
>>>>>> Hey Trond,
>>>>>>
>>>>>> Will this bug fix make 2.6.30?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Steve.
>>>>>>     
>>>>>>             
>>>>> Not in the form it is in now. As I've said earlier, I'm not happy about
>>>>> the sunrpc layer having to circumvent ordinary type checking on
>>>>> non-sunrpc structures.
>>>>>
>>>>> Cheers
>>>>>     Trond
>>>>>           
>>>> How is it circumventing?  It's currently incorrectly casting a pointer 
>>>> into a u64.  That seems just broken to me.  Also, its really the sunrpc 
>>>> rdma transport layer.  It deals specifically with rdma.  It _should_ 
>>>> know about rdma interfaces and types.
>>>>         
>>> The fact is that I'm simply not interested enough in rdma to tolerate
>>> hacks. If it isn't done cleanly, in a manner that I can maintain, then
>>> the whole transport layer comes out...
>>>       
>> I know exactly what you want - it's not what the code does now and
>> it's not an accessor function to set the hardware's u64 field. What's
>> needed is a new function to manage the entire RDMA triplet, and the
>> memory registration behind it, in the OFA code side. Put the hardware
>> goop below the line, IOW. I'll dust up Steve on this.
>>     
>
> This does indeed sound like what I'd looking for.
>
> There is a huge difference between having code that depends on well
> defined rdma interfaces, and code that depends on rdma hacks. A piece of
> code that requires casts from a non-local opaque type into another
> protocol-dependent non-local type will definitely fall in the latter
> category. I really don't care what the current code does, but a fix for
> that code is something that does it _correctly_; it is not yet another
> hack, whether or not it fixes a bug in the short term.
>
>    Trond
>
>   

Trond, I get your point, and we can certainly work on improving this 
with the rdma developer community.  But removing the one-line-broken 
cast will resolve a current crash situation for 2.6.30.  Can't we get 
this fix in 2.6.30 and work on the API improvements for 2.6.31?  I've 
CCed Roland and the ofa general list to get everyone involved in this 
thread so we can get this API design change going.

I agree we can clean this up moving forward, but lets fix the broken 
2.6.30 code.  

Will this work?

Steve.


From dotanba at gmail.com  Mon May 11 22:58:37 2009
From: dotanba at gmail.com (Dotan Barak)
Date: Tue, 12 May 2009 08:58:37 +0300
Subject: =?GB2312?Q?Re=3A_=5Bofa=2Dgeneral=5D_How_to_establish_IB_communcation_m?=
	=?GB2312?Q?ore_effectively=A3=BF?=
In-Reply-To: <a53064263780dafbe3b9744a706153d0.squirrel@webmail.ict.ac.cn>
References: <a53064263780dafbe3b9744a706153d0.squirrel@webmail.ict.ac.cn>
Message-ID: <2f3bf9a60905112258m3a7365d8w1fe869cac9bfbd9a@mail.gmail.com>

You can't find such samples in the verbs library; It can be found in
the rdma cma library, you should search for rping or ucmatose.

Dotan

2009/5/12  <zhouyonghao at ict.ac.cn>:
> Hi all,
>    I'm using libibverbs to build a cluster memory pool, and using TCP/IP
> handshake to exchange memory information and establish the connection
> before the IB communication. While I found this process costed a lot
> of time, 100ms in 1GEth LAN, so I want to use the rdma_cm or ib_ucm to
> handle the establishment. But I dont't find sample code or API
> document, is there anything I missed?
>    BTW, how to establish communication in current OFED? Any comparision
> or suggestion is appreciated, that will help me a lot.
>
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From dorfman.eli at gmail.com  Mon May 11 23:51:14 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Tue, 12 May 2009 09:51:14 +0300
Subject: [ofa-general] running ib diagnostics blocks
Message-ID: <4A091C62.8050906@gmail.com>

Hi,

What could be the reason that open("/dev/infiniband/umad0", O_RDWR|O_NONBLOCK)
blocks and does not return.

I did not find any errors in dmesg.

Thanks,
Eli


From or.gerlitz at gmail.com  Tue May 12 00:38:05 2009
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Tue, 12 May 2009 10:38:05 +0300
Subject: [ofa-general] running ib diagnostics blocks
In-Reply-To: <4A091C62.8050906@gmail.com>
References: <4A091C62.8050906@gmail.com>
Message-ID: <15ddcffd0905120038l34f71fa1m7fba3e4218021b11@mail.gmail.com>

Eli Dorfman wrote:

> What could be the reason that open("/dev/infiniband/umad0",
> O_RDWR|O_NONBLOCK)
> blocks and does not return. I did not find any errors in dmesg.


Eli,

You can examine the kernel stack of all processes, including yours... using
sysrq ($ echo 1 > /proc/sysrq-trigger and then $ echo t >
/proc/sysrq-trigge) and looking in the dmesg. e.g see if some other process
which deals with umad is in the D state...

Or.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090512/c3dec755/attachment.html>

From vlad at dev.mellanox.co.il  Tue May 12 00:45:02 2009
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Tue, 12 May 2009 10:45:02 +0300
Subject: [ofa-general] OFED 1.4.1-rc5 recall
In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD02A2B828@mtlexch01.mtl.com>
References: <5D49E7A8952DC44FB38C38FA0D758EAD020E76C4@mtlexch01.mtl.com>
	<5D49E7A8952DC44FB38C38FA0D758EAD02A2B828@mtlexch01.mtl.com>
Message-ID: <4A0928FE.9010007@dev.mellanox.co.il>

Hi,
OFED-1.4.1-rc5 was removed from OFA downloads.
OFED-1.4.1-rc6 will be released as soon as dependence issue between nfs and ib_core will be resolved and tested.

Regards,
Vladimir


From monis at Voltaire.COM  Tue May 12 02:31:40 2009
From: monis at Voltaire.COM (Moni Shoua)
Date: Tue, 12 May 2009 12:31:40 +0300
Subject: =?windows-1252?Q?Re=3A_=5Bofa-general=5D_How_to_establis?=
	=?windows-1252?Q?h_IB_communcation_more_effectively=3F?=
In-Reply-To: <2f3bf9a60905112258m3a7365d8w1fe869cac9bfbd9a@mail.gmail.com>
References: <a53064263780dafbe3b9744a706153d0.squirrel@webmail.ict.ac.cn>
	<2f3bf9a60905112258m3a7365d8w1fe869cac9bfbd9a@mail.gmail.com>
Message-ID: <4A0941FC.6020606@Voltaire.COM>

Dotan Barak wrote:
> You can't find such samples in the verbs library; It can be found in
> the rdma cma library, you should search for rping or ucmatose.
> 
> Dotan
> 
> 2009/5/12  <zhouyonghao at ict.ac.cn>:
>> Hi all,
>>    I'm using libibverbs to build a cluster memory pool, and using TCP/IP
>> handshake to exchange memory information and establish the connection
>> before the IB communication. While I found this process costed a lot
>> of time, 100ms in 1GEth LAN, so I want to use the rdma_cm or ib_ucm to
>> handle the establishment. But I dont't find sample code or API
>> document, is there anything I missed?
>>    BTW, how to establish communication in current OFED? Any comparision
>> or suggestion is appreciated, that will help me a lot.
>>
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
RPM  of librdmacm also includes detailed man pages (man rdma_cm)


From vlad at lists.openfabrics.org  Tue May 12 03:25:07 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Tue, 12 May 2009 03:25:07 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090512-0200 daily build status
Message-ID: <20090512102507.212B7E614E7@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From hnguyen at linux.vnet.ibm.com  Tue May 12 03:00:41 2009
From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen)
Date: Tue, 12 May 2009 12:00:41 +0200
Subject: [ofa-general] [PATCH] perftest: send_lat/bw: Attach to multicast
	group when QP is in INIT
Message-ID: <200905121200.41849.hnguyen@linux.vnet.ibm.com>

Subject: [PATCH] perftest: send_lat/bw: Attach to multicast group when QP is in INIT

If multicast is enabled, the current code of send_lat/bw attaches the QP
to a multicast group while it's still in RESET state.
Since the IB spec does not strictly specify the QP state for this operation
and ehca's current firmware does not allow attaching in RESET, this patch
moves the attach_mcast() function call after QP has been modified to INIT.

See also discussion thread http://lists.openfabrics.org/pipermail/general/2009-May/059450.html

Signed-off-by: Hoang-Nam Nguyen <hnguyen at de.ibm.com>
---
 send_bw.c  |   29 +++++++++++++++--------------
 send_lat.c |   30 +++++++++++++++---------------
 2 files changed, 30 insertions(+), 29 deletions(-)

diff --git a/send_bw.c b/send_bw.c
index afabfa4..9a10ff3 100755
--- a/send_bw.c
+++ b/send_bw.c
@@ -421,20 +421,6 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev,
 			return NULL;
 		}
 
-		if ((user_parm->connection_type==UD) && (user_parm->use_mcg) && (!user_parm->servername || user_parm->duplex)) {
-			union ibv_gid gid;
-			uint8_t mcg_gid[16] = MCG_GID;
-
-			/* use the local QP number as part of the mcg */
-			mcg_gid[11] = (user_parm->servername) ? 0 : 1;
-			*(uint32_t *)(&mcg_gid[12]) = ctx->qp->qp_num;
-			memcpy(gid.raw, mcg_gid, 16);
-
-			if (ibv_attach_mcast(ctx->qp, &gid, MCG_LID)) {
-				fprintf(stderr, "Couldn't attach QP to mcg\n");
-				return NULL;
-			}
-		}
 	}
 
 	{
@@ -457,6 +443,21 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev,
 				fprintf(stderr, "Failed to modify UD QP to INIT\n");
 				return NULL;
 			}
+
+			if ((user_parm->use_mcg) && (!user_parm->servername || user_parm->duplex)) {
+				union ibv_gid gid;
+				uint8_t mcg_gid[16] = MCG_GID;
+
+				/* use the local QP number as part of the mcg */
+				mcg_gid[11] = (user_parm->servername) ? 0 : 1;
+				*(uint32_t *)(&mcg_gid[12]) = ctx->qp->qp_num;
+				memcpy(gid.raw, mcg_gid, 16);
+
+				if (ibv_attach_mcast(ctx->qp, &gid, MCG_LID)) {
+					fprintf(stderr, "Couldn't attach QP to mcg\n");
+					return NULL;
+				}
+			}
 		} else if (ibv_modify_qp(ctx->qp, &attr,
 					 IBV_QP_STATE              |
 					 IBV_QP_PKEY_INDEX         |
diff --git a/send_lat.c b/send_lat.c
index 1f21652..e1a1156 100755
--- a/send_lat.c
+++ b/send_lat.c
@@ -425,21 +425,6 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size,
 			fprintf(stderr, "Couldn't create QP\n");
 			return NULL;
 		}
-
-		if ((user_parm->connection_type==UD) && (user_parm->use_mcg)) {
-			union ibv_gid gid;
-			uint8_t mcg_gid[16] = MCG_GID;
-
-			/* use the local QP number as part of the mcg */
-			mcg_gid[11] = (user_parm->servername) ? 0 : 1;
-			*(uint32_t *)(&mcg_gid[12]) = ctx->qp->qp_num;
-			memcpy(gid.raw, mcg_gid, 16);
-
-			if (ibv_attach_mcast(ctx->qp, &gid, MCG_LID)) {
-				fprintf(stderr, "Couldn't attach QP to mcg\n");
-				return NULL;
-			}
-		}
 	}
 
 	{
@@ -463,6 +448,21 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size,
 				fprintf(stderr, "Failed to modify UD QP to INIT\n");
 				return NULL;
 			}
+
+			if (user_parm->use_mcg) {
+				union ibv_gid gid;
+				uint8_t mcg_gid[16] = MCG_GID;
+
+				/* use the local QP number as part of the mcg */
+				mcg_gid[11] = (user_parm->servername) ? 0 : 1;
+				*(uint32_t *)(&mcg_gid[12]) = ctx->qp->qp_num;
+				memcpy(gid.raw, mcg_gid, 16);
+
+				if (ibv_attach_mcast(ctx->qp, &gid, MCG_LID)) {
+					fprintf(stderr, "Couldn't attach QP to mcg\n");
+					return NULL;
+				}
+			}
 		} else if (ibv_modify_qp(ctx->qp, &attr,
 					 IBV_QP_STATE              |
 					 IBV_QP_PKEY_INDEX         |
-- 
1.5.5


From hnrose at comcast.net  Tue May 12 04:21:03 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Tue, 12 May 2009 07:21:03 -0400
Subject: [ofa-general] [PATCH] opensm/PerfMgr: Reduce host name length
Message-ID: <20090512112103.GA7715@comcast.net>


to what's needed (based on NodeDescription length)

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/include/opensm/osm_event_plugin.h b/opensm/include/opensm/osm_event_plugin.h
index 41a5810..33d1920 100644
--- a/opensm/include/opensm/osm_event_plugin.h
+++ b/opensm/include/opensm/osm_event_plugin.h
@@ -60,7 +60,7 @@ BEGIN_C_DECLS
 *
 *********/
 
-#define OSM_EPI_NODE_NAME_LEN (128)
+#define OSM_EPI_NODE_NAME_LEN (65)
 
 struct osm_opensm;
 /** =========================================================================
diff --git a/opensm/include/opensm/osm_perfmgr_db.h b/opensm/include/opensm/osm_perfmgr_db.h
index d0eff73..42a47bd 100644
--- a/opensm/include/opensm/osm_perfmgr_db.h
+++ b/opensm/include/opensm/osm_perfmgr_db.h
@@ -131,7 +131,7 @@ typedef struct db_port {
 /** =========================================================================
  * group port counters for ports into the nodes
  */
-#define NODE_NAME_SIZE (IB_NODE_DESCRIPTION_SIZE << 1)
+#define NODE_NAME_SIZE (IB_NODE_DESCRIPTION_SIZE + 1)
 typedef struct db_node {
 	cl_map_item_t map_item;	/* must be first */
 	uint64_t node_guid;


From tziporet at dev.mellanox.co.il  Tue May 12 04:33:18 2009
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 12 May 2009 14:33:18 +0300
Subject: [ofa-general] OFED 1.4.1-rc5 symbol disagreements on SLES 11 SP0
In-Reply-To: <8D6B365D-9766-4908-873A-D4DC9BE3A9C9@opengridcomputing.com>
References: <8D6B365D-9766-4908-873A-D4DC9BE3A9C9@opengridcomputing.com>
Message-ID: <4A095E7E.6030607@mellanox.co.il>

Brian M. Rzycki wrote:
> Greetings,
>
> I have the following SLES 11 SP0 machine:
>
>
> It looks like the OFED installer isn't building ib_iser.ko even when I 
> choose 2,3.
>
This is the same bug reported on rc5. We removed rc5 and will publish 
RC6 soon

Tziporet


From tziporet at dev.mellanox.co.il  Tue May 12 04:34:10 2009
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 12 May 2009 14:34:10 +0300
Subject: [ofa-general] Re: [PATCH ofed-1.4.1 cxgb3 relnotes] Update cxgb3
 release notes for 1.4.1
In-Reply-To: <20090511153721.17587.46386.stgit@build.ogc.int>
References: <20090511153721.17587.46386.stgit@build.ogc.int>
Message-ID: <4A095EB2.6070505@mellanox.co.il>

Steve Wise wrote:
> Signed-off-by: Steve Wise <swise at opengridcomputing.com>
> ---
>   

Applied
Tziporet


From HNGUYEN at de.ibm.com  Tue May 12 05:40:13 2009
From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen)
Date: Tue, 12 May 2009 14:40:13 +0200
Subject: [ofa-general] Re: [PATCH] ehca: remove driver_data direct access of
	struct device
In-Reply-To: <4A0889A3.8020803@gmail.com>
References: <4A0889A3.8020803@gmail.com>
Message-ID: <OF8BA31CF1.C84202F8-ONC12575B4.004501A2-C12575B4.004599CE@de.ibm.com>

Hi,
Thanks for this patch. But I've to NACK because
1) Greg KH has already done a similar patch in his tree.
See http://lists.openfabrics.org/pipermail/general/2009-May/059442.html
2) Your patch is incomplete

Regards
Nam

Roel Kluin <roel.kluin at gmail.com> wrote on 11.05.2009 22:25:07:

> From:
>
> Roel Kluin <roel.kluin at gmail.com>
>
> To:
>
> Hoang-Nam Nguyen/Germany/IBM at IBMDE
>
> Cc:
>
> general at lists.openfabrics.org, lkml <linux-kernel at vger.kernel.org>:
>
> Date:
>
> 11.05.2009 22:25
>
> Subject:
>
> [PATCH] ehca: remove driver_data direct access of struct device
>
> To avoid direct access to the driver_data pointer in struct device, the
> functions dev_get_drvdata() and dev_set_drvdata() should be used.
>
> Signed-off-by: Roel Kluin <roel.kluin at gmail.com>
> ---
> diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/
> infiniband/hw/ehca/ehca_main.c
> index 368311c..5acfb4c 100644
> --- a/drivers/infiniband/hw/ehca/ehca_main.c
> +++ b/drivers/infiniband/hw/ehca/ehca_main.c
> @@ -749,7 +749,7 @@ static int __devinit ehca_probe(struct of_device
*dev,
>
>     shca->ofdev = dev;
>     shca->ipz_hca_handle.handle = *handle;
> -   dev->dev.driver_data = shca;
> +   dev_set_drvdata(&dev->dev, shca);
>
>     ret = ehca_sense_attributes(shca);
>     if (ret < 0) {
> @@ -878,7 +878,7 @@ probe1:
>
>  static int __devexit ehca_remove(struct of_device *dev)
>  {
> -   struct ehca_shca *shca = dev->dev.driver_data;
> +   struct ehca_shca *shca = dev_get_drvdata(&dev->dev);
>     unsigned long flags;
>     int ret;
>


From swise at opengridcomputing.com  Tue May 12 09:11:36 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 12 May 2009 11:11:36 -0500
Subject: [ofa-general] Re: [PATCH 2.6.30] xprtrdma: The frmr iova_start
	values are truncated by	the nfs rdma client.
In-Reply-To: <4A08E7B2.1010907@opengridcomputing.com>
References: <20090424190510.3134.90405.stgit@build.ogc.int>		<49F31A16.2080806@opengridcomputing.com>		<49F4AE86.4090908@opengridcomputing.com>		<49f515a5.1d1e640a.1c82.6677@mx.google.com>		<49F5ED55.1010607@opengridcomputing.com>		<1240855510.8818.9.camel@heimdal.trondhjem.org>		<1240856613.8818.16.camel@heimdal.trondhjem.org>		<49F60845.4010007@opengridcomputing.com>		<1240865214.8818.73.camel@heimdal.trondhjem.org>		<4A08A5C6.7040003@opengridcomputing.com>		<1242082203.1743.11.camel@heimdal.trondhjem.org>		<4A08BF1C.2050204@opengridcomputing.com>		<1242089066.1743.19.camel@heimdal.trondhjem.org>		<4a08cd7b.48c3f10a.6bb1.fffff6d3@mx.google.com>	<1242092150.16618.15.camel@heimdal.trondhjem.org>
	<4A08E7B2.1010907@opengridcomputing.com>
Message-ID: <4A099FB8.7090603@opengridcomputing.com>

Steve Wise wrote:

 >Trond Myklebust wrote (earlier in this thread):
 >
 > All I should need to know is that I can advertise either dma handles or
 > kernel VAs, and know that I can choose between two functions, say,
 > ib_send_wr_fastreg_dma_init() and ib_send_wr_fastreg_kva_init() to
 > initialise the ib_send_wr structure correctly.


To align more with the rest of the fast_reg API in ib_verbs.h, I propose:

static inline void ib_init_fast_reg_iova_start_dma(struct ib_send_wr 
*send_wr, dma_addr_t dma);
static inline void ib_init_fast_reg_iova_start_kva(struct ib_send_wr 
*send_wr, void *kva);

Thoughts?


From swise at opengridcomputing.com  Tue May 12 09:23:31 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 12 May 2009 11:23:31 -0500
Subject: [ofa-general] Re: [PATCH 2.6.30] xprtrdma: The frmr iova_start
	values are truncated by	the nfs rdma client.
In-Reply-To: <4A099FB8.7090603@opengridcomputing.com>
References: <20090424190510.3134.90405.stgit@build.ogc.int>		<49F31A16.2080806@opengridcomputing.com>		<49F4AE86.4090908@opengridcomputing.com>		<49f515a5.1d1e640a.1c82.6677@mx.google.com>		<49F5ED55.1010607@opengridcomputing.com>		<1240855510.8818.9.camel@heimdal.trondhjem.org>		<1240856613.8818.16.camel@heimdal.trondhjem.org>		<49F60845.4010007@opengridcomputing.com>		<1240865214.8818.73.camel@heimdal.trondhjem.org>		<4A08A5C6.7040003@opengridcomputing.com>		<1242082203.1743.11.camel@heimdal.trondhjem.org>		<4A08BF1C.2050204@opengridcomputing.com>		<1242089066.1743.19.camel@heimdal.trondhjem.org>		<4a08cd7b.48c3f10a.6bb1.fffff6d3@mx.google.com>	<1242092150.16618.15.camel@heimdal.trondhjem.org>	<4A08E7B2.1010907@opengridcomputing.com>
	<4A099FB8.7090603@opengridcomputing.com>
Message-ID: <4A09A283.3090605@opengridcomputing.com>

Steve Wise wrote:
> Steve Wise wrote:
>
> >Trond Myklebust wrote (earlier in this thread):
> >
> > All I should need to know is that I can advertise either dma handles or
> > kernel VAs, and know that I can choose between two functions, say,
> > ib_send_wr_fastreg_dma_init() and ib_send_wr_fastreg_kva_init() to
> > initialise the ib_send_wr structure correctly.
>
>
> To align more with the rest of the fast_reg API in ib_verbs.h, I propose:
>
> static inline void ib_init_fast_reg_iova_start_dma(struct ib_send_wr 
> *send_wr, dma_addr_t dma);
> static inline void ib_init_fast_reg_iova_start_kva(struct ib_send_wr 
> *send_wr, void *kva);
>
> Thoughts?
>
>
uncompiled patch:

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index c179318..fb56930 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1940,6 +1940,30 @@ static inline void ib_update_fast_reg_key(struct 
ib_mr *mr, u8 newkey)
 }
 
 /**
+ * ib_init_fast_reg_iova_start_dma - initializes the iova_start field
+ *   based on a dma address supplied by the user.
+ * @wr - struct ib_send_wr pointer to be initialized
+ * @addr - dma_addr_t value to be used as the iova_start
+ */
+static inline void ib_init_fast_reg_iova_start_dma(struct ib_send_wr *wr,
+                                                  dma_addr_t addr)
+{
+       wr->wr.fast_reg.iova_start = addr;
+}
+
+/**
+ * ib_init_fast_reg_iova_start_kva - initializes the iova_start field
+ *   based on a kernel virtual address supplied by the user.
+ * @wr - struct ib_send_wr pointer to be initialized
+ * @addr - void * address to be used as the iova_start
+ */
+static inline void ib_init_fast_reg_iova_start_kva(struct ib_send_wr *wr,
+                                                  void *addr)
+{
+       wr->wr.fast_reg.iova_start = (unsigned long)addr;
+}
+
+/**
  * ib_alloc_mw - Allocates a memory window.
  * @pd: The protection domain associated with the memory window.
  */


From sashak at voltaire.com  Tue May 12 10:50:57 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 12 May 2009 20:50:57 +0300
Subject: [ofa-general] Re: [PATCH] Fix 2 formatting diff's from old
	ibqueryerrors.
In-Reply-To: <20090506095114.3893f4aa.weiny2@llnl.gov>
References: <20090506095114.3893f4aa.weiny2@llnl.gov>
Message-ID: <20090512175057.GA27108@sashak.voltaire.com>

On 09:51 Wed 06 May     , Ira Weiny wrote:
> 2 changes I noted in the output from ibqueryerrors.
> 
> "Link Info:" was not being printed when "-r" was used.
> 
> The "header": Errors for 0x<guid> "<node name>"
> 
> Should only be printed when errors are found.
> 
> The following patch cleans those up.
> 
> Ira
> 
> 
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Tue, 28 Apr 2009 14:39:11 -0700
> Subject: [PATCH] Fix 2 formatting diff's from old ibqueryerrors.
> 
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Tue May 12 10:52:02 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 12 May 2009 20:52:02 +0300
Subject: [ofa-general] Re: [PATCH] Clean up printing of switch heading when
	printing "down links" only.
In-Reply-To: <20090506095303.f11659f1.weiny2@llnl.gov>
References: <20090506095303.f11659f1.weiny2@llnl.gov>
Message-ID: <20090512175202.GB27108@sashak.voltaire.com>

On 09:53 Wed 06 May     , Ira Weiny wrote:
> Another corner case:  If there are no down links on a switch and "-d" is selected then the header for that switch should not be printed.
> 
> Ira
> 
> 
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Thu, 30 Apr 2009 13:41:38 -0700
> Subject: [PATCH] Clean up printing of switch heading when printing "down links" only.
> 
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Tue May 12 10:55:41 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 12 May 2009 20:55:41 +0300
Subject: [ofa-general] [PATCH] opensm/osm_port.c: Remove error number
	from debug level log message
In-Reply-To: <f0e08f230905100346y14352b1eq5033493aadc43f0d@mail.gmail.com>
References: <20090507143346.GA1713@comcast.net> <4A067764.3040306@gmail.com>
	<f0e08f230905100346y14352b1eq5033493aadc43f0d@mail.gmail.com>
Message-ID: <20090512175541.GC27108@sashak.voltaire.com>

On 06:46 Sun 10 May     , Hal Rosenstock wrote:
> 
> Sasha has been adamant that any device supplied data errors use
> something other than ERROR log level.

But I think that VERBOSE is more appropriate than for such cases than
just DEBUG. Another way is to add another "level" for subnet warnings.

Sasha


From sashak at voltaire.com  Tue May 12 10:56:03 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 12 May 2009 20:56:03 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_port.c: Remove error number
	from debug level log message
In-Reply-To: <20090507143346.GA1713@comcast.net>
References: <20090507143346.GA1713@comcast.net>
Message-ID: <20090512175603.GD27108@sashak.voltaire.com>

On 10:33 Thu 07 May     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Tue May 12 11:12:17 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 12 May 2009 21:12:17 +0300
Subject: [ofa-general] [PATCH] saquery: fix -c arguement
In-Reply-To: <4A06BAC5.40405@voltaire.com>
References: <4A030525.7090209@voltaire.com> <4A06BAC5.40405@voltaire.com>
Message-ID: <20090512181217.GE27108@sashak.voltaire.com>

On 14:30 Sun 10 May     , Doron Shoham wrote:
> set SAQUERY_CMD_CLASS_PORT_INFO instead of CLASS_PORT_INFO
> 
> Signed-off-by: Doron Shoham <dorons at voltaire.com>

Applied. Thanks.

Sasha


From hal.rosenstock at gmail.com  Tue May 12 11:06:55 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 12 May 2009 14:06:55 -0400
Subject: [ofa-general] [PATCH] opensm/osm_port.c: Remove error number from
	debug level log message
In-Reply-To: <20090512175541.GC27108@sashak.voltaire.com>
References: <20090507143346.GA1713@comcast.net> <4A067764.3040306@gmail.com>
	<f0e08f230905100346y14352b1eq5033493aadc43f0d@mail.gmail.com>
	<20090512175541.GC27108@sashak.voltaire.com>
Message-ID: <f0e08f230905121106t747dc25fk35cc87e77eb5a839@mail.gmail.com>

On Tue, May 12, 2009 at 1:55 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 06:46 Sun 10 May     , Hal Rosenstock wrote:
>>
>> Sasha has been adamant that any device supplied data errors use
>> something other than ERROR log level.
>
> But I think that VERBOSE is more appropriate than for such cases than
> just DEBUG.

Yes, VERBOSE level is more consistent than DEBUG level with what is
done elsewhere in OpenSM.

-- Hal

> Another way is to add another "level" for subnet warnings.
>
> Sasha
>


From sashak at voltaire.com  Tue May 12 11:15:04 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 12 May 2009 21:15:04 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_lid_mgr.c bug in opensm LID
	assignment
In-Reply-To: <4A056913.7010700@gmail.com>
References: <4A056913.7010700@gmail.com>
Message-ID: <20090512181504.GF27108@sashak.voltaire.com>

On 14:29 Sat 09 May     , Eli Dorfman (Voltaire) wrote:
>  lid persistent range wrong check
>  used lids were not properly chekced which
>  caused duplicate lid assignment in some cases.
> 
> Signed-off-by: Eli Dorfman <elid at voltaire.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Tue May 12 11:22:57 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 12 May 2009 21:22:57 +0300
Subject: [ofa-general] Re: [PATCH] opensm/PerfMgr: Reduce host name length
In-Reply-To: <20090512112103.GA7715@comcast.net>
References: <20090512112103.GA7715@comcast.net>
Message-ID: <20090512182257.GG27108@sashak.voltaire.com>

On 07:21 Tue 12 May     , Hal Rosenstock wrote:
> 
> to what's needed (based on NodeDescription length)
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Tue May 12 11:39:25 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 12 May 2009 21:39:25 +0300
Subject: [ofa-general] [PATCH] opensm/osm_port.c: Remove error number
	from debug level log message
In-Reply-To: <f0e08f230905121106t747dc25fk35cc87e77eb5a839@mail.gmail.com>
References: <20090507143346.GA1713@comcast.net> <4A067764.3040306@gmail.com>
	<f0e08f230905100346y14352b1eq5033493aadc43f0d@mail.gmail.com>
	<20090512175541.GC27108@sashak.voltaire.com>
	<f0e08f230905121106t747dc25fk35cc87e77eb5a839@mail.gmail.com>
Message-ID: <20090512183925.GI27108@sashak.voltaire.com>

On 14:06 Tue 12 May     , Hal Rosenstock wrote:
> 
> Yes, VERBOSE level is more consistent than DEBUG level with what is
> done elsewhere in OpenSM.

Ok, I'm changing to VERBOSE.

Sasha


From hnrose at comcast.net  Tue May 12 11:32:33 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Tue, 12 May 2009 14:32:33 -0400
Subject: [ofa-general] [PATCH] opensm/osm_port.c: Change log level of Invalid
	OP_VLS 0 message to VERBOSE
Message-ID: <20090512183233.GA1113@comcast.net>


Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c
index 17bac73..cb8b153 100644
--- a/opensm/opensm/osm_port.c
+++ b/opensm/opensm/osm_port.c
@@ -381,7 +381,7 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log,
 		op_vls = p_subn->opt.max_op_vls;
 
 	if (op_vls == 0) {
-		OSM_LOG(p_log, OSM_LOG_DEBUG,
+		OSM_LOG(p_log, OSM_LOG_VERBOSE,
 			"Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n");
 		op_vls = 1;
 	}


From sashak at voltaire.com  Tue May 12 11:45:42 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 12 May 2009 21:45:42 +0300
Subject: [ofa-general] Re: [PATCH] opensm/osm_port.c: Change log level of
	Invalid OP_VLS 0 message to VERBOSE
In-Reply-To: <20090512183233.GA1113@comcast.net>
References: <20090512183233.GA1113@comcast.net>
Message-ID: <20090512184542.GJ27108@sashak.voltaire.com>

On 14:32 Tue 12 May     , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Oops, I committed this already in the local branch :) Will use your
version instead.

Applied. Thanks.

Sasha


From sashak at voltaire.com  Tue May 12 11:51:41 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 12 May 2009 21:51:41 +0300
Subject: [ofa-general] Re: [PATCH 0/2] osm_port.c: do not enforce PortInfo
	update if max_op_vls = 0
In-Reply-To: <4A030465.90009@voltaire.com>
References: <4A00386E.2050300@voltaire.com>
	<f0e08f230905050614n149dfc53yd6d0f03e5f2dd4c@mail.gmail.com>
	<4A0043B0.3030400@gmail.com>
	<f0e08f230905050659ld9f24acvf03a0bc1e0cdb827@mail.gmail.com>
	<20090506112135.GG10145@sk>
	<f0e08f230905060429m162dd8e5h5a9744c35d81033@mail.gmail.com>
	<4A029038.2040603@voltaire.com> <20090507115212.GC19236@sk>
	<4A030377.6050202@voltaire.com> <4A030465.90009@voltaire.com>
Message-ID: <20090512185141.GK27108@sashak.voltaire.com>

On 18:55 Thu 07 May     , Doron Shoham wrote:
> do not enforce PortInfo update if max_op_vls = 0
> 
> Signed-off-by: Doron Shoham <dorons at voltaire.com>
> ---
>  opensm/opensm/osm_port.c   |    2 +-
>  opensm/opensm/osm_subnet.c |    8 ++++++++
>  2 files changed, 9 insertions(+), 1 deletions(-)
> 
> diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c
> index 4d1bbf2..8bf1767 100644
> --- a/opensm/opensm/osm_port.c
> +++ b/opensm/opensm/osm_port.c
> @@ -383,7 +383,7 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log,
>  		op_vls = 1;
>  	}
>  	/* support user limitation of max_op_vls */
> -	if (op_vls > p_subn->opt.max_op_vls)
> +	if (p_subn->opt.max_op_vls && op_vls > p_subn->opt.max_op_vls)

Then you likely want to drop '0' value from the comment in config file
template (diff below), no?

Sasha

>  		op_vls = p_subn->opt.max_op_vls;
>  
>  
> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
> index ec15f8a..71fc7a0 100644
> --- a/opensm/opensm/osm_subnet.c
> +++ b/opensm/opensm/osm_subnet.c
> @@ -1288,6 +1288,14 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts)
>  		"# switch port connected to a CA or router port\n"
>  		"leaf_head_of_queue_lifetime 0x%02x\n\n"
>  		"# Limit the maximal operational VLs\n"
> +		"# Virtual Lanes operational on this port\n"
> +		"# Values are (IB Spec 1.2.1, 14.2.5.6 Table 146 \"PortInfo\")\n"
> +		"#    0: No change; valid only on Set()\n"
> +		"#    1: VL0\n"
> +		"#    2: VL0, VL1\n"
> +		"#    3: VL0 - VL3\n"
> +		"#    4: VL0 - VL7\n"
> +		"#    5: VL0 - VL14\n"
>  		"max_op_vls %u\n\n"
>  		"# Force PortInfo:LinkSpeedEnabled on switch ports\n"
>  		"# If 0, don't modify PortInfo:LinkSpeedEnabled on switch port\n"
> -- 
> 1.5.4
> 


From sashak at voltaire.com  Tue May 12 11:55:20 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 12 May 2009 21:55:20 +0300
Subject: [ofa-general] [PATCH 1/2] osm_port.c: check if op_vls = 0
	before max_op_vls comparison
In-Reply-To: <4A067905.5060401@gmail.com>
References: <f0e08f230905050614n149dfc53yd6d0f03e5f2dd4c@mail.gmail.com>
	<4A0043B0.3030400@gmail.com>
	<f0e08f230905050659ld9f24acvf03a0bc1e0cdb827@mail.gmail.com>
	<20090506112135.GG10145@sk>
	<f0e08f230905060429m162dd8e5h5a9744c35d81033@mail.gmail.com>
	<4A029038.2040603@voltaire.com> <20090507115212.GC19236@sk>
	<4A030377.6050202@voltaire.com> <4A03043C.4010709@voltaire.com>
	<4A067905.5060401@gmail.com>
Message-ID: <20090512185520.GL27108@sashak.voltaire.com>

On 09:49 Sun 10 May     , Eli Dorfman (Voltaire) wrote:
> Doron Shoham wrote:
> > check if op_vls = 0 before max_op_vls comparison
> > 
> > Signed-off-by: Doron Shoham <dorons at voltaire.com>
> > ---
> >  opensm/opensm/osm_port.c |    9 +++++----
> >  1 files changed, 5 insertions(+), 4 deletions(-)
> > 
> > diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c
> > index 2e6c642..4d1bbf2 100644
> > --- a/opensm/opensm/osm_port.c
> > +++ b/opensm/opensm/osm_port.c
> > @@ -376,15 +376,16 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log,
> >  	} else
> >  		op_vls = ib_port_info_get_op_vls(&p_physp->port_info);
> >  
> > -	/* support user limitation of max_op_vls */
> > -	if (op_vls > p_subn->opt.max_op_vls)
> > -		op_vls = p_subn->opt.max_op_vls;
> > -
> >  	if (op_vls == 0) {
> > +		/* for non compliant implementations */	
> >  		OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: "
> 
> I think that level should be OSM_LOG_ERROR.

OSM_LOG_VERBOSE is better - it is not OpenSM error.

And also - don't mix two ideas in a single patch :) .

Sasha


From sashak at voltaire.com  Tue May 12 12:00:36 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 12 May 2009 22:00:36 +0300
Subject: [ofa-general] Re: [PATCH 1/2] osm_port.c: check if op_vls = 0 before
	max_op_vls comparison
In-Reply-To: <4A068D87.6040801@voltaire.com>
References: <4A0043B0.3030400@gmail.com>
	<f0e08f230905050659ld9f24acvf03a0bc1e0cdb827@mail.gmail.com>
	<20090506112135.GG10145@sk>
	<f0e08f230905060429m162dd8e5h5a9744c35d81033@mail.gmail.com>
	<4A029038.2040603@voltaire.com> <20090507115212.GC19236@sk>
	<4A030377.6050202@voltaire.com> <4A03043C.4010709@voltaire.com>
	<4A067905.5060401@gmail.com> <4A068D87.6040801@voltaire.com>
Message-ID: <20090512190036.GM27108@sashak.voltaire.com>

On 11:17 Sun 10 May     , Doron Shoham wrote:
> check if op_vls = 0 before max_op_vls comparison
> 
> Signed-off-by: Doron Shoham <dorons at voltaire.com>

Applied. Thanks. See comments below.

> ---
>  opensm/opensm/osm_port.c |   11 ++++++-----
>  1 files changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c
> index 2e6c642..41b67ad 100644
> --- a/opensm/opensm/osm_port.c
> +++ b/opensm/opensm/osm_port.c
> @@ -376,15 +376,16 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log,
>  	} else
>  		op_vls = ib_port_info_get_op_vls(&p_physp->port_info);
>  
> -	/* support user limitation of max_op_vls */
> -	if (op_vls > p_subn->opt.max_op_vls)
> -		op_vls = p_subn->opt.max_op_vls;
> -
>  	if (op_vls == 0) {
> -		OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: "
> +		/* for non compliant implementations */	
                                                     ^^^^
Please care to not introduce trailing spaces.

> +		OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4102: "

Log level is OSM_LOG_VERBOSE now - merged with Hal's patches.

Sasha


From sashak at voltaire.com  Tue May 12 12:17:26 2009
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 12 May 2009 22:17:26 +0300
Subject: [ofa-general] Re: [RFC][PATCH] ibnetdiscover: remove report of max
	hops discovered.
In-Reply-To: <20090506180140.6213971e.weiny2@llnl.gov>
References: <20090504151005.9a565bc5.weiny2@llnl.gov>
	<f0e08f230905050338m4d11c0e9j205c514468e856ef@mail.gmail.com>
	<1241543312.18144.18.camel@auk31.llnl.gov>
	<f0e08f230905051125k4ca6ab45q58ec46e9385df9ba@mail.gmail.com>
	<20090506180140.6213971e.weiny2@llnl.gov>
Message-ID: <20090512191726.GN27108@sashak.voltaire.com>

On 18:01 Wed 06 May     , Ira Weiny wrote:
> The number reported as "max hops" from ibnetdiscover can change depending on
> the algorithm used to discover the fabric.  As Hal says in the message below
> using this number is therefore dangerous.
> 
> If no one is currently using this number I propose the patch below which
> removes the "max hops discovered" from the output.

I don't know about usages, it is rather additional ibnetdiscover info
(similar to date/time printout). But it was nice to have - it provides
some idea about what ibnetdiscover did. If you want to remove it anyway,
at least print it in verbose mode ('-v').

Sasha


From arlin.r.davis at intel.com  Tue May 12 12:21:23 2009
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Tue, 12 May 2009 12:21:23 -0700
Subject: [ofa-general] How to establish IB communcation more effectively?
In-Reply-To: <a53064263780dafbe3b9744a706153d0.squirrel@webmail.ict.ac.cn>
References: <a53064263780dafbe3b9744a706153d0.squirrel@webmail.ict.ac.cn>
Message-ID: <E3280858FA94444CA49D2BA02341C9834AA71180@orsmsx506.amr.corp.intel.com>


>Hi all,
>    I'm using libibverbs to build a cluster memory pool, and 
>using TCP/IP
>handshake to exchange memory information and establish the connection
>before the IB communication. While I found this process costed a lot
>of time, 100ms in 1GEth LAN, so I want to use the rdma_cm or ib_ucm to
>handle the establishment. But I dont't find sample code or API
>document, is there anything I missed?
>    BTW, how to establish communication in current OFED? Any 
>comparision
>or suggestion is appreciated, that will help me a lot.
>

What scale are you targeting?

Your single connection number seems high. For a connection
(socket connect, exchanging QP info, private data, qp modify)
using uDAPL socket cm versus rdma_cm I get:

socket_cm on 1Ge == ~900us
socket_cm on IPoIB (mlx4 ddr) == ~400us
rdma_cm on IB (mlx4 ddr) == ~2200us

As you can see, the path record queries via rdma_cm add 
a substantial penalty. With larger scale clusters this
really starts to hurt.

You can look at uDAPL (dapl/openib_cma and dapl/openib_scm) 
source for examples of a socket cm implementation vs rdma_cm. 
With the socket cm version we ran up to 14,400 cores with 
no problems using Intel MPI. However, with rdma_cm we 
had problems reaching 1000 cores due to IPoIB ARP storms and
SA path record query issues. If someone would step up and 
provide a scalable SA caching solution in OFED then rdma_cm 
could possibly work for us again. Any takers? :^)

-arlin


From or.gerlitz at gmail.com  Tue May 12 13:11:46 2009
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Tue, 12 May 2009 23:11:46 +0300
Subject: [ofa-general] How to establish IB communcation more effectively?
In-Reply-To: <E3280858FA94444CA49D2BA02341C9834AA71180@orsmsx506.amr.corp.intel.com>
References: <a53064263780dafbe3b9744a706153d0.squirrel@webmail.ict.ac.cn>
	<E3280858FA94444CA49D2BA02341C9834AA71180@orsmsx506.amr.corp.intel.com>
Message-ID: <15ddcffd0905121311j2c450aech7b0f933a6ce5ef91@mail.gmail.com>

Davis, Arlin R <arlin.r.davis at intel.com> wrote:
> For a connection (socket connect, exchanging QP info, private data, qp modify)
> using uDAPL socket cm versus rdma_cm I get:
> socket_cm on 1Ge == ~900us
> socket_cm on IPoIB (mlx4 ddr) == ~400us
> rdma_cm on IB (mlx4 ddr) == ~2200us
> As you can see, the path record queries via rdma_cm add a substantial penalty.

Hi Arlin,

Just to make sure we're on the same page: both IPoIB and the RDMA-CM
use SA path queries (ipoib for the unicast arp reply, and rdma-cm for
rdma_resolve_route), going into details, things look like:

with the rdma-cm:

rdma_resolve_addr
          A --> *  ARP request (broadcast)
          B --> A ARP reply (unicast, before that B does SA path query)
rdma_resolve_route
          A does SA path query
rdma_connect
          A --> B CM REQ
          B --> A CM REP
          A --> B CM RTU

with the socket cm / ipoib:

socket connect
          A --> *  ARP request (broadcast)
          B --> A ARP reply (unicast, before that B does SA path query)
          A --> B TCP SYN (unicast, A does SA path query!)
          B --> A TCP SYN + ACK
          A --> B TCP ACK

Looking on the differences between the flows, we can see that --both--
flows have --two-- path queries, so the 400us vs 2200us difference
can't be related to that.So, is it possible that you have counted
rdma_create_qp in the rdma-cm accounting and didn't count
ibv_create_qp in the scm accounting?

Or.


From sean.hefty at intel.com  Tue May 12 13:53:37 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 12 May 2009 13:53:37 -0700
Subject: [ofa-general] How to establish IB communcation more effectively?
In-Reply-To: <15ddcffd0905121311j2c450aech7b0f933a6ce5ef91@mail.gmail.com>
References: <a53064263780dafbe3b9744a706153d0.squirrel@webmail.ict.ac.cn>	<E3280858FA94444CA49D2BA02341C9834AA71180@orsmsx506.amr.corp.intel.com>
	<15ddcffd0905121311j2c450aech7b0f933a6ce5ef91@mail.gmail.com>
Message-ID: <89088CD4A69046AF95E5D54D970C5E34@amr.corp.intel.com>

>Just to make sure we're on the same page: both IPoIB and the RDMA-CM
>use SA path queries

But ipoib caches its path records...

- Sean


From arlin.r.davis at intel.com  Tue May 12 14:23:37 2009
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Tue, 12 May 2009 14:23:37 -0700
Subject: [ofa-general] How to establish IB communcation more effectively?
In-Reply-To: <15ddcffd0905121311j2c450aech7b0f933a6ce5ef91@mail.gmail.com>
References: <a53064263780dafbe3b9744a706153d0.squirrel@webmail.ict.ac.cn>
	<E3280858FA94444CA49D2BA02341C9834AA71180@orsmsx506.amr.corp.intel.com>
	<15ddcffd0905121311j2c450aech7b0f933a6ce5ef91@mail.gmail.com>
Message-ID: <E3280858FA94444CA49D2BA02341C9834AA713AB@orsmsx506.amr.corp.intel.com>

 
>Davis, Arlin R <arlin.r.davis at intel.com> wrote:
>> For a connection (socket connect, exchanging QP info, 
>private data, qp modify)
>> using uDAPL socket cm versus rdma_cm I get:
>> socket_cm on 1Ge == ~900us
>> socket_cm on IPoIB (mlx4 ddr) == ~400us
>> rdma_cm on IB (mlx4 ddr) == ~2200us
>> As you can see, the path record queries via rdma_cm add a 
>substantial penalty.
>
>Hi Arlin,
>
>Just to make sure we're on the same page: both IPoIB and the RDMA-CM
>use SA path queries (ipoib for the unicast arp reply, and rdma-cm for
>rdma_resolve_route), going into details, things look like:

I am running IPoIB connected so I assume there is no path query 
and I see no difference in IPoIB unconnected mode so I also assume 
it caches path records during ARP processing. Can someone confirm? 

ARP cache is also hit in all these cases so you can take 
ARP request/reply out. However, with rdma_cm we actually 
have to pick up the RDMA_CM_EVENT_ADDR_RESOLVED (arp) event 
before moving on to the rdma_resolve_route (path record), 
and then wait for RDMA_CM_EVENT_ROUTE_RESOLVED event 
before moving on to the rdma_connect call, and then 
finally wait for RDMA_CM_EVENT_ESTABLISHED. You start
to get the picture of where my time goes? Not only do 
we have path record query delays we have a 3 step event 
processing (waiting/waking on each) just to get connected.

My measurements are on top of uDAPL so everything is equal.
I simply added some timers to dtest around connect and 
wait for connection event:

start_timer
dat_ep_connect()
dat_evd_wait()
stop_timer
	
For example (client side):
		
eth0 socket_cm:  dtest -P ofa-v2-mlx4_0-1 -h cst-55-eth0 -t 
IPoIB socket_cm: dtest -P ofa-v2-mlx4_0-1 -h cst-55-ib0 -t
rdma_cm:         dtest -P ofa-v2-ib0 -h cst-55-ib0 -t


-arlin

From or.gerlitz at gmail.com  Tue May 12 14:32:38 2009
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Wed, 13 May 2009 00:32:38 +0300
Subject: [ofa-general] How to establish IB communcation more effectively?
In-Reply-To: <89088CD4A69046AF95E5D54D970C5E34@amr.corp.intel.com>
References: <a53064263780dafbe3b9744a706153d0.squirrel@webmail.ict.ac.cn>
	<E3280858FA94444CA49D2BA02341C9834AA71180@orsmsx506.amr.corp.intel.com>
	<15ddcffd0905121311j2c450aech7b0f933a6ce5ef91@mail.gmail.com>
	<89088CD4A69046AF95E5D54D970C5E34@amr.corp.intel.com>
Message-ID: <15ddcffd0905121432y2209860fo90f9cfbaa04bc41d@mail.gmail.com>

>>Just to make sure we're on the same page: both IPoIB and the RDMA-CM
>>use SA path queries

> But ipoib caches its path records...

Yes, of-course. But, to start with, lets analyze the case of each node
running --one-- rank and then take it from there to the case where
each node runs C ranks.

Or.


From or.gerlitz at gmail.com  Tue May 12 14:50:02 2009
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Wed, 13 May 2009 00:50:02 +0300
Subject: [ofa-general] How to establish IB communcation more effectively?
In-Reply-To: <E3280858FA94444CA49D2BA02341C9834AA713AB@orsmsx506.amr.corp.intel.com>
References: <a53064263780dafbe3b9744a706153d0.squirrel@webmail.ict.ac.cn>
	<E3280858FA94444CA49D2BA02341C9834AA71180@orsmsx506.amr.corp.intel.com>
	<15ddcffd0905121311j2c450aech7b0f933a6ce5ef91@mail.gmail.com>
	<E3280858FA94444CA49D2BA02341C9834AA713AB@orsmsx506.amr.corp.intel.com>
Message-ID: <15ddcffd0905121450r44d43c45vc9bbdc88e5ba4557@mail.gmail.com>

Davis, Arlin R <arlin.r.davis at intel.com> wrote:
>>Just to make sure we're on the same page: both IPoIB and the RDMA-CM
>>use SA path queries (ipoib for the unicast arp reply, and rdma-cm for
>>rdma_resolve_route), going into details, things look like:

> I am running IPoIB connected so I assume there is no path query
> and I see no difference in IPoIB unconnected mode so I also assume
> it caches path records during ARP processing. Can someone confirm?

Arlin,

Both the datagram and connected mode issue path query (its the way IB
works). The datagram mode uses the IB UD (Unreliable Datagram)
transport and once the path is resolve it creates IB AH (Address
Handle) which is used in conjunction with the UD QP. The connected
mode uses the IB RC (Reliable Connection) transport, so path info is
used to establish it connection through the IB CM.

> ARP cache is also hit in all these cases so you can take ARP request/reply out.

I am not with you: by "ARP cache" I assume you refer to the networking
stack neighbour table, correct? so this cache has the entries since
the IPoIB network was also used to spawn the job?

> However, with rdma_cm we actually have to pick up the ADDR_RESOLVED (arp)
> event before moving on to the rdma_resolve_route (path record), and then wait for
> ROUTE_RESOLVED event before moving on to the rdma_connect call, and then
> finally wait for ESTABLISHED. You start to get the picture of where my time goes? > Not only do we have path record query delays we have a 3 step event
> processing (waiting/waking on each) just to get connected.

Yes, this sounds like a potentially big difference from the TCP case,
lets see how many kernel --> user events we have in both methods --

rdma-cm active side
-----------------------
addr-resolved
route-resolved
established

rdma-cm passive side
--------------------------
connection-request
established

scm active side
------------------
connected

scm passive side
--------------------
connection request
connected

in the rdma-cm framework there are three kernel -->user
transitions/events for the active and two for the passive, where in
the scm framework there are two for the passive but only one for the
active. Also counting user --> kernel transitions, in the rdma-cm
active side there are three vs only one in the scm. This sounds like
where things would probably makes a difference. I believe it could be
fairly easy to have the kernel rdma ucm module do two successive calls
(route resolve and connect) once the local address is resolved, since
at that point the user space consumer can create their QP, etc.

> Not only do we have path record query delays

So we agree that its path query --delays-- and for one rank per node
its the same # of path queries? (Sean)

Or.


From sean.hefty at intel.com  Tue May 12 14:55:12 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 12 May 2009 14:55:12 -0700
Subject: [ofa-general] How to establish IB communcation more effectively?
In-Reply-To: <15ddcffd0905121432y2209860fo90f9cfbaa04bc41d@mail.gmail.com>
References: <a53064263780dafbe3b9744a706153d0.squirrel@webmail.ict.ac.cn>	<E3280858FA94444CA49D2BA02341C9834AA71180@orsmsx506.amr.corp.intel.com>	<15ddcffd0905121311j2c450aech7b0f933a6ce5ef91@mail.gmail.com>	<89088CD4A69046AF95E5D54D970C5E34@amr.corp.intel.com>
	<15ddcffd0905121432y2209860fo90f9cfbaa04bc41d@mail.gmail.com>
Message-ID: <2510151DA1D14A058528698BAE8C08C5@amr.corp.intel.com>

>Yes, of-course. But, to start with, lets analyze the case of each node
>running --one-- rank and then take it from there to the case where
>each node runs C ranks.

The caching is independent of running MPI though.  To get a fair comparison,
you'd probably have to reboot the entire cluster before running the test and
ensure that no other communication between the nodes occurs over ipoib.

For myself, I'm not sure that the tests are the same.  The DAPL providers create
and modify the QPs differently.  I'd need to walk through the code to see
whether QP creation time is included and verify that the QP modify calls are the
same.

As for responding to the initial question, using sockets with hard-coded values
seems to be the most common way to establish IB connections at scale, though I
would guess that using the ib_cm with hard-coded values would work about the
same.
 
- Sean


From ralph.campbell at qlogic.com  Tue May 12 15:10:21 2009
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Tue, 12 May 2009 15:10:21 -0700
Subject: [ofa-general] qperf: destroy QPs before destroying any other objects
Message-ID: <1242166221.3504.14.camel@chromite.mv.qlogic.com>

The QP contains references to the protection domain (PD), memory
regions (MR), address handles, completion queues (CQ), address
handles (AH), etc.
The QP should be destroyed before any other objects are destroyed
so that the referenced object is not busy.

Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>

diff --git a/src/rdma.c b/src/rdma.c
index 845c35f..492d240 100644
--- a/src/rdma.c
+++ b/src/rdma.c
@@ -1577,6 +1577,10 @@ show_node_info(DEVICE *dev)
 static void
 rd_close(DEVICE *dev)
 {
+    if (Req.use_cm)
+        cm_close(dev);
+    else
+        ib_close(dev);
     if (dev->ah)
         ibv_destroy_ah(dev->ah);
     if (dev->cq)
@@ -1585,10 +1589,6 @@ rd_close(DEVICE *dev)
         ibv_dealloc_pd(dev->pd);
     if (dev->channel)
         ibv_destroy_comp_channel(dev->channel);
-    if (Req.use_cm)
-        cm_close(dev);
-    else
-        ib_close(dev);
     rd_mrfree(dev);
 
     memset(dev, 0, sizeof(*dev));


From vlad at lists.openfabrics.org  Wed May 13 03:22:28 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Wed, 13 May 2009 03:22:28 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090513-0200 daily build status
Message-ID: <20090513102228.A11F0E6159A@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From johann at georgex.org  Tue May 12 18:55:03 2009
From: johann at georgex.org (johann at georgex.org)
Date: Tue, 12 May 2009 18:55:03 -0700
Subject: [ofa-general] Re: qperf: destroy QPs before destroying any other
	objects
In-Reply-To: <1242166221.3504.14.camel@chromite.mv.qlogic.com>
References: <1242166221.3504.14.camel@chromite.mv.qlogic.com>
Message-ID: <20090513015503.GA30869@georgex.org>

Ralph,

I've applied the patch and have committed it to the OFED git
repository.  Let me know if there is anything else I need to
do.

Johann

On Tue, May 12, 2009 at 03:10:21PM -0700, Ralph Campbell wrote:
> The QP contains references to the protection domain (PD), memory
> regions (MR), address handles, completion queues (CQ), address
> handles (AH), etc.
> The QP should be destroyed before any other objects are destroyed
> so that the referenced object is not busy.
> 
> Signed-off-by: Ralph Campbell <ralph.campbell at qlogic.com>
> 
> diff --git a/src/rdma.c b/src/rdma.c
> index 845c35f..492d240 100644
> --- a/src/rdma.c
> +++ b/src/rdma.c
> @@ -1577,6 +1577,10 @@ show_node_info(DEVICE *dev)
>  static void
>  rd_close(DEVICE *dev)
>  {
> +    if (Req.use_cm)
> +        cm_close(dev);
> +    else
> +        ib_close(dev);
>      if (dev->ah)
>          ibv_destroy_ah(dev->ah);
>      if (dev->cq)
> @@ -1585,10 +1589,6 @@ rd_close(DEVICE *dev)
>          ibv_dealloc_pd(dev->pd);
>      if (dev->channel)
>          ibv_destroy_comp_channel(dev->channel);
> -    if (Req.use_cm)
> -        cm_close(dev);
> -    else
> -        ib_close(dev);
>      rd_mrfree(dev);
>  
>      memset(dev, 0, sizeof(*dev));
> 


From weiny2 at llnl.gov  Wed May 13 09:30:20 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Wed, 13 May 2009 09:30:20 -0700
Subject: [ofa-general] [PATCH V2] ibnetdiscover: only report max hops
 discovered when requested
Message-ID: <20090513093020.f85f2a0a.weiny2@llnl.gov>

Added "-m" flag to report this information if the user wants it.  I also
changed the text in the message which says "reported max hops discovered".  I
don't know if we want to change that text to something else but I wanted to
indicate this number is not constant and may change.  This is true not just if
you change the algorithm of discovery but also if you run from different
nodes.

Thoughts,
Ira


From: Ira Weiny <weiny2 at llnl.gov>
Date: Wed, 6 May 2009 17:56:23 -0700
Subject: [PATCH] ibnetdiscover: only report max hops discovered when requested


Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/src/ibnetdiscover.c |    9 ++++++++-
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c
index 1799618..98ff1e4 100644
--- a/infiniband-diags/src/ibnetdiscover.c
+++ b/infiniband-diags/src/ibnetdiscover.c
@@ -65,6 +65,8 @@ static FILE *f;
 static char *node_name_map_file = NULL;
 static nn_map_t *node_name_map = NULL;
 
+static int max_hops = 0;
+
 /**
  * Define our own conversion functions to maintain compatibility with the old
  * ibnetdiscover which did not use the ibmad conversion functions.
@@ -448,7 +450,8 @@ dump_topology(int group, ibnd_fabric_t *fabric)
 	struct iter_user_data iter_user_data;
 
 	fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t));
-	fprintf(f, "# Max of %d hops discovered\n", fabric->maxhops_discovered);
+	if (max_hops)
+		fprintf(f, "# Reported max hops discovered: %d\n", fabric->maxhops_discovered);
 	fprintf(f, "# Initiated from node %016" PRIx64 " port %016" PRIx64 "\n",
 		fabric->from_node->guid,
 		mad_get_field64(fabric->from_node->info, 0, IB_NODE_PORT_GUID_F));
@@ -628,6 +631,9 @@ static int process_opt(void *context, int ch, char *optarg)
 	case 'p':
 		ports_report = 1;
 		break;
+	case 'm':
+		max_hops = 1;
+		break;
 	default:
 		return -1;
 	}
@@ -651,6 +657,7 @@ int main(int argc, char **argv)
 		{ "Router_list", 'R', 0, NULL, "list of connected routers" },
 		{ "node-name-map", 1, 1, "<file>", "node name map file" },
 		{ "ports", 'p', 0, NULL, "obtain a ports report" },
+		{ "max_hops", 'm', 0, NULL, "report max hops discovered by the library" },
 		{ 0 }
 	};
 	char usage_args[] = "[topology-file]";
-- 
1.5.4.5


From jsquyres at cisco.com  Wed May 13 10:49:27 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 13 May 2009 13:49:27 -0400
Subject: [ofa-general] RPM version numbers are the same
Message-ID: <3DAD41A5-1435-4D2D-8D07-7B5FD8038E1E@cisco.com>

Why are the RPM version numbers the same between rc5 and the current  
1.4.1 nightlies?

-- 
Jeff Squyres
Cisco Systems


From roel.kluin at gmail.com  Wed May 13 11:33:43 2009
From: roel.kluin at gmail.com (Roel Kluin)
Date: Wed, 13 May 2009 20:33:43 +0200
Subject: [ofa-general] [PATCH] nes: off by one in reset_adapter_ne020() and
	init_serdes()
Message-ID: <4A0B1287.4060603@gmail.com>

With a postfix increment i is incremented beyond 10/5k so the
error message will be displayed too soon.

Signed-off-by: Roel Kluin <roel.kluin at gmail.com>
---
This could occur almost never.

diff --git a/drivers/infiniband/hw/nes/nes_hw.c b/drivers/infiniband/hw/nes/nes_hw.c
index b832a7b..4a84d02 100644
--- a/drivers/infiniband/hw/nes/nes_hw.c
+++ b/drivers/infiniband/hw/nes/nes_hw.c
@@ -667,7 +667,7 @@ static unsigned int nes_reset_adapter_ne020(struct nes_device *nesdev, u8 *OneG_
 		i = 0;
 		while (((nes_read32(nesdev->regs+NES_SOFTWARE_RESET) & 0x00000040) == 0) && i++ < 10000)
 			mdelay(1);
-		if (i >= 10000) {
+		if (i > 10000) {
 			nes_debug(NES_DBG_INIT, "Did not see full soft reset done.\n");
 			return 0;
 		}
@@ -675,7 +675,7 @@ static unsigned int nes_reset_adapter_ne020(struct nes_device *nesdev, u8 *OneG_
 		i = 0;
 		while ((nes_read_indexed(nesdev, NES_IDX_INT_CPU_STATUS) != 0x80) && i++ < 10000)
 			mdelay(1);
-		if (i >= 10000) {
+		if (i > 10000) {
 			printk(KERN_ERR PFX "Internal CPU not ready, status = %02X\n",
 			       nes_read_indexed(nesdev, NES_IDX_INT_CPU_STATUS));
 			return 0;
@@ -701,7 +701,7 @@ static unsigned int nes_reset_adapter_ne020(struct nes_device *nesdev, u8 *OneG_
 	i = 0;
 	while (((nes_read32(nesdev->regs+NES_SOFTWARE_RESET) & 0x00000040) == 0) && i++ < 10000)
 		mdelay(1);
-	if (i >= 10000) {
+	if (i > 10000) {
 		nes_debug(NES_DBG_INIT, "Did not see port soft reset done.\n");
 		return 0;
 	}
@@ -711,7 +711,7 @@ static unsigned int nes_reset_adapter_ne020(struct nes_device *nesdev, u8 *OneG_
 	while (((u32temp = (nes_read_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_STATUS0)
 			& 0x0000000f)) != 0x0000000f) && i++ < 5000)
 		mdelay(1);
-	if (i >= 5000) {
+	if (i > 5000) {
 		nes_debug(NES_DBG_INIT, "Serdes 0 not ready, status=%x\n", u32temp);
 		return 0;
 	}
@@ -722,7 +722,7 @@ static unsigned int nes_reset_adapter_ne020(struct nes_device *nesdev, u8 *OneG_
 		while (((u32temp = (nes_read_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_STATUS1)
 				& 0x0000000f)) != 0x0000000f) && i++ < 5000)
 			mdelay(1);
-		if (i >= 5000) {
+		if (i > 5000) {
 			nes_debug(NES_DBG_INIT, "Serdes 1 not ready, status=%x\n", u32temp);
 			return 0;
 		}
@@ -792,7 +792,7 @@ static int nes_init_serdes(struct nes_device *nesdev, u8 hw_rev, u8 port_count,
 		while (((u32temp = (nes_read_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_STATUS0)
 				& 0x0000000f)) != 0x0000000f) && i++ < 5000)
 			mdelay(1);
-		if (i >= 5000) {
+		if (i > 5000) {
 			nes_debug(NES_DBG_PHY, "Init: serdes 0 not ready, status=%x\n", u32temp);
 			return 1;
 		}
@@ -815,7 +815,7 @@ static int nes_init_serdes(struct nes_device *nesdev, u8 hw_rev, u8 port_count,
 			while (((u32temp = (nes_read_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_STATUS1)
 				& 0x0000000f)) != 0x0000000f) && (i++ < 5000))
 				mdelay(1);
-			if (i >= 5000) {
+			if (i > 5000) {
 				printk("%s: Init: serdes 1 not ready, status=%x\n", __func__, u32temp);
 				/* return 1; */
 			}


From jsquyres at cisco.com  Wed May 13 11:34:04 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 13 May 2009 14:34:04 -0400
Subject: [ofa-general] /dev/infiniband/rdma_cm not created
Message-ID: <25C847DB-BE7C-427D-B0E2-AC7C184DFEC4@cisco.com>

I'm running on rhel4u6 with the 1.4.1 nightly from last night and  
sometimes /dev/infiniband/rdma_cm is not created.  I can see its entry  
in /etc/udev/rules.d/90-ib.rules:

KERNEL="umad*", NAME="infiniband/%k"
KERNEL="issm*", NAME="infiniband/%k"
KERNEL="ucm*", NAME="infiniband/%k", MODE="0666"
KERNEL="uverbs*", NAME="infiniband/%k", MODE="0666"
KERNEL="ucma", NAME="infiniband/%k", MODE="0666"
KERNEL="rdma_cm", NAME="infiniband/%k", MODE="0666"

But only some of these are created:

[11:29] svbu-mpi005:/etc/udev/rules.d % l /dev/infiniband/
total 0
drwxr-xr-x   2 root root      120 May 13 02:39 ./
drwxr-xr-x  10 root root     5740 May 13 09:39 ../
crw-------   1 root root 231,  64 May 13 02:39 issm0
crw-------   1 root root 231,   0 May 13 02:39 umad0
crw-rw-rw-   1 root root 231, 192 May 13 02:39 uverbs0
crw-rw-rw-   1 root root 231, 193 May 13 02:39 uverbs1
[11:29] svbu-mpi005:/etc/udev/rules.d %

I have both an IB HCA and an iWARP RNIC in this server:

hca_id:	mthca0
	fw_ver:				1.2.917
	node_guid:			0005:ad00:0008:bd60
	sys_image_guid:			0005:ad00:0100:d050
	vendor_id:			0x05ad
	vendor_part_id:			25204
	hw_ver:				0xA0
	board_id:			MT_03B0120002
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		2048 (4)
			active_mtu:		2048 (4)
			sm_lid:			2
			port_lid:		34
			port_lmc:		0x00

hca_id:	nes0
	node_guid:			0012:5502:b58c:0000
	sys_image_guid:			0012:5502:b58c:0000
	vendor_id:			0x1255
	vendor_part_id:			256
	hw_ver:				0x5
	board_id:			NES020 Board ID
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		2048 (4)
			active_mtu:		2048 (4)
			sm_lid:			0
			port_lid:		1
			port_lmc:		0x00

I don't see any obvious errors occurring in syslog or dmesg.

What could cause this failure?

-- 
Jeff Squyres
Cisco Systems


From robert.j.woodruff at intel.com  Wed May 13 11:39:15 2009
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Wed, 13 May 2009 11:39:15 -0700
Subject: [ofa-general] RE: [ewg] /dev/infiniband/rdma_cm not created
In-Reply-To: <25C847DB-BE7C-427D-B0E2-AC7C184DFEC4@cisco.com>
References: <25C847DB-BE7C-427D-B0E2-AC7C184DFEC4@cisco.com>
Message-ID: <382A478CAD40FA4FB46605CF81FE39F42DC7A792@orsmsx507.amr.corp.intel.com>

Is the driver loaded ? ie., do an /sbin/lsmod to see.

Also are there any messages that would indicate a 
problem when you do a dmesg.


-----Original Message-----
From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Jeff Squyres
Sent: Wednesday, May 13, 2009 11:34 AM
To: OpenFabrics General; OpenFabrics EWG
Subject: [ewg] /dev/infiniband/rdma_cm not created

I'm running on rhel4u6 with the 1.4.1 nightly from last night and  
sometimes /dev/infiniband/rdma_cm is not created.  I can see its entry  
in /etc/udev/rules.d/90-ib.rules:

KERNEL="umad*", NAME="infiniband/%k"
KERNEL="issm*", NAME="infiniband/%k"
KERNEL="ucm*", NAME="infiniband/%k", MODE="0666"
KERNEL="uverbs*", NAME="infiniband/%k", MODE="0666"
KERNEL="ucma", NAME="infiniband/%k", MODE="0666"
KERNEL="rdma_cm", NAME="infiniband/%k", MODE="0666"

But only some of these are created:

[11:29] svbu-mpi005:/etc/udev/rules.d % l /dev/infiniband/
total 0
drwxr-xr-x   2 root root      120 May 13 02:39 ./
drwxr-xr-x  10 root root     5740 May 13 09:39 ../
crw-------   1 root root 231,  64 May 13 02:39 issm0
crw-------   1 root root 231,   0 May 13 02:39 umad0
crw-rw-rw-   1 root root 231, 192 May 13 02:39 uverbs0
crw-rw-rw-   1 root root 231, 193 May 13 02:39 uverbs1
[11:29] svbu-mpi005:/etc/udev/rules.d %

I have both an IB HCA and an iWARP RNIC in this server:

hca_id:	mthca0
	fw_ver:				1.2.917
	node_guid:			0005:ad00:0008:bd60
	sys_image_guid:			0005:ad00:0100:d050
	vendor_id:			0x05ad
	vendor_part_id:			25204
	hw_ver:				0xA0
	board_id:			MT_03B0120002
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		2048 (4)
			active_mtu:		2048 (4)
			sm_lid:			2
			port_lid:		34
			port_lmc:		0x00

hca_id:	nes0
	node_guid:			0012:5502:b58c:0000
	sys_image_guid:			0012:5502:b58c:0000
	vendor_id:			0x1255
	vendor_part_id:			256
	hw_ver:				0x5
	board_id:			NES020 Board ID
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		2048 (4)
			active_mtu:		2048 (4)
			sm_lid:			0
			port_lid:		1
			port_lmc:		0x00

I don't see any obvious errors occurring in syslog or dmesg.

What could cause this failure?

-- 
Jeff Squyres
Cisco Systems

_______________________________________________
ewg mailing list
ewg at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


From jsquyres at cisco.com  Wed May 13 11:54:55 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 13 May 2009 14:54:55 -0400
Subject: [ofa-general] Re: [ewg] /dev/infiniband/rdma_cm not created
In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F42DC7A792@orsmsx507.amr.corp.intel.com>
References: <25C847DB-BE7C-427D-B0E2-AC7C184DFEC4@cisco.com>
	<382A478CAD40FA4FB46605CF81FE39F42DC7A792@orsmsx507.amr.corp.intel.com>
Message-ID: <94B101A0-36F4-4B80-B5F4-0ECB65EF685F@cisco.com>

On May 13, 2009, at 2:39 PM, Woodruff, Robert J wrote:

> Is the driver loaded ? ie., do an /sbin/lsmod to see.
>

Ah ha -- no, it is not:

[11:51] svbu-mpi005:/etc/udev/rules.d % /sbin/lsmod | grep rdma
[11:51] svbu-mpi005:/etc/udev/rules.d %

What would cause it to not be loaded?  I *assumed* (but didn't check)  
that it is loaded as part of OFED's /etc/init.d/openibd.  Is that  
correct?

> Also are there any messages that would indicate a
> problem when you do a dmesg.
>

As I indicated in my first mail :-), no.

>
>
>
> -----Original Message-----
> From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org 
> ] On Behalf Of Jeff Squyres
> Sent: Wednesday, May 13, 2009 11:34 AM
> To: OpenFabrics General; OpenFabrics EWG
> Subject: [ewg] /dev/infiniband/rdma_cm not created
>
> I'm running on rhel4u6 with the 1.4.1 nightly from last night and
> sometimes /dev/infiniband/rdma_cm is not created.  I can see its entry
> in /etc/udev/rules.d/90-ib.rules:
>
> KERNEL="umad*", NAME="infiniband/%k"
> KERNEL="issm*", NAME="infiniband/%k"
> KERNEL="ucm*", NAME="infiniband/%k", MODE="0666"
> KERNEL="uverbs*", NAME="infiniband/%k", MODE="0666"
> KERNEL="ucma", NAME="infiniband/%k", MODE="0666"
> KERNEL="rdma_cm", NAME="infiniband/%k", MODE="0666"
>
> But only some of these are created:
>
> [11:29] svbu-mpi005:/etc/udev/rules.d % l /dev/infiniband/
> total 0
> drwxr-xr-x   2 root root      120 May 13 02:39 ./
> drwxr-xr-x  10 root root     5740 May 13 09:39 ../
> crw-------   1 root root 231,  64 May 13 02:39 issm0
> crw-------   1 root root 231,   0 May 13 02:39 umad0
> crw-rw-rw-   1 root root 231, 192 May 13 02:39 uverbs0
> crw-rw-rw-   1 root root 231, 193 May 13 02:39 uverbs1
> [11:29] svbu-mpi005:/etc/udev/rules.d %
>
> I have both an IB HCA and an iWARP RNIC in this server:
>
> hca_id: mthca0
>         fw_ver:                         1.2.917
>         node_guid:                      0005:ad00:0008:bd60
>         sys_image_guid:                 0005:ad00:0100:d050
>         vendor_id:                      0x05ad
>         vendor_part_id:                 25204
>         hw_ver:                         0xA0
>         board_id:                       MT_03B0120002
>         phys_port_cnt:                  1
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 2
>                         port_lid:               34
>                         port_lmc:               0x00
>
> hca_id: nes0
>         node_guid:                      0012:5502:b58c:0000
>         sys_image_guid:                 0012:5502:b58c:0000
>         vendor_id:                      0x1255
>         vendor_part_id:                 256
>         hw_ver:                         0x5
>         board_id:                       NES020 Board ID
>         phys_port_cnt:                  1
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 0
>                         port_lid:               1
>                         port_lmc:               0x00
>
> I don't see any obvious errors occurring in syslog or dmesg.
>
> What could cause this failure?
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


-- 
Jeff Squyres
Cisco Systems


From jsquyres at cisco.com  Wed May 13 11:57:35 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 13 May 2009 14:57:35 -0400
Subject: [ofa-general] Re: [ewg] /dev/infiniband/rdma_cm not created
In-Reply-To: <94B101A0-36F4-4B80-B5F4-0ECB65EF685F@cisco.com>
References: <25C847DB-BE7C-427D-B0E2-AC7C184DFEC4@cisco.com>
	<382A478CAD40FA4FB46605CF81FE39F42DC7A792@orsmsx507.amr.corp.intel.com>
	<94B101A0-36F4-4B80-B5F4-0ECB65EF685F@cisco.com>
Message-ID: <5FE760CE-11B9-4668-971B-DD0A89A3ED95@cisco.com>

On May 13, 2009, at 2:54 PM, Jeff Squyres wrote:

> [11:51] svbu-mpi005:/etc/udev/rules.d % /sbin/lsmod | grep rdma
> [11:51] svbu-mpi005:/etc/udev/rules.d %
>
> What would cause it to not be loaded?  I *assumed* (but didn't  
> check) that it is loaded as part of OFED's /etc/init.d/openibd.  Is  
> that correct?


FWIW, I see the following in /etc/infiniband/openibd.conf:

# Start HCA driver upon boot
ONBOOT=yes

#...

# Load RDMA_CM module
RDMA_CM_LOAD=yes

-- 
Jeff Squyres
Cisco Systems


From arlin.r.davis at intel.com  Wed May 13 12:03:13 2009
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Wed, 13 May 2009 12:03:13 -0700
Subject: [ofa-general] Re: [ewg] /dev/infiniband/rdma_cm not created
In-Reply-To: <5FE760CE-11B9-4668-971B-DD0A89A3ED95@cisco.com>
References: <25C847DB-BE7C-427D-B0E2-AC7C184DFEC4@cisco.com>
	<382A478CAD40FA4FB46605CF81FE39F42DC7A792@orsmsx507.amr.corp.intel.com>
	<94B101A0-36F4-4B80-B5F4-0ECB65EF685F@cisco.com>
	<5FE760CE-11B9-4668-971B-DD0A89A3ED95@cisco.com>
Message-ID: <E3280858FA94444CA49D2BA02341C9834AE939EF@orsmsx506.amr.corp.intel.com>

 
>FWIW, I see the following in /etc/infiniband/openibd.conf:
>
>
># Load RDMA_CM module
>RDMA_CM_LOAD=yes
>

is RDMA_UCM_LOAD=yes ?

What do you see with "modinfo rdma_cm rdma_ucm" ?

From robert.j.woodruff at intel.com  Wed May 13 12:12:33 2009
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Wed, 13 May 2009 12:12:33 -0700
Subject: [ofa-general] RE: [ewg] /dev/infiniband/rdma_cm not created
In-Reply-To: <5FE760CE-11B9-4668-971B-DD0A89A3ED95@cisco.com>
References: <25C847DB-BE7C-427D-B0E2-AC7C184DFEC4@cisco.com>
	<382A478CAD40FA4FB46605CF81FE39F42DC7A792@orsmsx507.amr.corp.intel.com>
	<94B101A0-36F4-4B80-B5F4-0ECB65EF685F@cisco.com>
	<5FE760CE-11B9-4668-971B-DD0A89A3ED95@cisco.com>
Message-ID: <382A478CAD40FA4FB46605CF81FE39F42DC7A825@orsmsx507.amr.corp.intel.com>

Check to see if some other driver failed to load.
I think I have seen before that if another driver
fails to load, the start script bails out and
does not load the other drivers.

Perhaps try doing a /etc/init.d/openibd restart 
manually to see if something is failing to load. 

-----Original Message-----
From: Jeff Squyres [mailto:jsquyres at cisco.com] 
Sent: Wednesday, May 13, 2009 11:58 AM
To: Jeff Squyres
Cc: Woodruff, Robert J; OpenFabrics General; OpenFabrics EWG; Hefty, Sean
Subject: Re: [ewg] /dev/infiniband/rdma_cm not created

On May 13, 2009, at 2:54 PM, Jeff Squyres wrote:

> [11:51] svbu-mpi005:/etc/udev/rules.d % /sbin/lsmod | grep rdma
> [11:51] svbu-mpi005:/etc/udev/rules.d %
>
> What would cause it to not be loaded?  I *assumed* (but didn't  
> check) that it is loaded as part of OFED's /etc/init.d/openibd.  Is  
> that correct?


FWIW, I see the following in /etc/infiniband/openibd.conf:

# Start HCA driver upon boot
ONBOOT=yes

#...

# Load RDMA_CM module
RDMA_CM_LOAD=yes

-- 
Jeff Squyres
Cisco Systems


From jsquyres at cisco.com  Wed May 13 12:12:56 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 13 May 2009 15:12:56 -0400
Subject: [ofa-general] Re: [ewg] /dev/infiniband/rdma_cm not created
In-Reply-To: <E3280858FA94444CA49D2BA02341C9834AE939EF@orsmsx506.amr.corp.intel.com>
References: <25C847DB-BE7C-427D-B0E2-AC7C184DFEC4@cisco.com><382A478CAD40FA4FB46605CF81FE39F42DC7A792@orsmsx507.amr.corp.intel.com><94B101A0-36F4-4B80-B5F4-0ECB65EF685F@cisco.com>
	<5FE760CE-11B9-4668-971B-DD0A89A3ED95@cisco.com>
	<E3280858FA94444CA49D2BA02341C9834AE939EF@orsmsx506.amr.corp.intel.com>
Message-ID: <583F9D06-1DDD-4666-B174-BFB16E10B5B3@cisco.com>

On May 13, 2009, at 3:03 PM, Davis, Arlin R wrote:

> >FWIW, I see the following in /etc/infiniband/openibd.conf:
> >
> >
> ># Load RDMA_CM module
> >RDMA_CM_LOAD=yes
>
> is RDMA_UCM_LOAD=yes ?
>

Yes, sorry I didn't see that one first time around:

# Load RDMA_UCM module
RDMA_UCM_LOAD=yes

> What do you see with "modinfo rdma_cm rdma_ucm" ?

[root at svbu-mpi055 ~]# modinfo rdma_cm rdma_ucm
filename:       /lib/modules/2.6.9-67.ELsmp/updates/kernel/drivers/ 
infiniband/core/rdma_cm.ko
parm:           cma_response_timeout:CMA_CM_RESPONSE_TIMEOUT default=20
parm:           unify_tcp_port_space:Unify the host TCP and RDMA port  
space allocation (default=0)
parm:           tavor_quirk:Tavor performance quirk: limit MTU to 1K  
if > 0
license:        Dual BSD/GPL
description:    Generic RDMA CM Agent
author:         Sean Hefty
depends:        ib_addr,ib_cm,iw_cm,ib_core,ib_sa
vermagic:       2.6.9-67.ELsmp SMP gcc-3.4
filename:       /lib/modules/2.6.9-67.ELsmp/updates/kernel/drivers/ 
infiniband/core/rdma_ucm.ko
license:        Dual BSD/GPL
description:    RDMA Userspace Connection Manager Access
author:         Sean Hefty
depends:        rdma_cm,ib_uverbs,ib_core,rdma_cm
vermagic:       2.6.9-67.ELsmp SMP gcc-3.4
[root at svbu-mpi055 ~]#


-- 
Jeff Squyres
Cisco Systems


From jsquyres at cisco.com  Wed May 13 12:18:50 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 13 May 2009 15:18:50 -0400
Subject: [ofa-general] Re: [ewg] /dev/infiniband/rdma_cm not created
In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F42DC7A825@orsmsx507.amr.corp.intel.com>
References: <25C847DB-BE7C-427D-B0E2-AC7C184DFEC4@cisco.com>
	<382A478CAD40FA4FB46605CF81FE39F42DC7A792@orsmsx507.amr.corp.intel.com>
	<94B101A0-36F4-4B80-B5F4-0ECB65EF685F@cisco.com>
	<5FE760CE-11B9-4668-971B-DD0A89A3ED95@cisco.com>
	<382A478CAD40FA4FB46605CF81FE39F42DC7A825@orsmsx507.amr.corp.intel.com>
Message-ID: <7BFC31C1-1B86-47E8-A65D-19A3F4237AAB@cisco.com>

On May 13, 2009, at 3:12 PM, Woodruff, Robert J wrote:

> Check to see if some other driver failed to load.
> I think I have seen before that if another driver
> fails to load, the start script bails out and
> does not load the other drivers.
>
> Perhaps try doing a /etc/init.d/openibd restart
> manually to see if something is failing to load.
>

Weird -- doing it manually shows no problem:

[root at svbu-mpi055 ~]# /etc/init.d/openibd restart
Unloading HCA driver:                                      [  OK  ]
Loading HCA driver and Access Layer:                       [  OK  ]
Setting up InfiniBand network interfaces:
Bringing up interface ib0:                                 [  OK  ]
Bringing up interface ib1:                                 [  OK  ]
Setting up service network . . .                           [  done  ]
[root at svbu-mpi055 ~]# ls -l /dev/infiniband/rdma_cm
crw-rw-rw-  1 root root 10, 62 May 13 12:17 /dev/infiniband/rdma_cm
[root at svbu-mpi055 ~]#

Something must be going wrong during the bootup.  I'm unfortunately  
several thousand miles from the server and don't have a serial  
console.  I guess I'll insert some initlog's in /etc/init.d/openibd...

-- 
Jeff Squyres
Cisco Systems


From jsquyres at cisco.com  Wed May 13 12:59:18 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 13 May 2009 15:59:18 -0400
Subject: [ofa-general] Re: [ewg] /dev/infiniband/rdma_cm not created
In-Reply-To: <7BFC31C1-1B86-47E8-A65D-19A3F4237AAB@cisco.com>
References: <25C847DB-BE7C-427D-B0E2-AC7C184DFEC4@cisco.com><382A478CAD40FA4FB46605CF81FE39F42DC7A792@orsmsx507.amr.corp.intel.com><94B101A0-36F4-4B80-B5F4-0ECB65EF685F@cisco.com><5FE760CE-11B9-4668-971B-DD0A89A3ED95@cisco.com><382A478CAD40FA4FB46605CF81FE39F42DC7A825@orsmsx507.amr.corp.intel.com>
	<7BFC31C1-1B86-47E8-A65D-19A3F4237AAB@cisco.com>
Message-ID: <90B10556-C092-4BE0-A58F-8F1184AFBFCC@cisco.com>

Ok, I figured it out.  I have some creative /etc/sysconfig/network- 
script/ifcfg-ib* scripts that may choose to do nothing if no device is  
present (or some other esoteric, specific-to-jeffs-cluster criteria is  
met) -- they call "exit 0" in this case.  This apparently causes the  
top-level /etc/init.d/openibd to exit (!).  I've fixed this (they now  
never call "exit"); now everything works as expected.

Upon reflection, I can see that this was totally my fault -- ifcfg-*  
scripts are always sourced and should therefore never call "exit".

But given that /etc/init.d/openib is sooo complex and has sooo many  
moving parts, it would be nice if there were a way to track down  
problems a little more easily; perhaps a "verbose" setting in /etc/ 
infiniband/openibd.conf, or somesuch.  Indeed, since OFED is targeted  
at the datacenter, monitors attached to the servers in question and/or  
serial consoles may not be readily available.  Hence, having the  
ability to drop some verbose output into syslog during boot, for  
example, might be quite useful to sysadmins/network admins when  
troubleshooting.

Just my $0.02.

Thanks for the tips where to look, Woody!


On May 13, 2009, at 3:18 PM, Jeff Squyres (jsquyres) wrote:

> On May 13, 2009, at 3:12 PM, Woodruff, Robert J wrote:
>
> > Check to see if some other driver failed to load.
> > I think I have seen before that if another driver
> > fails to load, the start script bails out and
> > does not load the other drivers.
> >
> > Perhaps try doing a /etc/init.d/openibd restart
> > manually to see if something is failing to load.
> >
>
> Weird -- doing it manually shows no problem:
>
> [root at svbu-mpi055 ~]# /etc/init.d/openibd restart
> Unloading HCA driver:                                      [  OK  ]
> Loading HCA driver and Access Layer:                       [  OK  ]
> Setting up InfiniBand network interfaces:
> Bringing up interface ib0:                                 [  OK  ]
> Bringing up interface ib1:                                 [  OK  ]
> Setting up service network . . .                           [  done  ]
> [root at svbu-mpi055 ~]# ls -l /dev/infiniband/rdma_cm
> crw-rw-rw-  1 root root 10, 62 May 13 12:17 /dev/infiniband/rdma_cm
> [root at svbu-mpi055 ~]#
>
> Something must be going wrong during the bootup.  I'm unfortunately
> several thousand miles from the server and don't have a serial
> console.  I guess I'll insert some initlog's in /etc/init.d/openibd...
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


-- 
Jeff Squyres
Cisco Systems


From rdreier at cisco.com  Wed May 13 14:35:16 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 13 May 2009 14:35:16 -0700
Subject: [ofa-general] Re: [PATCH 2.6.30] xprtrdma: The frmr iova_start
	values are truncated by	the nfs rdma client.
In-Reply-To: <4A09A283.3090605@opengridcomputing.com> (Steve Wise's message of
	"Tue, 12 May 2009 11:23:31 -0500")
References: <20090424190510.3134.90405.stgit@build.ogc.int>
	<49F31A16.2080806@opengridcomputing.com>
	<49F4AE86.4090908@opengridcomputing.com>
	<49f515a5.1d1e640a.1c82.6677@mx.google.com>
	<49F5ED55.1010607@opengridcomputing.com>
	<1240855510.8818.9.camel@heimdal.trondhjem.org>
	<1240856613.8818.16.camel@heimdal.trondhjem.org>
	<49F60845.4010007@opengridcomputing.com>
	<1240865214.8818.73.camel@heimdal.trondhjem.org>
	<4A08A5C6.7040003@opengridcomputing.com>
	<1242082203.1743.11.camel@heimdal.trondhjem.org>
	<4A08BF1C.2050204@opengridcomputing.com>
	<1242089066.1743.19.camel@heimdal.trondhjem.org>
	<4a08cd7b.48c3f10a.6bb1.fffff6d3@mx.google.com>
	<1242092150.16618.15.camel@heimdal.trondhjem.org>
	<4A08E7B2.1010907@opengridcomputing.com>
	<4A099FB8.7090603@opengridcomputing.com>
	<4A09A283.3090605@opengridcomputing.com>
Message-ID: <adak54kr8iz.fsf@cisco.com>

 > Trond Myklebust wrote (earlier in this thread):
 > >
 > > All I should need to know is that I can advertise either dma handles or
 > > kernel VAs, and know that I can choose between two functions, say,
 > > ib_send_wr_fastreg_dma_init() and ib_send_wr_fastreg_kva_init() to
 > > initialise the ib_send_wr structure correctly.

I skimmed the earlier thread, and I have to say that I don't quite see
what the problem with assigning things to a u64 directly is.  You can
use any address you want, and I don't quite understand why using the
correct cast to avoid sign extension or truncation problems is such a
big maintenance burden?

The code below really just looks like obfuscation to me -- are we going
to want to add something like

/**
 * ib_init_fast_reg_iova_start_u64 - initializes the iova_start field
 *   based on a 64-bit address supplied by the user.
 * @wr - struct ib_send_wr pointer to be initialized
 * @addr - void * address to be used as the iova_start
 */
static inline void ib_init_fast_reg_iova_start_kva(struct ib_send_wr *wr,
						   u64 addr)
{
	wr->wr.fast_reg.iova_start = addr;
}

next, to make sure we don't get confused about assigning a u64 to a u64?
It all looks a bit overcomplicated to me.

 - R.

 > /**
 > + * ib_init_fast_reg_iova_start_dma - initializes the iova_start field
 > + *   based on a dma address supplied by the user.
 > + * @wr - struct ib_send_wr pointer to be initialized
 > + * @addr - dma_addr_t value to be used as the iova_start
 > + */
 > +static inline void ib_init_fast_reg_iova_start_dma(struct ib_send_wr *wr,
 > +                                                  dma_addr_t addr)
 > +{
 > +       wr->wr.fast_reg.iova_start = addr;
 > +}
 > +
 > +/**
 > + * ib_init_fast_reg_iova_start_kva - initializes the iova_start field
 > + *   based on a kernel virtual address supplied by the user.
 > + * @wr - struct ib_send_wr pointer to be initialized
 > + * @addr - void * address to be used as the iova_start
 > + */
 > +static inline void ib_init_fast_reg_iova_start_kva(struct ib_send_wr *wr,
 > +                                                  void *addr)
 > +{
 > +       wr->wr.fast_reg.iova_start = (unsigned long)addr;
 > +}
 > +
 > +/**
 >  * ib_alloc_mw - Allocates a memory window.
 >  * @pd: The protection domain associated with the memory window.
 >  */


From rdreier at cisco.com  Wed May 13 15:18:13 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 13 May 2009 15:18:13 -0700
Subject: [ofa-general] [GIT PULL] please pull infiniband.git
Message-ID: <ada4ovoir4q.fsf@cisco.com>

Linus, please pull from

    master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This tree is also available from kernel.org mirrors at:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This will get a couple of fixes to low-level drivers that fix crashes
seen when running NFS/RDMA:

Jack Morgenstein (1):
      IB/mlx4: Don't overwrite fast registration page list when posting work request

Roland Dreier (1):
      Merge branches 'cxgb3' and 'mlx4' into for-linus

Steve Wise (1):
      RDMA/cxgb3: Don't complete flushed send work requests twice

 drivers/infiniband/hw/cxgb3/cxio_hal.c |    1 +
 drivers/infiniband/hw/mlx4/mlx4_ib.h   |    1 +
 drivers/infiniband/hw/mlx4/mr.c        |   10 ++++++++--
 drivers/infiniband/hw/mlx4/qp.c        |    2 +-
 4 files changed, 11 insertions(+), 3 deletions(-)


diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c
index 8d71086..62f9cf2 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_hal.c
+++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c
@@ -410,6 +410,7 @@ int cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count)
 	ptr = wq->sq_rptr + count;
 	sqp = wq->sq + Q_PTR2IDX(ptr, wq->sq_size_log2);
 	while (ptr != wq->sq_wptr) {
+		sqp->signaled = 0;
 		insert_sq_cqe(wq, cq, sqp);
 		ptr++;
 		sqp = wq->sq + Q_PTR2IDX(ptr, wq->sq_size_log2);
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 9974e88..8a7dd67 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -86,6 +86,7 @@ struct mlx4_ib_mr {
 
 struct mlx4_ib_fast_reg_page_list {
 	struct ib_fast_reg_page_list	ibfrpl;
+	__be64			       *mapped_page_list;
 	dma_addr_t			map;
 };
 
diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
index 8e4d26d..8f3666b 100644
--- a/drivers/infiniband/hw/mlx4/mr.c
+++ b/drivers/infiniband/hw/mlx4/mr.c
@@ -231,7 +231,11 @@ struct ib_fast_reg_page_list *mlx4_ib_alloc_fast_reg_page_list(struct ib_device
 	if (!mfrpl)
 		return ERR_PTR(-ENOMEM);
 
-	mfrpl->ibfrpl.page_list = dma_alloc_coherent(&dev->dev->pdev->dev,
+	mfrpl->ibfrpl.page_list = kmalloc(size, GFP_KERNEL);
+	if (!mfrpl->ibfrpl.page_list)
+		goto err_free;
+
+	mfrpl->mapped_page_list = dma_alloc_coherent(&dev->dev->pdev->dev,
 						     size, &mfrpl->map,
 						     GFP_KERNEL);
 	if (!mfrpl->ibfrpl.page_list)
@@ -242,6 +246,7 @@ struct ib_fast_reg_page_list *mlx4_ib_alloc_fast_reg_page_list(struct ib_device
 	return &mfrpl->ibfrpl;
 
 err_free:
+	kfree(mfrpl->ibfrpl.page_list);
 	kfree(mfrpl);
 	return ERR_PTR(-ENOMEM);
 }
@@ -252,8 +257,9 @@ void mlx4_ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list)
 	struct mlx4_ib_fast_reg_page_list *mfrpl = to_mfrpl(page_list);
 	int size = page_list->max_page_list_len * sizeof (u64);
 
-	dma_free_coherent(&dev->dev->pdev->dev, size, page_list->page_list,
+	dma_free_coherent(&dev->dev->pdev->dev, size, mfrpl->mapped_page_list,
 			  mfrpl->map);
+	kfree(mfrpl->ibfrpl.page_list);
 	kfree(mfrpl);
 }
 
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index f385a24..20724ae 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -1365,7 +1365,7 @@ static void set_fmr_seg(struct mlx4_wqe_fmr_seg *fseg, struct ib_send_wr *wr)
 	int i;
 
 	for (i = 0; i < wr->wr.fast_reg.page_list_len; ++i)
-		wr->wr.fast_reg.page_list->page_list[i] =
+		mfrpl->mapped_page_list[i] =
 			cpu_to_be64(wr->wr.fast_reg.page_list->page_list[i] |
 				    MLX4_MTT_FLAG_PRESENT);
 

From rdreier at cisco.com  Wed May 13 15:19:17 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 13 May 2009 15:19:17 -0700
Subject: [ofa-general] [PATCH] nes: off by one in reset_adapter_ne020()
	and init_serdes()
In-Reply-To: <4A0B1287.4060603@gmail.com> (Roel Kluin's message of "Wed, 13
	May 2009 20:33:43 +0200")
References: <4A0B1287.4060603@gmail.com>
Message-ID: <adazldghcii.fsf@cisco.com>

Looks good to me.  NES guys?

 - R.


From abenjamin at sgi.com  Wed May 13 16:21:47 2009
From: abenjamin at sgi.com (Arputham Benjamin)
Date: Wed, 13 May 2009 16:21:47 -0700
Subject: [ofa-general] [PATCH] core/mthca: Distinguish multiple IB cards in
	/proc/interrupts
Message-ID: <4A0B560B.3090606@sgi.com>

When the mthca driver calls request_irq() to allocate interrupt resources, it uses
the fixed device name string "ib_mthca". When multiple IB cards are present in the system,
every instance of the resource is named "ib_mthca" in /proc/interrupts.
This can make it very confusing trying to work out exactly where IB interrupts are going and why.

Summary of changes:

o Added a new IB core API , ib_init_device() that allocates an ib_device struct
  and initializes its device name.
o Added a new field in mthca_dev struct to hold its device (IRQ) name.
o Replaced the call to ib_alloc_device by ib_init_device at mthca device
  init time.
o Modified device name parameter to request_irq() to use the device name 
  allocated by ib_init_device()

Signed-off-by: Arputham Benjamin <abenjamin at sgi.com>

--- a/ofa_kernel-1.4/drivers/infiniband/core/device.c	2008-08-14 16:58:42.962168204 -0700
+++ b/ofa_kernel-1.4/drivers/infiniband/core/device.c	2008-08-14 17:00:31.276257856 -0700
@@ -181,6 +181,40 @@ struct ib_device *ib_alloc_device(size_t
 EXPORT_SYMBOL(ib_alloc_device);
 
 /**
+ * ib_init_device - allocate and initialize an IB device struct
+ * @size:size of structure to allocate
+ * @name:HCA device name
+ *
+ * Low-level drivers should use ib_init_device() to allocate &struct
+ * ib_device and initialize its device name.  @size is the size of
+ * the structure to be allocated, including any private data used by
+ * the low-level driver.
+ * ib_dealloc_device() must be used to free structures allocated with
+ * ib_init_device().
+ */
+struct ib_device *ib_init_device(size_t size, const char *name)
+{
+	int ret = 0;
+	struct ib_device *device;
+
+	device =  (struct ib_device *) ib_alloc_device(size);
+	if (device) {
+		strlcpy(device->name, name, IB_DEVICE_NAME_MAX);
+		if (strchr(device->name, '%')) {
+			mutex_lock(&device_mutex);
+			ret = alloc_name(device->name);
+			mutex_unlock(&device_mutex);
+		}
+	}
+	if (ret) {
+		ib_dealloc_device(device);
+		return  NULL;
+	}
+	return device;
+}
+EXPORT_SYMBOL(ib_init_device);
+
+/**
  * ib_dealloc_device - free an IB device struct
  * @device:structure to free
  *
--- a/ofa_kernel-1.4/drivers/infiniband/hw/mthca/mthca_dev.h	2008-08-14 16:58:42.994168822 -0700
+++ b/ofa_kernel-1.4/drivers/infiniband/hw/mthca/mthca_dev.h	2008-08-14 17:00:31.288258088 -0700
@@ -360,6 +360,7 @@ struct mthca_dev {
 	struct ib_ah         *sm_ah[MTHCA_MAX_PORTS];
 	spinlock_t            sm_lock;
 	u8                    rate[MTHCA_MAX_PORTS];
+	char                  irq_name[MTHCA_NUM_EQ][IB_DEVICE_NAME_MAX];
 };
 
 #ifdef CONFIG_INFINIBAND_MTHCA_DEBUG
--- a/ofa_kernel-1.4/drivers/infiniband/hw/mthca/mthca_eq.c	2008-08-14 16:58:42.994168822 -0700
+++ b/ofa_kernel-1.4/drivers/infiniband/hw/mthca/mthca_eq.c	2008-08-14 17:00:31.304258396 -0700
@@ -860,17 +860,20 @@ int mthca_init_eq_table(struct mthca_dev
 
 	if (dev->mthca_flags & MTHCA_FLAG_MSI_X) {
 		static const char *eq_name[] = {
-			[MTHCA_EQ_COMP]  = DRV_NAME " (comp)",
-			[MTHCA_EQ_ASYNC] = DRV_NAME " (async)",
-			[MTHCA_EQ_CMD]   = DRV_NAME " (cmd)"
+			[MTHCA_EQ_COMP]  = " (comp)",
+			[MTHCA_EQ_ASYNC] = " (async)",
+			[MTHCA_EQ_CMD]   = " (cmd)"
 		};
 
 		for (i = 0; i < MTHCA_NUM_EQ; ++i) {
+			strcpy(&dev->irq_name[i][IB_DEVICE_NAME_MAX], dev->ib_dev.name);
+			strcat(&dev->irq_name[i][IB_DEVICE_NAME_MAX], eq_name[i]);
 			err = request_irq(dev->eq_table.eq[i].msi_x_vector,
 					  mthca_is_memfree(dev) ?
 					  mthca_arbel_msi_x_interrupt :
 					  mthca_tavor_msi_x_interrupt,
-					  0, eq_name[i], dev->eq_table.eq + i);
+					  0, &dev->irq_name[i][IB_DEVICE_NAME_MAX],
+					  dev->eq_table.eq + i);
 			if (err)
 				goto err_out_cmd;
 			dev->eq_table.eq[i].have_irq = 1;
--- a/ofa_kernel-1.4/drivers/infiniband/hw/mthca/mthca_main.c	2008-08-14 16:58:42.994168822 -0700
+++ b/ofa_kernel-1.4/drivers/infiniband/hw/mthca/mthca_main.c	2008-08-14 17:03:53.348154342 -0700
@@ -47,6 +47,8 @@
 #include "mthca_memfree.h"
 #include "mthca_wqe.h"
 
+struct ib_device *ib_init_device(size_t size, const char *name);
+
 MODULE_AUTHOR("Roland Dreier");
 MODULE_DESCRIPTION("Mellanox InfiniBand HCA low-level driver");
 MODULE_LICENSE("Dual BSD/GPL");
@@ -1091,7 +1093,7 @@ static int __mthca_init_one(struct pci_d
 		}
 	}
 
-	mdev = (struct mthca_dev *) ib_alloc_device(sizeof *mdev);
+	mdev = (struct mthca_dev *) ib_init_device(sizeof *mdev, "mthca%d");
 	if (!mdev) {
 		dev_err(&pdev->dev, "Device struct alloc failed, "
 			"aborting.\n");
--- a/ofa_kernel-1.4/drivers/infiniband/hw/mthca/mthca_provider.c	2008-08-14 16:58:42.998168899 -0700
+++ b/ofa_kernel-1.4/drivers/infiniband/hw/mthca/mthca_provider.c	2008-08-14 17:00:31.336259013 -0700
@@ -1358,7 +1358,6 @@ int mthca_register_device(struct mthca_d
 	if (ret)
 		return ret;
 
-	strlcpy(dev->ib_dev.name, "mthca%d", IB_DEVICE_NAME_MAX);
 	dev->ib_dev.owner                = THIS_MODULE;
 
 	dev->ib_dev.uverbs_abi_ver	 = MTHCA_UVERBS_ABI_VERSION;


From ogerlitz at voltaire.com  Thu May 14 00:22:14 2009
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 14 May 2009 10:22:14 +0300
Subject: [ofa-general] Re: [PATCH 2.6.30] xprtrdma: The frmr iova_start
	values are truncated by	the nfs rdma client.
In-Reply-To: <adak54kr8iz.fsf@cisco.com>
References: <20090424190510.3134.90405.stgit@build.ogc.int>	<49F31A16.2080806@opengridcomputing.com>	<49F4AE86.4090908@opengridcomputing.com>	<49f515a5.1d1e640a.1c82.6677@mx.google.com>	<49F5ED55.1010607@opengridcomputing.com>	<1240855510.8818.9.camel@heimdal.trondhjem.org>	<1240856613.8818.16.camel@heimdal.trondhjem.org>	<49F60845.4010007@opengridcomputing.com>	<1240865214.8818.73.camel@heimdal.trondhjem.org>	<4A08A5C6.7040003@opengridcomputing.com>	<1242082203.1743.11.camel@heimdal.trondhjem.org>	<4A08BF1C.2050204@opengridcomputing.com>	<1242089066.1743.19.camel@heimdal.trondhjem.org>	<4a08cd7b.48c3f10a.6bb1.fffff6d3@mx.google.com>	<1242092150.16618.15.camel@heimdal.trondhjem.org>	<4A08E7B2.1010907@opengridcomputing.com>	<4A099FB8.7090603@opengridcomputing.com>	<4A09A283.3090605@opengridcomputing.com>
	<adak54kr8iz.fsf@cisco.com>
Message-ID: <4A0BC6A6.1070002@voltaire.com>

> Trond Myklebust wrote 
>> All I should need to know is that I can advertise either dma handles or kernel VAs

Maybe its obvious to some people here, but may I ask why there's a need 
to post either dma address or kernel virtual address? is it application 
need? hardware (e.g IB vs iWARP vs  vendor implementation) specific? or 
something else?

Or.


From vlad at lists.openfabrics.org  Thu May 14 03:24:15 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Thu, 14 May 2009 03:24:15 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090514-0200 daily build status
Message-ID: <20090514102415.6A014E6118E@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From chien.tin.tung at intel.com  Thu May 14 06:28:06 2009
From: chien.tin.tung at intel.com (Tung, Chien Tin)
Date: Thu, 14 May 2009 06:28:06 -0700
Subject: [ofa-general] [PATCH] nes: off by one in reset_adapter_ne020()
	and	init_serdes()
In-Reply-To: <4A0B1287.4060603@gmail.com>
References: <4A0B1287.4060603@gmail.com>
Message-ID: <60BEFF3FBD4C6047B0F13F205CAFA3830350BDFFBE@azsmsx501.amr.corp.intel.com>

>With a postfix increment i is incremented beyond 10/5k so the
>error message will be displayed too soon.
>
>Signed-off-by: Roel Kluin <roel.kluin at gmail.com>
>---
>This could occur almost never.

Thanks for the patch.  Roland please apply.

Acked-by: Chien Tung <chien.tin.tung at intel.com>

Chien

From swise at opengridcomputing.com  Thu May 14 06:41:06 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 14 May 2009 08:41:06 -0500
Subject: [ofa-general] Re: [PATCH 2.6.30] xprtrdma: The frmr iova_start
	values are truncated by	the nfs rdma client.
In-Reply-To: <4A0BC6A6.1070002@voltaire.com>
References: <20090424190510.3134.90405.stgit@build.ogc.int>	<49F31A16.2080806@opengridcomputing.com>	<49F4AE86.4090908@opengridcomputing.com>	<49f515a5.1d1e640a.1c82.6677@mx.google.com>	<49F5ED55.1010607@opengridcomputing.com>	<1240855510.8818.9.camel@heimdal.trondhjem.org>	<1240856613.8818.16.camel@heimdal.trondhjem.org>	<49F60845.4010007@opengridcomputing.com>	<1240865214.8818.73.camel@heimdal.trondhjem.org>	<4A08A5C6.7040003@opengridcomputing.com>	<1242082203.1743.11.camel@heimdal.trondhjem.org>	<4A08BF1C.2050204@opengridcomputing.com>	<1242089066.1743.19.camel@heimdal.trondhjem.org>	<4a08cd7b.48c3f10a.6bb1.fffff6d3@mx.google.com>	<1242092150.16618.15.camel@heimdal.trondhjem.org>	<4A08E7B2.1010907@opengridcomputing.com>	<4A099FB8.7090603@opengridcomputing.com>	<4A09A283.3090605@opengridcomputing.com>
	<adak54kr8iz.fsf@cisco.com> <4A0BC6A6.1070002@voltaire.com>
Message-ID: <4A0C1F72.8050503@opengridcomputing.com>

Or Gerlitz wrote:
>> Trond Myklebust wrote
>>> All I should need to know is that I can advertise either dma handles 
>>> or kernel VAs
>
> Maybe its obvious to some people here, but may I ask why there's a 
> need to post either dma address or kernel virtual address? is it 
> application need? hardware (e.g IB vs iWARP vs  vendor implementation) 
> specific? or something else?
>
> Or.
>
>

The NFSRDMA transport uses Fast Register Memory Regions.  In this 
particular section of code, the NFSRDMA client is building a fastreg 
work request to bind a page list to a fastreg mr.  You can read about 
this in the IBTA spec on memory management extensions, or in the RDMA 
Verbs draft.


Steve.


From ogerlitz at Voltaire.com  Thu May 14 06:45:26 2009
From: ogerlitz at Voltaire.com (Or Gerlitz)
Date: Thu, 14 May 2009 16:45:26 +0300
Subject: [ofa-general] Re: [PATCH 2.6.30] xprtrdma: The frmr iova_start
	values are truncated by	the nfs rdma client.
In-Reply-To: <4A0C1F72.8050503@opengridcomputing.com>
References: <20090424190510.3134.90405.stgit@build.ogc.int>	<49F31A16.2080806@opengridcomputing.com>	<49F4AE86.4090908@opengridcomputing.com>	<49f515a5.1d1e640a.1c82.6677@mx.google.com>	<49F5ED55.1010607@opengridcomputing.com>	<1240855510.8818.9.camel@heimdal.trondhjem.org>	<1240856613.8818.16.camel@heimdal.trondhjem.org>	<49F60845.4010007@opengridcomputing.com>	<1240865214.8818.73.camel@heimdal.trondhjem.org>	<4A08A5C6.7040003@opengridcomputing.com>	<1242082203.1743.11.camel@heimdal.trondhjem.org>	<4A08BF1C.2050204@opengridcomputing.com>	<1242089066.1743.19.camel@heimdal.trondhjem.org>	<4a08cd7b.48c3f10a.6bb1.fffff6d3@mx.google.com>	<1242092150.16618.15.camel@heimdal.trondhjem.org>	<4A08E7B2.1010907@opengridcomputing.com>	<4A099FB8.7090603@opengridcomputing.com>	<4A09A283.3090605@opengridcomputing.com>
	<adak54kr8iz.fsf@cisco.com>	<4A0BC6A6.1070002@voltaire.com>
	<4A0C1F72.8050503@opengridcomputing.com>
Message-ID: <4A0C2076.8010702@Voltaire.com>

Steve Wise wrote:
> The NFSRDMA transport uses Fast Register Memory Regions.  In this
> particular section of code, the NFSRDMA client is building a fastreg
> work request to bind a page list to a fastreg mr.  You can read about
> this in the IBTA spec on memory management extensions, or in the RDMA Verbs draft.

Hi Steve,

I was aware for the context being fastreg work request. I was thinking
that the spec mandates either dma addr or kva on the iova but from your reply
I assume to be wrong, thanks.

Or.


From harsha at zresearch.com  Thu May 14 10:40:04 2009
From: harsha at zresearch.com (Harshavardhana)
Date: Thu, 14 May 2009 23:10:04 +0530
Subject: [ofa-general] GlusterFS 2.0 Release
Message-ID: <8a80e9760905141040y5456f1cbqfc79061379fd55ad@mail.gmail.com>

Greetings everyone,

On Behalf of GlusterFS Team I'm happy to announce the release of GlusterFS
version 2.0.

Announcement
===========
About GlusterFS:
GlusterFS is a clustered file system that runs on commodity
off-the-shelf hardware, delivering multiple times the scalability and
performance of conventional storage. The architecture is modular,
stackable and kernel-independent, which makes it easy to customize,
install, manage and support different operating systems. Multiple
storage systems can be clustered together, supporting petabytes of
capacity in a single global namespace. Building a configuration of a
few hundred terabytes can be accomplished in less than thirty minutes.

GlusterFS Release v2.0:
GlusterFS v2.0 has gone through a major revamp in design and
development since v1.3. Thanks to thousands of initial users who
provided us great feedback and bug reports. There are a number of
production deployments now. GlusterFS uses existing disk file systems
(such as Ext3, XFS, ZFS..) to store your data as regular files and
folders. You can restore the data, even after you uninstall GlusterFS.

So, give it a try and let us know. Please forward this message to
relevant users.

What is in 2.0 release:
http://www.gluster.org/docs/index.php/GlusterFS_Features

Who is using GlusterFS:
http://www.gluster.org/docs/index.php/Who%27s_using_GlusterFS

License: GNU GPLv3

Download: http://www.gluster.org/download.php

Happy Hacking
===========
--
GlusterFS Team
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090514/b5fd01a4/attachment.html>

From weiny2 at llnl.gov  Thu May 14 16:04:17 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 14 May 2009 16:04:17 -0700
Subject: [ofa-general] [PATCH] iblinkinfo,
 ibqueryerrors: prevent core when switch is not found
Message-ID: <20090514160417.e7505e06.weiny2@llnl.gov>


From: Ira Weiny <weiny2 at llnl.gov>
Date: Thu, 14 May 2009 15:52:42 -0700
Subject: [PATCH] iblinkinfo, ibqueryerrors: prevent core when switch is not found

	If the switch is not found print nice error message instead of seg faulting

Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/src/iblinkinfo.c    |   11 +++++++++--
 infiniband-diags/src/ibqueryerrors.c |   10 ++++++++--
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/infiniband-diags/src/iblinkinfo.c b/infiniband-diags/src/iblinkinfo.c
index cf38ecb..367056c 100644
--- a/infiniband-diags/src/iblinkinfo.c
+++ b/infiniband-diags/src/iblinkinfo.c
@@ -395,11 +395,18 @@ main(int argc, char **argv)
 			goto close_port;
 		}
 
-	if (guid) {
+	if (guid_str) {
 		ibnd_node_t *sw = ibnd_find_node_guid(fabric, guid);
-		print_switch(sw, NULL);
+		if (sw)
+			print_switch(sw, NULL);
+		else
+			fprintf(stderr, "Failed to find switch: %s\n", guid_str);
 	} else if (dr_path) {
 		ibnd_node_t *sw = ibnd_find_node_dr(fabric, dr_path);
+		if (sw)
+			print_switch(sw, NULL);
+		else
+			fprintf(stderr, "Failed to find switch: %s\n", dr_path);
 		print_switch(sw, NULL);
 	} else {
 		ibnd_iter_nodes_type(fabric, print_switch, IB_NODE_SWITCH, NULL);
diff --git a/infiniband-diags/src/ibqueryerrors.c b/infiniband-diags/src/ibqueryerrors.c
index 525af70..999329e 100644
--- a/infiniband-diags/src/ibqueryerrors.c
+++ b/infiniband-diags/src/ibqueryerrors.c
@@ -445,10 +445,16 @@ main(int argc, char **argv)
 
 	if (switch_guid) {
 		ibnd_node_t *node = ibnd_find_node_guid(fabric, switch_guid);
-		print_node(node, NULL);
+		if (node)
+			print_node(node, NULL);
+		else
+			fprintf(stderr, "Failed to find node: %s\n", switch_guid_str);
 	} else if (dr_path) {
 		ibnd_node_t *node = ibnd_find_node_dr(fabric, dr_path);
-		print_node(node, NULL);
+		if (node)
+			print_node(node, NULL);
+		else
+			fprintf(stderr, "Failed to find node: %s\n", dr_path);
 	} else
 		ibnd_iter_nodes(fabric, print_node, NULL);
 
-- 
1.5.4.5


From weiny2 at llnl.gov  Thu May 14 16:42:10 2009
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 14 May 2009 16:42:10 -0700
Subject: [ofa-general] [PATCH] iblinkinfo: remove unused file pointer.
Message-ID: <20090514164210.8a42f37d.weiny2@llnl.gov>


From: Ira Weiny <weiny2 at llnl.gov>
Date: Thu, 14 May 2009 16:39:52 -0700
Subject: [PATCH] iblinkinfo: remove unused file pointer.


Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/src/iblinkinfo.c |    6 ------
 1 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/infiniband-diags/src/iblinkinfo.c b/infiniband-diags/src/iblinkinfo.c
index 367056c..d422a2a 100644
--- a/infiniband-diags/src/iblinkinfo.c
+++ b/infiniband-diags/src/iblinkinfo.c
@@ -52,7 +52,6 @@
 #include <infiniband/ibnetdisc.h>
 
 char *argv0 = "iblinkinfotest";
-static FILE *f;
 
 static char *node_name_map_file = NULL;
 static nn_map_t *node_name_map = NULL;
@@ -294,8 +293,6 @@ main(int argc, char **argv)
 		{ 0 }
 	};
 
-	f = stdout;
-
 	argv0 = argv[0];
 
 	while (1) {
@@ -357,9 +354,6 @@ main(int argc, char **argv)
 	argc -= optind;
 	argv += optind;
 
-	if (argc && !(f = fopen(argv[0], "w")))
-		fprintf(stderr, "can't open file %s for writing", argv[0]);
-
 	ibmad_port = mad_rpc_open_port(ca, ca_port, mgmt_classes, 3);
 	if (!ibmad_port) {
 		fprintf(stderr, "Failed to open %s port %d", ca, ca_port);
-- 
1.5.4.5


From acceptany at gmail.com  Thu May 14 18:41:38 2009
From: acceptany at gmail.com (Jordan)
Date: Fri, 15 May 2009 09:41:38 +0800
Subject: [ofa-general] Some problem about the root nodes selection in up/down
	algorithm
Message-ID: <91fe68d50905141841x659cf13dt3076440c7ceeb995@mail.gmail.com>

In the function "updn_find_root_nodes_by_min_hop(OUT updn_t * p_updn)",
there are two sentences"thd1 = cas_num * 0.9; thd2 = cas_num * 0.05;"
I can't understand what the number "0.9, 0.05" means. Why use the number
"0.9, 0.05"?   What's the principle of this root node selection algorithm ?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090515/c4e869fa/attachment.html>

From vlad at lists.openfabrics.org  Fri May 15 03:46:13 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Fri, 15 May 2009 03:46:13 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090515-0200 daily build status
Message-ID: <20090515104613.700B8E61112@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From rdreier at cisco.com  Fri May 15 10:17:25 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 15 May 2009 10:17:25 -0700
Subject: [ofa-general] [PATCH] nes: off by one in reset_adapter_ne020()
	and init_serdes()
In-Reply-To: <4A0B1287.4060603@gmail.com> (Roel Kluin's message of "Wed, 13
	May 2009 20:33:43 +0200")
References: <4A0B1287.4060603@gmail.com>
Message-ID: <adaoctuffq2.fsf@cisco.com>

Thanks, I've applied this.


From rdreier at cisco.com  Fri May 15 14:44:07 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 15 May 2009 14:44:07 -0700
Subject: [ofa-general] Re: [PATCH] core/mthca: Distinguish multiple IB cards
	in /proc/interrupts
In-Reply-To: <4A0B560B.3090606@sgi.com> (Arputham Benjamin's message of "Wed, 
	13 May 2009 16:21:47 -0700")
References: <4A0B560B.3090606@sgi.com>
Message-ID: <adad4aaf3dk.fsf@cisco.com>


 > When the mthca driver calls request_irq() to allocate interrupt resources, it uses
 > the fixed device name string "ib_mthca". When multiple IB cards are present in the system,
 > every instance of the resource is named "ib_mthca" in /proc/interrupts.
 > This can make it very confusing trying to work out exactly where IB interrupts are going and why.

Fundamentally makes sense.  Some comments about the specifics:

 > o Added a new IB core API , ib_init_device() that allocates an ib_device struct
 >   and initializes its device name.

seems reasonable.  However I don't think we need both ib_init_device()
and ib_alloc_device(), and also the "ib_init_device" name doesn't imply
that it is allocating memory.

 > o Modified device name parameter to request_irq() to use the device name 
 >   allocated by ib_init_device()

You only did this for mthca and only in the MSI-X case.  I would suggest
that mthca at least needs to be consistent between MSI-X and non-MSI-X,
and it would be desirable to convert other drivers as well.

Also the mthca changes really should be separated out from the changes
to the core API.

So I would suggest reworking this into a series of patches:

1. Add a function ib_alloc_device_set_name() that does what your
   ib_init_device() function does.  (By the way, there is a problem with
   your implementation, since alloc_name() just checks the list of
   registered devices for a collision -- so devices that are allocated
   but not registered could be assigned the same name, if the kernel
   ever moves to parallelizing PCI probing or something like that -- so
   you should probably fix alloc_name() to check a list of all allocated
   devices or something like that)

2. For each RDMA driver (ie each of drivers/infiniband/hw/xxx), convert
   to using ib_init_device_alloc_name() -- one patch per driver.

3. Remove the old ib_alloc_device() and rename
   ib_alloc_device_set_name() back to ib_alloc_device().

4. Change mthca to use the device name when naming IRQs, both in MSI-X
   and INTx mode.

5. [optional] Have other drivers name their IRQs similarly.

One specific thing that puzzles me.  You add a field:

@@ -360,6 +360,7 @@ struct mthca_dev {
 	struct ib_ah         *sm_ah[MTHCA_MAX_PORTS];
 	spinlock_t            sm_lock;
 	u8                    rate[MTHCA_MAX_PORTS];
+	char                  irq_name[MTHCA_NUM_EQ][IB_DEVICE_NAME_MAX];
 };

which looks sane, but then the way you use it is:

 > +			strcpy(&dev->irq_name[i][IB_DEVICE_NAME_MAX], dev->ib_dev.name);
 > +			strcat(&dev->irq_name[i][IB_DEVICE_NAME_MAX], eq_name[i]);

why is the address you want at the position IB_DEVICE_NAME_MAX instead
of at index 0?  Also (this is theoretical only since IB_DEVICE_NAME_MAX
is much bigger than the size of "mthcaX") without range checking, since
you only allocate IB_DEVICE_NAME_MAX what prevents the eq_name part from
overflowing?  In general I don't like since strcpy()/strcat() instead of
strlcpy()/strlcat().

(And why write this as strcpy followed by strcat instead of a single
snprintf()?)

 - R.


From vlad at lists.openfabrics.org  Sat May 16 03:23:24 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sat, 16 May 2009 03:23:24 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090516-0200 daily build status
Message-ID: <20090516102324.C1753E61508@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From vlad at lists.openfabrics.org  Sun May 17 03:22:57 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sun, 17 May 2009 03:22:57 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090517-0200 daily build status
Message-ID: <20090517102258.1B77FE613CA@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From dorfman.eli at gmail.com  Sun May 17 07:06:46 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Sun, 17 May 2009 17:06:46 +0300
Subject: [ofa-general] [PATCH ] opensm: MFT tables are not set after non full
	member re-join
Message-ID: <4A1019F6.5060900@gmail.com>

MFT tables are not set after non full member re-join

In case of non full member re-join MFT tables are not set.
No need to set or check non full member reference to mlid (port->mcm_list).
This list should be used only for full members for cleanup when port goes down.

A simple scenarion to reproduce this:
1. Full member creates group
2. Non-member join - MFT sent
3. Full member leave
        a. group is deleted but non member port has still reference to the MLID
4. Full member re-creates the group
5. Non member re-joins - MFT *NOT* sent to switches

Signed-off-by: Eli Dorfman <elid at voltaire.com>
---
 opensm/include/opensm/osm_sm.h         |    3 ++-
 opensm/opensm/osm_sa_mcmember_record.c |    6 +++---
 opensm/opensm/osm_sm.c                 |   22 +++++++++++++++++++++-
 3 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/opensm/include/opensm/osm_sm.h b/opensm/include/opensm/osm_sm.h
index cc8321d..1a8a577 100644
--- a/opensm/include/opensm/osm_sm.h
+++ b/opensm/include/opensm/osm_sm.h
@@ -539,7 +539,8 @@ osm_resp_send(IN osm_sm_t * sm,
 ib_api_status_t
 osm_sm_mcgrp_join(IN osm_sm_t * const p_sm,
 		  IN const ib_net16_t mlid,
-		  IN const ib_net64_t port_guid);
+		  IN const ib_net64_t port_guid,
+				  IN uint8_t scope_state);
 /*
 * PARAMETERS
 *	p_sm
diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c
index 5543221..fe29dd6 100644
--- a/opensm/opensm/osm_sa_mcmember_record.c
+++ b/opensm/opensm/osm_sa_mcmember_record.c
@@ -1039,7 +1039,7 @@ static void mcmr_rcv_leave_mgrp(IN osm_sa_t * sa, IN osm_madw_t * p_madw)
 	if (!p_mgrp) {
 		char gid_str[INET6_ADDRSTRLEN];
 		CL_PLOCK_RELEASE(sa->p_lock);
-		OSM_LOG(sa->p_log, OSM_LOG_DEBUG,
+		OSM_LOG(sa->p_log, OSM_LOG_INFO,
 			"Failed since multicast group %s not present\n",
 			inet_ntop(AF_INET6, p_recvd_mcmember_rec->mgid.raw,
 				  gid_str, sizeof gid_str));
@@ -1309,8 +1309,8 @@ static void mcmr_rcv_join_mgrp(IN osm_sa_t * sa, IN osm_madw_t * p_madw)
 
 	/* do the actual routing (actually schedule the update) */
 	status = osm_sm_mcgrp_join(sa->sm, mlid,
-				   p_recvd_mcmember_rec->port_gid.unicast.
-				   interface_id);
+							   p_recvd_mcmember_rec->port_gid.unicast.interface_id,
+							   p_recvd_mcmember_rec->scope_state);
 
 	if (status != IB_SUCCESS) {
 		OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 1B14: "
diff --git a/opensm/opensm/osm_sm.c b/opensm/opensm/osm_sm.c
index daa60ff..b334d39 100644
--- a/opensm/opensm/osm_sm.c
+++ b/opensm/opensm/osm_sm.c
@@ -468,7 +468,7 @@ static ib_api_status_t sm_mgrp_process(IN osm_sm_t * p_sm,
 /**********************************************************************
  **********************************************************************/
 ib_api_status_t osm_sm_mcgrp_join(IN osm_sm_t * p_sm, IN const ib_net16_t mlid,
-				  IN const ib_net64_t port_guid)
+				  IN const ib_net64_t port_guid, IN uint8_t scope_state)
 {
 	osm_mgrp_t *p_mgrp;
 	osm_port_t *p_port;
@@ -515,6 +515,25 @@ ib_api_status_t osm_sm_mcgrp_join(IN osm_sm_t * p_sm, IN const ib_net16_t mlid,
 		goto Exit;
 	}
 
+	/* if there was no change from the last time
+	 * we processed the group we can skip doing anything
+	 */
+	if (p_mgrp->last_change_id == p_mgrp->last_tree_id) {
+		OSM_LOG(p_sm->p_log, OSM_LOG_VERBOSE,
+			"Skip processing mgrp with lid:0x%X last change id:%u\n",
+			cl_ntoh16(mlid), p_mgrp->last_change_id);
+		goto Exit;
+	} else {
+		OSM_LOG(p_sm->p_log, OSM_LOG_DEBUG,
+			"processing mgrp with lid:0x%X port: 0x%016" PRIx64 " last change id:%u tree id:%u\n",
+			cl_ntoh16(mlid), cl_ntoh64(port_guid), 
+			p_mgrp->last_change_id, p_mgrp->last_tree_id);
+	}
+
+	/* add mgrp only to FULL member port. used for cleanup when port goes down */
+	if (!(scope_state & IB_JOIN_STATE_FULL))
+		goto MgrpProcess;
+
 	/*
 	 * Check if the object (according to mlid) already exists on this port.
 	 * If it does - then no need to update it again, and no need to
@@ -543,6 +562,7 @@ ib_api_status_t osm_sm_mcgrp_join(IN osm_sm_t * p_sm, IN const ib_net16_t mlid,
 		goto Exit;
 	}
 
+MgrpProcess:
 	status = sm_mgrp_process(p_sm, p_mgrp);
 	CL_PLOCK_RELEASE(p_sm->p_lock);
 
-- 
1.5.3.6


From sebastien.dugue at bull.net  Mon May 18 00:55:16 2009
From: sebastien.dugue at bull.net (sebastien dugue)
Date: Mon, 18 May 2009 09:55:16 +0200
Subject: [ofa-general] [PATCH 2/3] libmlx4 - Optimize memory allocation of QP
 buffers with 64K pages
In-Reply-To: <20090518095156.7f9c39e6@frecb007965>
References: <20090518095156.7f9c39e6@frecb007965>
Message-ID: <20090518095516.6a803492@frecb007965>


  QP buffers are allocated with mlx4_alloc_buf(), which rounds the buffers
size to the page size and then allocates page aligned memory using
posix_memalign().

  However, this allocation is quite wasteful on architectures using 64K pages
(ia64 for example) because we then hit glibc's MMAP_THRESHOLD malloc
parameter and chunks are allocated using mmap. thus we end up allocating:

(requested size rounded to the page size) + (page size) + (malloc overhead)

rounded internally to the page size.

  So for example, if we request a buffer of page_size bytes, we end up
consuming 3 pages. In short, for each QP buffer we allocate, there is an
overhead of 2 pages. This is quite visible on large clusters especially where
the number of QP can reach several thousands.

  This patch creates a new function mlx4_alloc_page() for use by
mlx4_alloc_qp_buf() that does an mmap() instead of a posix_memalign() when
the page size is 64K.

Signed-off-by: Sebastien Dugue <sebastien.dugue at bull.net>

---
 src/buf.c  |   40 ++++++++++++++++++++++++++++++++++++++--
 src/mlx4.h |    7 +++++++
 src/qp.c   |    5 +++--
 3 files changed, 48 insertions(+), 4 deletions(-)

diff --git a/src/buf.c b/src/buf.c
index 0e5f9b6..c8b6823 100644
--- a/src/buf.c
+++ b/src/buf.c
@@ -35,6 +35,8 @@
 #endif /* HAVE_CONFIG_H */
 
 #include <stdlib.h>
+#include <sys/mman.h>
+#include <errno.h>
 
 #include "mlx4.h"
 
@@ -69,14 +71,48 @@ int mlx4_alloc_buf(struct mlx4_buf *buf, size_t size, int page_size)
 	if (ret)
 		free(buf->buf);
 
-	if (!ret)
+	if (!ret) {
 		buf->length = size;
+		buf->type = MLX4_MALIGN;
+	}
 
 	return ret;
 }
 
+#define PAGE_64K	(1UL << 16)
+
+int mlx4_alloc_page(struct mlx4_buf *buf, size_t size, int page_size)
+{
+	int ret;
+
+	/* Use the standard posix_memalign() call for pages < 64K */
+	if (page_size < PAGE_64K)
+		return mlx4_alloc_buf(buf, size, page_size);
+
+	/* Otherwise we can save a lot by using mmap directly */
+	buf->buf = mmap(0 ,align(size, page_size) , PROT_READ | PROT_WRITE,
+			MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+
+	if (buf->buf == MAP_FAILED)
+		return errno;
+
+	ret = ibv_dontfork_range(buf->buf, size);
+	if (ret)
+		munmap(buf->buf, align(size, page_size));
+	else {
+		buf->length = size;
+		buf->type = MLX4_MMAP;
+	}
+
+        return ret;
+ }
+
 void mlx4_free_buf(struct mlx4_buf *buf)
 {
 	ibv_dofork_range(buf->buf, buf->length);
-	free(buf->buf);
+
+	if ( buf->type == MLX4_MMAP )
+		munmap(buf->buf, buf->length);
+	else
+		free(buf->buf);
 }
diff --git a/src/mlx4.h b/src/mlx4.h
index 827a201..83547f5 100644
--- a/src/mlx4.h
+++ b/src/mlx4.h
@@ -161,9 +161,15 @@ struct mlx4_context {
 	pthread_mutex_t			db_list_mutex;
 };
 
+enum mlx4_buf_type {
+	MLX4_MMAP,
+	MLX4_MALIGN
+};
+
 struct mlx4_buf {
 	void			       *buf;
 	size_t				length;
+	enum mlx4_buf_type		type;
 };
 
 struct mlx4_pd {
@@ -288,6 +294,7 @@ static inline struct mlx4_ah *to_mah(struct ibv_ah *ibah)
 }
 
 int mlx4_alloc_buf(struct mlx4_buf *buf, size_t size, int page_size);
+int mlx4_alloc_page(struct mlx4_buf *buf, size_t size, int page_size);
 void mlx4_free_buf(struct mlx4_buf *buf);
 
 uint32_t *mlx4_alloc_db(struct mlx4_context *context, enum mlx4_db_type type);
diff --git a/src/qp.c b/src/qp.c
index d194ae3..557e255 100644
--- a/src/qp.c
+++ b/src/qp.c
@@ -604,8 +604,9 @@ int mlx4_alloc_qp_buf(struct ibv_pd *pd, struct ibv_qp_cap *cap,
 		qp->sq.offset = 0;
 	}
 
-	if (mlx4_alloc_buf(&qp->buf,
-			    align(qp->buf_size, to_mdev(pd->context->device)->page_size),
+	if (mlx4_alloc_page(&qp->buf,
+			    align(qp->buf_size,
+				  to_mdev(pd->context->device)->page_size),
 			    to_mdev(pd->context->device)->page_size)) {
 		free(qp->sq.wrid);
 		free(qp->rq.wrid);
-- 
1.6.3.rc3.12.gb7937


From sebastien.dugue at bull.net  Mon May 18 00:51:56 2009
From: sebastien.dugue at bull.net (sebastien dugue)
Date: Mon, 18 May 2009 09:51:56 +0200
Subject: [ofa-general] [PATCH 0/3] - libmthca libmlx4 - Optimize memory
 allocation of QP buffers with 64K pages
Message-ID: <20090518095156.7f9c39e6@frecb007965>


  Hi,

  libmthca and libmlx4 allocate QP buffers using posix_memalign(), which
results in big memory wastage on architectures with 64K pages.

  Replacing posix_memalign() with mmap() on those platforms allows to fix
this (more description in the patches themselves).

  Now, for some numbers, a micro benchmark I wrote shows the heap usage and
the number of mmaped pages used with posix_memalign() and mmap() respectively
for 1000, 2000, up to 8000 QP.

  MTHCA
               posix_memalign			   mmap
  QP	   heap		mmaped(pages)	   heap		mmaped(pages)
 1000	   838736	    2988	   576512	   1000
 2000	  1751216	    5973	  1161264	   2000
 3000	  2598144	    8961	  1746016	   3000
 4000	  3510656	   11946	  2330704	   4000
 5000	  4357616	   14934	  2915440	   5000
 6000	  5270080	   17919	  3500176	   6000
 7000	  6117056	   20907	  4084912	   7000
 8000	  6963968	   23895	  4669632	   8000

  MLX4
               posix_memalign			   mmap
  QP	   heap		mmaped(pages)	   heap		mmaped(pages)
 1000	  1469424	    2982	  1010544	   1003
 2000	  2994048	    5958	  2010752	   2003
 3000	  4518672	    8934	  3010960	   3003
 4000	  5969520	   11913	  4002960	   4003
 5000	  7494176	   14889	  5003168	   5003
 6000	  8953248	   17868	  6003376	   6003
 7000	 10477856	   20844	  7003584	   7003
 8000	 12002496	   23820	  8003792	   8003


  This patchset consists in 3 patches:

  1. Optimize memory allocation of QP buffers for libmthca
  2. Optimize memory allocation of QP buffers for libmlx4
  3. Fix the fixes patches for libmlx4 after having applied the
     previous patch.


  Sebastien Dugue


From sebastien.dugue at bull.net  Mon May 18 00:55:25 2009
From: sebastien.dugue at bull.net (sebastien dugue)
Date: Mon, 18 May 2009 09:55:25 +0200
Subject: [ofa-general] [PATCH 1/3] libmthca - Optimize memory allocation of
 QP buffers with 64K pages
In-Reply-To: <20090518095156.7f9c39e6@frecb007965>
References: <20090518095156.7f9c39e6@frecb007965>
Message-ID: <20090518095525.064a0cb5@frecb007965>


  QP buffers are allocated with mthca_alloc_buf(), which rounds the buffers
size to the page size and then allocates page aligned memory using
posix_memalign().

  However, this allocation is quite wasteful on architectures using 64K pages
(ia64 for example) because we then hit glibc's MMAP_THRESHOLD malloc
parameter and chunks are allocated using mmap. thus we end up allocating:

(requested size rounded to the page size) + (page size) + (malloc overhead)

rounded internally to the page size.

  So for example, if we request a buffer of page_size bytes, we end up
consuming 3 pages. In short, for each QP buffer we allocate, there is an
overhead of 2 pages. This is quite visible on large clusters especially where
the number of QP can reach several thousands.

  This patch creates a new function mthca_alloc_page() for use by
mthca_alloc_qp_buf() that does an mmap() instead of a posix_memalign() when
the page size is 64K.

Signed-off-by: Sebastien Dugue <sebastien.dugue at bull.net>

---
 src/buf.c   |   40 ++++++++++++++++++++++++++++++++++++++--
 src/mthca.h |    7 +++++++
 src/qp.c    |    7 ++++---
 3 files changed, 49 insertions(+), 5 deletions(-)

diff --git a/src/buf.c b/src/buf.c
index 6c1be4f..ae37e9c 100644
--- a/src/buf.c
+++ b/src/buf.c
@@ -35,6 +35,8 @@
 #endif /* HAVE_CONFIG_H */
 
 #include <stdlib.h>
+#include <sys/mman.h>
+#include <errno.h>
 
 #include "mthca.h"
 
@@ -69,8 +71,38 @@ int mthca_alloc_buf(struct mthca_buf *buf, size_t size, int page_size)
 	if (ret)
 		free(buf->buf);
 
-	if (!ret)
+	if (!ret) {
 		buf->length = size;
+		buf->type = MTHCA_MALIGN;
+	}
+
+	return ret;
+}
+
+#define PAGE_64K	(1UL << 16)
+
+int mthca_alloc_page(struct mthca_buf *buf, size_t size, int page_size)
+{
+	int ret;
+
+	/* Use the standard posix_memalign() call for pages < 64K */
+	if (page_size < PAGE_64K)
+		return mthca_alloc_buf(buf, size, page_size);
+
+	/* Otherwise we can save a lot by using mmap directly */
+	buf->buf = mmap(0 ,align(size, page_size) , PROT_READ | PROT_WRITE,
+			MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+
+	if (buf->buf == MAP_FAILED)
+		return errno;
+
+	ret = ibv_dontfork_range(buf->buf, size);
+	if (ret)
+		munmap(buf->buf, align(size, page_size));
+	else {
+		buf->length = size;
+		buf->type = MTHCA_MMAP;
+	}
 
 	return ret;
 }
@@ -78,5 +110,9 @@ int mthca_alloc_buf(struct mthca_buf *buf, size_t size, int page_size)
 void mthca_free_buf(struct mthca_buf *buf)
 {
 	ibv_dofork_range(buf->buf, buf->length);
-	free(buf->buf);
+
+	if ( buf->type == MTHCA_MMAP )
+		munmap(buf->buf, buf->length);
+	else
+		free(buf->buf);
 }
diff --git a/src/mthca.h b/src/mthca.h
index 66751f3..7db15a7 100644
--- a/src/mthca.h
+++ b/src/mthca.h
@@ -138,9 +138,15 @@ struct mthca_context {
 	int		       qp_table_mask;
 };
 
+enum mthca_buf_type {
+	MTHCA_MMAP,
+	MTHCA_MALIGN
+};
+
 struct mthca_buf {
 	void		       *buf;
 	size_t			length;
+	enum mthca_buf_type	type;
 };
 
 struct mthca_pd {
@@ -291,6 +297,7 @@ static inline int mthca_is_memfree(struct ibv_context *ibctx)
 }
 
 int mthca_alloc_buf(struct mthca_buf *buf, size_t size, int page_size);
+int mthca_alloc_page(struct mthca_buf *buf, size_t size, int page_size);
 void mthca_free_buf(struct mthca_buf *buf);
 
 int mthca_alloc_db(struct mthca_db_table *db_tab, enum mthca_db_type type,
diff --git a/src/qp.c b/src/qp.c
index 84dd206..15f4805 100644
--- a/src/qp.c
+++ b/src/qp.c
@@ -848,9 +848,10 @@ int mthca_alloc_qp_buf(struct ibv_pd *pd, struct ibv_qp_cap *cap,
 
 	qp->buf_size = qp->send_wqe_offset + (qp->sq.max << qp->sq.wqe_shift);
 
-	if (mthca_alloc_buf(&qp->buf,
-			    align(qp->buf_size, to_mdev(pd->context->device)->page_size),
-			    to_mdev(pd->context->device)->page_size)) {
+	if (mthca_alloc_page(&qp->buf,
+			     align(qp->buf_size,
+				   to_mdev(pd->context->device)->page_size),
+			     to_mdev(pd->context->device)->page_size)) {
 		free(qp->wrid);
 		return -1;
 	}
-- 
1.6.3.rc3.12.gb7937


From sebastien.dugue at bull.net  Mon May 18 01:06:18 2009
From: sebastien.dugue at bull.net (sebastien dugue)
Date: Mon, 18 May 2009 10:06:18 +0200
Subject: [ofa-general] [PATCH 3/3] libmlx4 - Fix fixes after QP buffers alloc
 optimization patch to allow build.
In-Reply-To: <20090518095156.7f9c39e6@frecb007965>
References: <20090518095156.7f9c39e6@frecb007965>
Message-ID: <20090518100618.3615f4ed@frecb007965>


  The patches in 'fixes/' need to be refreshed after the previous patch in
order to build properly.

Signed-off-by: Sebastien Dugue <sebastien.dugue at bull.net>

---
 fixes/lim_qp_resources.patch     |   20 ++++-------
 fixes/resize_cq_owner_bit.patch  |    4 +--
 fixes/userspace_dev_lims.patch   |   12 ++----
 fixes/xrc_consolidated_v2.patch  |   68 ++++++++++++++------------------------
 fixes/xrc_fix_close_domain.patch |    8 ++---
 fixes/xrc_rcv_qp_v2.patch        |   12 ++-----
 6 files changed, 44 insertions(+), 80 deletions(-)

diff --git a/fixes/lim_qp_resources.patch b/fixes/lim_qp_resources.patch
index 1f89256..54cc63e 100644
--- a/fixes/lim_qp_resources.patch
+++ b/fixes/lim_qp_resources.patch
@@ -7,11 +7,9 @@ qp creation also lie within the reported device limits.
     
 Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>
 
-Index: libmlx4/src/qp.c
-===================================================================
---- libmlx4.orig/src/qp.c	2008-06-04 08:24:45.000000000 +0300
-+++ libmlx4/src/qp.c	2008-06-04 08:24:49.000000000 +0300
-@@ -619,6 +619,7 @@ void mlx4_set_sq_sizes(struct mlx4_qp *q
+--- a/src/qp.c
++++ b/src/qp.c
+@@ -622,6 +622,7 @@ void mlx4_set_sq_sizes(struct mlx4_qp *q
  		       enum ibv_qp_type type)
  {
  	int wqe_size;
@@ -19,7 +17,7 @@ Index: libmlx4/src/qp.c
  
  	wqe_size = (1 << qp->sq.wqe_shift) - sizeof (struct mlx4_wqe_ctrl_seg);
  	switch (type) {
-@@ -636,8 +637,9 @@ void mlx4_set_sq_sizes(struct mlx4_qp *q
+@@ -639,8 +640,9 @@ void mlx4_set_sq_sizes(struct mlx4_qp *q
  	}
  
  	qp->sq.max_gs	     = wqe_size / sizeof (struct mlx4_wqe_data_seg);
@@ -31,10 +29,8 @@ Index: libmlx4/src/qp.c
  	cap->max_send_wr     = qp->sq.max_post;
  
  	/*
-Index: libmlx4/src/verbs.c
-===================================================================
---- libmlx4.orig/src/verbs.c	2008-06-04 08:24:45.000000000 +0300
-+++ libmlx4/src/verbs.c	2008-06-04 08:24:49.000000000 +0300
+--- a/src/verbs.c
++++ b/src/verbs.c
 @@ -390,12 +390,14 @@ struct ibv_qp *mlx4_create_qp(struct ibv
  	struct ibv_create_qp_resp resp;
  	struct mlx4_qp		 *qp;
@@ -54,9 +50,9 @@ Index: libmlx4/src/verbs.c
  	    attr->cap.max_inline_data > 1024)
  		return NULL;
  
-@@ -461,8 +463,14 @@ struct ibv_qp *mlx4_create_qp(struct ibv
- 	if (ret)
+@@ -464,8 +466,14 @@ struct ibv_qp *mlx4_create_qp(struct ibv
  		goto err_destroy;
+ 	pthread_mutex_unlock(&to_mctx(pd->context)->qp_table_mutex);
  
 -	qp->rq.wqe_cnt = qp->rq.max_post = attr->cap.max_recv_wr;
 +	qp->rq.wqe_cnt = attr->cap.max_recv_wr;
diff --git a/fixes/resize_cq_owner_bit.patch b/fixes/resize_cq_owner_bit.patch
index 6557027..0a5b564 100644
--- a/fixes/resize_cq_owner_bit.patch
+++ b/fixes/resize_cq_owner_bit.patch
@@ -3,11 +3,9 @@ for the target buffer (and not left as it was in the source buffer).
 
 Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>
 
-diff --git a/src/cq.c b/src/cq.c
-index 68e16e9..8226b6b 100644
 --- a/src/cq.c
 +++ b/src/cq.c
-@@ -455,6 +455,8 @@ void mlx4_cq_resize_copy_cqes(struct mlx4_cq *cq, void *buf, int old_cqe)
+@@ -478,6 +478,8 @@ void mlx4_cq_resize_copy_cqes(struct mlx
  	cqe = get_cqe(cq, (i & old_cqe));
  
  	while ((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) != MLX4_CQE_OPCODE_RESIZE) {
diff --git a/fixes/userspace_dev_lims.patch b/fixes/userspace_dev_lims.patch
index 07cf638..80d4d14 100644
--- a/fixes/userspace_dev_lims.patch
+++ b/fixes/userspace_dev_lims.patch
@@ -9,10 +9,8 @@ preferable to breaking the ABI.
     
 Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>
 
-Index: libmlx4/src/mlx4.c
-===================================================================
---- libmlx4.orig/src/mlx4.c	2008-06-03 15:45:18.000000000 +0300
-+++ libmlx4/src/mlx4.c	2008-06-04 08:24:10.000000000 +0300
+--- a/src/mlx4.c
++++ b/src/mlx4.c
 @@ -104,6 +104,7 @@ static struct ibv_context *mlx4_alloc_co
  	struct ibv_get_context		cmd;
  	struct mlx4_alloc_ucontext_resp resp;
@@ -42,10 +40,8 @@ Index: libmlx4/src/mlx4.c
  err_free:
  	free(context);
  	return NULL;
-Index: libmlx4/src/mlx4.h
-===================================================================
---- libmlx4.orig/src/mlx4.h	2008-06-03 15:45:18.000000000 +0300
-+++ libmlx4/src/mlx4.h	2008-06-04 08:24:10.000000000 +0300
+--- a/src/mlx4.h
++++ b/src/mlx4.h
 @@ -83,6 +83,20 @@
  
  #define PFX		"mlx4: "
diff --git a/fixes/xrc_consolidated_v2.patch b/fixes/xrc_consolidated_v2.patch
index 6fbd0a9..78a4f6c 100644
--- a/fixes/xrc_consolidated_v2.patch
+++ b/fixes/xrc_consolidated_v2.patch
@@ -18,8 +18,6 @@ V2:
 2. Changed xrc_ops to more ops
 3. Check for xrc verbs in ibv_more_ops via AC_CHECK_MEMBER
 
-diff --git a/configure.in b/configure.in
-index 25f27f7..46a3a64 100644
 --- a/configure.in
 +++ b/configure.in
 @@ -42,6 +42,12 @@ AC_CHECK_HEADER(valgrind/memcheck.h,
@@ -35,11 +33,9 @@ index 25f27f7..46a3a64 100644
  
  dnl Checks for library functions
  AC_CHECK_FUNC(ibv_read_sysfs_file, [],
-diff --git a/src/cq.c b/src/cq.c
-index 68e16e9..c598b87 100644
 --- a/src/cq.c
 +++ b/src/cq.c
-@@ -194,8 +194,9 @@ static int mlx4_poll_one(struct mlx4_cq *cq,
+@@ -194,8 +194,9 @@ static int mlx4_poll_one(struct mlx4_cq 
  {
  	struct mlx4_wq *wq;
  	struct mlx4_cqe *cqe;
@@ -50,7 +46,7 @@ index 68e16e9..c598b87 100644
  	uint32_t g_mlpath_rqpn;
  	uint16_t wqe_index;
  	int is_error;
-@@ -221,20 +223,29 @@ static int mlx4_poll_one(struct mlx4_cq *cq,
+@@ -221,20 +222,29 @@ static int mlx4_poll_one(struct mlx4_cq 
  	is_error = (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
  		MLX4_CQE_OPCODE_ERROR;
  
@@ -84,7 +80,7 @@ index 68e16e9..c598b87 100644
  
  	if (is_send) {
  		wq = &(*cur_qp)->sq;
-@@ -242,6 +254,10 @@ static int mlx4_poll_one(struct mlx4_cq *cq,
+@@ -242,6 +252,10 @@ static int mlx4_poll_one(struct mlx4_cq 
  		wq->tail += (uint16_t) (wqe_index - (uint16_t) wq->tail);
  		wc->wr_id = wq->wrid[wq->tail & (wq->wqe_cnt - 1)];
  		++wq->tail;
@@ -95,7 +91,7 @@ index 68e16e9..c598b87 100644
  	} else if ((*cur_qp)->ibv_qp.srq) {
  		srq = to_msrq((*cur_qp)->ibv_qp.srq);
  		wqe_index = htons(cqe->wqe_index);
-@@ -387,6 +403,10 @@ void __mlx4_cq_clean(struct mlx4_cq *cq, uint32_t qpn, struct mlx4_srq *srq)
+@@ -387,6 +401,10 @@ void __mlx4_cq_clean(struct mlx4_cq *cq,
  	uint32_t prod_index;
  	uint8_t owner_bit;
  	int nfreed = 0;
@@ -106,7 +102,7 @@ index 68e16e9..c598b87 100644
  
  	/*
  	 * First we need to find the current producer index, so we
-@@ -405,7 +425,12 @@ void __mlx4_cq_clean(struct mlx4_cq *cq, uint32_t qpn, struct mlx4_srq *srq)
+@@ -405,7 +423,12 @@ void __mlx4_cq_clean(struct mlx4_cq *cq,
  	 */
  	while ((int) --prod_index - (int) cq->cons_index >= 0) {
  		cqe = get_cqe(cq, prod_index & cq->ibv_cq.cqe);
@@ -120,8 +116,6 @@ index 68e16e9..c598b87 100644
  			if (srq && !(cqe->owner_sr_opcode & MLX4_CQE_IS_SEND_MASK))
  				mlx4_free_srq_wqe(srq, ntohs(cqe->wqe_index));
  			++nfreed;
-diff --git a/src/mlx4-abi.h b/src/mlx4-abi.h
-index 20a40c9..1b1253c 100644
 --- a/src/mlx4-abi.h
 +++ b/src/mlx4-abi.h
 @@ -68,6 +68,14 @@ struct mlx4_resize_cq {
@@ -152,8 +146,6 @@ index 20a40c9..1b1253c 100644
 +#endif
 +
  #endif /* MLX4_ABI_H */
-diff --git a/src/mlx4.c b/src/mlx4.c
-index 671e849..27ca75d 100644
 --- a/src/mlx4.c
 +++ b/src/mlx4.c
 @@ -68,6 +68,16 @@ struct {
@@ -173,7 +165,7 @@ index 671e849..27ca75d 100644
  static struct ibv_context_ops mlx4_ctx_ops = {
  	.query_device  = mlx4_query_device,
  	.query_port    = mlx4_query_port,
-@@ -124,6 +134,15 @@ static struct ibv_context *mlx4_alloc_context(struct ibv_device *ibdev, int cmd_
+@@ -124,6 +134,15 @@ static struct ibv_context *mlx4_alloc_co
  	for (i = 0; i < MLX4_QP_TABLE_SIZE; ++i)
  		context->qp_table[i].refcnt = 0;
  
@@ -189,7 +181,7 @@ index 671e849..27ca75d 100644
  	for (i = 0; i < MLX4_NUM_DB_TYPE; ++i)
  		context->db_list[i] = NULL;
  
-@@ -156,6 +175,9 @@ static struct ibv_context *mlx4_alloc_context(struct ibv_device *ibdev, int cmd_
+@@ -156,6 +175,9 @@ static struct ibv_context *mlx4_alloc_co
  	pthread_spin_init(&context->uar_lock, PTHREAD_PROCESS_PRIVATE);
  
  	context->ibv_ctx.ops = mlx4_ctx_ops;
@@ -199,8 +191,6 @@ index 671e849..27ca75d 100644
  
  	if (mlx4_query_device(&context->ibv_ctx, &dev_attrs))
  		goto query_free;
-diff --git a/src/mlx4.h b/src/mlx4.h
-index 8643d8f..3eadb98 100644
 --- a/src/mlx4.h
 +++ b/src/mlx4.h
 @@ -79,6 +79,11 @@
@@ -248,7 +238,7 @@ index 8643d8f..3eadb98 100644
  	struct mlx4_db_page	       *db_list[MLX4_NUM_DB_TYPE];
  	pthread_mutex_t			db_list_mutex;
  };
-@@ -260,6 +284,11 @@ struct mlx4_ah {
+@@ -266,6 +290,11 @@ struct mlx4_ah {
  	struct mlx4_av			av;
  };
  
@@ -260,7 +250,7 @@ index 8643d8f..3eadb98 100644
  static inline unsigned long align(unsigned long val, unsigned long align)
  {
  	return (val + align - 1) & ~(align - 1);
-@@ -304,6 +333,13 @@ static inline struct mlx4_ah *to_mah(struct ibv_ah *ibah)
+@@ -310,6 +339,13 @@ static inline struct mlx4_ah *to_mah(str
  	return to_mxxx(ah, ah);
  }
  
@@ -272,9 +262,9 @@ index 8643d8f..3eadb98 100644
 +#endif
 +
  int mlx4_alloc_buf(struct mlx4_buf *buf, size_t size, int page_size);
+ int mlx4_alloc_page(struct mlx4_buf *buf, size_t size, int page_size);
  void mlx4_free_buf(struct mlx4_buf *buf);
- 
-@@ -350,6 +386,10 @@ void mlx4_free_srq_wqe(struct mlx4_srq *srq, int ind);
+@@ -357,6 +393,10 @@ void mlx4_free_srq_wqe(struct mlx4_srq *
  int mlx4_post_srq_recv(struct ibv_srq *ibsrq,
  		       struct ibv_recv_wr *wr,
  		       struct ibv_recv_wr **bad_wr);
@@ -285,7 +275,7 @@ index 8643d8f..3eadb98 100644
  
  struct ibv_qp *mlx4_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr);
  int mlx4_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
-@@ -380,5 +420,16 @@ int mlx4_alloc_av(struct mlx4_pd *pd, struct ibv_ah_attr *attr,
+@@ -387,5 +427,16 @@ int mlx4_alloc_av(struct mlx4_pd *pd, st
  void mlx4_free_av(struct mlx4_ah *ah);
  int mlx4_attach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid);
  int mlx4_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid);
@@ -302,11 +292,9 @@ index 8643d8f..3eadb98 100644
 +
  
  #endif /* MLX4_H */
-diff --git a/src/qp.c b/src/qp.c
-index 01e8580..2f02430 100644
 --- a/src/qp.c
 +++ b/src/qp.c
-@@ -226,7 +226,7 @@ int mlx4_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
+@@ -226,7 +226,7 @@ int mlx4_post_send(struct ibv_qp *ibqp, 
  		ctrl = wqe = get_send_wqe(qp, ind & (qp->sq.wqe_cnt - 1));
  		qp->sq.wrid[ind & (qp->sq.wqe_cnt - 1)] = wr->wr_id;
  
@@ -315,7 +303,7 @@ index 01e8580..2f02430 100644
  			(wr->send_flags & IBV_SEND_SIGNALED ?
  			 htonl(MLX4_WQE_CTRL_CQ_UPDATE) : 0) |
  			(wr->send_flags & IBV_SEND_SOLICITED ?
-@@ -243,6 +243,9 @@ int mlx4_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
+@@ -243,6 +243,9 @@ int mlx4_post_send(struct ibv_qp *ibqp, 
  		size = sizeof *ctrl / 16;
  
  		switch (ibqp->qp_type) {
@@ -325,7 +313,7 @@ index 01e8580..2f02430 100644
  		case IBV_QPT_RC:
  		case IBV_QPT_UC:
  			switch (wr->opcode) {
-@@ -543,6 +546,7 @@ void mlx4_calc_sq_wqe_size(struct ibv_qp_cap *cap, enum ibv_qp_type type,
+@@ -543,6 +546,7 @@ void mlx4_calc_sq_wqe_size(struct ibv_qp
  		size += sizeof (struct mlx4_wqe_raddr_seg);
  		break;
  
@@ -333,7 +321,7 @@ index 01e8580..2f02430 100644
  	case IBV_QPT_RC:
  		size += sizeof (struct mlx4_wqe_raddr_seg);
  		/*
-@@ -631,6 +635,7 @@ void mlx4_set_sq_sizes(struct mlx4_qp *qp, struct ibv_qp_cap *cap,
+@@ -632,6 +636,7 @@ void mlx4_set_sq_sizes(struct mlx4_qp *q
  
  	case IBV_QPT_UC:
  	case IBV_QPT_RC:
@@ -341,11 +329,9 @@ index 01e8580..2f02430 100644
  		wqe_size -= sizeof (struct mlx4_wqe_raddr_seg);
  		break;
  
-diff --git a/src/srq.c b/src/srq.c
-index ba2ceb9..1350792 100644
 --- a/src/srq.c
 +++ b/src/srq.c
-@@ -167,3 +167,53 @@ int mlx4_alloc_srq_buf(struct ibv_pd *pd, struct ibv_srq_attr *attr,
+@@ -167,3 +167,53 @@ int mlx4_alloc_srq_buf(struct ibv_pd *pd
  
  	return 0;
  }
@@ -399,8 +385,6 @@ index ba2ceb9..1350792 100644
 +	pthread_mutex_unlock(&ctx->xrc_srq_table_mutex);
 +}
 +
-diff --git a/src/verbs.c b/src/verbs.c
-index 400050c..b7c9c8e 100644
 --- a/src/verbs.c
 +++ b/src/verbs.c
 @@ -368,18 +368,36 @@ int mlx4_query_srq(struct ibv_srq *srq,
@@ -447,7 +431,7 @@ index 400050c..b7c9c8e 100644
  
  	return 0;
  }
-@@ -415,7 +433,7 @@ struct ibv_qp *mlx4_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr)
+@@ -415,7 +433,7 @@ struct ibv_qp *mlx4_create_qp(struct ibv
  	qp->sq.wqe_cnt = align_queue_size(attr->cap.max_send_wr + qp->sq_spare_wqes);
  	qp->rq.wqe_cnt = align_queue_size(attr->cap.max_recv_wr);
  
@@ -456,7 +440,7 @@ index 400050c..b7c9c8e 100644
  		attr->cap.max_recv_wr = qp->rq.wqe_cnt = 0;
  	else {
  		if (attr->cap.max_recv_sge < 1)
-@@ -433,7 +451,7 @@ struct ibv_qp *mlx4_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr)
+@@ -433,7 +451,7 @@ struct ibv_qp *mlx4_create_qp(struct ibv
  	    pthread_spin_init(&qp->rq.lock, PTHREAD_PROCESS_PRIVATE))
  		goto err_free;
  
@@ -465,7 +449,7 @@ index 400050c..b7c9c8e 100644
  		qp->db = mlx4_alloc_db(to_mctx(pd->context), MLX4_DB_TYPE_RQ);
  		if (!qp->db)
  			goto err_free;
-@@ -442,7 +460,7 @@ struct ibv_qp *mlx4_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr)
+@@ -442,7 +460,7 @@ struct ibv_qp *mlx4_create_qp(struct ibv
  	}
  
  	cmd.buf_addr	    = (uintptr_t) qp->buf.buf;
@@ -474,7 +458,7 @@ index 400050c..b7c9c8e 100644
  		cmd.db_addr = 0;
  	else
  		cmd.db_addr = (uintptr_t) qp->db;
-@@ -485,7 +503,7 @@ err_destroy:
+@@ -489,7 +507,7 @@ err_destroy:
  
  err_rq_db:
  	pthread_mutex_unlock(&to_mctx(pd->context)->qp_table_mutex);
@@ -483,7 +467,7 @@ index 400050c..b7c9c8e 100644
  		mlx4_free_db(to_mctx(pd->context), MLX4_DB_TYPE_RQ, qp->db);
  
  err_free:
-@@ -544,7 +562,7 @@ int mlx4_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
+@@ -548,7 +566,7 @@ int mlx4_modify_qp(struct ibv_qp *qp, st
  			mlx4_cq_clean(to_mcq(qp->send_cq), qp->qp_num, NULL);
  
  		mlx4_init_qp_indices(to_mqp(qp));
@@ -492,16 +476,16 @@ index 400050c..b7c9c8e 100644
  			*to_mqp(qp)->db = 0;
  	}
  
-@@ -603,7 +621,7 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp)
- 
+@@ -611,7 +629,7 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp)
  	mlx4_unlock_cqs(ibqp);
+ 	pthread_mutex_unlock(&to_mctx(ibqp->context)->qp_table_mutex);
  
 -	if (!ibqp->srq)
 +	if (!ibqp->srq && ibqp->qp_type != IBV_QPT_XRC)
  		mlx4_free_db(to_mctx(ibqp->context), MLX4_DB_TYPE_RQ, qp->db);
  	free(qp->sq.wrid);
  	if (qp->rq.wqe_cnt)
-@@ -661,3 +679,103 @@ int mlx4_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid)
+@@ -669,3 +687,103 @@ int mlx4_detach_mcast(struct ibv_qp *qp,
  {
  	return ibv_cmd_detach_mcast(qp, gid, lid);
  }
@@ -605,8 +589,6 @@ index 400050c..b7c9c8e 100644
 +	return 0;
 +}
 +#endif
-diff --git a/src/wqe.h b/src/wqe.h
-index 6f7f309..fa2f8ac 100644
 --- a/src/wqe.h
 +++ b/src/wqe.h
 @@ -65,7 +65,7 @@ struct mlx4_wqe_ctrl_seg {
diff --git a/fixes/xrc_fix_close_domain.patch b/fixes/xrc_fix_close_domain.patch
index dfad7ac..3af2640 100644
--- a/fixes/xrc_fix_close_domain.patch
+++ b/fixes/xrc_fix_close_domain.patch
@@ -6,11 +6,9 @@ Need to pass this upward to caller.
 
 Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>
 
-Index: libmlx4/src/verbs.c
-===================================================================
---- libmlx4.orig/src/verbs.c	2008-09-01 10:51:11.000000000 +0300
-+++ libmlx4/src/verbs.c	2008-09-01 10:52:40.000000000 +0300
-@@ -774,9 +774,11 @@
+--- a/src/verbs.c
++++ b/src/verbs.c
+@@ -782,9 +782,11 @@ struct ibv_xrc_domain *mlx4_open_xrc_dom
  
  int mlx4_close_xrc_domain(struct ibv_xrc_domain *d)
  {
diff --git a/fixes/xrc_rcv_qp_v2.patch b/fixes/xrc_rcv_qp_v2.patch
index 311c500..00ffd53 100644
--- a/fixes/xrc_rcv_qp_v2.patch
+++ b/fixes/xrc_rcv_qp_v2.patch
@@ -5,11 +5,9 @@ Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>
 V2:
 1. xrc_ops changed to more_ops
 
-diff --git a/src/mlx4.c b/src/mlx4.c
-index 27ca75d..e5ded78 100644
 --- a/src/mlx4.c
 +++ b/src/mlx4.c
-@@ -74,6 +74,11 @@ static struct ibv_more_ops mlx4_more_ops = {
+@@ -74,6 +74,11 @@ static struct ibv_more_ops mlx4_more_ops
  	.create_xrc_srq   = mlx4_create_xrc_srq,
  	.open_xrc_domain  = mlx4_open_xrc_domain,
  	.close_xrc_domain = mlx4_close_xrc_domain,
@@ -21,11 +19,9 @@ index 27ca75d..e5ded78 100644
  #endif
  };
  #endif
-diff --git a/src/mlx4.h b/src/mlx4.h
-index 3eadb98..6307a2d 100644
 --- a/src/mlx4.h
 +++ b/src/mlx4.h
-@@ -429,6 +429,21 @@ struct ibv_xrc_domain *mlx4_open_xrc_domain(struct ibv_context *context,
+@@ -436,6 +436,21 @@ struct ibv_xrc_domain *mlx4_open_xrc_dom
  					    int fd, int oflag);
  
  int mlx4_close_xrc_domain(struct ibv_xrc_domain *d);
@@ -47,11 +43,9 @@ index 3eadb98..6307a2d 100644
  #endif
  
  
-diff --git a/src/verbs.c b/src/verbs.c
-index b7c9c8e..8261eae 100644
 --- a/src/verbs.c
 +++ b/src/verbs.c
-@@ -778,4 +778,59 @@ int mlx4_close_xrc_domain(struct ibv_xrc_domain *d)
+@@ -786,4 +786,59 @@ int mlx4_close_xrc_domain(struct ibv_xrc
  	free(d);
  	return 0;
  }
-- 
1.6.3.rc3.12.gb7937


From eli at mellanox.co.il  Mon May 18 01:55:24 2009
From: eli at mellanox.co.il (Eli Cohen)
Date: Mon, 18 May 2009 11:55:24 +0300
Subject: [ofa-general] [PATCH 1/2] mlx4_core: Use module parameter for number
	of MTTs per segment
Message-ID: <20090518085524.GA16094@mtls03>

The current MTTs allocator uses kmalloc to allocate a buffer for it's buddy
system implementation and thus is limited by the amount of MTT segments that it
can control. As a result, the size of memory that can be registered is limited
too. This patch uses a module parameter to control the number of MTT entries
that each segment represents, thus allowing to register more memory with the
same number of segments.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
 drivers/net/mlx4/main.c     |   14 ++++++++++++--
 drivers/net/mlx4/mr.c       |    6 +++---
 drivers/net/mlx4/profile.c  |    2 +-
 include/linux/mlx4/device.h |    1 +
 4 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index 30bea96..018348c 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -100,6 +100,10 @@ module_param_named(use_prio, use_prio, bool, 0444);
 MODULE_PARM_DESC(use_prio, "Enable steering by VLAN priority on ETH ports "
 		  "(0/1, default 0)");
 
+static int log_mtts_per_seg = ilog2(MLX4_MTT_ENTRY_PER_SEG);
+module_param_named(log_mtts_per_seg, log_mtts_per_seg, int, 0444);
+MODULE_PARM_DESC(log_mtts_per_seg, "Log2 number of MTT entries per segment (1-5)");
+
 int mlx4_check_port_params(struct mlx4_dev *dev,
 			   enum mlx4_port_type *port_type)
 {
@@ -203,12 +207,13 @@ static int mlx4_dev_cap(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap)
 	dev->caps.max_cqes	     = dev_cap->max_cq_sz - 1;
 	dev->caps.reserved_cqs	     = dev_cap->reserved_cqs;
 	dev->caps.reserved_eqs	     = dev_cap->reserved_eqs;
+	dev->caps.mtts_per_seg	     = 1 << log_mtts_per_seg;
 	dev->caps.reserved_mtts	     = DIV_ROUND_UP(dev_cap->reserved_mtts,
-						    MLX4_MTT_ENTRY_PER_SEG);
+						    dev->caps.mtts_per_seg);
 	dev->caps.reserved_mrws	     = dev_cap->reserved_mrws;
 	dev->caps.reserved_uars	     = dev_cap->reserved_uars;
 	dev->caps.reserved_pds	     = dev_cap->reserved_pds;
-	dev->caps.mtt_entry_sz	     = MLX4_MTT_ENTRY_PER_SEG * dev_cap->mtt_entry_sz;
+	dev->caps.mtt_entry_sz	     = dev->caps.mtts_per_seg * dev_cap->mtt_entry_sz;
 	dev->caps.max_msg_sz         = dev_cap->max_msg_sz;
 	dev->caps.page_size_cap	     = ~(u32) (dev_cap->min_page_sz - 1);
 	dev->caps.flags		     = dev_cap->flags;
@@ -1304,6 +1309,11 @@ static int __init mlx4_verify_params(void)
 		return -1;
 	}
 
+	if ((log_mtts_per_seg < 1) || (log_mtts_per_seg > 5)) {
+		printk(KERN_WARNING "mlx4_core: bad log_mtts_per_seg: %d\n", log_mtts_per_seg);
+		return -1;
+	}
+
 	return 0;
 }
 
diff --git a/drivers/net/mlx4/mr.c b/drivers/net/mlx4/mr.c
index 0caf74c..3b8973d 100644
--- a/drivers/net/mlx4/mr.c
+++ b/drivers/net/mlx4/mr.c
@@ -209,7 +209,7 @@ int mlx4_mtt_init(struct mlx4_dev *dev, int npages, int page_shift,
 	} else
 		mtt->page_shift = page_shift;
 
-	for (mtt->order = 0, i = MLX4_MTT_ENTRY_PER_SEG; i < npages; i <<= 1)
+	for (mtt->order = 0, i = dev->caps.mtts_per_seg; i < npages; i <<= 1)
 		++mtt->order;
 
 	mtt->first_seg = mlx4_alloc_mtt_range(dev, mtt->order);
@@ -350,7 +350,7 @@ int mlx4_mr_enable(struct mlx4_dev *dev, struct mlx4_mr *mr)
 		mpt_entry->pd_flags |= cpu_to_be32(MLX4_MPT_PD_FLAG_FAST_REG |
 						   MLX4_MPT_PD_FLAG_RAE);
 		mpt_entry->mtt_sz    = cpu_to_be32((1 << mr->mtt.order) *
-						   MLX4_MTT_ENTRY_PER_SEG);
+						   dev->caps.mtts_per_seg);
 	} else {
 		mpt_entry->flags    |= cpu_to_be32(MLX4_MPT_FLAG_SW_OWNS);
 	}
@@ -391,7 +391,7 @@ static int mlx4_write_mtt_chunk(struct mlx4_dev *dev, struct mlx4_mtt *mtt,
 	    (start_index + npages - 1) / (PAGE_SIZE / sizeof (u64)))
 		return -EINVAL;
 
-	if (start_index & (MLX4_MTT_ENTRY_PER_SEG - 1))
+	if (start_index & (dev->caps.mtts_per_seg - 1))
 		return -EINVAL;
 
 	mtts = mlx4_table_find(&priv->mr_table.mtt_table, mtt->first_seg +
diff --git a/drivers/net/mlx4/profile.c b/drivers/net/mlx4/profile.c
index cebdf32..bd22df9 100644
--- a/drivers/net/mlx4/profile.c
+++ b/drivers/net/mlx4/profile.c
@@ -98,7 +98,7 @@ u64 mlx4_make_profile(struct mlx4_dev *dev,
 	profile[MLX4_RES_EQ].size     = dev_cap->eqc_entry_sz;
 	profile[MLX4_RES_DMPT].size   = dev_cap->dmpt_entry_sz;
 	profile[MLX4_RES_CMPT].size   = dev_cap->cmpt_entry_sz;
-	profile[MLX4_RES_MTT].size    = MLX4_MTT_ENTRY_PER_SEG * dev_cap->mtt_entry_sz;
+	profile[MLX4_RES_MTT].size    = dev->caps.mtts_per_seg * dev_cap->mtt_entry_sz;
 	profile[MLX4_RES_MCG].size    = MLX4_MGM_ENTRY_SIZE;
 
 	profile[MLX4_RES_QP].num      = request->num_qp;
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 3aff8a6..ce7cc6c 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -210,6 +210,7 @@ struct mlx4_caps {
 	int			num_comp_vectors;
 	int			num_mpts;
 	int			num_mtt_segs;
+	int			mtts_per_seg;
 	int			fmr_reserved_mtts;
 	int			reserved_mtts;
 	int			reserved_mrws;
-- 
1.6.3


From eli at mellanox.co.il  Mon May 18 01:55:51 2009
From: eli at mellanox.co.il (Eli Cohen)
Date: Mon, 18 May 2009 11:55:51 +0300
Subject: [ofa-general] [PATCH 2/2] ib_mthca: Use module parameter for number
	of MTTs per segment
Message-ID: <20090518085551.GA16106@mtls03>

The current MTTs allocator uses kmalloc to allocate a buffer for it's buddy
system implementation and thus is limited by the amount of MTT segments that it
can control. As a result, the size of memory that can be registered is limited
too. This patch uses a module parameter to control the number of MTT entries
that each segment represents, thus allowing to register more memory with the
same number of segments.

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
 drivers/infiniband/hw/mthca/mthca_cmd.c     |    2 +-
 drivers/infiniband/hw/mthca/mthca_dev.h     |    1 +
 drivers/infiniband/hw/mthca/mthca_main.c    |   17 ++++++++++++++---
 drivers/infiniband/hw/mthca/mthca_mr.c      |   16 ++++++++--------
 drivers/infiniband/hw/mthca/mthca_profile.c |    4 ++--
 5 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c b/drivers/infiniband/hw/mthca/mthca_cmd.c
index 6d55f9d..8c2ed99 100644
--- a/drivers/infiniband/hw/mthca/mthca_cmd.c
+++ b/drivers/infiniband/hw/mthca/mthca_cmd.c
@@ -1059,7 +1059,7 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev *dev,
 	MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MTT_OFFSET);
 	if (mthca_is_memfree(dev))
 		dev_lim->reserved_mtts = ALIGN((1 << (field >> 4)) * sizeof(u64),
-					       MTHCA_MTT_SEG_SIZE) / MTHCA_MTT_SEG_SIZE;
+					       dev->limits.mtt_seg_size) / dev->limits.mtt_seg_size;
 	else
 		dev_lim->reserved_mtts = 1 << (field >> 4);
 	MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET);
diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h
index 2525901..9ef611f 100644
--- a/drivers/infiniband/hw/mthca/mthca_dev.h
+++ b/drivers/infiniband/hw/mthca/mthca_dev.h
@@ -159,6 +159,7 @@ struct mthca_limits {
 	int      reserved_eqs;
 	int      num_mpts;
 	int      num_mtt_segs;
+	int	 mtt_seg_size;
 	int      fmr_reserved_mtts;
 	int      reserved_mtts;
 	int      reserved_mrws;
diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c
index 1d83cf7..13da9f1 100644
--- a/drivers/infiniband/hw/mthca/mthca_main.c
+++ b/drivers/infiniband/hw/mthca/mthca_main.c
@@ -125,6 +125,10 @@ module_param_named(fmr_reserved_mtts, hca_profile.fmr_reserved_mtts, int, 0444);
 MODULE_PARM_DESC(fmr_reserved_mtts,
 		 "number of memory translation table segments reserved for FMR");
 
+static int log_mtts_per_seg = ilog2(MTHCA_MTT_SEG_SIZE / 8);
+module_param_named(log_mtts_per_seg, log_mtts_per_seg, int, 0444);
+MODULE_PARM_DESC(log_mtts_per_seg, "Log2 number of MTT entries per segment (1-5)");
+
 static char mthca_version[] __devinitdata =
 	DRV_NAME ": Mellanox InfiniBand HCA driver v"
 	DRV_VERSION " (" DRV_RELDATE ")\n";
@@ -162,6 +166,7 @@ static int mthca_dev_lim(struct mthca_dev *mdev, struct mthca_dev_lim *dev_lim)
 	int err;
 	u8 status;
 
+	mdev->limits.mtt_seg_size = (1 << log_mtts_per_seg) * 8;
 	err = mthca_QUERY_DEV_LIM(mdev, dev_lim, &status);
 	if (err) {
 		mthca_err(mdev, "QUERY_DEV_LIM command failed, aborting.\n");
@@ -460,11 +465,11 @@ static int mthca_init_icm(struct mthca_dev *mdev,
 	}
 
 	/* CPU writes to non-reserved MTTs, while HCA might DMA to reserved mtts */
-	mdev->limits.reserved_mtts = ALIGN(mdev->limits.reserved_mtts * MTHCA_MTT_SEG_SIZE,
-					   dma_get_cache_alignment()) / MTHCA_MTT_SEG_SIZE;
+	mdev->limits.reserved_mtts = ALIGN(mdev->limits.reserved_mtts * mdev->limits.mtt_seg_size,
+					   dma_get_cache_alignment()) / mdev->limits.mtt_seg_size;
 
 	mdev->mr_table.mtt_table = mthca_alloc_icm_table(mdev, init_hca->mtt_base,
-							 MTHCA_MTT_SEG_SIZE,
+							 mdev->limits.mtt_seg_size,
 							 mdev->limits.num_mtt_segs,
 							 mdev->limits.reserved_mtts,
 							 1, 0);
@@ -1315,6 +1320,12 @@ static void __init mthca_validate_profile(void)
 		printk(KERN_WARNING PFX "Corrected fmr_reserved_mtts to %d.\n",
 		       hca_profile.fmr_reserved_mtts);
 	}
+
+	if ((log_mtts_per_seg < 1) || (log_mtts_per_seg > 5)) {
+		printk(KERN_WARNING PFX "bad log_mtts_per_seg (%d). Using default - %d\n",
+		       log_mtts_per_seg, ilog2(MTHCA_MTT_SEG_SIZE / 8));
+		log_mtts_per_seg = ilog2(MTHCA_MTT_SEG_SIZE / 8);
+	}
 }
 
 static int __init mthca_init(void)
diff --git a/drivers/infiniband/hw/mthca/mthca_mr.c b/drivers/infiniband/hw/mthca/mthca_mr.c
index 882e6b7..d606edf 100644
--- a/drivers/infiniband/hw/mthca/mthca_mr.c
+++ b/drivers/infiniband/hw/mthca/mthca_mr.c
@@ -220,7 +220,7 @@ static struct mthca_mtt *__mthca_alloc_mtt(struct mthca_dev *dev, int size,
 
 	mtt->buddy = buddy;
 	mtt->order = 0;
-	for (i = MTHCA_MTT_SEG_SIZE / 8; i < size; i <<= 1)
+	for (i = dev->limits.mtt_seg_size / 8; i < size; i <<= 1)
 		++mtt->order;
 
 	mtt->first_seg = mthca_alloc_mtt_range(dev, mtt->order, buddy);
@@ -267,7 +267,7 @@ static int __mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt,
 
 	while (list_len > 0) {
 		mtt_entry[0] = cpu_to_be64(dev->mr_table.mtt_base +
-					   mtt->first_seg * MTHCA_MTT_SEG_SIZE +
+					   mtt->first_seg * dev->limits.mtt_seg_size +
 					   start_index * 8);
 		mtt_entry[1] = 0;
 		for (i = 0; i < list_len && i < MTHCA_MAILBOX_SIZE / 8 - 2; ++i)
@@ -326,7 +326,7 @@ static void mthca_tavor_write_mtt_seg(struct mthca_dev *dev,
 	u64 __iomem *mtts;
 	int i;
 
-	mtts = dev->mr_table.tavor_fmr.mtt_base + mtt->first_seg * MTHCA_MTT_SEG_SIZE +
+	mtts = dev->mr_table.tavor_fmr.mtt_base + mtt->first_seg * dev->limits.mtt_seg_size +
 		start_index * sizeof (u64);
 	for (i = 0; i < list_len; ++i)
 		mthca_write64_raw(cpu_to_be64(buffer_list[i] | MTHCA_MTT_FLAG_PRESENT),
@@ -345,10 +345,10 @@ static void mthca_arbel_write_mtt_seg(struct mthca_dev *dev,
 	/* For Arbel, all MTTs must fit in the same page. */
 	BUG_ON(s / PAGE_SIZE != (s + list_len * sizeof(u64) - 1) / PAGE_SIZE);
 	/* Require full segments */
-	BUG_ON(s % MTHCA_MTT_SEG_SIZE);
+	BUG_ON(s % dev->limits.mtt_seg_size);
 
 	mtts = mthca_table_find(dev->mr_table.mtt_table, mtt->first_seg +
-				s / MTHCA_MTT_SEG_SIZE, &dma_handle);
+				s / dev->limits.mtt_seg_size, &dma_handle);
 
 	BUG_ON(!mtts);
 
@@ -479,7 +479,7 @@ int mthca_mr_alloc(struct mthca_dev *dev, u32 pd, int buffer_size_shift,
 	if (mr->mtt)
 		mpt_entry->mtt_seg =
 			cpu_to_be64(dev->mr_table.mtt_base +
-				    mr->mtt->first_seg * MTHCA_MTT_SEG_SIZE);
+				    mr->mtt->first_seg * dev->limits.mtt_seg_size);
 
 	if (0) {
 		mthca_dbg(dev, "Dumping MPT entry %08x:\n", mr->ibmr.lkey);
@@ -626,7 +626,7 @@ int mthca_fmr_alloc(struct mthca_dev *dev, u32 pd,
 		goto err_out_table;
 	}
 
-	mtt_seg = mr->mtt->first_seg * MTHCA_MTT_SEG_SIZE;
+	mtt_seg = mr->mtt->first_seg * dev->limits.mtt_seg_size;
 
 	if (mthca_is_memfree(dev)) {
 		mr->mem.arbel.mtts = mthca_table_find(dev->mr_table.mtt_table,
@@ -908,7 +908,7 @@ int mthca_init_mr_table(struct mthca_dev *dev)
 			 dev->mr_table.mtt_base);
 
 		dev->mr_table.tavor_fmr.mtt_base =
-			ioremap(addr, mtts * MTHCA_MTT_SEG_SIZE);
+			ioremap(addr, mtts * dev->limits.mtt_seg_size);
 		if (!dev->mr_table.tavor_fmr.mtt_base) {
 			mthca_warn(dev, "MTT ioremap for FMR failed.\n");
 			err = -ENOMEM;
diff --git a/drivers/infiniband/hw/mthca/mthca_profile.c b/drivers/infiniband/hw/mthca/mthca_profile.c
index d168c25..8edb28a 100644
--- a/drivers/infiniband/hw/mthca/mthca_profile.c
+++ b/drivers/infiniband/hw/mthca/mthca_profile.c
@@ -94,7 +94,7 @@ s64 mthca_make_profile(struct mthca_dev *dev,
 	profile[MTHCA_RES_RDB].size  = MTHCA_RDB_ENTRY_SIZE;
 	profile[MTHCA_RES_MCG].size  = MTHCA_MGM_ENTRY_SIZE;
 	profile[MTHCA_RES_MPT].size  = dev_lim->mpt_entry_sz;
-	profile[MTHCA_RES_MTT].size  = MTHCA_MTT_SEG_SIZE;
+	profile[MTHCA_RES_MTT].size  = dev->limits.mtt_seg_size;
 	profile[MTHCA_RES_UAR].size  = dev_lim->uar_scratch_entry_sz;
 	profile[MTHCA_RES_UDAV].size = MTHCA_AV_SIZE;
 	profile[MTHCA_RES_UARC].size = request->uarc_size;
@@ -232,7 +232,7 @@ s64 mthca_make_profile(struct mthca_dev *dev,
 			dev->limits.num_mtt_segs = profile[i].num;
 			dev->mr_table.mtt_base   = profile[i].start;
 			init_hca->mtt_base       = profile[i].start;
-			init_hca->mtt_seg_sz     = ffs(MTHCA_MTT_SEG_SIZE) - 7;
+			init_hca->mtt_seg_sz     = ffs(dev->limits.mtt_seg_size) - 7;
 			break;
 		case MTHCA_RES_UAR:
 			dev->limits.num_uars       = profile[i].num;
-- 
1.6.3


From PHF at zurich.ibm.com  Mon May 18 02:52:27 2009
From: PHF at zurich.ibm.com (Philip Frey1)
Date: Mon, 18 May 2009 11:52:27 +0200
Subject: [ofa-general] RPATH issue with libibverbs (OFED 1.4)
Message-ID: <OFDB52FACE.17B77D4A-ONC12575BA.00359B7E-C12575BA.00363DDF@ch.ibm.com>

Hi,

I am no longer able to build the libibverbs due to an RPATH issue.
Can you give me some advice as to how to solve it?

When running the 'install.pl' script, I get the following output:
...
Running  rpmbuild --rebuild  --define '_topdir /var/tmp/OFED_topdir' 
--define 'dist %{nil}' --target x86_64 --define '_prefix /usr' --define 
'_exec_prefix /usr' --define '_sysconfdir /etc' --define '_usr /usr' 
/root/OFED/1.4/OFED-1.4/SRPMS/libibverbs-1.1.2-1.ofed1.4.src.rpm

Failed to build libibverbs RPM 
See /tmp/OFED.2614.logs/libibverbs.rpmbuild.log 

The last few lines from that log are:
ERROR   0001: file '/usr/bin/ibv_asyncwatch' contains a standard rpath 
'/usr/lib64' in [/usr/lib64]
ERROR   0001: file '/usr/bin/ibv_srq_pingpong' contains a standard rpath 
'/usr/lib64' in [/usr/lib64]
ERROR   0001: file '/usr/bin/ibv_devices' contains a standard rpath 
'/usr/lib64' in [/usr/lib64]
ERROR   0001: file '/usr/bin/ibv_devinfo' contains a standard rpath 
'/usr/lib64' in [/usr/lib64]
ERROR   0001: file '/usr/bin/ibv_rc_pingpong' contains a standard rpath 
'/usr/lib64' in [/usr/lib64]
ERROR   0001: file '/usr/bin/ibv_ud_pingpong' contains a standard rpath 
'/usr/lib64' in [/usr/lib64]
ERROR   0001: file '/usr/bin/ibv_uc_pingpong' contains a standard rpath 
'/usr/lib64' in [/usr/lib64]
error: Bad exit status from /var/tmp/rpm-tmp.52084 (%install)

I am runing the following Fedora kernel: 2.6.27.21-78.2.41.fc9.x86_64

On another machine with the exact same setup, the installation works fine.

Many thanks and kind regards,
 Philip

-- 
   Philip Frey 
   IBM Zurich Research Laboratory
   Saumerstrasse 4                                   |  Phone: +41 44 724 
8613
   CH-8803 Rueschlikon/Switzerland  |  Email: phf at zurich.ibm.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090518/216c45da/attachment.html>

From vlad at lists.openfabrics.org  Mon May 18 03:21:13 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Mon, 18 May 2009 03:21:13 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090518-0200 daily build status
Message-ID: <20090518102113.2EE37E61348@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From Bert.Wiegers at t-systems-sfr.com  Mon May 18 04:10:42 2009
From: Bert.Wiegers at t-systems-sfr.com (Wiegers, Bert)
Date: Mon, 18 May 2009 13:10:42 +0200
Subject: [ofa-general] MTU in IPoIB
In-Reply-To: <20090414185748.5ea98ae7@beno.local.bs>
References: <200904112233.51105.bs_lists@aakef.fastmail.fm>
	<f0e08f230904130440mf92a5c8m2746398b6b99d40c@mail.gmail.com>
	<20090414091223.c7911402.weiny2@llnl.gov>
	<20090414185748.5ea98ae7@beno.local.bs>
Message-ID: <5C59564D023C844F9C20CCD32687E5F4210E2EE6@SFREXMBX01.acds.t-systems-sfr.com>

Hi,

In our default-setup we are using IPoIB. This is set up with a MTU of 65520

ib0       Link encap:UNSPEC  HWaddr 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00
          inet addr:10.75.107.32  Bcast:10.75.255.255  Mask:255.255.0.0
          inet6 addr: fe80::214:4fa4:d3ba:25/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
          RX packets:9318 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4197 errors:0 dropped:10 overruns:0 carrier:0
          collisions:0 txqueuelen:4096
          RX bytes:25032362 (23.8 Mb)  TX bytes:636320 (621.4 Kb)


Our dmesg on the other hand shows these hints:


ib_core: module not supported by Novell, setting U taint flag.
ib_mad: module not supported by Novell, setting U taint flag.
ib_mthca: module not supported by Novell, setting U taint flag.
mlx4_ib: module not supported by Novell, setting U taint flag.
mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0 (April 4, 2008)
ib_ipath: module not supported by Novell, setting U taint flag.
cxgb3: module not supported by Novell, setting U taint flag.
iw_cxgb3: module not supported by Novell, setting U taint flag.
ib_umad: module not supported by Novell, setting U taint flag.
NET: Registered protocol family 10
lo: Disabled Privacy Extensions
IPv6 over IPv4 tunneling driver
ib_uverbs: module not supported by Novell, setting U taint flag.
ib_sa: module not supported by Novell, setting U taint flag.
ib_cm: module not supported by Novell, setting U taint flag.
ib_ipoib: module not supported by Novell, setting U taint flag.
ADDRCONF(NETDEV_UP): ib0: link is not ready
ib0: enabling connected mode will cause multicast packet drops
ib0: mtu > 2044 will cause multicast packet drops.
ib0: mtu > 2044 will cause multicast packet drops.
ib1: enabling connected mode will cause multicast packet drops
ib1: mtu > 2044 will cause multicast packet drops.
ib1: mtu > 2044 will cause multicast packet drops.
ib_addr: module not supported by Novell, setting U taint flag.
iw_cm: module not supported by Novell, setting U taint flag.
rdma_cm: module not supported by Novell, setting U taint flag.
ib_sdp: module not supported by Novell, setting U taint flag.
NET: Registered protocol family 27
qlgc_vnic: module not supported by Novell, setting U taint flag.
QLGC_VNIC: Initializing QLogic Corp. Virtual NIC (VNIC) driver version 1.3.0.0.4
rdma_ucm: module not supported by Novell, setting U taint flag.
scsi_transport_iscsi: module not supported by Novell, setting U taint flag.
Loading iSCSI transport class v2.0-869.
libiscsi: module not supported by Novell, setting U taint flag.
iscsi_tcp: module not supported by Novell, setting U taint flag.
iscsi: registered transport (tcp)
ib_iser: module not supported by Novell, setting U taint flag.
iscsi: registered transport (iser)
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
eth0: no IPv6 routers present
ib0: no IPv6 routers present


So should I limit the MTU to 2044?

Thanks.

Bert


From Robert at saq.co.uk  Mon May 18 05:18:40 2009
From: Robert at saq.co.uk (Robert Dunkley)
Date: Mon, 18 May 2009 13:18:40 +0100
Subject: [ofa-general] MTU in IPoIB
References: <200904112233.51105.bs_lists@aakef.fastmail.fm><f0e08f230904130440mf92a5c8m2746398b6b99d40c@mail.gmail.com><20090414091223.c7911402.weiny2@llnl.gov><20090414185748.5ea98ae7@beno.local.bs>
	<5C59564D023C844F9C20CCD32687E5F4210E2EE6@SFREXMBX01.acds.t-systems-sfr.com>
Message-ID: <C1EAC9C5E752D24C968FF091D446D823362C29@ALTERNATEREALIT>

Hi,

This warning is normal. If you don't need Multicast then it is of no
concern at all. If you have an app that uses multicast then you will
have to limit the MTU (In this case you might be better off using
reliable transmission - not connected mode).

Rob

-----Original Message-----
From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Wiegers,
Bert
Sent: 18 May 2009 12:11
To: general at lists.openfabrics.org
Subject: [ofa-general] MTU in IPoIB

Hi,

In our default-setup we are using IPoIB. This is set up with a MTU of
65520

ib0       Link encap:UNSPEC  HWaddr
80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00
          inet addr:10.75.107.32  Bcast:10.75.255.255  Mask:255.255.0.0
          inet6 addr: fe80::214:4fa4:d3ba:25/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
          RX packets:9318 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4197 errors:0 dropped:10 overruns:0 carrier:0
          collisions:0 txqueuelen:4096
          RX bytes:25032362 (23.8 Mb)  TX bytes:636320 (621.4 Kb)


Our dmesg on the other hand shows these hints:


ib_core: module not supported by Novell, setting U taint flag.
ib_mad: module not supported by Novell, setting U taint flag.
ib_mthca: module not supported by Novell, setting U taint flag.
mlx4_ib: module not supported by Novell, setting U taint flag.
mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0 (April 4, 2008)
ib_ipath: module not supported by Novell, setting U taint flag.
cxgb3: module not supported by Novell, setting U taint flag.
iw_cxgb3: module not supported by Novell, setting U taint flag.
ib_umad: module not supported by Novell, setting U taint flag.
NET: Registered protocol family 10
lo: Disabled Privacy Extensions
IPv6 over IPv4 tunneling driver
ib_uverbs: module not supported by Novell, setting U taint flag.
ib_sa: module not supported by Novell, setting U taint flag.
ib_cm: module not supported by Novell, setting U taint flag.
ib_ipoib: module not supported by Novell, setting U taint flag.
ADDRCONF(NETDEV_UP): ib0: link is not ready
ib0: enabling connected mode will cause multicast packet drops
ib0: mtu > 2044 will cause multicast packet drops.
ib0: mtu > 2044 will cause multicast packet drops.
ib1: enabling connected mode will cause multicast packet drops
ib1: mtu > 2044 will cause multicast packet drops.
ib1: mtu > 2044 will cause multicast packet drops.
ib_addr: module not supported by Novell, setting U taint flag.
iw_cm: module not supported by Novell, setting U taint flag.
rdma_cm: module not supported by Novell, setting U taint flag.
ib_sdp: module not supported by Novell, setting U taint flag.
NET: Registered protocol family 27
qlgc_vnic: module not supported by Novell, setting U taint flag.
QLGC_VNIC: Initializing QLogic Corp. Virtual NIC (VNIC) driver version
1.3.0.0.4
rdma_ucm: module not supported by Novell, setting U taint flag.
scsi_transport_iscsi: module not supported by Novell, setting U taint
flag.
Loading iSCSI transport class v2.0-869.
libiscsi: module not supported by Novell, setting U taint flag.
iscsi_tcp: module not supported by Novell, setting U taint flag.
iscsi: registered transport (tcp)
ib_iser: module not supported by Novell, setting U taint flag.
iscsi: registered transport (iser)
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff,
status -11
ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
eth0: no IPv6 routers present
ib0: no IPv6 routers present


So should I limit the MTU to 2044?

Thanks.

Bert
_______________________________________________
general mailing list
general at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general

The SAQ Group

Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ
SAQ is the trading name of SEMTEC Limited. Registered in England & Wales
Company Number: 06481952

http://www.saqnet.co.uk AS29219

SAQ Group Delivers high quality, honestly priced communication and I.T. services to UK Business.

Broadband : Domains : Email : Hosting : CoLo : Servers : Racks : Transit : Backups : Managed Networks : Remote Support.

ISPA Member

Find us in http://www.thebestof.co.uk/petersfield


From ogerlitz at Voltaire.com  Mon May 18 05:20:19 2009
From: ogerlitz at Voltaire.com (Or Gerlitz)
Date: Mon, 18 May 2009 15:20:19 +0300
Subject: [ofa-general] MTU in IPoIB
In-Reply-To: <5C59564D023C844F9C20CCD32687E5F4210E2EE6@SFREXMBX01.acds.t-systems-sfr.com>
References: <200904112233.51105.bs_lists@aakef.fastmail.fm>	<f0e08f230904130440mf92a5c8m2746398b6b99d40c@mail.gmail.com>	<20090414091223.c7911402.weiny2@llnl.gov>	<20090414185748.5ea98ae7@beno.local.bs>
	<5C59564D023C844F9C20CCD32687E5F4210E2EE6@SFREXMBX01.acds.t-systems-sfr.com>
Message-ID: <4A115283.6020803@Voltaire.com>

Wiegers, Bert wrote:
> In our default-setup we are using IPoIB. This is set up with a MTU of 65520
> ib0       Link encap:UNSPEC  HWaddr 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00
>           UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
[...]
> ib0: enabling connected mode will cause multicast packet drops
> ib0: mtu > 2044 will cause multicast packet drops.
> So should I limit the MTU to 2044?

Please take a look on Documentation/infiniband/ipoib.txt, specifically commit b49ca "IPoIB: Document newish features" should help you understand things better.

Or.


From tziporet at mellanox.co.il  Mon May 18 06:06:27 2009
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Mon, 18 May 2009 16:06:27 +0300
Subject: [ofa-general] EWG/OFED meeting agenda for today (May 18)
Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD02B7BFDF@mtlexch01.mtl.com>


This is the agenda for today's EWG/OFED meeting:

1. OFED 1.4.1 bugs status and decision on RC6 date

1628    	blo  	andy.grover at oracle.com  	RDS in 1.4.1
cannot connect to RDS in 1.3.1 - I think Andy sent a fix for this
1596 	cri 	Jeffrey.C.Becker at nasa.gov 	openibd stop failed when
nfs is loaded - Jeff B. is working on this - need update
1616 	cri 	swise at opengridcomputing.com 	iommu_alloc error when
running connectathon on ppc64 nfs ... - I think Steve sent a patch for
this
1571 	cri 	vu at mellanox.com 		nfsrdma server crash
@test5 connectathon basic test, - Need update from Vu

We had a problematic RC5 which we deleted. We now wait for bug 1596
resolution

2. New memory registration API - update from Jeff S.

3. OFED 1.5 status update - all

4. Open discussion

Tziporet
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090518/d83602ae/attachment.html>

From swise at opengridcomputing.com  Mon May 18 07:20:36 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 18 May 2009 09:20:36 -0500
Subject: [ofa-general] Re: EWG/OFED meeting agenda for today (May 18)
In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD02B7BFDF@mtlexch01.mtl.com>
References: <5D49E7A8952DC44FB38C38FA0D758EAD02B7BFDF@mtlexch01.mtl.com>
Message-ID: <4A116EB4.4060102@opengridcomputing.com>

Tziporet Koren wrote:
>
> This is the agenda for today's EWG/OFED meeting:
>
> 1. OFED 1.4.1 bugs status and decision on RC6 date
>
> 1628 blo andy.grover at oracle.com RDS in 1.4.1 cannot connect to RDS in 
> 1.3.1 - I think Andy sent a fix for this
>
> 1596 cri Jeffrey.C.Becker at nasa.gov openibd stop failed when nfs is 
> loaded - Jeff B. is working on this - need update
>
> 1616 cri swise at opengridcomputing.com iommu_alloc error when running 
> connectathon on ppc64 nfs … - I think Steve sent a patch for this
>


I did. I just now closed 1616.


From worleys at gmail.com  Mon May 18 09:21:00 2009
From: worleys at gmail.com (Chris Worley)
Date: Mon, 18 May 2009 10:21:00 -0600
Subject: [ofa-general] SRP aggregate bandwidth decreasing as threads increase
Message-ID: <f3177b9e0905180921r60601120l8ad8363e99621e31@mail.gmail.com>

I'm seeing peak performance at ~4 threads (1.6GB/s), w/ threads >10
I'm seeing aggregate performance drop significantly.  This is not a
drive issue: locally, the drives get best performance >~32 threads,
and maintain their aggregate way beyond that.

Is there any tunable parameter or source code change in the initiator
or target code that would effect performance with a high thread count?

Thanks,

Chris


From jsquyres at cisco.com  Mon May 18 09:24:48 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Mon, 18 May 2009 12:24:48 -0400
Subject: [ofa-general] Memory registration redux
In-Reply-To: <adafxfgshgl.fsf@cisco.com>
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com><adabpq6t2k8.fsf@cisco.com><D6421CE6-78D4-4C91-8803-9482E1C60566@cisco.com>
	<adafxfgshgl.fsf@cisco.com>
Message-ID: <730ECD2D-62E3-4EC7-92E7-A50FAD12D432@cisco.com>

On May 7, 2009, at 5:58 PM, Roland Dreier (rdreier) wrote:

>  > Specifically: the actual dereg of 0x1000-0x3fff is blocked on also
>  > releasing 0x2000-0x2fff.
>
> If everyone is doing this, how do you handle the case that Jason  
> pointed
> out, namely:
>
>  * you register 0x1000 ... 0x3fff
>  * you want to register 0x2000 ... 0x2fff and have a cache hit
>  * you finish up with 0x1000 ... 0x3fff
>  * app does something (which is valid since you finished up with the
>    bigger range) that invalidates mapping 0x3000 ... 0x3fff (eg free()
>    that leads to munmap() or whatever), and your hooks tell you so.
>  * app reallocates a mapping in 0x3000 ... 0x3fff
>  * you want to re-register 0x1000 ... 0x3fff -- but it has to be  
> marked
>    both invalid and in-use in the cache at this point !?
>


Sorry; this mail slipped by me and I just saw it now.

If this can actually happen -- that the mapping of 0x1000 ... 0x3fff  
can change even though it is still registered, then we're screwed --  
we have no way of knowing that this is now invalid (Open MPI, at least  
-- can't speak for others).

Is there a way to detect condition this in userspace?

-- 
Jeff Squyres
Cisco Systems


From worleys at gmail.com  Mon May 18 09:57:10 2009
From: worleys at gmail.com (Chris Worley)
Date: Mon, 18 May 2009 10:57:10 -0600
Subject: [ofa-general] Re: [Scst-devel] SRP aggregate bandwidth decreasing as
	threads increase
In-Reply-To: <C2F174F99918D54CA2A96E57C5079B6F015040D7@sbc-exmsg2.sbcounty.gov>
References: <f3177b9e0905180921r60601120l8ad8363e99621e31@mail.gmail.com>
	<C2F174F99918D54CA2A96E57C5079B6F015040D7@sbc-exmsg2.sbcounty.gov>
Message-ID: <f3177b9e0905180957x4a943f97l8f320fdecbc57885@mail.gmail.com>

On Mon, May 18, 2009 at 10:52 AM, Sufficool, Stanley
<ssufficool at rov.sbcounty.gov> wrote:
> IIRC, The SRP Target code has many context switches that throttle
> performance at higher thread counts.

Can anything be done to reduce the context switches?  Is there, for
example, one thread on the target per user thread that may best be
pinned?

Thanks,

Chris
>
>> -----Original Message-----
>> From: Chris Worley [mailto:worleys at gmail.com]
>> Sent: Monday, May 18, 2009 9:21 AM
>> To: OpenIB; scst-devel
>> Subject: [Scst-devel] SRP aggregate bandwidth decreasing as
>> threads increase
>>
>>
>> I'm seeing peak performance at ~4 threads (1.6GB/s), w/
>> threads >10 I'm seeing aggregate performance drop
>> significantly.  This is not a drive issue: locally, the
>> drives get best performance >~32 threads, and maintain their
>> aggregate way beyond that.
>>
>> Is there any tunable parameter or source code change in the
>> initiator or target code that would effect performance with a
>> high thread count?
>>
>> Thanks,
>>
>> Chris


From ssufficool at rov.sbcounty.gov  Mon May 18 09:52:12 2009
From: ssufficool at rov.sbcounty.gov (Sufficool, Stanley)
Date: Mon, 18 May 2009 09:52:12 -0700
Subject: [ofa-general] RE: [Scst-devel] SRP aggregate bandwidth decreasing as
	threads increase
In-Reply-To: <f3177b9e0905180921r60601120l8ad8363e99621e31@mail.gmail.com>
Message-ID: <C2F174F99918D54CA2A96E57C5079B6F015040D7@sbc-exmsg2.sbcounty.gov>

IIRC, The SRP Target code has many context switches that throttle
performance at higher thread counts.

> -----Original Message-----
> From: Chris Worley [mailto:worleys at gmail.com] 
> Sent: Monday, May 18, 2009 9:21 AM
> To: OpenIB; scst-devel
> Subject: [Scst-devel] SRP aggregate bandwidth decreasing as 
> threads increase
> 
> 
> I'm seeing peak performance at ~4 threads (1.6GB/s), w/ 
> threads >10 I'm seeing aggregate performance drop 
> significantly.  This is not a drive issue: locally, the 
> drives get best performance >~32 threads, and maintain their 
> aggregate way beyond that.
> 
> Is there any tunable parameter or source code change in the 
> initiator or target code that would effect performance with a 
> high thread count?
> 
> Thanks,
> 
> Chris
> 
> --------------------------------------------------------------
> ----------------
> Crystal Reports - New Free Runtime and 30 Day Trial
> Check out the new simplified licensing option that enables 
> unlimited royalty-free distribution of the report engine 
> for externally facing server and web deployment. 
> http://p.sf.net/sfu/businessobjects
> _______________________________________________
> Scst-devel mailing list
> Scst-devel at lists.sourceforge.net 
> https://lists.sourceforge.net/lists/listinfo/scst-devel
> 


From bart.vanassche at gmail.com  Mon May 18 10:22:43 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Mon, 18 May 2009 19:22:43 +0200
Subject: [ofa-general] RE: [Scst-devel] SRP aggregate bandwidth decreasing
	as threads increase
In-Reply-To: <C2F174F99918D54CA2A96E57C5079B6F015040D7@sbc-exmsg2.sbcounty.gov>
References: <f3177b9e0905180921r60601120l8ad8363e99621e31@mail.gmail.com>
	<C2F174F99918D54CA2A96E57C5079B6F015040D7@sbc-exmsg2.sbcounty.gov>
Message-ID: <e2e108260905181022n7b444435g83755d7cfb97d0f2@mail.gmail.com>

On Mon, May 18, 2009 at 6:52 PM, Sufficool, Stanley
<ssufficool at rov.sbcounty.gov> wrote:
> IIRC, The SRP Target code has many context switches that throttle
> performance at higher thread counts.

Depends on which version of ib_srpt you are using. The ib_srpt kernel
module has a parameter called "thread" which allows to control whether
disk I/O is handled in another thread than the one that communicates
over InfiniBand (thread=1) or in the same thread (thread=0). For older
versions of the ib_srpt kernel module the default was thread=1, which
caused indeed a lot of context switches. On December 3, 2008 (SCST
Subversion revision 594) the default has been changed from thread=1 to
thread=0 because the latter results in better performance.

Bart.


From caitlin.bestler at gmail.com  Mon May 18 11:02:23 2009
From: caitlin.bestler at gmail.com (Caitlin Bestler)
Date: Mon, 18 May 2009 11:02:23 -0700
Subject: [ofa-general] Memory registration redux
In-Reply-To: <730ECD2D-62E3-4EC7-92E7-A50FAD12D432@cisco.com>
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
	<adabpq6t2k8.fsf@cisco.com>
	<D6421CE6-78D4-4C91-8803-9482E1C60566@cisco.com>
	<adafxfgshgl.fsf@cisco.com>
	<730ECD2D-62E3-4EC7-92E7-A50FAD12D432@cisco.com>
Message-ID: <469958e00905181102s758ca2f3uddb09e8d604bd030@mail.gmail.com>

On Mon, May 18, 2009 at 9:24 AM, Jeff Squyres <jsquyres at cisco.com> wrote:
> On May 7, 2009, at 5:58 PM, Roland Dreier (rdreier) wrote:
>
>>  > Specifically: the actual dereg of 0x1000-0x3fff is blocked on also
>>  > releasing 0x2000-0x2fff.
>>
>> If everyone is doing this, how do you handle the case that Jason pointed
>> out, namely:
>>
>>  * you register 0x1000 ... 0x3fff
>>  * you want to register 0x2000 ... 0x2fff and have a cache hit
>>  * you finish up with 0x1000 ... 0x3fff
>>  * app does something (which is valid since you finished up with the
>>   bigger range) that invalidates mapping 0x3000 ... 0x3fff (eg free()
>>   that leads to munmap() or whatever), and your hooks tell you so.
>>  * app reallocates a mapping in 0x3000 ... 0x3fff
>>  * you want to re-register 0x1000 ... 0x3fff -- but it has to be marked
>>   both invalid and in-use in the cache at this point !?
>>
>
>
> Sorry; this mail slipped by me and I just saw it now.
>
> If this can actually happen -- that the mapping of 0x1000 ... 0x3fff can
> change even though it is still registered, then we're screwed -- we have no
> way of knowing that this is now invalid (Open MPI, at least -- can't speak
> for others).
>
> Is there a way to detect condition this in userspace?
>
How does 0x1000 to 0x3fff get registered as a single Memory Region?
If it is legitimate to free() 0x3000..0x3fff then how can there ever be a
legitimate reference to 0x1000..0x3fff? If there is no such single reference,
I don't see how a Memory Region is every created covering that range.

If the user creates the Memory Region, then they are responsible for not
free()ing a portion of it.

Would the MPI library ever create a single large memory region based on
two distinct Sends?


From generationgnu at yahoo.com  Mon May 18 11:04:10 2009
From: generationgnu at yahoo.com (Sam Haxor)
Date: Mon, 18 May 2009 11:04:10 -0700 (PDT)
Subject: [Scst-devel] [ofa-general] RE: SRP aggregate bandwidth decreasing
	as threads increase
In-Reply-To: <e2e108260905181022n7b444435g83755d7cfb97d0f2@mail.gmail.com>
References: <f3177b9e0905180921r60601120l8ad8363e99621e31@mail.gmail.com>
	<C2F174F99918D54CA2A96E57C5079B6F015040D7@sbc-exmsg2.sbcounty.gov>
	<e2e108260905181022n7b444435g83755d7cfb97d0f2@mail.gmail.com>
Message-ID: <977100.12599.qm@web111918.mail.gq1.yahoo.com>


If we pin the thread to 'a' CPU(say CPU-X) then can we pass on a hint to the BE driver to process the response completion on the same 'CPU-X' ? This way the CPU-cache will be utilized efficiently. I don't know how the IB driver etc works. So can't contribute in terms of code right now. But this is how any sample implementation could/should look like - 

BE driver creates the following - 

QUEUES[NR_CPUS];
QUEUES {
CMD QUEUE[SOME_QUEUE_DEPTH];
RSP   QUEUE[SOME_QUEUE_DEPTH];
};

1) scsi-mid down-calls BE driver, and also passes a hint aka 'thread-CPU-X'.
2) BE transmits cmd on QUEUES[thread-CPU-X]->CMD QUEUE[slot_index];
3) The adapter(HBA/HCA) will interrupt the BE driver on the 'thread-CPU-X'.
    3.1) Now it is the BE drivers responsibility to affinitize the response-draining with the corresponding CPU @ driver load
           time.

Ciao


----- Original Message ----
> From: Bart Van Assche <bart.vanassche at gmail.com>
> To: "Sufficool, Stanley" <ssufficool at rov.sbcounty.gov>
> Cc: Chris Worley <worleys at gmail.com>; scst-devel <scst-devel at lists.sourceforge.net>; OpenIB <general at lists.openfabrics.org>
> Sent: Monday, May 18, 2009 1:22:43 PM
> Subject: Re: [Scst-devel] [ofa-general] RE: SRP aggregate bandwidth decreasing as threads increase
> 
> On Mon, May 18, 2009 at 6:52 PM, Sufficool, Stanley
> wrote:
> > IIRC, The SRP Target code has many context switches that throttle
> > performance at higher thread counts.
> 
> Depends on which version of ib_srpt you are using. The ib_srpt kernel
> module has a parameter called "thread" which allows to control whether
> disk I/O is handled in another thread than the one that communicates
> over InfiniBand (thread=1) or in the same thread (thread=0). For older
> versions of the ib_srpt kernel module the default was thread=1, which
> caused indeed a lot of context switches. On December 3, 2008 (SCST
> Subversion revision 594) the default has been changed from thread=1 to
> thread=0 because the latter results in better performance.
> 
> Bart.
> 
> ------------------------------------------------------------------------------
> Crystal Reports - New Free Runtime and 30 Day Trial
> Check out the new simplified licensing option that enables 
> unlimited royalty-free distribution of the report engine 
> for externally facing server and web deployment. 
> http://p.sf.net/sfu/businessobjects
> _______________________________________________
> Scst-devel mailing list
> Scst-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scst-devel


From vst at vlnb.net  Mon May 18 11:23:22 2009
From: vst at vlnb.net (Vladislav Bolkhovitin)
Date: Mon, 18 May 2009 22:23:22 +0400
Subject: [ofa-general] Re: [Scst-devel] SRP aggregate bandwidth decreasing as
	threads increase
In-Reply-To: <f3177b9e0905180921r60601120l8ad8363e99621e31@mail.gmail.com>
References: <f3177b9e0905180921r60601120l8ad8363e99621e31@mail.gmail.com>
Message-ID: <4A11A79A.8070204@vlnb.net>

Chris Worley, on 05/18/2009 08:21 PM wrote:
> I'm seeing peak performance at ~4 threads (1.6GB/s), w/ threads >10
> I'm seeing aggregate performance drop significantly.  This is not a
> drive issue: locally, the drives get best performance >~32 threads,
> and maintain their aggregate way beyond that.
> 
> Is there any tunable parameter or source code change in the initiator
> or target code that would effect performance with a high thread count?

Check README of ib_srpt from the SCST SVN trunk.

> Thanks,
> 
> Chris
> 
> ------------------------------------------------------------------------------
> Crystal Reports - New Free Runtime and 30 Day Trial
> Check out the new simplified licensing option that enables 
> unlimited royalty-free distribution of the report engine 
> for externally facing server and web deployment. 
> http://p.sf.net/sfu/businessobjects
> _______________________________________________
> Scst-devel mailing list
> Scst-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scst-devel
> 


From jsquyres at cisco.com  Mon May 18 11:24:33 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Mon, 18 May 2009 14:24:33 -0400
Subject: [ofa-general] Memory registration redux
In-Reply-To: <469958e00905181102s758ca2f3uddb09e8d604bd030@mail.gmail.com>
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
	<adabpq6t2k8.fsf@cisco.com>
	<D6421CE6-78D4-4C91-8803-9482E1C60566@cisco.com>
	<adafxfgshgl.fsf@cisco.com>
	<730ECD2D-62E3-4EC7-92E7-A50FAD12D432@cisco.com>
	<469958e00905181102s758ca2f3uddb09e8d604bd030@mail.gmail.com>
Message-ID: <167AB191-8F5E-4B82-823A-B2A04E2BF76D@cisco.com>

On May 18, 2009, at 2:02 PM, Caitlin Bestler wrote:

> >>  > Specifically: the actual dereg of 0x1000-0x3fff is blocked on  
> also
> >>  > releasing 0x2000-0x2fff.
> >>
> >> If everyone is doing this, how do you handle the case that Jason  
> pointed
> >> out, namely:
> >>
> >>  * you register 0x1000 ... 0x3fff
> >>  * you want to register 0x2000 ... 0x2fff and have a cache hit
> >>  * you finish up with 0x1000 ... 0x3fff
> >>  * app does something (which is valid since you finished up with  
> the
> >>   bigger range) that invalidates mapping 0x3000 ... 0x3fff (eg  
> free()
> >>   that leads to munmap() or whatever), and your hooks tell you so.
> >>  * app reallocates a mapping in 0x3000 ... 0x3fff
> >>  * you want to re-register 0x1000 ... 0x3fff -- but it has to be  
> marked
> >>   both invalid and in-use in the cache at this point !?
>

I think I mis-parsed the above scenario in my previous response.

When our memory hooks tell us that memory is about to be removed from  
the process, we unregister all pages in the relevant region and remove  
those entries from the cache.  So the next time you look in the cache  
for 0x3000-0x3fff, it won't be there -- it'll be treated as cache-cold.

> How does 0x1000 to 0x3fff get registered as a single Memory Region?
> If it is legitimate to free() 0x3000..0x3fff then how can there ever  
> be a
> legitimate reference to 0x1000..0x3fff? If there is no such single  
> reference,
> I don't see how a Memory Region is every created covering that range.
>
> If the user creates the Memory Region, then they are responsible for  
> not
> free()ing a portion of it.
>

Agreed.  If an application does that, it deserves what it gets.

> Would the MPI library ever create a single large memory region based  
> on
> two distinct Sends?
>


Per my prior mail, Open MPI registers chucks at a time.  Each chunk is  
potentially a multiple of pages.  So yes, you could end up having a  
single registration that spans the buffers used in multiple, distinct  
MPI sends.  We reference count by page to ensure that deregistrations  
do not occur prematurely.

For example, if page X contains the end of one large buffer and the  
beginning of another, both of which are being used in ongoing non- 
blocking MPI communications.  Then page X's entry on our cache will  
have a refcount == 2.  OMPI won't allow the registration containing  
that page to become eligible for deregistering until the cache entry's  
refcount goes down to 0.

See my prior mail for a more complex example of our cache's behavior.

-- 
Jeff Squyres
Cisco Systems


From worleys at gmail.com  Mon May 18 11:40:58 2009
From: worleys at gmail.com (Chris Worley)
Date: Mon, 18 May 2009 12:40:58 -0600
Subject: [ofa-general] RE: [Scst-devel] SRP aggregate bandwidth decreasing
	as threads increase
In-Reply-To: <e2e108260905181022n7b444435g83755d7cfb97d0f2@mail.gmail.com>
References: <f3177b9e0905180921r60601120l8ad8363e99621e31@mail.gmail.com>
	<C2F174F99918D54CA2A96E57C5079B6F015040D7@sbc-exmsg2.sbcounty.gov>
	<e2e108260905181022n7b444435g83755d7cfb97d0f2@mail.gmail.com>
Message-ID: <f3177b9e0905181140h74bfa7b8vebab4db0dbfb5da3@mail.gmail.com>

On Mon, May 18, 2009 at 11:22 AM, Bart Van Assche
<bart.vanassche at gmail.com> wrote:
> On Mon, May 18, 2009 at 6:52 PM, Sufficool, Stanley
> <ssufficool at rov.sbcounty.gov> wrote:
>> IIRC, The SRP Target code has many context switches that throttle
>> performance at higher thread counts.
>
> Depends on which version of ib_srpt you are using. The ib_srpt kernel
> module has a parameter called "thread" which allows to control whether
> disk I/O is handled in another thread than the one that communicates
> over InfiniBand (thread=1) or in the same thread (thread=0). For older
> versions of the ib_srpt kernel module the default was thread=1, which
> caused indeed a lot of context switches. On December 3, 2008 (SCST
> Subversion revision 594) the default has been changed from thread=1 to
> thread=0 because the latter results in better performance.

I won't have access to the targets until tomorrow (at which point I
may not have internet access), so I'm trying to gather a few possible
solutions today.

I'm using a very recent version of the SCST target code, it would only
be ~1 month old.  So, I'm guessing I have the "thread=0" code.  Maybe,
for a high thread count, this needs to be "=1"?

Is there a way to control the number of threads once "thread=1" is
set?  Does it spawn one thread per initiator thread?

Any other ideas of things to try?

Thanks,

Chris
>
> Bart.
>


From worleys at gmail.com  Mon May 18 11:56:27 2009
From: worleys at gmail.com (Chris Worley)
Date: Mon, 18 May 2009 12:56:27 -0600
Subject: [ofa-general] Re: [Scst-devel] SRP aggregate bandwidth decreasing as
	threads increase
In-Reply-To: <4A11A79A.8070204@vlnb.net>
References: <f3177b9e0905180921r60601120l8ad8363e99621e31@mail.gmail.com>
	<4A11A79A.8070204@vlnb.net>
Message-ID: <f3177b9e0905181156v61999846mb9440f501731c450@mail.gmail.com>

On Mon, May 18, 2009 at 12:23 PM, Vladislav Bolkhovitin <vst at vlnb.net> wrote:
> Chris Worley, on 05/18/2009 08:21 PM wrote:
>>
>> I'm seeing peak performance at ~4 threads (1.6GB/s), w/ threads >10
>> I'm seeing aggregate performance drop significantly.  This is not a
>> drive issue: locally, the drives get best performance >~32 threads,
>> and maintain their aggregate way beyond that.
>>
>> Is there any tunable parameter or source code change in the initiator
>> or target code that would effect performance with a high thread count?
>
> Check README of ib_srpt from the SCST SVN trunk.

There are 42 README's in scst.  Do you mean the one in scst/trunk/srpt
which talks of three performance issues:

1) Minimizing QUEUEFULL conditions.
2) Setting IRQ affinity on the drives.
3) Setting "thread=1".

?

Chris


From arlin.r.davis at intel.com  Mon May 18 12:07:30 2009
From: arlin.r.davis at intel.com (Arlin Davis)
Date: Mon, 18 May 2009 12:07:30 -0700
Subject: [ofa-general] [PATCH] uDAPL (v2.0) linux_osd: use pthread_self
	instead of getpid for debug messages
Message-ID: <1CC1B86C7C264FBD8AF0ED57A2F9BF4D@amr.corp.intel.com>


getpid provides process ids which are not unique. Use unique thread
id's in debug messages to help isolate issues across many device
opens with multiple CM threads.

Signed-off-by: Arlin Davis <arlin.r.davis at intel.com>
---
 dapl/common/dapl_debug.c    |    2 +-
 dapl/udapl/linux/dapl_osd.h |    3 +--
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/dapl/common/dapl_debug.c b/dapl/common/dapl_debug.c
index ba33cfc..20ee405 100644
--- a/dapl/common/dapl_debug.c
+++ b/dapl/common/dapl_debug.c
@@ -49,7 +49,7 @@ void dapl_internal_dbg_log(DAPL_DBG_TYPE type, const char *fmt, ...)
 	if (type & g_dapl_dbg_type) {
 		if (DAPL_DBG_DEST_STDOUT & g_dapl_dbg_dest) {
 			va_start(args, fmt);
-			fprintf(stdout, "%s:%d: ", _ptr_host_,
+			fprintf(stdout, "%s:%lx: ", _ptr_host_,
 				dapl_os_getpid());
 			dapl_os_vprintf(fmt, args);
 			va_end(args);
diff --git a/dapl/udapl/linux/dapl_osd.h b/dapl/udapl/linux/dapl_osd.h
index 1c098c5..0378a70 100644
--- a/dapl/udapl/linux/dapl_osd.h
+++ b/dapl/udapl/linux/dapl_osd.h
@@ -572,8 +572,7 @@ dapl_os_strtol(const char *nptr, char **endptr, int base)
 #define dapl_os_vprintf(fmt,args)	vprintf(fmt,args)
 #define dapl_os_syslog(fmt,args)	vsyslog(LOG_USER|LOG_WARNING,fmt,args)
 
-#define dapl_os_getpid getpid
-
+#define dapl_os_getpid (long int)pthread_self 
 
 #endif /*  _DAPL_OSD_H_ */
 
-- 
1.5.2.5


From arlin.r.davis at intel.com  Mon May 18 12:07:38 2009
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Mon, 18 May 2009 12:07:38 -0700
Subject: [ofa-general] [PATCH] uDAPL (v2.0) dtest: add connection timers on
	client side
Message-ID: <E3280858FA94444CA49D2BA02341C983521A9F88@orsmsx506.amr.corp.intel.com>


Add timers for active connections and print
results. Allow polling or wait on conn event.

Signed-off-by: Arlin Davis <arlin.r.davis at intel.com>
---
 test/dtest/dtest.c |   34 ++++++++++++++++++++++++----------
 1 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/test/dtest/dtest.c b/test/dtest/dtest.c
index 6ff7798..f1f0f2b 100755
--- a/test/dtest/dtest.c
+++ b/test/dtest/dtest.c
@@ -183,6 +183,7 @@ struct dt_time {
 	double rdma_rd_total;
 	double rtt;
 	double close;
+	double conn;
 };
 
 struct dt_time time;
@@ -197,6 +198,7 @@ static int verbose = 0;
 static int polling = 0;
 static int poll_count = 0;
 static int rdma_wr_poll_count = 0;
+static int conn_poll_count = 0;
 static int rdma_rd_poll_count[MAX_RDMA_RD] = { 0 };
 static int delay = 0;
 static int buf_len = RDMA_BUFFER_SIZE;
@@ -617,6 +619,9 @@ complete:
 	}
 	printf("%d: EP create: %10.2lf usec\n", getpid(), time.epc);
 	printf("%d: EP free:   %10.2lf usec\n", getpid(), time.epf);
+	if (!server)
+		printf("%d: connect:   %10.2lf usec, poll_cnt=%d\n", 
+		       getpid(), time.conn, conn_poll_count);
 	printf("%d: TOTAL:     %10.2lf usec\n", getpid(), time.total);
 
 #if defined(_WIN32) || defined(_WIN64)
@@ -843,6 +848,9 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id)
 	/* setup receive rdma buffer to initial string to be overwritten */
 	strcpy((char *)rbuf, "blah, blah, blah\n");
 
+	/* clear event structure */
+	memset(&event, 0, sizeof(DAT_EVENT));
+
 	if (server) {		/* SERVER */
 
 		/* create the service point for server listen */
@@ -962,6 +970,7 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id)
 			pdata[i] = i + 1;
 
 		LOGPRINTF("%d Connecting to server\n", getpid());
+        	start = get_time();
 		ret = dat_ep_connect(h_ep,
 				     &remote_addr,
 				     conn_id,
@@ -979,14 +988,18 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id)
 
 	printf("%d Waiting for connect response\n", getpid());
 
-	ret = dat_evd_wait(h_conn_evd, DAT_TIMEOUT_INFINITE, 1, &event, &nmore);
-	if (ret != DAT_SUCCESS) {
-		fprintf(stderr, "%d Error dat_evd_wait: %s\n",
-			getpid(), DT_RetToString(ret));
-		return (ret);
-	} else
-		LOGPRINTF("%d dat_evd_wait for h_conn_evd completed\n",
-			  getpid());
+	if (polling) 
+		while (DAT_GET_TYPE(dat_evd_dequeue(h_conn_evd, &event)) == 
+		       DAT_QUEUE_EMPTY)
+			conn_poll_count++;
+	else 
+		ret = dat_evd_wait(h_conn_evd, DAT_TIMEOUT_INFINITE, 
+				   1, &event, &nmore);
+
+	if (!server) {
+        	stop = get_time();
+        	time.conn += ((stop - start) * 1.0e6);
+	}
 
 #ifdef TEST_REJECT_WITH_PRIVATE_DATA
 	if (event.event_number != DAT_CONNECTION_EVENT_PEER_REJECTED) {
@@ -1012,8 +1025,9 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id)
 #endif
 
 	if (event.event_number != DAT_CONNECTION_EVENT_ESTABLISHED) {
-		fprintf(stderr, "%d Error unexpected conn event : %s\n",
-			getpid(), DT_EventToSTr(event.event_number));
+		fprintf(stderr, "%d Error unexpected conn event : 0x%x %s\n",
+			getpid(), event.event_number,
+			DT_EventToSTr(event.event_number));
 		return (DAT_ABORT);
 	}
 
-- 
1.5.2.5


From arlin.r.davis at intel.com  Mon May 18 12:08:24 2009
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Mon, 18 May 2009 12:08:24 -0700
Subject: [ofa-general] [PATCH] uDAPL (v2.0) scm: multi-hca CM processing
 broken. Need cr thread wakeup mechanism per HCA.
Message-ID: <E3280858FA94444CA49D2BA02341C983521A9F8D@orsmsx506.amr.corp.intel.com>


Currently there is only one pipe across all
device opens. This results in some posted CR work
getting delayed or not processed at all. Provide
pipe for each device open and cr thread created
and manage on a per device level.

Signed-off-by: Arlin Davis <arlin.r.davis at intel.com>
---
 dapl/openib_scm/dapl_ib_cm.c   |   23 ++++++------
 dapl/openib_scm/dapl_ib_util.c |   74 +++++++++++++++++++++++++---------------
 dapl/openib_scm/dapl_ib_util.h |    1 +
 3 files changed, 58 insertions(+), 40 deletions(-)

diff --git a/dapl/openib_scm/dapl_ib_cm.c b/dapl/openib_scm/dapl_ib_cm.c
index a2b02eb..9cad5be 100644
--- a/dapl/openib_scm/dapl_ib_cm.c
+++ b/dapl/openib_scm/dapl_ib_cm.c
@@ -54,8 +54,6 @@
 #include "dapl_ib_util.h"
 #include "dapl_osd.h"
 
-extern DAPL_SOCKET g_scm[2];
-
 #if defined(_WIN32) || defined(_WIN64)
 enum DAPL_FD_EVENTS {
 	DAPL_FD_READ = 0x1,
@@ -282,7 +280,7 @@ static void dapli_cm_destroy(struct ib_cm_handle *cm_ptr)
 	dapl_os_unlock(&cm_ptr->lock);
 
 	/* wakeup work thread */
-	if (send(g_scm[1], "w", sizeof "w", 0) == -1)
+	if (send(cm_ptr->hca->ib_trans.scm[1], "w", sizeof "w", 0) == -1)
 		dapl_log(DAPL_DBG_TYPE_CM,
 			 " cm_destroy: thread wakeup error = %s\n",
 			 strerror(errno));
@@ -299,7 +297,7 @@ static void dapli_cm_queue(struct ib_cm_handle *cm_ptr)
 	dapl_os_unlock(&cm_ptr->hca->ib_trans.lock);
 
 	/* wakeup CM work thread */
-	if (send(g_scm[1], "w", sizeof "w", 0) == -1)
+	if (send(cm_ptr->hca->ib_trans.scm[1], "w", sizeof "w", 0) == -1)
 		dapl_log(DAPL_DBG_TYPE_CM,
 			 " cm_queue: thread wakeup error = %s\n",
 			 strerror(errno));
@@ -1210,7 +1208,8 @@ dapls_ib_remove_conn_listener(IN DAPL_IA * ia_ptr, IN DAPL_SP * sp_ptr)
 		/* cr_thread will free */
 		cm_ptr->state = SCM_DESTROY;
 		sp_ptr->cm_srvc_handle = NULL;
-		if (send(g_scm[1], "w", sizeof "w", 0) == -1)
+		if (send(cm_ptr->hca->ib_trans.scm[1], 
+			 "w", sizeof "w", 0) == -1)
 			dapl_log(DAPL_DBG_TYPE_CM,
 				 " cm_destroy: thread wakeup error = %s\n",
 				 strerror(errno));
@@ -1312,7 +1311,7 @@ dapls_ib_reject_connection(IN dp_ib_cm_handle_t cm_ptr,
 
 	/* cr_thread will destroy CR */
 	cm_ptr->state = SCM_REJECTED;
-	if (send(g_scm[1], "w", sizeof "w", 0) == -1)
+	if (send(cm_ptr->hca->ib_trans.scm[1], "w", sizeof "w", 0) == -1)
 		dapl_log(DAPL_DBG_TYPE_CM,
 			 " cm_destroy: thread wakeup error = %s\n",
 			 strerror(errno));
@@ -1552,7 +1551,7 @@ void cr_thread(void *arg)
 
 	while (hca_ptr->ib_trans.cr_state == IB_THREAD_RUN) {
 		dapl_fd_zero(set);
-		dapl_fd_set(g_scm[0], set, DAPL_FD_READ);
+		dapl_fd_set(hca_ptr->ib_trans.scm[0], set, DAPL_FD_READ);
 
 		if (!dapl_llist_is_empty(&hca_ptr->ib_trans.list))
 			next_cr = dapl_llist_peek_head(&hca_ptr->ib_trans.list);
@@ -1652,9 +1651,8 @@ void cr_thread(void *arg)
 						    &cr->dst.ia_address)->
 						   sin_addr));
 
-				/* POLLUP, NVAL, or poll error, issue event if connected */
-				if (cr->state == SCM_CONNECTED)
-					dapli_socket_disconnect(cr);
+				/* POLLUP, NVAL, or poll error. - DISC */
+				dapli_socket_disconnect(cr);
 			}
 
 			dapl_os_lock(&hca_ptr->ib_trans.lock);
@@ -1664,8 +1662,9 @@ void cr_thread(void *arg)
 		dapl_select(set);
 
 		/* if pipe used to wakeup, consume */
-		while (dapl_poll(g_scm[0], DAPL_FD_READ) == DAPL_FD_READ) {
-			if (recv(g_scm[0], rbuf, 2, 0) == -1)
+		while (dapl_poll(hca_ptr->ib_trans.scm[0], 
+				 DAPL_FD_READ) == DAPL_FD_READ) {
+			if (recv(hca_ptr->ib_trans.scm[0], rbuf, 2, 0) == -1)
 				dapl_log(DAPL_DBG_TYPE_CM,
 					 " cr_thread: read pipe error = %s\n",
 					 strerror(errno));
diff --git a/dapl/openib_scm/dapl_ib_util.c b/dapl/openib_scm/dapl_ib_util.c
index c95b0c2..30c71fa 100644
--- a/dapl/openib_scm/dapl_ib_util.c
+++ b/dapl/openib_scm/dapl_ib_util.c
@@ -58,7 +58,6 @@ static const char rcsid[] = "$Id:  $";
 #include <stdlib.h>
 
 int g_dapl_loopback_connection = 0;
-DAPL_SOCKET g_scm[2];
 
 enum ibv_mtu dapl_ib_mtu(int mtu)
 {
@@ -138,22 +137,7 @@ static DAT_RETURN getlocalipaddr(DAT_SOCK_ADDR * addr, int addr_len)
 	return ret;
 }
 
-/*
- * dapls_ib_init, dapls_ib_release
- *
- * Initialize Verb related items for device open
- *
- * Input:
- * 	none
- *
- * Output:
- *	none
- *
- * Returns:
- * 	0 success, -1 error
- *
- */
-int32_t dapls_ib_init(void)
+static int32_t create_cr_pipe(IN DAPL_HCA * hca_ptr)
 {
 	DAPL_SOCKET listen_socket;
 	struct sockaddr_in addr;
@@ -179,32 +163,58 @@ int32_t dapls_ib_init(void)
 	if (ret)
 		goto err1;
 
-	g_scm[1] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
-	if (g_scm[1] == DAPL_INVALID_SOCKET)
+	hca_ptr->ib_trans.scm[1] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
+	if (hca_ptr->ib_trans.scm[1] == DAPL_INVALID_SOCKET)
 		goto err1;
 
-	ret = connect(g_scm[1], (struct sockaddr *)&addr, sizeof(addr));
+	ret = connect(hca_ptr->ib_trans.scm[1], 
+		      (struct sockaddr *)&addr, sizeof(addr));
 	if (ret)
 		goto err2;
 
-	g_scm[0] = accept(listen_socket, NULL, NULL);
-	if (g_scm[0] == DAPL_INVALID_SOCKET)
+	hca_ptr->ib_trans.scm[0] = accept(listen_socket, NULL, NULL);
+	if (hca_ptr->ib_trans.scm[0] == DAPL_INVALID_SOCKET)
 		goto err2;
 
 	closesocket(listen_socket);
 	return 0;
 
       err2:
-	closesocket(g_scm[1]);
+	closesocket(hca_ptr->ib_trans.scm[1]);
       err1:
 	closesocket(listen_socket);
 	return 1;
 }
 
+static void destroy_cr_pipe(IN DAPL_HCA * hca_ptr)
+{
+	closesocket(hca_ptr->ib_trans.scm[0]);
+	closesocket(hca_ptr->ib_trans.scm[1]);
+}
+
+
+/*
+ * dapls_ib_init, dapls_ib_release
+ *
+ * Initialize Verb related items for device open
+ *
+ * Input:
+ * 	none
+ *
+ * Output:
+ *	none
+ *
+ * Returns:
+ * 	0 success, -1 error
+ *
+ */
+int32_t dapls_ib_init(void)
+{
+	return 0;
+}
+
 int32_t dapls_ib_release(void)
 {
-	closesocket(g_scm[0]);
-	closesocket(g_scm[1]);
 	return 0;
 }
 
@@ -382,6 +392,14 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA * hca_ptr)
 	/* initialize CM list for listens on this HCA */
 	dapl_llist_init_head(&hca_ptr->ib_trans.list);
 
+	/* initialize pipe, user level wakeup on select */
+	if (create_cr_pipe(hca_ptr)) {
+		dapl_log(DAPL_DBG_TYPE_ERR,
+			 " open_hca: failed to init cr pipe - %s\n",
+			 strerror(errno));
+		goto bail;
+	}
+
 	/* create thread to process inbound connect request */
 	hca_ptr->ib_trans.cr_state = IB_THREAD_INIT;
 	dat_status = dapl_os_thread_create(cr_thread,
@@ -455,21 +473,21 @@ DAT_RETURN dapls_ib_close_hca(IN DAPL_HCA * hca_ptr)
 
 	/* destroy cr_thread and lock */
 	hca_ptr->ib_trans.cr_state = IB_THREAD_CANCEL;
-	if (send(g_scm[1], "w", sizeof "w", 0) == -1)
+	if (send(hca_ptr->ib_trans.scm[1], "w", sizeof "w", 0) == -1)
 		dapl_log(DAPL_DBG_TYPE_UTIL,
 			 " thread_destroy: thread wakeup err = %s\n",
 			 strerror(errno));
 	while (hca_ptr->ib_trans.cr_state != IB_THREAD_EXIT) {
 		dapl_dbg_log(DAPL_DBG_TYPE_UTIL,
 			     " close_hca: waiting for cr_thread\n");
-		if (send(g_scm[1], "w", sizeof "w", 0) == -1)
+		if (send(hca_ptr->ib_trans.scm[1], "w", sizeof "w", 0) == -1)
 			dapl_log(DAPL_DBG_TYPE_UTIL,
 				 " thread_destroy: thread wakeup err = %s\n",
 				 strerror(errno));
 		dapl_os_sleep_usec(2000);
 	}
 	dapl_os_lock_destroy(&hca_ptr->ib_trans.lock);
-
+	destroy_cr_pipe(hca_ptr); /* no longer need pipe */
 	return (DAT_SUCCESS);
 }
 
diff --git a/dapl/openib_scm/dapl_ib_util.h b/dapl/openib_scm/dapl_ib_util.h
index 5493312..e924572 100644
--- a/dapl/openib_scm/dapl_ib_util.h
+++ b/dapl/openib_scm/dapl_ib_util.h
@@ -304,6 +304,7 @@ typedef struct _ib_hca_transport
 	uint8_t			tclass;
 	uint8_t			mtu;
 	DAT_NAMED_ATTR		named_attr;
+	DAPL_SOCKET		scm[2];
 } ib_hca_transport_t;
 
 /* provider specfic fields for shared memory support */
-- 
1.5.2.5


From arlin.r.davis at intel.com  Mon May 18 12:08:41 2009
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Mon, 18 May 2009 12:08:41 -0700
Subject: [ofa-general] [PATCH] uDAPL (v2.0) windows: add build files for
 openib_scm, remove /Wp64 build option.
Message-ID: <E3280858FA94444CA49D2BA02341C983521A9F8F@orsmsx506.amr.corp.intel.com>


Add build files for windows socket cm and change build
option on windows providers. The new Win7 WDK issues a
depreciated compiler option warning for /Wp64
(Enable 64-bit porting warnings)

Signed-off-by: Arlin Davis <arlin.r.davis at intel.com>
---
 dapl/openib_cma/SOURCES  |    2 +-
 dapl/openib_scm/SOURCES  |   53 ++++++++++++++++++++++++++++++++++++++++++++++
 dapl/openib_scm/udapl.rc |   48 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 102 insertions(+), 1 deletions(-)
 create mode 100644 dapl/openib_scm/SOURCES
 create mode 100644 dapl/openib_scm/udapl.rc

diff --git a/dapl/openib_cma/SOURCES b/dapl/openib_cma/SOURCES
index 29e836e..e59ef35 100644
--- a/dapl/openib_cma/SOURCES
+++ b/dapl/openib_cma/SOURCES
@@ -53,4 +53,4 @@ TARGETLIBS= \
 	$(TARGETPATH)\*\librdmacmd.lib
 !endif
 
-MSC_WARNING_LEVEL = /W1 /wd4113 /Wp64
+MSC_WARNING_LEVEL = /W1 /wd4113
diff --git a/dapl/openib_scm/SOURCES b/dapl/openib_scm/SOURCES
new file mode 100644
index 0000000..f9204d9
--- /dev/null
+++ b/dapl/openib_scm/SOURCES
@@ -0,0 +1,53 @@
+!if $(FREEBUILD)
+TARGETNAME=dapl2-ofa-scm
+!else
+TARGETNAME=dapl2-ofa-scmd
+!endif
+
+TARGETPATH = ..\..\..\..\bin\user\obj$(BUILD_ALT_DIR)
+TARGETTYPE = DYNLINK
+DLLENTRY = _DllMainCRTStartup
+
+!if $(_NT_TOOLS_VERSION) == 0x700
+DLLDEF=$O\udapl_ofa_scm_exports.def
+!else
+DLLDEF=$(OBJ_PATH)\$O\udapl_ofa_scm_exports.def
+!endif
+
+USE_MSVCRT = 1
+
+SOURCES = \
+	udapl.rc \
+	..\dapl_common_src.c	\
+	..\dapl_udapl_src.c		\
+	dapl_ib_cq.c			\
+	dapl_ib_extensions.c	\
+	dapl_ib_mem.c			\
+	dapl_ib_qp.c			\
+	dapl_ib_util.c			\
+	dapl_ib_cm.c
+
+INCLUDES = ..\include;..\common;windows;..\..\dat\include;\
+		   ..\..\dat\udat\windows;..\udapl\windows;\
+		   ..\..\..\..\inc;..\..\..\..\inc\user;..\..\..\libibverbs\include
+
+DAPL_OPTS = -DEXPORT_DAPL_SYMBOLS -DDAT_EXTENSIONS -DSOCK_CM -DOPENIB -DCQ_WAIT_OBJECT
+
+USER_C_FLAGS = $(USER_C_FLAGS) $(DAPL_OPTS)
+
+!if !$(FREEBUILD)
+USER_C_FLAGS = $(USER_C_FLAGS) -DDAPL_DBG
+!endif
+
+TARGETLIBS= \
+	$(SDK_LIB_PATH)\kernel32.lib \
+	$(SDK_LIB_PATH)\ws2_32.lib \
+!if $(FREEBUILD)
+	$(TARGETPATH)\*\dat2.lib \
+	$(TARGETPATH)\*\libibverbs.lib
+!else
+	$(TARGETPATH)\*\dat2d.lib \
+	$(TARGETPATH)\*\libibverbsd.lib
+!endif
+
+MSC_WARNING_LEVEL = /W1 /wd4113
diff --git a/dapl/openib_scm/udapl.rc b/dapl/openib_scm/udapl.rc
new file mode 100644
index 0000000..8550256
--- /dev/null
+++ b/dapl/openib_scm/udapl.rc
@@ -0,0 +1,48 @@
+/*
+ * Copyright (c) 2007, 2009 Intel Corporation.  All rights reserved.
+ *
+ * This software is available to you under the OpenIB.org BSD license
+ * below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * $Id$
+ */
+
+
+#include <oib_ver.h>
+
+#define VER_FILETYPE			VFT_DLL
+#define VER_FILESUBTYPE			VFT2_UNKNOWN
+
+#if DBG
+#define VER_FILEDESCRIPTION_STR		"Direct Access Provider Library v2.0 (OFA socket-cm) (Debug)"
+#define VER_INTERNALNAME_STR		"dapl2-ofa-scmd.dll"
+#define VER_ORIGINALFILENAME_STR	"dapl2-ofa-scmd.dll"
+#else
+#define VER_FILEDESCRIPTION_STR		"Direct Access Provider Library v2.0 (OFA socket-cm)"
+#define VER_INTERNALNAME_STR		"dapl2-ofa-scm.dll"
+#define VER_ORIGINALFILENAME_STR	"dapl2-ofa-scm.dll"
+#endif
+
+#include <common.ver>
-- 
1.5.2.5


From sean.hefty at intel.com  Mon May 18 12:15:57 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 18 May 2009 12:15:57 -0700
Subject: [ofa-general] [PATCH] uDAPL (v2.0) linux_osd: use
	pthread_self	instead of getpid for debug messages
In-Reply-To: <1CC1B86C7C264FBD8AF0ED57A2F9BF4D@amr.corp.intel.com>
References: <1CC1B86C7C264FBD8AF0ED57A2F9BF4D@amr.corp.intel.com>
Message-ID: <C658993954914F0ABA901B6E824EE800@amr.corp.intel.com>

please copy the ofw mail list on dapl changes

>diff --git a/dapl/udapl/linux/dapl_osd.h b/dapl/udapl/linux/dapl_osd.h
>index 1c098c5..0378a70 100644
>--- a/dapl/udapl/linux/dapl_osd.h
>+++ b/dapl/udapl/linux/dapl_osd.h
>@@ -572,8 +572,7 @@ dapl_os_strtol(const char *nptr, char **endptr, int base)
> #define dapl_os_vprintf(fmt,args)	vprintf(fmt,args)
> #define dapl_os_syslog(fmt,args)	vsyslog(LOG_USER|LOG_WARNING,fmt,args)
>
>-#define dapl_os_getpid getpid
>-
>+#define dapl_os_getpid (long int)pthread_self

Maybe add a new call, dapl_os_get_thread_id or something similar, to avoid
confusion with the name and what the call returns.


From bart.vanassche at gmail.com  Mon May 18 12:18:22 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Mon, 18 May 2009 21:18:22 +0200
Subject: [ofa-general] RE: [Scst-devel] SRP aggregate bandwidth decreasing
	as threads increase
In-Reply-To: <f3177b9e0905181140h74bfa7b8vebab4db0dbfb5da3@mail.gmail.com>
References: <f3177b9e0905180921r60601120l8ad8363e99621e31@mail.gmail.com>
	<C2F174F99918D54CA2A96E57C5079B6F015040D7@sbc-exmsg2.sbcounty.gov>
	<e2e108260905181022n7b444435g83755d7cfb97d0f2@mail.gmail.com>
	<f3177b9e0905181140h74bfa7b8vebab4db0dbfb5da3@mail.gmail.com>
Message-ID: <e2e108260905181218i4f561647gc2c0794fd44d301a@mail.gmail.com>

On Mon, May 18, 2009 at 8:40 PM, Chris Worley <worleys at gmail.com> wrote:
> Any other ideas of things to try?

Depends on the workload that is running on the initiators. Are the
initiators performing linear I/O or block I/O ? Which I/O scheduler is
being used by the initiator systems, and how has it been configured ?
Which I/O scheduler has been configured on the target, and with which
parameters ? As you probably know, you can find these parameters under
/sys/class/block/sda/queue/{*,*/*}. Are you using scst_disk or
scst_vdisk ? And what is the kernel version of the target system ? By
the way, an important I/O performance regression has been fixed in
kernel 2.6.29 (see also http://lwn.net/Articles/325307/).

Bart.


From worleys at gmail.com  Mon May 18 12:41:52 2009
From: worleys at gmail.com (Chris Worley)
Date: Mon, 18 May 2009 13:41:52 -0600
Subject: [ofa-general] RE: [Scst-devel] SRP aggregate bandwidth decreasing
	as threads increase
In-Reply-To: <e2e108260905181218i4f561647gc2c0794fd44d301a@mail.gmail.com>
References: <f3177b9e0905180921r60601120l8ad8363e99621e31@mail.gmail.com>
	<C2F174F99918D54CA2A96E57C5079B6F015040D7@sbc-exmsg2.sbcounty.gov>
	<e2e108260905181022n7b444435g83755d7cfb97d0f2@mail.gmail.com>
	<f3177b9e0905181140h74bfa7b8vebab4db0dbfb5da3@mail.gmail.com>
	<e2e108260905181218i4f561647gc2c0794fd44d301a@mail.gmail.com>
Message-ID: <f3177b9e0905181241y3554ee1fif47d1fd033c1238c@mail.gmail.com>

On Mon, May 18, 2009 at 1:18 PM, Bart Van Assche
<bart.vanassche at gmail.com> wrote:
> On Mon, May 18, 2009 at 8:40 PM, Chris Worley <worleys at gmail.com> wrote:
>> Any other ideas of things to try?
>
> Depends on the workload that is running on the initiators. Are the
> initiators performing linear I/O or block I/O ?

I'm not sure what "linear I/O" is.  It is block I/O at 56KB chunks;
using direct I/O.

> Which I/O scheduler is
> being used by the initiator systems, and how has it been configured ?

The noop scheduler is being used on the targets and initiators.  All
the standard schedulers performed worse.

> Which I/O scheduler has been configured on the target, and with which
> parameters ? As you probably know, you can find these parameters under
> /sys/class/block/sda/queue/{*,*/*}. Are you using scst_disk or
> scst_vdisk ?

scst_vdisk.

> And what is the kernel version of the target system ?

Ubuntu 8.10 with a 2.6.27 kernel, if my memory serves me correctly.

> By
> the way, an important I/O performance regression has been fixed in
> kernel 2.6.29 (see also http://lwn.net/Articles/325307/).

Thanks,  I'll try that.

Chris
>
> Bart.
>


From arlin.r.davis at intel.com  Mon May 18 12:47:56 2009
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Mon, 18 May 2009 12:47:56 -0700
Subject: [ofa-general] [PATCH] uDAPL (v2.0) linux_osd: use pthread_self
	instead of getpid for debug messages
In-Reply-To: <C658993954914F0ABA901B6E824EE800@amr.corp.intel.com>
References: <1CC1B86C7C264FBD8AF0ED57A2F9BF4D@amr.corp.intel.com>
	<C658993954914F0ABA901B6E824EE800@amr.corp.intel.com>
Message-ID: <E3280858FA94444CA49D2BA02341C983521AA034@orsmsx506.amr.corp.intel.com>


>please copy the ofw mail list on dapl changes

ok

>Maybe add a new call, dapl_os_get_thread_id or something 
>similar, to avoid
>confusion with the name and what the call returns.

What about this...

diff --git a/dapl/common/dapl_debug.c b/dapl/common/dapl_debug.c
index 20ee405..6c6eeb5 100644
--- a/dapl/common/dapl_debug.c
+++ b/dapl/common/dapl_debug.c
@@ -50,7 +50,7 @@ void dapl_internal_dbg_log(DAPL_DBG_TYPE type, const char *fmt, ...)
                if (DAPL_DBG_DEST_STDOUT & g_dapl_dbg_dest) {
                        va_start(args, fmt);
                        fprintf(stdout, "%s:%lx: ", _ptr_host_,
-                               dapl_os_getpid());
+                               dapl_os_gettid());
                        dapl_os_vprintf(fmt, args);
                        va_end(args);
                }
diff --git a/dapl/udapl/linux/dapl_osd.h b/dapl/udapl/linux/dapl_osd.h
index 0378a70..e0e30bf 100644
--- a/dapl/udapl/linux/dapl_osd.h
+++ b/dapl/udapl/linux/dapl_osd.h
@@ -572,7 +572,8 @@ dapl_os_strtol(const char *nptr, char **endptr, int base)
 #define dapl_os_vprintf(fmt,args)      vprintf(fmt,args)
 #define dapl_os_syslog(fmt,args)       vsyslog(LOG_USER|LOG_WARNING,fmt,args)

-#define dapl_os_getpid (long int)pthread_self
+#define dapl_os_getpid (int)getpid
+#define dapl_os_gettid (long int)pthread_self

 #endif /*  _DAPL_OSD_H_ */


From bart.vanassche at gmail.com  Mon May 18 12:55:39 2009
From: bart.vanassche at gmail.com (Bart Van Assche)
Date: Mon, 18 May 2009 21:55:39 +0200
Subject: [ofa-general] RE: [Scst-devel] SRP aggregate bandwidth decreasing
	as threads increase
In-Reply-To: <f3177b9e0905181241y3554ee1fif47d1fd033c1238c@mail.gmail.com>
References: <f3177b9e0905180921r60601120l8ad8363e99621e31@mail.gmail.com>
	<C2F174F99918D54CA2A96E57C5079B6F015040D7@sbc-exmsg2.sbcounty.gov>
	<e2e108260905181022n7b444435g83755d7cfb97d0f2@mail.gmail.com>
	<f3177b9e0905181140h74bfa7b8vebab4db0dbfb5da3@mail.gmail.com>
	<e2e108260905181218i4f561647gc2c0794fd44d301a@mail.gmail.com>
	<f3177b9e0905181241y3554ee1fif47d1fd033c1238c@mail.gmail.com>
Message-ID: <e2e108260905181255m6d75ef81o7a1fcc1f037bb37@mail.gmail.com>

On Mon, May 18, 2009 at 9:41 PM, Chris Worley <worleys at gmail.com> wrote:
> On Mon, May 18, 2009 at 1:18 PM, Bart Van Assche
> <bart.vanassche at gmail.com> wrote:
>> Which I/O scheduler has been configured on the target, and with which
>> parameters ? As you probably know, you can find these parameters under
>> /sys/class/block/sda/queue/{*,*/*}. Are you using scst_disk or
>> scst_vdisk ?
>
> scst_vdisk.

The default number of kernel threads for scst_vdisk is five (kernel
module parameter num_threads). It might be interesting to experiment
with this parameter.

Bart.


From sean.hefty at intel.com  Mon May 18 13:14:19 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Mon, 18 May 2009 13:14:19 -0700
Subject: [ofa-general] [PATCH] uDAPL (v2.0) linux_osd: use pthread_self
	instead of getpid for debug messages
In-Reply-To: <E3280858FA94444CA49D2BA02341C983521AA034@orsmsx506.amr.corp.intel.com>
References: <1CC1B86C7C264FBD8AF0ED57A2F9BF4D@amr.corp.intel.com>
	<C658993954914F0ABA901B6E824EE800@amr.corp.intel.com>
	<E3280858FA94444CA49D2BA02341C983521AA034@orsmsx506.amr.corp.intel.com>
Message-ID: <5B78B104FBFB4D7789FE9B9F7F476F51@amr.corp.intel.com>

>diff --git a/dapl/common/dapl_debug.c b/dapl/common/dapl_debug.c
>index 20ee405..6c6eeb5 100644
>--- a/dapl/common/dapl_debug.c
>+++ b/dapl/common/dapl_debug.c
>@@ -50,7 +50,7 @@ void dapl_internal_dbg_log(DAPL_DBG_TYPE type, const char
>*fmt, ...)
>                if (DAPL_DBG_DEST_STDOUT & g_dapl_dbg_dest) {
>                        va_start(args, fmt);
>                        fprintf(stdout, "%s:%lx: ", _ptr_host_,
>-                               dapl_os_getpid());
>+                               dapl_os_gettid());
>                        dapl_os_vprintf(fmt, args);
>                        va_end(args);
>                }
>diff --git a/dapl/udapl/linux/dapl_osd.h b/dapl/udapl/linux/dapl_osd.h
>index 0378a70..e0e30bf 100644
>--- a/dapl/udapl/linux/dapl_osd.h
>+++ b/dapl/udapl/linux/dapl_osd.h
>@@ -572,7 +572,8 @@ dapl_os_strtol(const char *nptr, char **endptr, int base)
> #define dapl_os_vprintf(fmt,args)      vprintf(fmt,args)
> #define dapl_os_syslog(fmt,args)       vsyslog(LOG_USER|LOG_WARNING,fmt,args)
>
>-#define dapl_os_getpid (long int)pthread_self
>+#define dapl_os_getpid (int)getpid
>+#define dapl_os_gettid (long int)pthread_self

That's fine - what about Windows?  :)


From worleys at gmail.com  Mon May 18 13:46:55 2009
From: worleys at gmail.com (Chris Worley)
Date: Mon, 18 May 2009 14:46:55 -0600
Subject: [ofa-general] RE: [Scst-devel] SRP aggregate bandwidth decreasing
	as threads increase
In-Reply-To: <e2e108260905181255m6d75ef81o7a1fcc1f037bb37@mail.gmail.com>
References: <f3177b9e0905180921r60601120l8ad8363e99621e31@mail.gmail.com>
	<C2F174F99918D54CA2A96E57C5079B6F015040D7@sbc-exmsg2.sbcounty.gov>
	<e2e108260905181022n7b444435g83755d7cfb97d0f2@mail.gmail.com>
	<f3177b9e0905181140h74bfa7b8vebab4db0dbfb5da3@mail.gmail.com>
	<e2e108260905181218i4f561647gc2c0794fd44d301a@mail.gmail.com>
	<f3177b9e0905181241y3554ee1fif47d1fd033c1238c@mail.gmail.com>
	<e2e108260905181255m6d75ef81o7a1fcc1f037bb37@mail.gmail.com>
Message-ID: <f3177b9e0905181346q676298e5uf1d6354ab697727e@mail.gmail.com>

On Mon, May 18, 2009 at 1:55 PM, Bart Van Assche
<bart.vanassche at gmail.com> wrote:
> On Mon, May 18, 2009 at 9:41 PM, Chris Worley <worleys at gmail.com> wrote:
>> On Mon, May 18, 2009 at 1:18 PM, Bart Van Assche
>> <bart.vanassche at gmail.com> wrote:
>>> Which I/O scheduler has been configured on the target, and with which
>>> parameters ? As you probably know, you can find these parameters under
>>> /sys/class/block/sda/queue/{*,*/*}. Are you using scst_disk or
>>> scst_vdisk ?
>>
>> scst_vdisk.
>
> The default number of kernel threads for scst_vdisk is five (kernel
> module parameter num_threads). It might be interesting to experiment
> with this parameter.

Thanks!  I'll try that too.

Chris


From rdreier at cisco.com  Mon May 18 14:15:11 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 18 May 2009 14:15:11 -0700
Subject: [ofa-general] Memory registration redux
In-Reply-To: <167AB191-8F5E-4B82-823A-B2A04E2BF76D@cisco.com> (Jeff Squyres's
	message of "Mon, 18 May 2009 14:24:33 -0400")
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
	<adabpq6t2k8.fsf@cisco.com>
	<D6421CE6-78D4-4C91-8803-9482E1C60566@cisco.com>
	<adafxfgshgl.fsf@cisco.com>
	<730ECD2D-62E3-4EC7-92E7-A50FAD12D432@cisco.com>
	<469958e00905181102s758ca2f3uddb09e8d604bd030@mail.gmail.com>
	<167AB191-8F5E-4B82-823A-B2A04E2BF76D@cisco.com>
Message-ID: <aday6sudsf4.fsf@cisco.com>


 > When our memory hooks tell us that memory is about to be removed from
 > the process, we unregister all pages in the relevant region and remove
 > those entries from the cache.  So the next time you look in the cache
 > for 0x3000-0x3fff, it won't be there -- it'll be treated as
 > cache-cold.

So you want the registration cache to be reference counted per-page?
Seems like potentially a lot of overhead -- if someone registers a
million pages, then to check for a cache hit, you have to potentially
check millions of reference counts.

 > > How does 0x1000 to 0x3fff get registered as a single Memory Region?
 > > If it is legitimate to free() 0x3000..0x3fff then how can there ever
 > > be a
 > > legitimate reference to 0x1000..0x3fff? If there is no such single
 > > reference,
 > > I don't see how a Memory Region is every created covering that range.
 > >
 > > If the user creates the Memory Region, then they are responsible for
 > > not
 > > free()ing a portion of it.
 > >
 > 
 > Agreed.  If an application does that, it deserves what it gets.

Hang on.  The whole point of MR caching is exactly that you don't
unregister a memory region, even after you're done using the memory it
covers, in the hope that you'll want to reuse that registration.  And
the whole point of this thread is that an application can then free()
some of the memory that is still registered in the cache.

 > Per my prior mail, Open MPI registers chucks at a time.  Each chunk is
 > potentially a multiple of pages.  So yes, you could end up having a
 > single registration that spans the buffers used in multiple, distinct
 > MPI sends.  We reference count by page to ensure that deregistrations
 > do not occur prematurely.

Hmm, I'm worried that the exact semantics of the memory cache seem to be
tied into how the MPI implementation is registering memory.  Open MPI
happens to work in small chunks (I guess) and so your cache is tailored
for that use case.  I know the original proposal was an attempt to come
up with something that all the MPIs can agree on, but it didn't cover
the full semantics, at least not for cases like the overlapping
sub-registrations that we're discussing here.  Is there still one set of
semantics everyone can agree on?

 - R.


From arlin.r.davis at intel.com  Mon May 18 14:33:08 2009
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Mon, 18 May 2009 14:33:08 -0700
Subject: [ofa-general] [PATCH] uDAPL (v2.0) linux_osd: use pthread_self
	instead of getpid for debug messages
In-Reply-To: <5B78B104FBFB4D7789FE9B9F7F476F51@amr.corp.intel.com>
References: <1CC1B86C7C264FBD8AF0ED57A2F9BF4D@amr.corp.intel.com>
	<C658993954914F0ABA901B6E824EE800@amr.corp.intel.com>
	<E3280858FA94444CA49D2BA02341C983521AA034@orsmsx506.amr.corp.intel.com>
	<5B78B104FBFB4D7789FE9B9F7F476F51@amr.corp.intel.com>
Message-ID: <E3280858FA94444CA49D2BA02341C983521AA1F4@orsmsx506.amr.corp.intel.com>

 
>
>That's fine - what about Windows?  :)
>
>

Yes, I do windows. Please verify following patch:

diff --git a/dapl/common/dapl_debug.c b/dapl/common/dapl_debug.c
index 20ee405..6723217 100644
--- a/dapl/common/dapl_debug.c
+++ b/dapl/common/dapl_debug.c
@@ -49,8 +49,8 @@ void dapl_internal_dbg_log(DAPL_DBG_TYPE type, const char *fmt, ...)
 	if (type & g_dapl_dbg_type) {
 		if (DAPL_DBG_DEST_STDOUT & g_dapl_dbg_dest) {
 			va_start(args, fmt);
-			fprintf(stdout, "%s:%lx: ", _ptr_host_,
-				dapl_os_getpid());
+			fprintf(stdout, "%s:%x: ", _ptr_host_,
+				dapl_os_gettid());
 			dapl_os_vprintf(fmt, args);
 			va_end(args);
 		}
diff --git a/dapl/udapl/linux/dapl_osd.h b/dapl/udapl/linux/dapl_osd.h
index 0378a70..cb61cae 100644
--- a/dapl/udapl/linux/dapl_osd.h
+++ b/dapl/udapl/linux/dapl_osd.h
@@ -572,7 +572,8 @@ dapl_os_strtol(const char *nptr, char **endptr, int base)
 #define dapl_os_vprintf(fmt,args)	vprintf(fmt,args)
 #define dapl_os_syslog(fmt,args)	vsyslog(LOG_USER|LOG_WARNING,fmt,args)
 
-#define dapl_os_getpid (long int)pthread_self 
+#define dapl_os_getpid (DAT_UINT32)getpid 
+#define dapl_os_gettid (DAT_UINT32)pthread_self 
 
 #endif /*  _DAPL_OSD_H_ */
 
diff --git a/dapl/udapl/windows/dapl_osd.h b/dapl/udapl/windows/dapl_osd.h
index cdfeb24..416a24b 100644
--- a/dapl/udapl/windows/dapl_osd.h
+++ b/dapl/udapl/windows/dapl_osd.h
@@ -501,11 +501,8 @@ dapl_os_strtol(const char *nptr, char **endptr, int base)
     return strtol(nptr, endptr, base);
 }
 
-STATIC __inline int
-dapl_os_getpid(void)
-{
-	return (int)GetCurrentProcessId();
-}
+#define dapl_os_getpid	(DAT_UINT32)GetCurrentProcessId()
+#define dapl_os_gettid	(DAT_UINT32)GetCurrentThreadId()
 
 /*
  *  Debug Helper Functions


From abenjamin at sgi.com  Mon May 18 16:44:25 2009
From: abenjamin at sgi.com (Arputham Benjamin)
Date: Mon, 18 May 2009 16:44:25 -0700
Subject: [ofa-general] Re: [PATCH] core/mthca: Distinguish multiple IB cards
	in /proc/interrupts
In-Reply-To: <adad4aaf3dk.fsf@cisco.com>
References: <4A0B560B.3090606@sgi.com> <adad4aaf3dk.fsf@cisco.com>
Message-ID: <4A11F2D9.6080107@sgi.com>

Roland Dreier wrote:
> why is the address you want at the position IB_DEVICE_NAME_MAX instead
> of at index 0?  
It should be 0. Thanks for pointing that out.
>  In general I don't like since strcpy()/strcat() instead of
> strlcpy()/strlcat().
>
>
>  - R.
>   
I'll modify the code to use snprintf().

Thank you for your help.

Regards,
Benjamin


From Zhen.Liang at Sun.COM  Mon May 18 21:37:32 2009
From: Zhen.Liang at Sun.COM (Liang Zhen)
Date: Tue, 19 May 2009 12:37:32 +0800
Subject: [ofa-general] Problems using OFED 1.4 on largesmp nodes
In-Reply-To: <1233654242.1364.39.camel@pyren.uio.no>
References: <1233654242.1364.39.camel@pyren.uio.no>
Message-ID: <4A12378C.8030101@sun.com>

Hi Ole,
Have you got solution for this? I think we got exactly same problem on
4600 with ofed-1.4.1-rc4:
lspci output:
03:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe
2.0 2.5GT/s] (rev a0)

and error messages from dmesg:

mlx4_core: Mellanox ConnectX core driver v1.0 (April 4, 2008)
mlx4_core: Initializing 0000:03:00.0
mlx4_core 0000:03:00.0: Requested number of MACs is too much for port 1,
reducing to 1.
mlx4_core 0000:03:00.0: command 0x13 failed: fw status = 0x1
mlx4_core 0000:03:00.0: SW2HW_EQ failed (-5)
mlx4_core 0000:03:00.0: Failed to initialize event queue table, aborting.
mlx4_core: probe of 0000:03:00.0 failed with error -5

Thanks
Liang


Ole Widar Saastad wrote:
> I have problems using the OFED 1.4 software on the Sun x4600 nodes.
> Need help to get this to work. We plan to run GPFS over IB on these
> nodes in addition to MPI.
>
> Sun 4600 nodes with 8 quad core cpus,
> Quad-Core AMD Opteron(tm) Processor 8380
>
> OS is Rocks release 4.
> centos-release-4-4.2/x86_64/
>
> Linux compute-0-0.local 2.6.9-67.0.15.ELlargesmp #1 SMP Thu May 8
> 11:03:57 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
>
>
> Needless to say our 300+ nodes (SUN x2200 with quad core) runs fine with
> OFED 1.4 (and 1.3), they have the almost the same kernel : 
> Linux compute-4-0.local 2.6.9-67.0.15.ELsmp #1 SMP Thu May 8 10:50:20
> EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
> Same except  ELsmp and not ELlargesmp.
>
> More information:
>
> dmesg prints out the following error message :
>
> Losing some ticks... checking if CPU frequency changed.
> modulecmd[17499]: segfault at 0000007fc0b01688 rip 000000000060aa38 rsp 0000007fbfffcfd8 error 6
> mlx4_core: Mellanox ConnectX core driver v1.0 (April 4, 2008)
> mlx4_core: Initializing 0000:02:00.0
> ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 19 (level, low) -> IRQ 193
> PCI: Setting latency timer of device 0000:02:00.0 to 64
> mlx4_core 0000:02:00.0: Requested number of MACs is too much for port 1, reducing to 1.
> MSI INIT SUCCESS
> mlx4_core 0000:02:00.0: command 0x13 failed: fw status = 0x1
> mlx4_core 0000:02:00.0: SW2HW_EQ failed (-5)
> mlx4_core 0000:02:00.0: Failed to initialize event queue table, aborting.
> mlx4_core: probe of 0000:02:00.0 failed with error -5
>
> The following software is installed:
>
> Select Option [1-5]:3
> kernel-ib
> libibverbs
> libibverbs-devel
> libibverbs-utils
> libmthca
> libmlx4
> libcxgb3
> libnes
> libipathverbs
> libibcommon
> libibcommon-devel
> libibumad
> libibumad-devel
> ofed-docs
> ofed-scripts
> ibvexdmtools
> qlgc_vnic_daemon
>
>
> Just to be sure the card is present :
> lspci returns :
> 02:00.0 InfiniBand: Mellanox Technologies: Unknown device 634a (rev a0)
>
>
>   


From dorfman.eli at gmail.com  Tue May 19 01:56:36 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Tue, 19 May 2009 11:56:36 +0300
Subject: [ofa-general] Re: [PATCH ] opensm: MFT tables are not set after non
 full member re-join
In-Reply-To: <4A1019F6.5060900@gmail.com>
References: <4A1019F6.5060900@gmail.com>
Message-ID: <4A127444.8080707@gmail.com>

Eli Dorfman (Voltaire) wrote:
> MFT tables are not set after non full member re-join
> 
> In case of non full member re-join MFT tables are not set.
> No need to set or check non full member reference to mlid (port->mcm_list).
> This list should be used only for full members for cleanup when port goes down.
> 
> A simple scenarion to reproduce this:
> 1. Full member creates group
> 2. Non-member join - MFT sent
> 3. Full member leave
>         a. group is deleted but non member port has still reference to the MLID
> 4. Full member re-creates the group
> 5. Non member re-joins - MFT *NOT* sent to switches
> 
> Signed-off-by: Eli Dorfman <elid at voltaire.com>
> ---
>  opensm/include/opensm/osm_sm.h         |    3 ++-
>  opensm/opensm/osm_sa_mcmember_record.c |    6 +++---
>  opensm/opensm/osm_sm.c                 |   22 +++++++++++++++++++++-
>  3 files changed, 26 insertions(+), 5 deletions(-)
> 
> diff --git a/opensm/include/opensm/osm_sm.h b/opensm/include/opensm/osm_sm.h
> index cc8321d..1a8a577 100644
> --- a/opensm/include/opensm/osm_sm.h
> +++ b/opensm/include/opensm/osm_sm.h
> @@ -539,7 +539,8 @@ osm_resp_send(IN osm_sm_t * sm,
>  ib_api_status_t
>  osm_sm_mcgrp_join(IN osm_sm_t * const p_sm,
>  		  IN const ib_net16_t mlid,
> -		  IN const ib_net64_t port_guid);
> +		  IN const ib_net64_t port_guid,
> +				  IN uint8_t scope_state);
>  /*
>  * PARAMETERS
>  *	p_sm
> diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c
> index 5543221..fe29dd6 100644
> --- a/opensm/opensm/osm_sa_mcmember_record.c
> +++ b/opensm/opensm/osm_sa_mcmember_record.c
> @@ -1039,7 +1039,7 @@ static void mcmr_rcv_leave_mgrp(IN osm_sa_t * sa, IN osm_madw_t * p_madw)
>  	if (!p_mgrp) {
>  		char gid_str[INET6_ADDRSTRLEN];
>  		CL_PLOCK_RELEASE(sa->p_lock);
> -		OSM_LOG(sa->p_log, OSM_LOG_DEBUG,
> +		OSM_LOG(sa->p_log, OSM_LOG_INFO,
>  			"Failed since multicast group %s not present\n",
>  			inet_ntop(AF_INET6, p_recvd_mcmember_rec->mgid.raw,
>  				  gid_str, sizeof gid_str));
> @@ -1309,8 +1309,8 @@ static void mcmr_rcv_join_mgrp(IN osm_sa_t * sa, IN osm_madw_t * p_madw)
>  
>  	/* do the actual routing (actually schedule the update) */
>  	status = osm_sm_mcgrp_join(sa->sm, mlid,
> -				   p_recvd_mcmember_rec->port_gid.unicast.
> -				   interface_id);
> +							   p_recvd_mcmember_rec->port_gid.unicast.interface_id,
> +							   p_recvd_mcmember_rec->scope_state);
>  
>  	if (status != IB_SUCCESS) {
>  		OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 1B14: "
> diff --git a/opensm/opensm/osm_sm.c b/opensm/opensm/osm_sm.c
> index daa60ff..b334d39 100644
> --- a/opensm/opensm/osm_sm.c
> +++ b/opensm/opensm/osm_sm.c
> @@ -468,7 +468,7 @@ static ib_api_status_t sm_mgrp_process(IN osm_sm_t * p_sm,
>  /**********************************************************************
>   **********************************************************************/
>  ib_api_status_t osm_sm_mcgrp_join(IN osm_sm_t * p_sm, IN const ib_net16_t mlid,
> -				  IN const ib_net64_t port_guid)
> +				  IN const ib_net64_t port_guid, IN uint8_t scope_state)
>  {
>  	osm_mgrp_t *p_mgrp;
>  	osm_port_t *p_port;
> @@ -515,6 +515,25 @@ ib_api_status_t osm_sm_mcgrp_join(IN osm_sm_t * p_sm, IN const ib_net16_t mlid,
>  		goto Exit;
>  	}
>  
> +	/* if there was no change from the last time
> +	 * we processed the group we can skip doing anything
> +	 */
> +	if (p_mgrp->last_change_id == p_mgrp->last_tree_id) {
> +		OSM_LOG(p_sm->p_log, OSM_LOG_VERBOSE,
> +			"Skip processing mgrp with lid:0x%X last change id:%u\n",
> +			cl_ntoh16(mlid), p_mgrp->last_change_id);
> +		goto Exit;
> +	} else {
> +		OSM_LOG(p_sm->p_log, OSM_LOG_DEBUG,
> +			"processing mgrp with lid:0x%X port: 0x%016" PRIx64 " last change id:%u tree id:%u\n",
> +			cl_ntoh16(mlid), cl_ntoh64(port_guid), 
> +			p_mgrp->last_change_id, p_mgrp->last_tree_id);
> +	}
> +
> +	/* add mgrp only to FULL member port. used for cleanup when port goes down */
> +	if (!(scope_state & IB_JOIN_STATE_FULL))
> +		goto MgrpProcess;
> +
>  	/*
>  	 * Check if the object (according to mlid) already exists on this port.
>  	 * If it does - then no need to update it again, and no need to
> @@ -543,6 +562,7 @@ ib_api_status_t osm_sm_mcgrp_join(IN osm_sm_t * p_sm, IN const ib_net16_t mlid,
>  		goto Exit;
>  	}
>  
> +MgrpProcess:
>  	status = sm_mgrp_process(p_sm, p_mgrp);
>  	CL_PLOCK_RELEASE(p_sm->p_lock);
>  

The following fixes a bug in the above PATCH that will lock the opensm 
when multicast group was not changed.

diff --git a/opensm/opensm/osm_sm.c b/opensm/opensm/osm_sm.c
index b334d39..28cd76f 100644
--- a/opensm/opensm/osm_sm.c
+++ b/opensm/opensm/osm_sm.c
@@ -519,6 +519,7 @@ ib_api_status_t osm_sm_mcgrp_join(IN osm_sm_t * p_sm, IN const ib_net16_t mlid,
 	 * we processed the group we can skip doing anything
 	 */
 	if (p_mgrp->last_change_id == p_mgrp->last_tree_id) {
+		CL_PLOCK_RELEASE(p_sm->p_lock);
 		OSM_LOG(p_sm->p_log, OSM_LOG_VERBOSE,
 			"Skip processing mgrp with lid:0x%X last change id:%u\n",
 			cl_ntoh16(mlid), p_mgrp->last_change_id);


From vlad at lists.openfabrics.org  Tue May 19 03:25:55 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Tue, 19 May 2009 03:25:55 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090519-0200 daily build status
Message-ID: <20090519102555.601EBE615C4@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.27
Passed on i686 with linux-2.6.26
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From sebastien.dugue at bull.net  Tue May 19 05:09:21 2009
From: sebastien.dugue at bull.net (sebastien dugue)
Date: Tue, 19 May 2009 14:09:21 +0200
Subject: [ofa-general] [PATCH] perftest - Fix proc_get_cpu_mhz() on IA64
Message-ID: <20090519140921.3ea442b6@frecb007965>


  On IA64, proc_get_cpu_mhz() must use the ITC frequency rather than
the CPU frequency.

Signed-off-by: Sebastien Dugue <sebastien.dugue at bull.net>

---
 get_clock.c |   18 +++++++++++++-----
 1 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/get_clock.c b/get_clock.c
index 0acb074..cc86452 100755
--- a/get_clock.c
+++ b/get_clock.c
@@ -144,12 +144,20 @@ static double proc_get_cpu_mhz(int no_cpu_freq_fail)
 	while(fgets(buf, sizeof(buf), f)) {
 		double m;
 		int rc;
+
+#if defined (__ia64__)
+		/* Use the ITC frequency on IA64 */
+		rc = sscanf(buf, "itc MHz : %lf", &m);
+#elif defined (__PPC__) || defined (__PPC64__)
+		/* PPC has a different format as well */
+		rc = sscanf(buf, "clock : %lf", &m);
+#else
 		rc = sscanf(buf, "cpu MHz : %lf", &m);
-		if (rc != 1) {	/* PPC has a different format */
-			rc = sscanf(buf, "clock : %lf", &m);
-			if (rc != 1)
-				continue;
-		}
+#endif
+
+		if (rc != 1)
+			continue;
+
 		if (mhz == 0.0) {
 			mhz = m;
 			continue;
-- 
1.6.3.rc3.12.gb7937


From tziporet at dev.mellanox.co.il  Tue May 19 07:28:16 2009
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 19 May 2009 17:28:16 +0300
Subject: [ofa-general] Problems using OFED 1.4 on largesmp nodes
In-Reply-To: <4A12378C.8030101@sun.com>
References: <1233654242.1364.39.camel@pyren.uio.no> <4A12378C.8030101@sun.com>
Message-ID: <4A12C200.4000708@mellanox.co.il>

Liang Zhen wrote:
> Hi Ole,
> Have you got solution for this? I think we got exactly same problem on
> 4600 with ofed-1.4.1-rc4:
> lspci output:
> 03:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe
> 2.0 2.5GT/s] (rev a0)
>
> and error messages from dmesg:
>
> mlx4_core: Mellanox ConnectX core driver v1.0 (April 4, 2008)
> mlx4_core: Initializing 0000:03:00.0
> mlx4_core 0000:03:00.0: Requested number of MACs is too much for port 1,
> reducing to 1.
> mlx4_core 0000:03:00.0: command 0x13 failed: fw status = 0x1
> mlx4_core 0000:03:00.0: SW2HW_EQ failed (-5)
> mlx4_core 0000:03:00.0: Failed to initialize event queue table, aborting.
> mlx4_core: probe of 0000:03:00.0 failed with error -5
>
>   
Can you send me the FW version and board type
Since the driver is not loading you can use mstflint to get this data
Please use:

The devices can be accessed by their PCI ID as displayed by lspci
(bus:dev.fn).
Example:
# List all Mellanox devices
> /sbin/lspci -d 15b3:
02:00.0 Ethernet controller: Mellanox Technologies Unknown device 6368
(rev a0)

# Use mstflint tool to query the firmware on this device
> mstflint -d 02:00.0 q

Tziporet


From jsquyres at cisco.com  Tue May 19 07:57:30 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Tue, 19 May 2009 10:57:30 -0400
Subject: [ofa-general] Memory registration redux
In-Reply-To: <aday6sudsf4.fsf@cisco.com>
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com><adabpq6t2k8.fsf@cisco.com><D6421CE6-78D4-4C91-8803-9482E1C60566@cisco.com><adafxfgshgl.fsf@cisco.com><730ECD2D-62E3-4EC7-92E7-A50FAD12D432@cisco.com><469958e00905181102s758ca2f3uddb09e8d604bd030@mail.gmail.com><167AB191-8F5E-4B82-823A-B2A04E2BF76D@cisco.com>
	<aday6sudsf4.fsf@cisco.com>
Message-ID: <2639C2E6-BFD7-463C-AF1D-382F360F6236@cisco.com>

On May 18, 2009, at 5:15 PM, Roland Dreier (rdreier) wrote:

> So you want the registration cache to be reference counted per-page?
> Seems like potentially a lot of overhead -- if someone registers a
> million pages, then to check for a cache hit, you have to potentially
> check millions of reference counts.
>

Our caches are hash tables of balanced red-black trees.  So in  
practice, we won't be trolling through anywhere near a million  
reference counts to find a hit.

> Hang on.  The whole point of MR caching is exactly that you don't
> unregister a memory region, even after you're done using the memory it
> covers, in the hope that you'll want to reuse that registration.  And
> the whole point of this thread is that an application can then free()
> some of the memory that is still registered in the cache.
>

Sorry -- the implication that I took from Caitlyn's text was that the  
memory was *used* after it was freed.  That is clearly erroneous.

What OMPI does (and apparently other MPI's do) is simply invalidate  
any registration for free'd memory.  Additionally, we won't unregister  
memory while there is at least one use of it outstanding (that MPI  
knows about, such as a pending non-blocking communication).  We lazily  
unregister just for exactly the case you're talking about (might want  
to use it for verbs communication again later).

>  > Per my prior mail, Open MPI registers chucks at a time.  Each  
> chunk is
>  > potentially a multiple of pages.  So yes, you could end up having a
>  > single registration that spans the buffers used in multiple,  
> distinct
>  > MPI sends.  We reference count by page to ensure that  
> deregistrations
>  > do not occur prematurely.
>
> Hmm, I'm worried that the exact semantics of the memory cache seem  
> to be
> tied into how the MPI implementation is registering memory.  Open MPI
> happens to work in small chunks (I guess) and so your cache is  
> tailored
> for that use case.  I know the original proposal was an attempt to  
> come
> up with something that all the MPIs can agree on, but it didn't cover
> the full semantics, at least not for cases like the overlapping
> sub-registrations that we're discussing here.  Is there still one  
> set of
> semantics everyone can agree on?
>


So just to be clear -- let's separate the two issues that are evolving  
from this thread:

1. fix the hole where memory returned to the OS cannot be guaranteed  
to be caught by userspace (and therefore may still stay registered and/ 
or invalidate userspace registration cache entries)

2. have libibverbs include some form of memory registration caching  
(potentially using the solution to #1 to know when to invalidate reg.  
cache entries)

Personally, I would prioritize them in the issues in this order.  Did  
a solution for #1 get agreed upon?  I admit that I got lost in the  
kernel discussion of issues between you, Jason, etc.

Agreeing on registration caching semantics may take a little more  
discussion (although, as someone pointed out earlier, if libibverbs'  
reg caching is optional, then the verbs-based app can choose to use it  
or their own scheme).

-- 
Jeff Squyres
Cisco Systems


From akepner at sgi.com  Tue May 19 14:55:05 2009
From: akepner at sgi.com (akepner at sgi.com)
Date: Tue, 19 May 2009 14:55:05 -0700
Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in
	ipoib_neigh_cleanup()
Message-ID: <20090519215505.GN6837@sgi.com>


We've seen a few instances of a crash in ipoib_neigh_cleanup() due to 
the use of a stale pointer:


848         neigh = *to_ipoib_neigh(n); <- read neigh (no locking)
.....
858         spin_lock_irqsave(&priv->lock, flags);
859
860         if (neigh->ah) <--- at this point neigh may be stale
861                 ah = neigh->ah;
862         list_del(&neigh->list);
863         ipoib_neigh_free(n->dev, neigh);
864
865         spin_unlock_irqrestore(&priv->lock, flags);


(Mentioned this here:
http://lists.openfabrics.org/pipermail/ewg/2008-April/006459.html)

We've been using a patch which re-reads neigh after the spinlock is 
taken. It's been effective in practice, but there's still a window 
where it's possible to use the stale pointer.

I've been looking into a proper fix for this, and I'd like to solicit 
any ideas. 

First thought was to use RCU, e.g., instead of to_ipoib_neigh(), use:

static inline struct ipoib_neigh* ipoib_neigh_retrieve(struct neighbour *n)
{
	struct ipoib_neigh **np;
	np = (void*) n + ALIGN(offsetof(struct neighbour, ha) +
			       INFINIBAND_ALEN, sizeof(void *));
	return rcu_dereference(*np);
}

static inline void ipoib_neigh_assign(struct neighbour *n,
                                     struct ipoib_neigh *in)
{
	struct ipoib_neigh **np;
	np = (void*) n + ALIGN(offsetof(struct neighbour, ha) +
			       INFINIBAND_ALEN, sizeof(void *));
	rcu_assign_pointer(*np, in);
}

where ipoib_neigh_retrieve() is done under rcu_read_lock() and 
ipoib_neigh_assign() under some spinlock (ipoib_dev_priv's lock 
might be repurposed for that use).

But that approach gets more complicated than seems warranted 
(partly because there's a need to promote readers to writers in 
a few places...).


Second thought was to use new locks to serialize access to the 
ipoib_neigh pointer stashed away in struct neighbour. Something 
like:

struct ipoib_neigh_lock
{
	spinlock_t sl;
}__attribute__((__aligned__(SMP_CACHE_BYTES)));

#define IPOIB_LOCK_SHIFT        6
#define IPOIB_LOCK_SIZE         (1 << IPOIB_LOCK_SHIFT)
#define IPOIB_LOCK_MASK         (IPOIB_LOCK_SIZE -1)

static struct ipoib_neigh_lock
ipoib_neigh_locks[IPOIB_LOCK_SIZE] __cacheline_aligned;

static inline void lock_ipoib_neigh(unsigned int hval)
{
	spin_lock(&ipoib_neigh_locks[hval & IPOIB_LOCK_MASK].sl);
}

static inline void unlock_ipoib_neigh(unsigned int hval)
{
	spin_unlock(&ipoib_neigh_locks[hval & IPOIB_LOCK_MASK].sl);
}

unsigned int ipoib_neigh_hval(struct neighbour *n);

....

static void ipoib_neigh_cleanup(struct neighbour *n)
{
	.....
        unsigned int hval = ipoib_neigh_hval(n);

        lock_ipoib_neigh(hval);

        neigh = *to_ipoib_neigh(n);
        if (neigh)
                priv = netdev_priv(neigh->dev);
        else
                return;
	....
        spin_lock_irqsave(&priv->lock, flags);

        if (neigh->ah)
                ah = neigh->ah;
        list_del(&neigh->list);
        ipoib_neigh_free(n->dev, neigh);

        spin_unlock_irqrestore(&priv->lock, flags);

        unlock_ipoib_neigh(hval);
....


This seems much simpler, but maybe there are better approaches?

-- 
Arthur


From rdreier at cisco.com  Tue May 19 15:01:13 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 19 May 2009 15:01:13 -0700
Subject: [ofa-general] Re: [PATCH 2/3] libmlx4 - Optimize memory allocation
	of QP buffers with 64K pages
In-Reply-To: <20090518095516.6a803492@frecb007965> (sebastien dugue's message
	of "Mon, 18 May 2009 09:55:16 +0200")
References: <20090518095156.7f9c39e6@frecb007965>
	<20090518095516.6a803492@frecb007965>
Message-ID: <adaiqjwda6u.fsf@cisco.com>

 >   QP buffers are allocated with mlx4_alloc_buf(), which rounds the buffers
 > size to the page size and then allocates page aligned memory using
 > posix_memalign().
 > 
 >   However, this allocation is quite wasteful on architectures using 64K pages
 > (ia64 for example) because we then hit glibc's MMAP_THRESHOLD malloc
 > parameter and chunks are allocated using mmap. thus we end up allocating:
 > 
 > (requested size rounded to the page size) + (page size) + (malloc overhead)
 > 
 > rounded internally to the page size.
 > 
 >   So for example, if we request a buffer of page_size bytes, we end up
 > consuming 3 pages. In short, for each QP buffer we allocate, there is an
 > overhead of 2 pages. This is quite visible on large clusters especially where
 > the number of QP can reach several thousands.
 > 
 >   This patch creates a new function mlx4_alloc_page() for use by
 > mlx4_alloc_qp_buf() that does an mmap() instead of a posix_memalign() when
 > the page size is 64K.

makes sense I guess.  It would be nice if glibc() were smart enough to
know that mmap(MAP_ANONYMOUS) is going to give something page-aligned
anyway, but it seems that malloc overhead (required to make the memory
from posix_memalign() work with free()) is going to cost at least one
extra page, which as you point out is pretty bad with 64KB pages.  (Of
course 64KB pages are a disaster for any workload that deals with small
objects of any kind, but that's another story)

However I wonder why we want to make this optimization only for 64KB
pages.  It seems the code would be simpler if we just had our own
page-aligned allocator using mmap(MAP_ANONYMOUS) and just used it
unconditionally everywhere.  Or is it not actually better even on
sane-sized (ie 4KB) page systems?  It seems we still have the malloc
overhead which is going to cost us another page?

 - R.


From abenjamin at sgi.com  Tue May 19 19:41:52 2009
From: abenjamin at sgi.com (Arputham Benjamin)
Date: Tue, 19 May 2009 19:41:52 -0700
Subject: [ofa-general] Re: [PATCH] core/mthca: Distinguish multiple IB cards
	in /proc/interrupts
In-Reply-To: <adad4aaf3dk.fsf@cisco.com>
References: <4A0B560B.3090606@sgi.com> <adad4aaf3dk.fsf@cisco.com>
Message-ID: <4A136DF0.7000402@sgi.com>

Roland Dreier wrote:
> So I would suggest reworking this into a series of patches:
>
> 1. Add a function ib_alloc_device_set_name() that does what your
>    ib_init_device() function does.  (By the way, there is a problem with
>    your implementation, since alloc_name() just checks the list of
>    registered devices for a collision -- so devices that are allocated
>    but not registered could be assigned the same name, if the kernel
>    ever moves to parallelizing PCI probing or something like that -- so
>    you should probably fix alloc_name() to check a list of all allocated
>    devices or something like that)
>   
The current implementation of IB core module doesn't maintain a list of allocated IB devices. Are you suggesting that we create a separate list of allocated but not registered devices in addition to the existing list of registered devices. Please clarify.

Alternatively, we can use the existing registered devices list named 'device_list' in the IB core module to keep track of both allocated and registered devices. Currently, the ib_device can be in one of three states(IB_DEV_UNINITIALIZED, IB_DEV_REGISTERED, IB_DEV_UNREGISTERED). We can enhance this to include 'INITIALIZED' state and add the ib_device to 'device_list' with this new state at ib_alloc_device_set_name() time. In this case, there will be no changes to alloc_name() as it is already checking for device name collision in a single list irrespective of the state of the device.


> 2. For each RDMA driver (ie each of drivers/infiniband/hw/xxx), convert
>    to using ib_init_device_alloc_name() -- one patch per driver.
>   

I wanted to point out that the proposed patch will not fix the /proc/interrupts reporting issue for ConnectX IB devices because request_irq() is done by mlx4_core and not by mlx4_ib. Also, mlx4_core doesn't plug into IB core module.


> 3. Remove the old ib_alloc_device() and rename
>    ib_alloc_device_set_name() back to ib_alloc_device().     
>
>  - R.
>   
I assume that there will be a transition period to allow deprecation of ib_alloc_device_set_name() before we can apply this patch. Is my assumption correct?

Regards,
Benjamin


From rdreier at cisco.com  Tue May 19 21:04:39 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 19 May 2009 21:04:39 -0700
Subject: [ofa-general] Re: [PATCH] core/mthca: Distinguish multiple IB cards
	in /proc/interrupts
In-Reply-To: <4A136DF0.7000402@sgi.com> (Arputham Benjamin's message of "Tue, 
	19 May 2009 19:41:52 -0700")
References: <4A0B560B.3090606@sgi.com> <adad4aaf3dk.fsf@cisco.com>
	<4A136DF0.7000402@sgi.com>
Message-ID: <adaab58ctd4.fsf@cisco.com>

[please hit "enter" every 70-80 characters or so, it makes email easier
to read and quote]

 > The current implementation of IB core module doesn't maintain a list
 > of allocated IB devices. Are you suggesting that we create a separate
 > list of allocated but not registered devices in addition to the
 > existing list of registered devices. Please clarify.

 > Alternatively, we can use the existing registered devices list named
 > 'device_list' in the IB core module to keep track of both allocated
 > and registered devices. Currently, the ib_device can be in one of
 > three states(IB_DEV_UNINITIALIZED, IB_DEV_REGISTERED,
 > IB_DEV_UNREGISTERED). We can enhance this to include 'INITIALIZED'
 > state and add the ib_device to 'device_list' with this new state at
 > ib_alloc_device_set_name() time. In this case, there will be no
 > changes to alloc_name() as it is already checking for device name
 > collision in a single list irrespective of the state of the device.

The second solution (adding an INITIALIZED state) seems simpler.  In
fact we could get rid of the UNINITIALIZED state after the patch series
since there wouldn't be a way to allocate an unitialized structure.

 > I wanted to point out that the proposed patch will not fix the
 > /proc/interrupts reporting issue for ConnectX IB devices because
 > request_irq() is done by mlx4_core and not by mlx4_ib. Also,
 > mlx4_core doesn't plug into IB core module.

Good point.  So I guess we should try to come up with a more general way
that works for mlx4 as well.  Perhaps enhance the PCI core so that all
MSI-X vectors for a device are reported in the /sys hierarchy (analogous
to the existing irq file that is under /sys/devices), which would work
for all possible devices, rather than having an RDMA-specific method?

 > I assume that there will be a transition period to allow deprecation
 > of ib_alloc_device_set_name() before we can apply this patch. Is my
 > assumption correct?

No, once all the drivers in the kernel are converted to the new API,
there's no longer any point in keeping the old API (especially given how
rare new RDMA drivers are).

 - R.


From zafargilani at gmail.com  Tue May 19 21:25:53 2009
From: zafargilani at gmail.com (Zafar Gilani)
Date: Wed, 20 May 2009 09:25:53 +0500
Subject: [ofa-general] Executing IB Verbs/RDMA client/server code via JNI
Message-ID: <7d4423d30905192125s16bc1ee9jd65d564e37275cda@mail.gmail.com>

I am an undergrad student doing my FYP. I am using IB Verbs and RDMA CM to
implement a communication device over InfiniBand fabric. I have executed
client/server code (most part from Roland Dreier, CISCO) and it works
absolutely fine. However when I try to call the same thing via JNI, the code
gets stuck at "ibv_alloc_pd". I have checked the "cm_id", the ibv_context
"cm_id->verbs" and the protection domain "ibv_pd" but I am unable to resolve
the error.

JVM actually crashes stating that error exists at frame: "ibv_alloc_pd" and
that crashed happened in native code. Though this is understandable, but the
error is not, since the same code works when executed directly with c
compiler but gives trouble with JNI.

Compilers:
   java version "1.6.0_07"
   Java(TM) SE Runtime Environment (build 1.6.0_07-b06)
   Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode)

   gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42)

Environment:
   Red Hat 4.12
   2 Intel(R) Xeon(TM) 5060 CPU dual-core hyperthreading 3.20GHz

I have attached the compressed file that contains all the files (.java, .h,
.c and .log). I was hoping that someone could may be point me in the right
direction.

Any help will be greatly appreciated.

Regards,
-- 
Syed Zafar ul Hussan Gilani | BIT-7
Research Student | CHPSC
MSP 2008-09
NUST SEECS | http://hpc.niit.edu.pk/~zafar <http://hpc.niit.edu.pk/%7Ezafar>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090520/21c166ee/attachment.html>

From zafargilani at gmail.com  Tue May 19 21:27:23 2009
From: zafargilani at gmail.com (Zafar Gilani)
Date: Wed, 20 May 2009 09:27:23 +0500
Subject: [ofa-general] Re: Executing IB Verbs/RDMA client/server code via JNI
In-Reply-To: <7d4423d30905192125s16bc1ee9jd65d564e37275cda@mail.gmail.com>
References: <7d4423d30905192125s16bc1ee9jd65d564e37275cda@mail.gmail.com>
Message-ID: <7d4423d30905192127h47379d3ch61070741d9292f88@mail.gmail.com>

Sorry forgot to attach the tarball.

On Wed, May 20, 2009 at 9:25 AM, Zafar Gilani <zafargilani at gmail.com> wrote:

> I am an undergrad student doing my FYP. I am using IB Verbs and RDMA CM to
> implement a communication device over InfiniBand fabric. I have executed
> client/server code (most part from Roland Dreier, CISCO) and it works
> absolutely fine. However when I try to call the same thing via JNI, the code
> gets stuck at "ibv_alloc_pd". I have checked the "cm_id", the ibv_context
> "cm_id->verbs" and the protection domain "ibv_pd" but I am unable to resolve
> the error.
>
> JVM actually crashes stating that error exists at frame: "ibv_alloc_pd" and
> that crashed happened in native code. Though this is understandable, but the
> error is not, since the same code works when executed directly with c
> compiler but gives trouble with JNI.
>
> Compilers:
>    java version "1.6.0_07"
>    Java(TM) SE Runtime Environment (build 1.6.0_07-b06)
>    Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode)
>
>    gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42)
>
> Environment:
>    Red Hat 4.12
>    2 Intel(R) Xeon(TM) 5060 CPU dual-core hyperthreading 3.20GHz
>
> I have attached the compressed file that contains all the files (.java, .h,
> .c and .log). I was hoping that someone could may be point me in the right
> direction.
>
> Any help will be greatly appreciated.
>
> Regards,
> --
> Syed Zafar ul Hussan Gilani | BIT-7
> Research Student | CHPSC
> MSP 2008-09
> NUST SEECS | http://hpc.niit.edu.pk/~zafar<http://hpc.niit.edu.pk/%7Ezafar>
>


-- 
Syed Zafar ul Hussan Gilani | BIT-7
Research Student | CHPSC
MSP 2008-09
NUST SEECS | http://hpc.niit.edu.pk/~zafar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090520/643a8b33/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jni.tar
Type: application/x-tar
Size: 49664 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090520/643a8b33/attachment.tar>

From sebastien.dugue at bull.net  Tue May 19 23:00:47 2009
From: sebastien.dugue at bull.net (sebastien dugue)
Date: Wed, 20 May 2009 08:00:47 +0200
Subject: [ofa-general] Re: [PATCH 2/3] libmlx4 - Optimize memory allocation
 of QP buffers with 64K pages
In-Reply-To: <adaiqjwda6u.fsf@cisco.com>
References: <20090518095156.7f9c39e6@frecb007965>
	<20090518095516.6a803492@frecb007965> <adaiqjwda6u.fsf@cisco.com>
Message-ID: <20090520080047.3d20cce7@frecb007965>


  Hi Roland,

On Tue, 19 May 2009 15:01:13 -0700
Roland Dreier <rdreier at cisco.com> wrote:

>  >   QP buffers are allocated with mlx4_alloc_buf(), which rounds the buffers
>  > size to the page size and then allocates page aligned memory using
>  > posix_memalign().
>  > 
>  >   However, this allocation is quite wasteful on architectures using 64K pages
>  > (ia64 for example) because we then hit glibc's MMAP_THRESHOLD malloc
>  > parameter and chunks are allocated using mmap. thus we end up allocating:
>  > 
>  > (requested size rounded to the page size) + (page size) + (malloc overhead)
>  > 
>  > rounded internally to the page size.
>  > 
>  >   So for example, if we request a buffer of page_size bytes, we end up
>  > consuming 3 pages. In short, for each QP buffer we allocate, there is an
>  > overhead of 2 pages. This is quite visible on large clusters especially where
>  > the number of QP can reach several thousands.
>  > 
>  >   This patch creates a new function mlx4_alloc_page() for use by
>  > mlx4_alloc_qp_buf() that does an mmap() instead of a posix_memalign() when
>  > the page size is 64K.
> 
> makes sense I guess.  It would be nice if glibc() were smart enough to
> know that mmap(MAP_ANONYMOUS) is going to give something page-aligned
> anyway,

  If you mean in the posix_memalign() path, then yes it'd be really nice.

> but it seems that malloc overhead (required to make the memory
> from posix_memalign() work with free()) is going to cost at least one
> extra page, which as you point out is pretty bad with 64KB pages.  (Of
> course 64KB pages are a disaster for any workload that deals with small
> objects of any kind, but that's another story)

  Yep, agreed.

> 
> However I wonder why we want to make this optimization only for 64KB
> pages.  It seems the code would be simpler if we just had our own
> page-aligned allocator using mmap(MAP_ANONYMOUS) and just used it
> unconditionally everywhere.  Or is it not actually better even on
> sane-sized (ie 4KB) page systems?  It seems we still have the malloc
> overhead which is going to cost us another page?

  Well not really, because if we stay below MMAP_THRESHOLD, as we do
with 4K pages, the only overhead is malloc's chaining structure. The
extra space used to align the buffer is released before posix_memalign()
returns, but that does increase fragmentation of mallocs chunks.

  Also, for 4K pages, mmap() systematically results in a syscall whereas
posix_memalign() does not necessarily, but as we're not on a fast path
I'm not sure what would be best. I don't mind converting all QP buffers
allocation to mmap(), but I'd like to hear what people think.

  Thanks Roland,

  Sebastien.


From ogerlitz at Voltaire.com  Wed May 20 00:14:34 2009
From: ogerlitz at Voltaire.com (Or Gerlitz)
Date: Wed, 20 May 2009 10:14:34 +0300
Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in
	ipoib_neigh_cleanup()
In-Reply-To: <20090519215505.GN6837@sgi.com>
References: <20090519215505.GN6837@sgi.com>
Message-ID: <4A13ADDA.5040908@Voltaire.com>

akepner at sgi.com wrote:
> We've seen a few instances of a crash in ipoib_neigh_cleanup() due to the use of a stale pointer:
> 848         neigh = *to_ipoib_neigh(n); <- read neigh (no locking)
> .....
> 858         spin_lock_irqsave(&priv->lock, flags);
> 860         if (neigh->ah) <--- at this point neigh may be stale
[...]
> I've been looking into a proper fix for this, and I'd like to solicit any ideas. 

Before going into possible solutions, could you say what kernel are you using? 

With this or related problems being around for couple of years, I always wanted to understand (A) why access to from-the-kernel-point-of-view-to-be-destructed-neighbour be protected? and (B) how come it can becomes stale? before 2.6.17-20 or so this could have happen since the ipoib neighbour destructor could have been called for NON ipoib neighbours - which for my understanding isn't the case any more in modern kernels see commit ecbb416939da77c0d107409976499724baddce7b "[NET]: Fix neighbour destructor handling"

Or.


From vlad at lists.openfabrics.org  Wed May 20 03:22:31 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Wed, 20 May 2009 03:22:31 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090520-0200 daily build status
Message-ID: <20090520102231.5EEB4E61401@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From eli at dev.mellanox.co.il  Wed May 20 04:10:36 2009
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Wed, 20 May 2009 14:10:36 +0300
Subject: [ofa-general] Re: [PATCH 2/3] libmlx4 - Optimize memory
	allocation of QP buffers with 64K pages
In-Reply-To: <20090520080047.3d20cce7@frecb007965>
References: <20090518095156.7f9c39e6@frecb007965>
	<20090518095516.6a803492@frecb007965> <adaiqjwda6u.fsf@cisco.com>
	<20090520080047.3d20cce7@frecb007965>
Message-ID: <20090520111036.GA13831@mtls03>

On Wed, May 20, 2009 at 08:00:47AM +0200, sebastien dugue wrote:
> 
>   Well not really, because if we stay below MMAP_THRESHOLD, as we do
> with 4K pages, the only overhead is malloc's chaining structure. The
> extra space used to align the buffer is released before posix_memalign()
> returns, but that does increase fragmentation of mallocs chunks.
> 
>   Also, for 4K pages, mmap() systematically results in a syscall whereas
> posix_memalign() does not necessarily, but as we're not on a fast path
> I'm not sure what would be best. I don't mind converting all QP buffers
> allocation to mmap(), but I'd like to hear what people think.
> 

If the only reasoning behind using a MMAP_THRESHOLD is to avoid the
system call for smaller allocations, then I think we'd better use a
uniform allocation scheme -- mmap -- as you proposed and not
distinguish between the two cases.


From sebastien.dugue at bull.net  Wed May 20 04:39:06 2009
From: sebastien.dugue at bull.net (sebastien dugue)
Date: Wed, 20 May 2009 13:39:06 +0200
Subject: [ofa-general] Re: [PATCH 2/3] libmlx4 - Optimize memory
	allocation of QP buffers with 64K pages
In-Reply-To: <20090520111036.GA13831@mtls03>
References: <20090518095156.7f9c39e6@frecb007965>
	<20090518095516.6a803492@frecb007965> <adaiqjwda6u.fsf@cisco.com>
	<20090520080047.3d20cce7@frecb007965>
	<20090520111036.GA13831@mtls03>
Message-ID: <20090520133906.552fee01@frecb007965>

On Wed, 20 May 2009 14:10:36 +0300
Eli Cohen <eli at dev.mellanox.co.il> wrote:

> On Wed, May 20, 2009 at 08:00:47AM +0200, sebastien dugue wrote:
> > 
> >   Well not really, because if we stay below MMAP_THRESHOLD, as we do
> > with 4K pages, the only overhead is malloc's chaining structure. The
> > extra space used to align the buffer is released before posix_memalign()
> > returns, but that does increase fragmentation of mallocs chunks.
> > 
> >   Also, for 4K pages, mmap() systematically results in a syscall whereas
> > posix_memalign() does not necessarily, but as we're not on a fast path
> > I'm not sure what would be best. I don't mind converting all QP buffers
> > allocation to mmap(), but I'd like to hear what people think.
> > 
> 
> If the only reasoning behind using a MMAP_THRESHOLD is to avoid the
> system call for smaller allocations,

  Well, that's not the only reason. From what I understand, for small
allocations, glibc's malloc can recycle freed heap chunks much more easily
than mmapped chunks. Also the mmapped chunk must be zeroed by the kernel
before being handed to the user which does not comes for free.


> then I think we'd better use a
> uniform allocation scheme -- mmap -- as you proposed and not
> distinguish between the two cases.
> 

  I will respin those patches early next week if nobody disagrees with
this route.

  Thanks,

  Sebastien.


From zafargilani at gmail.com  Tue May 19 21:14:02 2009
From: zafargilani at gmail.com (Zafar Gilani)
Date: Wed, 20 May 2009 09:14:02 +0500
Subject: [ofa-general] Executing client/server code via JNI
In-Reply-To: <7d4423d30905182216q5db826c1m48e3989d6e0df35f@mail.gmail.com>
References: <7d4423d30905182216q5db826c1m48e3989d6e0df35f@mail.gmail.com>
Message-ID: <7d4423d30905192114k4624f962p27f62886d7cc66f1@mail.gmail.com>

I am an undergrad student doing my FYP. I am using IB Verbs and RDMA CM to
implement a communication device over InfiniBand fabric. I have executed
client/server code (most part from Roland Dreier, CISCO) and it works
absolutely fine. However when I try to call the same thing via JNI, the code
gets stuck at "ibv_alloc_pd". I have checked the "cm_id", the ibv_context
"cm_id->verbs" and the protection domain "ibv_pd", these seem to work fine.

JVM actually crashes stating that error exists at frame: "ibv_alloc_pd" and
that crashed happened in native code. Though this is understandable, but the
error is not, since the same code works when executed directly with c
compiler but gives trouble with JNI.

Compilers:
   java version "1.6.0_07"
   Java(TM) SE Runtime Environment (build 1.6.0_07-b06)
   Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode)

   gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42)

Environment:
   Red Hat 4.12
   2 Intel(R) Xeon(TM) 5060 CPU dual-core hyperthreading 3.20GHz

I have attached the compressed file that contains all the files (.java, .h,
.c and .log). I was hoping that somebody could give me any ideas about the
solution.

Any help will be greatly appreciated.

Regards,
-- 
Zafar Gilani | BIT-7
Research Student | CHPSC
MSP 2008-09
NUST SEECS | http://hpc.niit.edu.pk/~zafar <http://hpc.niit.edu.pk/%7Ezafar>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090520/e8e6ee91/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jni.tar
Type: application/x-tar
Size: 49664 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090520/e8e6ee91/attachment.tar>

From zafargilani at gmail.com  Wed May 20 09:07:25 2009
From: zafargilani at gmail.com (Zafar Gilani)
Date: Wed, 20 May 2009 21:07:25 +0500
Subject: [ofa-general] Executing IB Verbs/RDMA client/server code via JNI
Message-ID: <7d4423d30905200907ob633063jf0aac806d4260f3e@mail.gmail.com>

This is my second message on the list. This one is exactly same as first
one, my previous did not receive any replies. I will be thankful if anyone
could point out the problem in the attached code files. Problem is explained
below:

I am an undergrad student doing my FYP. I am using IB Verbs and RDMA CM to
implement a communication device over InfiniBand fabric. I have executed
client/server code (most part from Roland Dreier, CISCO) and it works
absolutely fine. However when I try to call the same thing via JNI, the code
gets stuck at "ibv_alloc_pd". I have checked the "cm_id", the ibv_context
"cm_id->verbs" and the protection domain "ibv_pd" but I am unable to resolve
the error.

JVM actually crashes stating that error exists at frame: "ibv_alloc_pd" and
that crashed happened in native code. Though this is understandable, but the
error is not, since the same code works when executed directly with c
compiler but gives trouble with JNI.

Compilers:
   java version "1.6.0_07"
   Java(TM) SE Runtime Environment (build 1.6.0_07-b06)
   Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode)

   gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42)

Environment:
   Red Hat 4.12
   2 Intel(R) Xeon(TM) 5060 CPU dual-core hyperthreading 3.20GHz

I have attached the compressed file (jni.tar) that contains all the files
(.java, .h, .c and .log). I was hoping that someone could may be point me in
the right direction.

Any help will be greatly appreciated.

Thanks,
-- 
Syed Zafar ul Hussan Gilani | BIT-7
Research Student | CHPSC
MSP 2008-09
NUST SEECS | http://hpc.niit.edu.pk/~zafar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090520/ced9f9bf/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jni.tar
Type: application/x-tar
Size: 49664 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090520/ced9f9bf/attachment.tar>

From zafargilani at gmail.com  Wed May 20 09:31:12 2009
From: zafargilani at gmail.com (Zafar Gilani)
Date: Wed, 20 May 2009 21:31:12 +0500
Subject: [ofa-general] Problem executing IB Verbs/RDMA code via JNI
Message-ID: <7d4423d30905200931r7e6a49aas1254ec644ba028c1@mail.gmail.com>

This is my second message on the list. This one is exactly same as first
one, my previous did not receive any replies. I will be thankful if anyone
could point out the problem in the code files. Problem is explained below:

I am an undergrad student doing my FYP. I am using IB Verbs and RDMA CM to
implement a communication device over InfiniBand fabric. I have executed
client/server code (most part from Roland Dreier, CISCO) and it works
absolutely fine. However when I try to call the same thing via JNI, the code
gets stuck at "ibv_alloc_pd". I have checked the "cm_id", the ibv_context
"cm_id->verbs" and the protection domain "ibv_pd" but I am unable to resolve
the error.

JVM actually crashes stating that error exists at frame: "ibv_alloc_pd" and
that crash happened in native code. Though this is understandable, but the
error is not, since the same code works when executed directly with c
compiler but gives trouble with JNI.

Compilers:
   java version "1.6.0_07"
   Java(TM) SE Runtime Environment (build 1.6.0_07-b06)
   Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode)

   gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42)

Environment:
   Red Hat 4.12
   2 Intel(R) Xeon(TM) 5060 CPU dual-core hyperthreading 3.20GHz

Compressed file (jni.tar) that contains all the files (.java, .h, .c and
.log) is available at [http://hpc.niit.edu.pk/~zafar/work/ib/jni.tar]. I was
hoping that someone could may be point me in the right direction.

Any help will be greatly appreciated.

Thanks,
-- 
Syed Zafar ul Hussan Gilani | BIT-7
Research Student | CHPSC
MSP 2008-09
NUST SEECS | http://hpc.niit.edu.pk/~zafar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090520/27f4dfa3/attachment.html>

From rdreier at cisco.com  Wed May 20 10:28:38 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 20 May 2009 10:28:38 -0700
Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in
	ipoib_neigh_cleanup()
In-Reply-To: <20090519215505.GN6837@sgi.com> (akepner@sgi.com's message of
	"Tue, 19 May 2009 14:55:05 -0700")
References: <20090519215505.GN6837@sgi.com>
Message-ID: <adaws8bbs55.fsf@cisco.com>


 > We've seen a few instances of a crash in ipoib_neigh_cleanup() due to 
 > the use of a stale pointer:
 > 
 > 
 > 848         neigh = *to_ipoib_neigh(n); <- read neigh (no locking)
 > .....
 > 858         spin_lock_irqsave(&priv->lock, flags);
 > 859
 > 860         if (neigh->ah) <--- at this point neigh may be stale
 > 861                 ah = neigh->ah;
 > 862         list_del(&neigh->list);
 > 863         ipoib_neigh_free(n->dev, neigh);
 > 864
 > 865         spin_unlock_irqrestore(&priv->lock, flags);

I'd like to understand the bug first -- how is the neighbour being
destroyed out from under us in ipoib_neigh_cleanup()?  I would have
thought the cleanup function would run when no references to the struct
remain but before it's freed.

 - R.


From sokar6012 at hotmail.com  Wed May 20 12:58:31 2009
From: sokar6012 at hotmail.com (anthony garnier)
Date: Wed, 20 May 2009 19:58:31 +0000
Subject: [ofa-general] Infiniband with Xen
Message-ID: <BAY139-W21E7902C852FC8E202B60EAE580@phx.gbl>


Hi,
I'm currently working on the latest version of xen with debian (lenny), and I have done 2 PCI passthrough, the fisrt one is with eth1 and i got no probleme with this one, but the second one is with a infiniband adapter ( MT23108 Cougar revision A1,latest firmware 3.5) => It's not working
I got this message with dmesg on my DomU :

[    4.111023] ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
[    4.111047] ib_mthca: Initializing 0000:00:00.0
[    4.111501] ib_mthca 0000:00:00.0: enabling device (0000 -> 0002)
[    4.112859] ib_mthca 0000:00:00.0: No bridge found for 0000:00:00.0
[   15.526745] ib_mthca 0000:00:00.0: PCI device did not come back after reset, aborting.
[   15.526765] ib_mthca 0000:00:00.0: Failed to reset HCA, aborting.

Do you know if there is a solution like previously with XEN smartio or Xen-IB  (wich is no more developped) to do  High Performance VMM-Bypass I/O in Virtual Machines.
_________________________________________________________________
Téléphonez gratuitement à tous vos proches avec Windows Live Messenger  !  Téléchargez-le maintenant !
http://www.windowslive.fr/messenger/1.asp
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090520/8c7bc455/attachment.html>

From abenjamin at sgi.com  Wed May 20 13:24:34 2009
From: abenjamin at sgi.com (Arputham Benjamin)
Date: Wed, 20 May 2009 15:24:34 -0500
Subject: [ofa-general] RE: [PATCH] core/mthca: Distinguish multiple IB cards
	in /proc/interrupts
References: <4A0B560B.3090606@sgi.com> <adad4aaf3dk.fsf@cisco.com>
	<4A136DF0.7000402@sgi.com> <adaab58ctd4.fsf@cisco.com>
Message-ID: <1AB9A794DBDDF54A8A81BE2296F7BDFE992691@cf--amer001e--3.americas.sgi.com>

> > I wanted to point out that the proposed patch will not fix the
> > /proc/interrupts reporting issue for ConnectX IB devices because
> > request_irq() is done by mlx4_core and not by mlx4_ib. Also,
> > mlx4_core doesn't plug into IB core module.

> Good point.  So I guess we should try to come up with a more general way
> that works for mlx4 as well.  Perhaps enhance the PCI core so that all
> MSI-X vectors for a device are reported in the /sys hierarchy (analogous
> to the existing irq file that is under /sys/devices), which would work
> for all possible devices, rather than having an RDMA-specific method?

> - R.

Can I proceed with the ib_alloc_device_set_name()IB core API changes,
and mthca driver changes we agreed? After we test and apply these
patches, we can take a look at how we can fix mlx4 as well.

Please confirm.

Regards,
Benjamin

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090520/91053c35/attachment.html>

From rdreier at cisco.com  Wed May 20 13:54:07 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 20 May 2009 13:54:07 -0700
Subject: [ofa-general] Re: [PATCH] core/mthca: Distinguish multiple IB cards
	in /proc/interrupts
In-Reply-To: <1AB9A794DBDDF54A8A81BE2296F7BDFE992691@cf--amer001e--3.americas.sgi.com>
	(Arputham Benjamin's message of "Wed, 20 May 2009 15:24:34 -0500")
References: <4A0B560B.3090606@sgi.com> <adad4aaf3dk.fsf@cisco.com>
	<4A136DF0.7000402@sgi.com> <adaab58ctd4.fsf@cisco.com>
	<1AB9A794DBDDF54A8A81BE2296F7BDFE992691@cf--amer001e--3.americas.sgi.com>
Message-ID: <adaskizbimo.fsf@cisco.com>

 > Can I proceed with the ib_alloc_device_set_name()IB core API changes,
 > and mthca driver changes we agreed? After we test and apply these
 > patches, we can take a look at how we can fix mlx4 as well.

I think it would be much better to come up with a way to handle mlx4 as
well.  There's not much point in making core changes if they don't fix
the issue for all drivers.

 - R.


From rdreier at cisco.com  Wed May 20 13:55:49 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 20 May 2009 13:55:49 -0700
Subject: [ofa-general] Infiniband with Xen
In-Reply-To: <BAY139-W21E7902C852FC8E202B60EAE580@phx.gbl> (anthony garnier's
	message of "Wed, 20 May 2009 19:58:31 +0000")
References: <BAY139-W21E7902C852FC8E202B60EAE580@phx.gbl>
Message-ID: <adaoctnbiju.fsf@cisco.com>

 > I'm currently working on the latest version of xen with debian (lenny), and I have done 2 PCI passthrough, the fisrt one is with eth1 and i got no probleme with this one, but the second one is with a infiniband adapter ( MT23108 Cougar revision A1,latest firmware 3.5) => It's not working
 > I got this message with dmesg on my DomU :
 > 
 > [    4.111023] ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
 > [    4.111047] ib_mthca: Initializing 0000:00:00.0
 > [    4.111501] ib_mthca 0000:00:00.0: enabling device (0000 -> 0002)
 > [    4.112859] ib_mthca 0000:00:00.0: No bridge found for 0000:00:00.0
 > [   15.526745] ib_mthca 0000:00:00.0: PCI device did not come back after reset, aborting.
 > [   15.526765] ib_mthca 0000:00:00.0: Failed to reset HCA, aborting.
 > 
 > Do you know if there is a solution like previously with XEN smartio or Xen-IB  (wich is no more developped) to do  High Performance VMM-Bypass I/O in Virtual Machines.

I'm not sure about smartio or Xen-IB, but you could try assigning both
HCA PCI devices to your domU and see if it works better (the HCA should
appear in lspci as both a PCI bridge and an actual HCA device, and the
driver expects to find both)

 - R.


From akepner at sgi.com  Wed May 20 14:37:03 2009
From: akepner at sgi.com (akepner at sgi.com)
Date: Wed, 20 May 2009 14:37:03 -0700
Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in
	ipoib_neigh_cleanup()
In-Reply-To: <adaws8bbs55.fsf@cisco.com>
References: <20090519215505.GN6837@sgi.com> <adaws8bbs55.fsf@cisco.com>
Message-ID: <20090520213703.GT6837@sgi.com>

On Wed, May 20, 2009 at 10:28:38AM -0700, Roland Dreier wrote:
> 
>  > We've seen a few instances of a crash in ipoib_neigh_cleanup() due to 
>  > the use of a stale pointer:
>  > 
>  > 
>  > 848         neigh = *to_ipoib_neigh(n); <- read neigh (no locking)
>  > .....
>  > 858         spin_lock_irqsave(&priv->lock, flags);
>  > 859
>  > 860         if (neigh->ah) <--- at this point neigh may be stale
>  > 861                 ah = neigh->ah;
>  > 862         list_del(&neigh->list);
>  > 863         ipoib_neigh_free(n->dev, neigh);
>  > 864
>  > 865         spin_unlock_irqrestore(&priv->lock, flags);
> 
> I'd like to understand the bug first -- how is the neighbour being
> destroyed out from under us in ipoib_neigh_cleanup()?  I would have
> thought the cleanup function would run when no references to the struct
> remain but before it's freed.
> 

I should have been more specific - it's not the ipoib_neigh structure 
pointer itself, but the list inside the structure where we've found a 
problem. The specific crash we are trying to fix is when someone does 
list_del(&neigh->list) just before we acquire the lock at line 860. 
(But all the callers of list_del(&neigh->list) subsequently call 
ipoib_neigh_free(), too, so the neigh pointer is bad.)

The signature of the crash is like this:

Unable to handle kernel paging request at 0000000000100108 RIP: ^M
                                          ^^^^^^^^^^^^^^^^
                                          LIST_POISON1+0x8
<ffffffff882a3794>{:ib_ipoib:ipoib_neigh_cleanup+368}^M
PGD 4152b3067 PUD 413ee4067 PMD 0 ^M
Oops: 0002 [1] SMP ^M
last sysfs file: /class/infiniband/mthca1/node_type^M
CPU 7 ^M
Modules linked in: sg sd_mod crc32c libcrc32c rdma_ucm rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_sa ipv6 ib_uverbs ib_umad mlx4_ib mlx4_core ib_mthca ib_mad ib_core iscsi_tcp libiscsi scsi_transport_iscsi loop numatools xpmem i2c_i801 libata i2c_core scsi_mod shpchp pci_hotplug nfs lockd nfs_acl af_packet sunrpc e1000^M
Pid: 0, comm: swapper Tainted: G     U 2.6.16.54-0.2.5-smp #1^M
RIP: 0010:[<ffffffff882a3794>] <ffffffff882a3794>{:ib_ipoib:ipoib_neigh_cleanup+368}^M
RSP: 0018:ffff81042088bea8  EFLAGS: 00010082^M
RAX: 0000000000200200 RBX: ffff8104162fdd40 RCX: ffff8104162fdd98^M
RDX: 0000000000100100 RSI: ffff8104162fdd40 RDI: ffff81041b2f8500^M
RBP: ffff8103c7600480 R08: ffff81041e7b10f0 R09: 0000000000000000^M
R10: ffff810420885e48 R11: 0000000000003a98 R12: ffff81041aa39480^M
R13: ffff81041b2f8500 R14: 0000000000000246 R15: ffffffff803d3ff0^M
FS:  0000000000000000(0000) GS:ffff810420fdc2c0(0000) knlGS:0000000000000000^M
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b^M
CR2: 0000000000100108 CR3: 0000000417469000 CR4: 00000000000006e0^M
Process swapper (pid: 0, threadinfo ffff810420884000, task ffff81042083f100)^M
Stack: ffff8103c7600480 0000000000000000 ffff8103c7600480 0000000108a8ec53 ^M
       ffffffff803c2700 ffffffff80284ebd ffff8103c7600480 ffffffff803862a0 ^M
       ffff81041610c380 ffffffff802871ca ^M
Call Trace: <IRQ> <ffffffff80284ebd>{neigh_destroy+197}^M
       <ffffffff802871ca>{neigh_periodic_timer+249} <ffffffff802870d1>{neigh_periodic_timer+0}^M
       <ffffffff8013ba84>{run_timer_softirq+348} <ffffffff801377d1>{__do_softirq+85}^M
       <ffffffff8010c11e>{call_softirq+30} <ffffffff8010d07c>{do_softirq+44}^M
       <ffffffff80109e3a>{mwait_idle+0} <ffffffff8010ba78>{apic_timer_interrupt+132} <EOI>^M
       <ffffffff80109e3a>{mwait_idle+0} <ffffffff80109e70>{mwait_idle+54}^M
       <ffffffff80109e17>{cpu_idle+151} <ffffffff80119036>{start_secondary+1240}^M

-- 
Arthur


From abenjamin at sgi.com  Wed May 20 16:54:05 2009
From: abenjamin at sgi.com (Arputham Benjamin)
Date: Wed, 20 May 2009 18:54:05 -0500
Subject: [ofa-general] RE: [PATCH] core/mthca: Distinguish multiple IB cards
	in /proc/interrupts
References: <4A0B560B.3090606@sgi.com> <adad4aaf3dk.fsf@cisco.com>
	<4A136DF0.7000402@sgi.com> <adaab58ctd4.fsf@cisco.com>
	<1AB9A794DBDDF54A8A81BE2296F7BDFE992691@cf--amer001e--3.americas.sgi.com>
	<adaskizbimo.fsf@cisco.com>
Message-ID: <1AB9A794DBDDF54A8A81BE2296F7BDFE992692@cf--amer001e--3.americas.sgi.com>

>> Can I proceed with the ib_alloc_device_set_name()IB core API changes,
>> and mthca driver changes we agreed? After we test and apply these
>> patches, we can take a look at how we can fix mlx4 as well.

> I think it would be much better to come up with a way to handle mlx4 as
> well.  There's not much point in making core changes if they don't fix
> the issue for all drivers.

> - R.

I wanted to add some clarification.

We have two types of IB devices:
1)Devices that can operate as an InfiniBand adapter only
2)Devices that can operate as an InfiniBand adapter or as an Ethernet NIC

As per the current implementation of OFED stack, the driver architecture
of #2 is very different from #1 because it needs to make sure InfiniBand
and Ethernet functions can share the device without interfering with
each other.

I was thinking that we can fix /proc/interrupts issue for case#1 first
and worry about #2 later because the design to fix /proc/interrupts 
for mlx4 case is going to be different and independent just as the
driver design is different and independent for the two cases today.

We don't have a common kernel module in OFED stack that plugs into
both types of IB devices as far as interrupt resource allocation
is concerned. I think creating such a module would be a fundamental
S/W arch change and would require a lot of changes to adopt to it.

Please let me know if you still think we need a common solution for
both cases mentioned above. Any suggestions at a high level for such
a common solution?

Thank you for your help.

Regards,
Benjamin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090520/d30c0808/attachment.html>

From ahomike at us.ibm.com  Wed May 20 17:23:01 2009
From: ahomike at us.ibm.com (Mike Aho)
Date: Wed, 20 May 2009 19:23:01 -0500
Subject: [ofa-general] ibv_devinfo -v conventions for RNIC in 1.4.1-X
Message-ID: <OFB2DAEF73.FC5596D3-ON862575BD.0001D36B-862575BD.00021BB7@us.ibm.com>

I ran ibv_devinfo -v for 1.4.1-rc4 and got all the information dumped for 
an RNIC (Ethernet RDMA card).  Is there doc somewhere to explain how the 
fields are addressed differently than an IB HCA?  Is there a set of 
conventions? 

Mike  Aho

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090520/5c062373/attachment.html>

From swise at opengridcomputing.com  Wed May 20 18:47:12 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 20 May 2009 20:47:12 -0500
Subject: [ofa-general] ibv_devinfo -v conventions for RNIC in 1.4.1-X
In-Reply-To: <OFB2DAEF73.FC5596D3-ON862575BD.0001D36B-862575BD.00021BB7@us.ibm.com>
References: <OFB2DAEF73.FC5596D3-ON862575BD.0001D36B-862575BD.00021BB7@us.ibm.com>
Message-ID: <4A14B2A0.8000704@opengridcomputing.com>

Mike Aho wrote:

> I ran ibv_devinfo -v for 1.4.1-rc4 and got all the information dumped 
> for an RNIC (Ethernet RDMA card).  Is there doc somewhere to explain 
> how the fields are addressed differently than an IB HCA?  Is there a 
> set of conventions?  
>
> Mike  Aho

Hey Mike,

Unfortunately, no doc.  What fields are not clear?

Steve.


From jgunthorpe at obsidianresearch.com  Wed May 20 20:28:51 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Wed, 20 May 2009 21:28:51 -0600
Subject: [ofa-general] RE: [PATCH] core/mthca: Distinguish multiple IB
	cards in /proc/interrupts
In-Reply-To: <1AB9A794DBDDF54A8A81BE2296F7BDFE992692@cf--amer001e--3.americas.sgi.com>
References: <4A0B560B.3090606@sgi.com> <adad4aaf3dk.fsf@cisco.com>
	<4A136DF0.7000402@sgi.com> <adaab58ctd4.fsf@cisco.com>
	<1AB9A794DBDDF54A8A81BE2296F7BDFE992691@cf--amer001e--3.americas.sgi.com>
	<adaskizbimo.fsf@cisco.com>
	<1AB9A794DBDDF54A8A81BE2296F7BDFE992692@cf--amer001e--3.americas.sgi.com>
Message-ID: <20090521032851.GC11690@obsidianresearch.com>

On Wed, May 20, 2009 at 06:54:05PM -0500, Arputham Benjamin wrote:

> I was thinking that we can fix /proc/interrupts issue for case#1 first
> and worry about #2 later because the design to fix /proc/interrupts
> for mlx4 case is going to be different and independent just as the
> driver design is different and independent for the two cases today.

I notice on my system with 2.6.28 some drivers are appending the PCI
ID:

 17:  101272606   IO-APIC-fasteoi   ATI IXP, radeon at pci:0000:01:05.0

This is alot simpler than trying to create small monotonic numbers..

Jason


From ogerlitz at voltaire.com  Wed May 20 22:32:22 2009
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 21 May 2009 08:32:22 +0300
Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in
	ipoib_neigh_cleanup()
In-Reply-To: <20090520213703.GT6837@sgi.com>
References: <20090519215505.GN6837@sgi.com> <adaws8bbs55.fsf@cisco.com>
	<20090520213703.GT6837@sgi.com>
Message-ID: <4A14E766.1010005@voltaire.com>

akepner at sgi.com wrote:
> I should have been more specific [...] {:ib_ipoib:ipoib_neigh_cleanup+368}

yes, being more specific here helps because

> Pid: 0, comm: swapper Tainted: G     U 2.6.16.54-0.2.5-smp 
its very likely that the problem you face in 2.6.16 was fixed by the 
commit I pointed on in my previous reply on this thread.

Or.


From ogerlitz at voltaire.com  Wed May 20 22:41:56 2009
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 21 May 2009 08:41:56 +0300
Subject: [ofa-general] ibv_devinfo -v conventions for RNIC 
In-Reply-To: <OFB2DAEF73.FC5596D3-ON862575BD.0001D36B-862575BD.00021BB7@us.ibm.com>
References: <OFB2DAEF73.FC5596D3-ON862575BD.0001D36B-862575BD.00021BB7@us.ibm.com>
Message-ID: <4A14E9A4.9090600@voltaire.com>

Mike Aho wrote:
> Is there doc somewhere to explain how the fields are addressed 
> differently than an IB HCA?  Is there a set of conventions? 
try 
https://wiki.openfabrics.org/tiki-index.php?page=Verbs%3A+Infiniband+vs+iWARP

Or.


From vlad at lists.openfabrics.org  Thu May 21 03:23:31 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Thu, 21 May 2009 03:23:31 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090521-0200 daily build status
Message-ID: <20090521102332.4133CE60E8C@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From amirv at mellanox.co.il  Thu May 21 04:58:06 2009
From: amirv at mellanox.co.il (Amir Vadai)
Date: Thu, 21 May 2009 14:58:06 +0300
Subject: [ofa-general] Default value of LocalAckTimeout for a new QP
Message-ID: <4A1541CE.9090509@mellanox.co.il>

Sean Hi,

I need to know how is the LocalAckTimeout for a new QP is calculated 
(for an SDP QP).
Is there a way to change it through a module parameter?
If not, what is the right way to change it?

Thanks,
Amir


From Zhen.Liang at Sun.COM  Thu May 21 05:03:12 2009
From: Zhen.Liang at Sun.COM (Liang Zhen)
Date: Thu, 21 May 2009 20:03:12 +0800
Subject: [ofa-general] Problems using OFED 1.4 on largesmp nodes
In-Reply-To: <4A12C200.4000708@mellanox.co.il>
References: <1233654242.1364.39.camel@pyren.uio.no> <4A12378C.8030101@sun.com>
	<4A12C200.4000708@mellanox.co.il>
Message-ID: <4A154300.6020100@sun.com>

Tziporet,

I get two x4600 and think they are same, but on the one failed to
startup when I run mstflint:
mstflint -d 03:00.0 q
Warning: memory access to device 03:00.0 failed: Input/output error.
Warning: Fallback on IO: much slower, and unsafe if device in use.
*** ERROR *** Can not open 03:00.0: Not a directory MFE_CR_ERROR

On the other one (which load driver without error):

mstflint -d 03:00.0 q
Image type: Failsafe
I.S. Version: 1
Chip Revision: A0
Description: Node Port1 Port2 Sys image
GUIDs: 00066a0098006abd 00066a00a0006abd 00066a01a0006abd 00066a0098006abd
Board ID: j (MT_00A0000001)
VSD: j
PSID: MT_00A0000001

mstflint -d 03:00.0 v

Failsafe image:

Invariant /0x00000028-0x0000095f (0x000938)/ (BOOT2) - OK

Primary Image /0x00010000-0x00010107 (0x000108)/ (Pointer Sector)- OK
/0x00030028-0x000308af (0x000888)/ (BOOT2) - OK
/0x000308b0-0x00034feb (0x00473c)/ (BOOT2) - OK
/0x00034fec-0x00035edb (0x000ef0)/ (Configuration) - OK
/0x00035edc-0x00035f0f (0x000034)/ (GUID) - OK
/0x00035f10-0x0003ed63 (0x008e54)/ (DDR) - OK
/0x0003ed64-0x0004d63b (0x00e8d8)/ (DDR) - OK
/0x0004d63c-0x00050573 (0x002f38)/ (DDR) - OK
/0x00050574-0x0005204f (0x001adc)/ (DDR) - OK
/0x00052050-0x0006accf (0x018c80)/ (DDR) - OK
/0x0006acd0-0x0007f23f (0x014570)/ (DDR) - OK
/0x0007f240-0x0007f253 (0x000014)/ (Configuration) - OK
/0x0007f254-0x0007f297 (0x000044)/ (Jump addresses) - OK
/0x0007f298-0x0007f33f (0x0000a8)/ (FW Configuration) - OK

Secondary Image /0x00020000-0x00020107 (0x000108)/ (Pointer Sector)- OK
/0x00080028-0x000808af (0x000888)/ (BOOT2) - OK
/0x000808b0-0x00084feb (0x00473c)/ (BOOT2) - OK
/0x00084fec-0x00085edb (0x000ef0)/ (Configuration) - OK
/0x00085edc-0x00085f0f (0x000034)/ (GUID) - OK
/0x00085f10-0x0008ed63 (0x008e54)/ (DDR) - OK
/0x0008ed64-0x0009d63b (0x00e8d8)/ (DDR) - OK
/0x0009d63c-0x000a0573 (0x002f38)/ (DDR) - OK
/0x000a0574-0x000a204f (0x001adc)/ (DDR) - OK
/0x000a2050-0x000baccf (0x018c80)/ (DDR) - OK
/0x000bacd0-0x000cf23f (0x014570)/ (DDR) - OK
/0x000cf240-0x000cf253 (0x000014)/ (Configuration) - OK
/0x000cf254-0x000cf297 (0x000044)/ (Jump addresses) - OK
/0x000cf298-0x000cf33f (0x0000a8)/ (FW Configuration) - OK

FW image verification succeeded. Image is bootable.


Thanks
Liang

Tziporet Koren wrote:
> Liang Zhen wrote:
>   
>> Hi Ole,
>> Have you got solution for this? I think we got exactly same problem on
>> 4600 with ofed-1.4.1-rc4:
>> lspci output:
>> 03:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe
>> 2.0 2.5GT/s] (rev a0)
>>
>> and error messages from dmesg:
>>
>> mlx4_core: Mellanox ConnectX core driver v1.0 (April 4, 2008)
>> mlx4_core: Initializing 0000:03:00.0
>> mlx4_core 0000:03:00.0: Requested number of MACs is too much for port 1,
>> reducing to 1.
>> mlx4_core 0000:03:00.0: command 0x13 failed: fw status = 0x1
>> mlx4_core 0000:03:00.0: SW2HW_EQ failed (-5)
>> mlx4_core 0000:03:00.0: Failed to initialize event queue table, aborting.
>> mlx4_core: probe of 0000:03:00.0 failed with error -5
>>
>>   
>>     
> Can you send me the FW version and board type
> Since the driver is not loading you can use mstflint to get this data
> Please use:
>
> The devices can be accessed by their PCI ID as displayed by lspci
> (bus:dev.fn).
> Example:
> # List all Mellanox devices
>   
>> /sbin/lspci -d 15b3:
>>     
> 02:00.0 Ethernet controller: Mellanox Technologies Unknown device 6368
> (rev a0)
>
> # Use mstflint tool to query the firmware on this device
>   
>> mstflint -d 02:00.0 q
>>     
>
> Tziporet
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From sokar6012 at hotmail.com  Thu May 21 05:05:13 2009
From: sokar6012 at hotmail.com (anthony garnier)
Date: Thu, 21 May 2009 12:05:13 +0000
Subject: [ofa-general] Infiniband with Xen
In-Reply-To: <adaoctnbiju.fsf@cisco.com>
References: <BAY139-W21E7902C852FC8E202B60EAE580@phx.gbl>
	<adaoctnbiju.fsf@cisco.com>
Message-ID: <BAY139-W216D23BBED1EB689F563D5AE590@phx.gbl>


HI,
I have tried to also to passthrough the pci bridge but It doesn't work, I got that with de dmesg on my Dom0 :

 1297.293367] pciback: vpci: 0000:04:00.0: assign to virtual slot 0     ( this is  HCA)
[ 1297.295538] pciback: vpci: 0000:03:07.1: assign to virtual slot 1    ( This is eth1)
[ 1297.298346] pciback: vpci: 0000:03:08.0: assign to virtual slot 2     ( this is the HCA bridge)
[ 1298.655743] pciback 0000:03:08.0: Driver tried to write to a read-only configuration space field at offset 0x3e, size 2. This may be harmless, but if you have problems with your device:
[ 1298.655747] 1) see permissive attribute in sysfs
[ 1298.655749] 2) report problems to the xen-devel mailing list along with details of your device obtained from lspci.


> From: rdreier at cisco.com
> To: sokar6012 at hotmail.com
> CC: general at lists.openfabrics.org
> Subject: Re: [ofa-general] Infiniband with Xen
> Date: Wed, 20 May 2009 13:55:49 -0700
> 
>  > I'm currently working on the latest version of xen with debian (lenny), and I have done 2 PCI passthrough, the fisrt one is with eth1 and i got no probleme with this one, but the second one is with a infiniband adapter ( MT23108 Cougar revision A1,latest firmware 3.5) => It's not working
>  > I got this message with dmesg on my DomU :
>  > 
>  > [    4.111023] ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
>  > [    4.111047] ib_mthca: Initializing 0000:00:00.0
>  > [    4.111501] ib_mthca 0000:00:00.0: enabling device (0000 -> 0002)
>  > [    4.112859] ib_mthca 0000:00:00.0: No bridge found for 0000:00:00.0
>  > [   15.526745] ib_mthca 0000:00:00.0: PCI device did not come back after reset, aborting.
>  > [   15.526765] ib_mthca 0000:00:00.0: Failed to reset HCA, aborting.
>  > 
>  > Do you know if there is a solution like previously with XEN smartio or Xen-IB  (wich is no more developped) to do  High Performance VMM-Bypass I/O in Virtual Machines.
> 
> I'm not sure about smartio or Xen-IB, but you could try assigning both
> HCA PCI devices to your domU and see if it works better (the HCA should
> appear in lspci as both a PCI bridge and an actual HCA device, and the
> driver expects to find both)
> 
>  - R.

_________________________________________________________________
Découvrez toutes les possibilités de communication avec vos proches
http://www.microsoft.com/windows/windowslive/default.aspx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090521/43735507/attachment.html>

From ahomike at us.ibm.com  Thu May 21 05:42:16 2009
From: ahomike at us.ibm.com (Mike Aho)
Date: Thu, 21 May 2009 07:42:16 -0500
Subject: [ofa-general] ibv_devinfo -v conventions for RNIC in 1.4.1-X
In-Reply-To: <4A14B2A0.8000704@opengridcomputing.com>
References: <OFB2DAEF73.FC5596D3-ON862575BD.0001D36B-862575BD.00021BB7@us.ibm.com>
	<4A14B2A0.8000704@opengridcomputing.com>
Message-ID: <OFD68C720F.EC57CD22-ON862575BD.004356E6-862575BD.0045CA65@us.ibm.com>

Steve,  Below is the verbose version I got as an example for a Chelsio 
card on a Dell machine.  I REALLY like that an RNIC can show its settings 
via ibv_devinfo.  But an RNIC is different from an IB card and perhaps 
ibv_devinfo should have an indicator [by port?] that differentiates an 
RNIC from an IB HCA and do the output differently.  I doubt 1.4.1 can 
change this short term but 1.5 could incorporate a meaningful change. 
Developing a "readme" to cover it would recover separate maintenance and 
be challenging to keep updated and RNIC meaningful output would be more 
useful.

It appears that the node_guid and sys_image_guid incorporate the MAC 
addresses into the format.  Are these fields really useful as GUIDs?  This 
also assumes a single port RNIC and I could see a two-port RNIC coming 
along or an IO card with an IB port and Ethernet (RNIC) port.
I could see MAC address(es) being shown in the port subsection based on a 
port indicator enum of IB HCA, Ethernet, etc.

The max_raw_ipv6_qp and max_raw_eth_qp seem superfluous on a plain RNIC 
but perhaps make sense on a combined RNIC and HCA adapter.  So I think it 
can stay.

Under the port subsection, the use of the enum values for mtu size are 
good for an IB HCA but seem not applicable to an RNIC.  Perhaps these 
should move from an enum to a range of values for RNIC.

I can live with some of the other IB artifacts under the port subsection 
such as pkey and the table lengths but these could be suppressed on an 
RNIC port indicator in the future.  The active width, speed, and physical 
state need to change to address an RNIC port.

Mike Aho

hca_id: cxgb3_0
        fw_ver:                         7.0.0
        node_guid:                      0007:4305:6009:0000
        sys_image_guid:                 0007:4305:6009:0000
        vendor_id:                      0x1425
        vendor_part_id:                 48
        hw_ver:                         0x1
        board_id:                       1425.30
        phys_port_cnt:                  1
        max_mr_size:                    0x100000000
        page_size_cap:                  0xffff000
        max_qp:                         32736
        max_qp_wr:                      1023
        device_cap_flags:               0x00228000
        max_sge:                        4
        max_sge_rd:                     1
        max_cq:                         32767
        max_cqe:                        8192
        max_mr:                         32768
        max_pd:                         32767
        max_qp_rd_atom:                 8
        max_ee_rd_atom:                 0
        max_res_rd_atom:                0
        max_qp_init_rd_atom:            8
        max_ee_init_rd_atom:            0
        atomic_cap:                     ATOMIC_NONE (0)
        max_ee:                         0
        max_rdd:                        0
        max_mw:                         0
        max_raw_ipv6_qp:                0
        max_raw_ethy_qp:                0
        max_mcast_grp:                  0
        max_mcast_qp_attach:            0
        max_total_mcast_qp_attach:      0
        max_ah:                         0
        max_fmr:                        0
        max_srq:                        0
        max_pkeys:                      0
        local_ca_ack_delay:             0
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             2048 (4)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        max_msg_sz:             0xffffffff
                        port_cap_flags:         0x009f0000
                        max_vl_num:             invalid value (0)
                        bad_pkey_cntr:          0x0
                        qkey_viol_cntr:         0x0
                        sm_sl:                  0
                        pkey_tbl_len:           1
                        gid_tbl_len:            1
                        subnet_timeout:         0
                        init_type_reply:        0
                        active_width:           4X (2)
                        active_speed:           5.0 Gbps (2)
                        phys_state:             invalid physical state (0)

Mike  Aho
 

From:
Steve Wise <swise at opengridcomputing.com>
To:
Mike Aho/Rochester/IBM at IBMUS
Cc:
general at lists.openfabrics.org
Date:
05/20/2009 08:47 PM
Subject:
Re: [ofa-general] ibv_devinfo -v conventions for RNIC in 1.4.1-X


Mike Aho wrote:

> I ran ibv_devinfo -v for 1.4.1-rc4 and got all the information dumped 
> for an RNIC (Ethernet RDMA card).  Is there doc somewhere to explain 
> how the fields are addressed differently than an IB HCA?  Is there a 
> set of conventions? 
>
> Mike  Aho

Hey Mike,

Unfortunately, no doc.  What fields are not clear?

Steve.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090521/b5daa376/attachment.html>

From tziporet at dev.mellanox.co.il  Thu May 21 06:01:38 2009
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Thu, 21 May 2009 16:01:38 +0300
Subject: [ofa-general] Problems using OFED 1.4 on largesmp nodes
In-Reply-To: <4A154300.6020100@sun.com>
References: <1233654242.1364.39.camel@pyren.uio.no> <4A12378C.8030101@sun.com>
	<4A12C200.4000708@mellanox.co.il> <4A154300.6020100@sun.com>
Message-ID: <4A1550B2.4090506@mellanox.co.il>

Liang Zhen wrote:
> Tziporet,
>
> I get two x4600 and think they are same, but on the one failed to
> startup when I run mstflint:
> mstflint -d 03:00.0 q
> Warning: memory access to device 03:00.0 failed: Input/output error.
> Warning: Fallback on IO: much slower, and unsafe if device in use.
> *** ERROR *** Can not open 03:00.0: Not a directory MFE_CR_ERROR
>   
So you have some HW error. From mails with Ole, I understand his issue
was also a HW issue.

In his case it was an issue of having both the FC card and the IB card on the same south
bridge.

Please approach your HW vendor for resolution.

Tziporet


From hal.rosenstock at gmail.com  Thu May 21 06:20:00 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 21 May 2009 09:20:00 -0400
Subject: [ofa-general] Default value of LocalAckTimeout for a new QP
In-Reply-To: <4A1541CE.9090509@mellanox.co.il>
References: <4A1541CE.9090509@mellanox.co.il>
Message-ID: <f0e08f230905210620x5485287btdd84aa0118aa592d@mail.gmail.com>

On Thu, May 21, 2009 at 7:58 AM, Amir Vadai <amirv at mellanox.co.il> wrote:
> Sean Hi,
>
> I need to know how is the LocalAckTimeout for a new QP is calculated (for an
> SDP QP).

It's a function of the packet life time on the path from source to
destination and the local CA ack delay.

> Is there a way to change it through a module parameter?
> If not, what is the right way to change it?

It depends on the SM being used. OpenSM has a way to change the path
PLT returned/used.

-- Hal

> Thanks,
> Amir
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>


From tziporet at mellanox.co.il  Thu May 21 08:33:24 2009
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Thu, 21 May 2009 18:33:24 +0300
Subject: [ofa-general] OFED 1.4.1-rc6  is available
In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD02A2B828@mtlexch01.mtl.com>
References: <5D49E7A8952DC44FB38C38FA0D758EAD020E76C4@mtlexch01.mtl.com>
	<5D49E7A8952DC44FB38C38FA0D758EAD02A2B828@mtlexch01.mtl.com>
Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD02C12251@mtlexch01.mtl.com>

 
Hi,

OFED-1.4.1-rc6  release is available on

http://www.openfabrics.org/downloads/OFED/ofed-1.4.1/OFED-1.4.1-rc6.tgz

To get BUILD_ID run ofed_info

Please report any issues in bugzilla https://bugs.openfabrics.org/  for
OFED 1.4.1

Vladimir & Tziporet

========================================================================


Release information:
------------------------------
Linux Operating Systems:
      - RedHat EL4 up4:  2.6.9-42.ELsmp      *
      - RedHat EL4 up5:  2.6.9-55.ELsmp
      - RedHat EL4 up6:  2.6.9-67.ELsmp
      - RedHat EL4 up7:  2.6.9-78.ELsmp
      - RedHat EL5:        2.6.18-8.el5
      - RedHat EL5 up1:  2.6.18-53.el5
      - RedHat EL5 up2:  2.6.18-92.el5
      - RedHat EL5 up3:  2.6.18-128.el5
      - OEL 4.5:              2.6.9-55.ELsmp
      - OEL 5.2:              2.6.18-92.el5
      - CentOS 5.2:         2.6.18-92.el5
      - Fedora C9:           2.6.25-14.fc9          *
      - SLES10:              2.6.16.21-0.8-smp
      - SLES10 SP1:       2.6.16.46-0.12-smp
      - SLES10 SP1 up1: 2.6.16.53-0.16-smp
      - SLES10 SP2:       2.6.16.60-0.21-smp
      - SLES11 GA:         2.6.27.13-1-default
      - OpenSuSE 10.3:   2.6.22.5-31             *
      - kernel.org:             2.6.26 and 2.6.27

    * Minimal QA for these versions

Systems:
      * x86_64
      * x86
      * ia64
      * ppc64


Main Changes from OFED-1.4.1-rc4
==========================
- Fixed all backport issues with NFS/RDMA
- mlx4_en: Updated driver to version 1.4.1 that was released by Mellanox
- Added an error in case of mlx4 library mismatch with kernel (due to
XRC support)
- 7 bug fixed (see attachment)
- Updated bonding package: ib-bonding-0.9.0-40
- Updated MVAPICH package: 1.1.0-3355
- Updated documentation

Tasks that should be completed for GA (May 27):
====================================
1. Critical bug fixes - see list bellow
2. Complete documentation update

Open bugs:
========
bug_id	bug_severity	op_sys		assigned_to
1630    	cri  		RHEL  	amirv at mellanox.co.il  	sdp
module fails to compile with gcc 3.4 on i386 - we already have a patch
but did not wanted to risk rc6
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ofed-1.4.1-rc6-fixed-bugs.csv
Type: application/octet-stream
Size: 814 bytes
Desc: ofed-1.4.1-rc6-fixed-bugs.csv
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090521/a8732c87/attachment.obj>

From arlin.r.davis at intel.com  Thu May 21 09:16:34 2009
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Thu, 21 May 2009 09:16:34 -0700
Subject: [ofa-general] Default value of LocalAckTimeout for a new QP
In-Reply-To: <f0e08f230905210620x5485287btdd84aa0118aa592d@mail.gmail.com>
References: <4A1541CE.9090509@mellanox.co.il>
	<f0e08f230905210620x5485287btdd84aa0118aa592d@mail.gmail.com>
Message-ID: <E3280858FA94444CA49D2BA02341C98352280848@orsmsx506.amr.corp.intel.com>

 
>> Sean Hi,
>>
>> I need to know how is the LocalAckTimeout for a new QP is 
>calculated (for an
>> SDP QP).
>
>It's a function of the packet life time on the path from source to
>destination and the local CA ack delay.
>
>> Is there a way to change it through a module parameter?
>> If not, what is the right way to change it?
>
>It depends on the SM being used. OpenSM has a way to change the path
>PLT returned/used.

You can actually modify the path record (rdma_cm_id.route->path_rec) 
before the rdma_connect, after the RDMA_CM_EVENT_ROUTE_RESOLVED event, 
and rdma_cm will pick up the change for the internal QP modify. 

-arlin


From akepner at sgi.com  Thu May 21 12:39:10 2009
From: akepner at sgi.com (akepner at sgi.com)
Date: Thu, 21 May 2009 12:39:10 -0700
Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in
	ipoib_neigh_cleanup()
In-Reply-To: <4A14E766.1010005@voltaire.com>
References: <20090519215505.GN6837@sgi.com> <adaws8bbs55.fsf@cisco.com>
	<20090520213703.GT6837@sgi.com> <4A14E766.1010005@voltaire.com>
Message-ID: <20090521193910.GX6837@sgi.com>

On Thu, May 21, 2009 at 08:32:22AM +0300, Or Gerlitz wrote:
> ....
> >Pid: 0, comm: swapper Tainted: G     U 2.6.16.54-0.2.5-smp 
> its very likely that the problem you face in 2.6.16 was fixed by the 
> commit I pointed on in my previous reply on this thread.
> 

Hmmm, it's not obvious to me that that commit 
(ecbb416939da77c0d107409976499724baddce7b) would be relevant 
to the bug that I mentioned earlier. 

-- 
Arthur


From amirv at mellanox.co.il  Thu May 21 12:54:04 2009
From: amirv at mellanox.co.il (Amir Vadai)
Date: Thu, 21 May 2009 22:54:04 +0300
Subject: [ofa-general] Default value of LocalAckTimeout for a new QP
References: <4A1541CE.9090509@mellanox.co.il>
	<f0e08f230905210620x5485287btdd84aa0118aa592d@mail.gmail.com>
Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD61E808@mtlexch01.mtl.com>

Hal Hi,
 
Assuming I  am using OpenSM - how can I tell  it to do  it?
 
Thanks,
Amir

________________________________

From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
Sent: Thu 21-May-09 4:20 PM
To: Amir Vadai
Cc: Sean Hefty; Nimrod Gindi; OpenIB
Subject: Re: [ofa-general] Default value of LocalAckTimeout for a new QP


On Thu, May 21, 2009 at 7:58 AM, Amir Vadai <amirv at mellanox.co.il> wrote:
> Sean Hi,
>
> I need to know how is the LocalAckTimeout for a new QP is calculated (for an
> SDP QP).

It's a function of the packet life time on the path from source to
destination and the local CA ack delay.

> Is there a way to change it through a module parameter?
> If not, what is the right way to change it?

It depends on the SM being used. OpenSM has a way to change the path
PLT returned/used.

-- Hal

> Thanks,
> Amir
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090521/7a8f2ce9/attachment.html>

From amirv at mellanox.co.il  Thu May 21 12:57:00 2009
From: amirv at mellanox.co.il (Amir Vadai)
Date: Thu, 21 May 2009 22:57:00 +0300
Subject: [ofa-general] Default value of LocalAckTimeout for a new QP
References: <4A1541CE.9090509@mellanox.co.il>
	<f0e08f230905210620x5485287btdd84aa0118aa592d@mail.gmail.com>
	<E3280858FA94444CA49D2BA02341C98352280848@orsmsx506.amr.corp.intel.com>
Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD61E809@mtlexch01.mtl.com>

For that I will need to access structures that are private to CMA.
I hoped there is a way to do it in SDP code only or preferably from the environment.
 
- Amir

________________________________

From: Davis, Arlin R [mailto:arlin.r.davis at intel.com]
Sent: Thu 21-May-09 7:16 PM
To: Hal Rosenstock; Amir Vadai
Cc: Nimrod Gindi; OpenIB
Subject: RE: [ofa-general] Default value of LocalAckTimeout for a new QP


>> Sean Hi,
>>
>> I need to know how is the LocalAckTimeout for a new QP is
>calculated (for an
>> SDP QP).
>
>It's a function of the packet life time on the path from source to
>destination and the local CA ack delay.
>
>> Is there a way to change it through a module parameter?
>> If not, what is the right way to change it?
>
>It depends on the SM being used. OpenSM has a way to change the path
>PLT returned/used.

You can actually modify the path record (rdma_cm_id.route->path_rec)
before the rdma_connect, after the RDMA_CM_EVENT_ROUTE_RESOLVED event,
and rdma_cm will pick up the change for the internal QP modify.

-arlin


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090521/1ba823f0/attachment.html>

From hal.rosenstock at gmail.com  Thu May 21 13:01:36 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 21 May 2009 16:01:36 -0400
Subject: [ofa-general] Default value of LocalAckTimeout for a new QP
In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD61E808@mtlexch01.mtl.com>
References: <4A1541CE.9090509@mellanox.co.il>
	<f0e08f230905210620x5485287btdd84aa0118aa592d@mail.gmail.com>
	<5D49E7A8952DC44FB38C38FA0D758EAD61E808@mtlexch01.mtl.com>
Message-ID: <f0e08f230905211301w7c1f7e13jd4af3bef1c055b84@mail.gmail.com>

On Thu, May 21, 2009 at 3:54 PM, Amir Vadai <amirv at mellanox.co.il> wrote:
> Hal Hi,
>
> Assuming I  am using OpenSM - how can I tell  it to do  it?

Assuming this is a non QoS configuration, you need to configure the
subnet_timeout.

-- Hal

>
> Thanks,
> Amir
> ________________________________
> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> Sent: Thu 21-May-09 4:20 PM
> To: Amir Vadai
> Cc: Sean Hefty; Nimrod Gindi; OpenIB
> Subject: Re: [ofa-general] Default value of LocalAckTimeout for a new QP
>
> On Thu, May 21, 2009 at 7:58 AM, Amir Vadai <amirv at mellanox.co.il> wrote:
>> Sean Hi,
>>
>> I need to know how is the LocalAckTimeout for a new QP is calculated (for
>> an
>> SDP QP).
>
> It's a function of the packet life time on the path from source to
> destination and the local CA ack delay.
>
>> Is there a way to change it through a module parameter?
>> If not, what is the right way to change it?
>
> It depends on the SM being used. OpenSM has a way to change the path
> PLT returned/used.
>
> -- Hal
>
>> Thanks,
>> Amir
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
>


From amirv at mellanox.co.il  Thu May 21 13:04:32 2009
From: amirv at mellanox.co.il (Amir Vadai)
Date: Thu, 21 May 2009 23:04:32 +0300
Subject: [ofa-general] Default value of LocalAckTimeout for a new QP
References: <4A1541CE.9090509@mellanox.co.il><f0e08f230905210620x5485287btdd84aa0118aa592d@mail.gmail.com><5D49E7A8952DC44FB38C38FA0D758EAD61E808@mtlexch01.mtl.com>
	<f0e08f230905211301w7c1f7e13jd4af3bef1c055b84@mail.gmail.com>
Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD61E80A@mtlexch01.mtl.com>

Thanks,
 
Will try it.
 
- Amir

________________________________

From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
Sent: Thu 21-May-09 11:01 PM
To: Amir Vadai
Cc: Sean Hefty; Nimrod Gindi; OpenIB
Subject: Re: [ofa-general] Default value of LocalAckTimeout for a new QP


On Thu, May 21, 2009 at 3:54 PM, Amir Vadai <amirv at mellanox.co.il> wrote:
> Hal Hi,
>
> Assuming I  am using OpenSM - how can I tell  it to do  it?

Assuming this is a non QoS configuration, you need to configure the
subnet_timeout.

-- Hal

>
> Thanks,
> Amir
> ________________________________
> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> Sent: Thu 21-May-09 4:20 PM
> To: Amir Vadai
> Cc: Sean Hefty; Nimrod Gindi; OpenIB
> Subject: Re: [ofa-general] Default value of LocalAckTimeout for a new QP
>
> On Thu, May 21, 2009 at 7:58 AM, Amir Vadai <amirv at mellanox.co.il> wrote:
>> Sean Hi,
>>
>> I need to know how is the LocalAckTimeout for a new QP is calculated (for
>> an
>> SDP QP).
>
> It's a function of the packet life time on the path from source to
> destination and the local CA ack delay.
>
>> Is there a way to change it through a module parameter?
>> If not, what is the right way to change it?
>
> It depends on the SM being used. OpenSM has a way to change the path
> PLT returned/used.
>
> -- Hal
>
>> Thanks,
>> Amir
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090521/06873d7b/attachment.html>

From dave at thedillows.org  Thu May 21 13:34:15 2009
From: dave at thedillows.org (David Dillow)
Date: Thu, 21 May 2009 16:34:15 -0400
Subject: [ofa-general] Problem executing IB Verbs/RDMA code via JNI
In-Reply-To: <7d4423d30905200931r7e6a49aas1254ec644ba028c1@mail.gmail.com>
References: <7d4423d30905200931r7e6a49aas1254ec644ba028c1@mail.gmail.com>
Message-ID: <1242938055.4422.7.camel@obelisk.thedillows.org>

On Wed, 2009-05-20 at 21:31 +0500, Zafar Gilani wrote:
> This is my second message on the list. This one is exactly same as
> first one, my previous did not receive any replies. I will be thankful
> if anyone could point out the problem in the code files. Problem is
> explained below:

You sent several almost identical messages to this list in a very short
period of time, with a question that while not exactly "could you do my
homework for me?" is not too far removed. You then complain that you did
not get a response within 12 hours, 8+ hours of which the US contingent
of this list were likely asleep or otherwise away from their computers.
The rest of the list was likely trying to solve their own problems at
their workplace. This is a mostly volunteer effort; I don't think many
people, if any, get paid to monitor this list.

While people here are generally interested in helping people out, I
think your expectations may need to be adjusted a bit closer to reality.


From akepner at sgi.com  Thu May 21 14:00:49 2009
From: akepner at sgi.com (akepner at sgi.com)
Date: Thu, 21 May 2009 14:00:49 -0700
Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in
	ipoib_neigh_cleanup()
In-Reply-To: <4A13ADDA.5040908@Voltaire.com>
References: <20090519215505.GN6837@sgi.com> <4A13ADDA.5040908@Voltaire.com>
Message-ID: <20090521210049.GY6837@sgi.com>


As a recap to this thrilling thread, I'm tracking down a panic with 
a backtrace like this:

     ib_ipoib:ipoib_neigh_cleanup+368
     ....
     neigh_periodic_timer+0
     run_timer_softirq+348
     __do_softirq+85
     call_softirq+30
     do_softirq+44
     .....

And the following helpful hint:
Unable to handle kernel paging request at 0000000000100108
                                          ^^^^^^^^^^^^^^^^
                                          LIST_POISON1+0x8

So, we're in ipoib_neigh_cleanup(), doing the list_del():

static void ipoib_neigh_cleanup(struct neighbour *n)
{
	.......
	neigh = *to_ipoib_neigh(n);
	.....
	spin_lock_irqsave(&priv->lock, flags);
	if (neigh->ah)
		ah = neigh->ah;
	list_del(&neigh->list);
	ipoib_neigh_free(n->dev, neigh);

	spin_unlock_irqrestore(&priv->lock, flags);


This has been practically impossible to reproduce (and I don't 
even have the original crashdump available any longer). 

What would prevent a race between a tx completion (with an 
error) and the cleanup of a neighbour? In that case the tx 
completion handler could do:

ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
{
	........
        if (wc->status != IB_WC_SUCCESS &&
            wc->status != IB_WC_WR_FLUSH_ERR) {
		.....
		spin_lock_irqsave(&priv->lock, flags);
		neigh = tx->neigh;

		if (neigh) {
 			neigh->cm = NULL;
			list_del(&neigh->list);
		.......
		spin_unlock_irqrestore(&priv->lock, flags);

While ipoib_neigh_cleanup() could grab the (now stale) neigh, and 
crash like above.

(I've tried simulating tx completion failures to trigger this 
behavior, but haven't gotten lucky yet....)

-- 
Arthur


From rdreier at cisco.com  Thu May 21 15:22:48 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 21 May 2009 15:22:48 -0700
Subject: [ofa-general] Re: [PATCH] core/mthca: Distinguish multiple IB cards
	in /proc/interrupts
In-Reply-To: <1AB9A794DBDDF54A8A81BE2296F7BDFE992692@cf--amer001e--3.americas.sgi.com>
	(Arputham Benjamin's message of "Wed, 20 May 2009 18:54:05 -0500")
References: <4A0B560B.3090606@sgi.com> <adad4aaf3dk.fsf@cisco.com>
	<4A136DF0.7000402@sgi.com> <adaab58ctd4.fsf@cisco.com>
	<1AB9A794DBDDF54A8A81BE2296F7BDFE992691@cf--amer001e--3.americas.sgi.com>
	<adaskizbimo.fsf@cisco.com>
	<1AB9A794DBDDF54A8A81BE2296F7BDFE992692@cf--amer001e--3.americas.sgi.com>
Message-ID: <ada8wkqayfb.fsf@cisco.com>

 > I wanted to add some clarification.
 > 
 > We have two types of IB devices:
 > 1)Devices that can operate as an InfiniBand adapter only
 > 2)Devices that can operate as an InfiniBand adapter or as an Ethernet NIC
 > 
 > As per the current implementation of OFED stack, the driver architecture
 > of #2 is very different from #1 because it needs to make sure InfiniBand
 > and Ethernet functions can share the device without interfering with
 > each other.
 > 
 > I was thinking that we can fix /proc/interrupts issue for case#1 first
 > and worry about #2 later because the design to fix /proc/interrupts 
 > for mlx4 case is going to be different and independent just as the
 > driver design is different and independent for the two cases today.

I disagree.  A verbs consumer of mlx4 doesn't have to worry about the
internal design of the driver being different from mthca, and I would
hope that carries over to indentifying interrupts.  It's much better for
users if we can just come up with a solution that handles both of your
cases at once, rather than an ad hoc solution for a subset of drivers.

 > Please let me know if you still think we need a common solution for
 > both cases mentioned above. Any suggestions at a high level for such
 > a common solution?

I already suggested adding MSI-X vector information to
/sys/devices/... to match the existing "irq" file there.  That would
allow userspace to figure out which interrupt belonged where.  Jason's
idea of adding the PCI device name to the interrupt name seems viable to
me as well.

 - R.


From rdreier at cisco.com  Thu May 21 15:33:19 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 21 May 2009 15:33:19 -0700
Subject: [ofa-general] Infiniband with Xen
In-Reply-To: <BAY139-W216D23BBED1EB689F563D5AE590@phx.gbl> (anthony garnier's
	message of "Thu, 21 May 2009 12:05:13 +0000")
References: <BAY139-W21E7902C852FC8E202B60EAE580@phx.gbl>
	<adaoctnbiju.fsf@cisco.com>
	<BAY139-W216D23BBED1EB689F563D5AE590@phx.gbl>
Message-ID: <ada4oveaxxs.fsf@cisco.com>

 > I have tried to also to passthrough the pci bridge but It doesn't work, I got that with de dmesg on my Dom0 :
 > 
 >  1297.293367] pciback: vpci: 0000:04:00.0: assign to virtual slot 0     ( this is  HCA)
 > [ 1297.295538] pciback: vpci: 0000:03:07.1: assign to virtual slot 1    ( This is eth1)
 > [ 1297.298346] pciback: vpci: 0000:03:08.0: assign to virtual slot 2     ( this is the HCA bridge)
 > [ 1298.655743] pciback 0000:03:08.0: Driver tried to write to a read-only configuration space field at offset 0x3e, size 2. This may be harmless, but if you have problems with your device:
 > [ 1298.655747] 1) see permissive attribute in sysfs
 > [ 1298.655749] 2) report problems to the xen-devel mailing list along with details of your device obtained from lspci.

Yes, the driver does set the PCI bridge back up after resetting the HCA
(to restore all the configuration values that are lost during reset).
So you need to set up Xen so that the driver is allowed to restore the
HCA and PCI bridge config space.

Also it's not clear to me from these messages whether the HCA is put
into the domU PCI topology as being under the PCI bridge -- that is
probably required for the driver to work.

 - R.


From abenjamin at sgi.com  Thu May 21 16:23:17 2009
From: abenjamin at sgi.com (Arputham Benjamin)
Date: Thu, 21 May 2009 18:23:17 -0500
Subject: [ofa-general] RE: [PATCH] core/mthca: Distinguish multiple IB cards
	in /proc/interrupts
References: <4A0B560B.3090606@sgi.com> <adad4aaf3dk.fsf@cisco.com>
	<4A136DF0.7000402@sgi.com> <adaab58ctd4.fsf@cisco.com>
	<1AB9A794DBDDF54A8A81BE2296F7BDFE992691@cf--amer001e--3.americas.sgi.com>
	<adaskizbimo.fsf@cisco.com>
	<1AB9A794DBDDF54A8A81BE2296F7BDFE992692@cf--amer001e--3.americas.sgi.com>
	<ada8wkqayfb.fsf@cisco.com>
Message-ID: <1AB9A794DBDDF54A8A81BE2296F7BDFE992696@cf--amer001e--3.americas.sgi.com>

> I disagree.  A verbs consumer of mlx4 doesn't have to worry about the
> internal design of the driver being different from mthca, and I would
> hope that carries over to indentifying interrupts.  It's much better for
> users if we can just come up with a solution that handles both of your
> cases at once, rather than an ad hoc solution for a subset of drivers.

I was not suggesting that we change the interface to verbs consumer/user
or how we present the interrupt info. to user between mlx4 and mthca. 
I agree that it's much better if we can just come up with a solution
that handles both.

Any plan to merge the functionality of ib_core and mlx4_core into something
like 'ofa_core' that will control resource allocation for both Infiniband 
and Ethernet functions? A single core will help in any similar resource issues.

> I already suggested adding MSI-X vector information to
> /sys/devices/... to match the existing "irq" file there.  That would
> allow userspace to figure out which interrupt belonged where.  Jason's
> idea of adding the PCI device name to the interrupt name seems viable to
> me as well.

>  - R.

Don't we need both /sys/devices/... and /proc/interrupts?

Regards,
Benjamin


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090521/51ffaa85/attachment.html>

From rdreier at cisco.com  Thu May 21 17:05:05 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 21 May 2009 17:05:05 -0700
Subject: [ofa-general] RE: [PATCH] core/mthca: Distinguish multiple IB
	cards in /proc/interrupts
In-Reply-To: <1AB9A794DBDDF54A8A81BE2296F7BDFE992696@cf--amer001e--3.americas.sgi.com>
	(Arputham Benjamin's message of "Thu, 21 May 2009 18:23:17 -0500")
References: <4A0B560B.3090606@sgi.com> <adad4aaf3dk.fsf@cisco.com>
	<4A136DF0.7000402@sgi.com> <adaab58ctd4.fsf@cisco.com>
	<1AB9A794DBDDF54A8A81BE2296F7BDFE992691@cf--amer001e--3.americas.sgi.com>
	<adaskizbimo.fsf@cisco.com>
	<1AB9A794DBDDF54A8A81BE2296F7BDFE992692@cf--amer001e--3.americas.sgi.com>
	<ada8wkqayfb.fsf@cisco.com>
	<1AB9A794DBDDF54A8A81BE2296F7BDFE992696@cf--amer001e--3.americas.sgi.com>
Message-ID: <adazld69f4e.fsf@cisco.com>

 > Any plan to merge the functionality of ib_core and mlx4_core into something
 > like 'ofa_core' that will control resource allocation for both Infiniband 
 > and Ethernet functions? A single core will help in any similar resource issues.

No, since they do pretty different things.

 > Don't we need both /sys/devices/... and /proc/interrupts?

Not sure what you mean.  If we put msi-x info under /sys, then you can
figure out which interrupts belong to a given HCA by following the
device link from /sys/class/infiniband.  Similarly if /proc/interrupts
gives the PCI device, then you have the same ability.  So either way
works as far as I can tell.


From zafargilani at gmail.com  Thu May 21 21:36:00 2009
From: zafargilani at gmail.com (Zafar Gilani)
Date: Fri, 22 May 2009 09:36:00 +0500
Subject: [ofa-general] Problem executing IB Verbs/RDMA code via JNI
In-Reply-To: <1242938055.4422.7.camel@obelisk.thedillows.org>
References: <7d4423d30905200931r7e6a49aas1254ec644ba028c1@mail.gmail.com>
	<1242938055.4422.7.camel@obelisk.thedillows.org>
Message-ID: <7d4423d30905212136w27c7db0ev61736ba267f9eb78@mail.gmail.com>

> You sent several almost identical messages to this list in a very short
> period of time,

First of all I apologize for sending more than one message, but the reason
is that I did not see any list of discussions on the lists.openfabrics.org,
just a digest sent daily, so I wasn't sure whether anybody will have a look
at it or not. Apart from that I think you have misunderstood my email. When
I said "This is my second message on the list. This one is exactly same as
first one, my previous did not receive any replies.", it did not imply that
you people are supposed to reply, it was to ensure that if anybody has
previously seen my first message, should not waste his/her time at this.
Another reason was to see my message in the daily digest general, it is easy
to miss a message in such a list. It would be my mistake if somebody could
have possibly helped me but did not see my problem.

> with a question that while not exactly "could you do my
> homework for me?" is not too far removed.

Secondly I am not asking you or anybody to do my home work. I have explained
in the email what I have already tried, and requested for help such that
somebody could give me any ideas or pointers in the right direction, since
there must be highly experienced people on the list. Reason for giving the
code was for better understanding of the context of the problem for someone
who is willing to give his 10-15 minutes in this regard.

> You then complain that you did
> not get a response within 12 hours, 8+ hours of which the US contingent
> of this list were likely asleep or otherwise away from their computers.

Could you point out the part in my email where I am complaining explicitly
or implicitly? Secondly is everyone from the US on this list? Last time I
remember internet did not have territorial jurisdiction.

Regards,
Zafar
<http://hpc.niit.edu.pk/%7Ezafar>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090522/464bdea6/attachment.html>

From jgunthorpe at obsidianresearch.com  Thu May 21 21:46:53 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Thu, 21 May 2009 22:46:53 -0600
Subject: [ofa-general] RE: [PATCH] core/mthca: Distinguish multiple IB
	cards in /proc/interrupts
In-Reply-To: <1AB9A794DBDDF54A8A81BE2296F7BDFE992696@cf--amer001e--3.americas.sgi.com>
References: <4A0B560B.3090606@sgi.com> <adad4aaf3dk.fsf@cisco.com>
	<4A136DF0.7000402@sgi.com> <adaab58ctd4.fsf@cisco.com>
	<1AB9A794DBDDF54A8A81BE2296F7BDFE992691@cf--amer001e--3.americas.sgi.com>
	<adaskizbimo.fsf@cisco.com>
	<1AB9A794DBDDF54A8A81BE2296F7BDFE992692@cf--amer001e--3.americas.sgi.com>
	<ada8wkqayfb.fsf@cisco.com>
	<1AB9A794DBDDF54A8A81BE2296F7BDFE992696@cf--amer001e--3.americas.sgi.com>
Message-ID: <20090522044653.GF11690@obsidianresearch.com>

On Thu, May 21, 2009 at 06:23:17PM -0500, Arputham Benjamin wrote:

> > I already suggested adding MSI-X vector information to
> > /sys/devices/... to match the existing "irq" file there.  That would
> > allow userspace to figure out which interrupt belonged where.  Jason's
> > idea of adding the PCI device name to the interrupt name seems viable to
> > me as well.

> Don't we need both /sys/devices/... and /proc/interrupts?

You don't need the device name in proc/interrupts, that is just for
easy use of cat..

FWIW, this problem is not really an IB problem, but more a Linux
problem, there should have a better interface for matching MSI vectors
to the PCI device to the counters. Fixing up the MSI vector routines
in PCI core to note the vector numbers in sysfs would help everyone.

Jason


From dave at thedillows.org  Thu May 21 23:05:50 2009
From: dave at thedillows.org (David Dillow)
Date: Fri, 22 May 2009 02:05:50 -0400
Subject: [ofa-general] Problem executing IB Verbs/RDMA code via JNI
In-Reply-To: <7d4423d30905212136w27c7db0ev61736ba267f9eb78@mail.gmail.com>
References: <7d4423d30905200931r7e6a49aas1254ec644ba028c1@mail.gmail.com>
	<1242938055.4422.7.camel@obelisk.thedillows.org>
	<7d4423d30905212136w27c7db0ev61736ba267f9eb78@mail.gmail.com>
Message-ID: <1242972350.4422.38.camel@obelisk.thedillows.org>

On Fri, 2009-05-22 at 09:36 +0500, Zafar Gilani wrote:
> First of all I apologize for sending more than one message, but the
> reason is that I did not see any list of discussions on the
> lists.openfabrics.org, just a digest sent daily, so I wasn't sure
> whether anybody will have a look at it or not. Apart from that I think
> you have misunderstood my email.

Nice backpedaling, but
1) The list is archived at lists.openfabrics.org, and it is hard to miss
the link.
2) The archive has a copy of your message to me, within 90 minutes of
you sending it.
3) This is not a high-volume list, and "my message might be missed"
really doesn't hold water.

> > You then complain that you did
> > not get a response within 12 hours, 8+ hours of which the US
> contingent
> > of this list were likely asleep or otherwise away from their
> computers.
> 
> Could you point out the part in my email where I am complaining
> explicitly or implicitly?

"No one answered my email" in 12 hours is generally considered a whine.

> Secondly is everyone from the US on this list? Last time I remember
> internet did not have territorial jurisdiction.

No, only a very small percentage of the US population is on this list.
I'd be mildly surprised if less than half of the list is based in the
same timezones as the US, though.

I pointed out that good part of the list membership was asleep, and the
reset were working on their day jobs. You seem to have cut that part.

In any event, if the code works natively and not inside the JVM, I would
suggest investigating what is different in the broken environment. You
seem to have forgotten the error log in the tarball, nor did you state
which side crashes -- client, server, or both. You also don't specify
which OFED version you are running. This makes it hard for the people
that actually want to help you.


From vlad at lists.openfabrics.org  Fri May 22 03:22:20 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Fri, 22 May 2009 03:22:20 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090522-0200 daily build status
Message-ID: <20090522102220.BA12BE613EC@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From yossi.openib at gmail.com  Fri May 22 03:24:17 2009
From: yossi.openib at gmail.com (Yossi Etigin)
Date: Fri, 22 May 2009 13:24:17 +0300
Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in
	ipoib_neigh_cleanup()
In-Reply-To: <20090521193910.GX6837@sgi.com>
References: <20090519215505.GN6837@sgi.com>
	<adaws8bbs55.fsf@cisco.com>	<20090520213703.GT6837@sgi.com>
	<4A14E766.1010005@voltaire.com> <20090521193910.GX6837@sgi.com>
Message-ID: <4A167D51.1060607@gmail.com>

akepner at sgi.com wrote:
> On Thu, May 21, 2009 at 08:32:22AM +0300, Or Gerlitz wrote:
>> ....
>>> Pid: 0, comm: swapper Tainted: G     U 2.6.16.54-0.2.5-smp 
>> its very likely that the problem you face in 2.6.16 was fixed by the 
>> commit I pointed on in my previous reply on this thread.
>>
> 
> Hmmm, it's not obvious to me that that commit 
> (ecbb416939da77c0d107409976499724baddce7b) would be relevant 
> to the bug that I mentioned earlier. 
> 

So, ipoib tries to list_del(neigh) twice because the second time
the condition (neigh != NULL) is not protected with a lock.
How about this:

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index ab2c192..993b5a7 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -845,23 +845,24 @@ static void ipoib_neigh_cleanup(struct neighbour *n)
 	unsigned long flags;
 	struct ipoib_ah *ah = NULL;
 
+	spin_lock_irqsave(&priv->lock, flags);
+
 	neigh = *to_ipoib_neigh(n);
 	if (neigh)
 		priv = netdev_priv(neigh->dev);
 	else
-		return;
+		goto out;
 	ipoib_dbg(priv,
 		  "neigh_cleanup for %06x %pI6\n",
 		  IPOIB_QPN(n->ha),
 		  n->ha + 4);
 
-	spin_lock_irqsave(&priv->lock, flags);
-
 	if (neigh->ah)
 		ah = neigh->ah;
 	list_del(&neigh->list);
 	ipoib_neigh_free(n->dev, neigh);
 
+out:
 	spin_unlock_irqrestore(&priv->lock, flags);
 
 	if (ah)


From sokar6012 at hotmail.com  Fri May 22 04:34:29 2009
From: sokar6012 at hotmail.com (anthony garnier)
Date: Fri, 22 May 2009 11:34:29 +0000
Subject: [ofa-general] Infiniband with Xen
In-Reply-To: <ada4oveaxxs.fsf@cisco.com>
References: <BAY139-W21E7902C852FC8E202B60EAE580@phx.gbl>
	<adaoctnbiju.fsf@cisco.com>
	<BAY139-W216D23BBED1EB689F563D5AE590@phx.gbl> 
	<ada4oveaxxs.fsf@cisco.com>
Message-ID: <BAY139-W11E3D9D2B56A1BD0519E0CAE560@phx.gbl>


Hi,
 You told me " you need to set up Xen so that the driver is allowed to restore the HCA and PCI bridge config space."
But how can I set up xen to allow the driver to restore the HCA and pci config space?

> From: rdreier at cisco.com
> To: sokar6012 at hotmail.com
> CC: general at lists.openfabrics.org
> Subject: Re: [ofa-general] Infiniband with Xen
> Date: Thu, 21 May 2009 15:33:19 -0700
> 
>  > I have tried to also to passthrough the pci bridge but It doesn't work, I got that with de dmesg on my Dom0 :
>  > 
>  >  1297.293367] pciback: vpci: 0000:04:00.0: assign to virtual slot 0     ( this is  HCA)
>  > [ 1297.295538] pciback: vpci: 0000:03:07.1: assign to virtual slot 1    ( This is eth1)
>  > [ 1297.298346] pciback: vpci: 0000:03:08.0: assign to virtual slot 2     ( this is the HCA bridge)
>  > [ 1298.655743] pciback 0000:03:08.0: Driver tried to write to a read-only configuration space field at offset 0x3e, size 2. This may be harmless, but if you have problems with your device:
>  > [ 1298.655747] 1) see permissive attribute in sysfs
>  > [ 1298.655749] 2) report problems to the xen-devel mailing list along with details of your device obtained from lspci.
> 
> Yes, the driver does set the PCI bridge back up after resetting the HCA
> (to restore all the configuration values that are lost during reset).
> So you need to set up Xen so that the driver is allowed to restore the
> HCA and PCI bridge config space.
> 
> Also it's not clear to me from these messages whether the HCA is put
> into the domU PCI topology as being under the PCI bridge -- that is
> probably required for the driver to work.
> 
>  - R.

_________________________________________________________________
Vous voulez savoir ce que vous pouvez faire avec le nouveau Windows Live ? Lancez-vous !
http://www.microsoft.com/windows/windowslive/default.aspx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090522/6cad2181/attachment.html>

From hnrose at comcast.net  Fri May 22 04:42:34 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Fri, 22 May 2009 07:42:34 -0400
Subject: [ofa-general] [PATCH 1/2] [TRIVIAL] opensm/osm_ucast_lash.c: Fix
	commentary typo
Message-ID: <20090522114234.GB29953@comcast.net>


Signed-off-by: Robert Pearson <rpearson at systemfabricworks.com>
Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
index fa8e7e9..e034d6f 100644
--- a/opensm/opensm/osm_ucast_lash.c
+++ b/opensm/opensm/osm_ucast_lash.c
@@ -94,7 +94,7 @@ static void connect_switches(lash_t * p_lash, int sw1, int sw2, int phy_port_1)
 		if (sw1 == sw2)
 			return;
 
-		/* see if we are alredy linked to sw2 */
+		/* see if we are already linked to sw2 */
 		for (i = 0; i < num; i++) {
 			l = node->links[i];
 

From hnrose at comcast.net  Fri May 22 04:41:10 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Fri, 22 May 2009 07:41:10 -0400
Subject: [ofa-general] [PATCH] opensm/osm_mesh.c: Use define rather than hard
	coded constant
Message-ID: <20090522114110.GA29953@comcast.net>


Add LARGE define and use it

Signed-off-by: Robert Pearson <rpearson at systemfabricworks.com>
Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c
index 263d29e..1867876 100644
--- a/opensm/opensm/osm_mesh.c
+++ b/opensm/opensm/osm_mesh.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2008      System Fabric Works, Inc.
+ * Copyright (c) 2008,2009      System Fabric Works, Inc.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -50,6 +50,7 @@
 
 #define MAX_DEGREE	(8)
 #define MAX_DIMENSION	(8)
+#define LARGE		(0x7fffffff)
 
 /*
  * characteristic polynomials for selected 1d through 8d tori
@@ -594,7 +595,7 @@ static int get_switch_metric(lash_t *p_lash, int sw)
 
 			/* make all distances big except s1 to itself */
 			for (sw2 = 0; sw2 < p_lash->num_switches; sw2++)
-				p_lash->switches[sw2]->node->temp = 0x7fffffff;
+				p_lash->switches[sw2]->node->temp = LARGE;
 
 			s1->node->temp = 0;
 
@@ -603,7 +604,7 @@ static int get_switch_metric(lash_t *p_lash, int sw)
 
 				for (sw2 = 0; sw2 < p_lash->num_switches; sw2++) {
 					s2 = p_lash->switches[sw2];
-					if (s2->node->temp == 0x7fffffff)
+					if (s2->node->temp == LARGE)
 						continue;
 					for (j = 0; j < s2->node->num_links; j++) {
 						sw3 = s2->node->links[j]->switch_id;
@@ -1120,7 +1121,7 @@ static int measure_geometry(lash_t *p_lash, mesh_t *mesh, int seed)
 
 		s->node->coord = calloc(dimension, sizeof(int));
 		for (i = 0; i < dimension; i++)
-			s->node->coord[i] = (sw == seed)? 0 : 0x7fffffff;
+			s->node->coord[i] = (sw == seed) ? 0 : LARGE;
 
 		for (i = 0; i < s->node->num_links; i++)
 			if (s->node->axes[i] == 0)
@@ -1137,7 +1138,7 @@ static int measure_geometry(lash_t *p_lash, mesh_t *mesh, int seed)
 		for (sw = 0; sw < num_switches; sw++) {
 			s = p_lash->switches[sw];
 
-			if (s->node->coord[0] == 0x7fffffff)
+			if (s->node->coord[0] == LARGE)
 				continue;
 
 			for (j = 0; j < s->node->num_links; j++) {
@@ -1172,15 +1173,15 @@ static int measure_geometry(lash_t *p_lash, mesh_t *mesh, int seed)
 	mesh->size = calloc(dimension, sizeof(int));
 
 	for (i = 0; i < dimension; i++) {
-		max[i] = -0x7fffffff;
-		min[i] = 0x7fffffff;
+		max[i] = -LARGE;
+		min[i] = LARGE;
 	}
 
 	for (sw = 0; sw < num_switches; sw++) {
 		s = p_lash->switches[sw];
 
 		for (i = 0; i < dimension; i++) {
-			if (s->node->coord[i] == 0x7fffffff)
+			if (s->node->coord[i] == LARGE)
 				continue;
 			if (s->node->coord[i] > max[i])
 				max[i] = s->node->coord[i];


From hnrose at comcast.net  Fri May 22 04:43:46 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Fri, 22 May 2009 07:43:46 -0400
Subject: [ofa-general] [PATCH 2/2] opensm/osm_ucast_lash.c: Use calloc rather
	than malloc/memset
Message-ID: <20090522114346.GC29953@comcast.net>


Signed-off-by: Robert Pearson <rpearson at systemfabricworks.com>
Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
index e034d6f..a987eb3 100644
--- a/opensm/opensm/osm_ucast_lash.c
+++ b/opensm/opensm/osm_ucast_lash.c
@@ -62,9 +62,9 @@ typedef struct _reachable_dest {
 
 static cdg_vertex_t *create_cdg_vertex(unsigned num_switches)
 {
-	cdg_vertex_t *v = malloc(sizeof(*v) + (num_switches - 1) * sizeof(v->deps[0]));
+	cdg_vertex_t *v;
 
-	memset(v, 0, sizeof(*v) + (num_switches - 1) * sizeof(v->deps[0]));
+	v = calloc(1, sizeof(*v) + (num_switches - 1) * sizeof(v->deps[0]));
 
 	return v;
 }
@@ -838,13 +838,12 @@ static int lash_core(lash_t * p_lash)
 		}
 	}
 
-	switch_bitmap = malloc(num_switches * num_switches * sizeof(int));
+	switch_bitmap = calloc(num_switches * num_switches, sizeof(int));
 	if (!switch_bitmap) {
 		OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4D04: "
 			"Failed allocating switch_bitmap - out of memory\n");
 		goto Exit;
 	}
-	memset(switch_bitmap, 0, num_switches * num_switches * sizeof(int));
 
 	for (i = 0; i < num_switches; i++) {
 		for (dest_switch = 0; dest_switch < num_switches; dest_switch++)
@@ -1145,10 +1144,9 @@ static int discover_network_properties(lash_t * p_lash)
 
 	p_lash->num_switches = cl_qmap_count(&p_subn->sw_guid_tbl);
 
-	p_lash->switches = malloc(p_lash->num_switches * sizeof(switch_t *));
+	p_lash->switches = calloc(p_lash->num_switches, sizeof(switch_t *));
 	if (!p_lash->switches)
 		return -1;
-	memset(p_lash->switches, 0, p_lash->num_switches * sizeof(switch_t *));
 
 	vl_min = 5;		/* set to a high value */
 
@@ -1251,11 +1249,10 @@ static lash_t *lash_create(osm_opensm_t * p_osm)
 {
 	lash_t *p_lash;
 
-	p_lash = malloc(sizeof(lash_t));
+	p_lash = calloc(1, sizeof(lash_t));
 	if (!p_lash)
 		return NULL;
 
-	memset(p_lash, 0, sizeof(lash_t));
 	p_lash->p_osm = p_osm;
 
 	return (p_lash);


From zafargilani at gmail.com  Fri May 22 08:04:01 2009
From: zafargilani at gmail.com (Zafar Gilani)
Date: Fri, 22 May 2009 20:04:01 +0500
Subject: [ofa-general] Problem in client/server code with JNI
Message-ID: <7d4423d30905220804p596a139aq351d8fc956aa32c4@mail.gmail.com>

I am using IB Verbs and RDMA CM to implement a communication device over
InfiniBand fabric. I have executed client/server code (most part from Roland
Dreier, CISCO) and it works absolutely fine. The server listens for
requests, client sends two integers to the server and server returns their
sum.

When I try to call the same thing via JNI, the code gets stuck at method
"ibv_alloc_pd()" (line 170) in the client code (nativeclient.c). I have
checked the rdma_cm_id "cm_id", the ibv_context "cm_id->verbs" and the
protection domain "ibv_pd" but I am unable to resolve the error.

JVM actually crashes stating that error exists at frame: "ibv_alloc_pd" and
that crash happened in native code. Though this is understandable, but the
error is not, since the same code works when executed directly with c
compiler but gives trouble with JNI.

Compilers:
   java version "1.6.0_07"
   Java(TM) SE Runtime Environment (build 1.6.0_07-b06)
   Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode)

   gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42)

Environment:
   Red Hat 4.12
   2 Intel(R) Xeon(TM) 5060 CPU dual-core hyperthreading 3.20GHz

   OFED version 1.4

Compressed file (jni.tar) that contains all the files (.java, .h, .c and
.log) is available at
[http://hpc.niit.edu.pk/~zafar/work/ib/jni.tar<http://hpc.niit.edu.pk/%7Ezafar/work/ib/jni.tar>]
for better understanding. I was hoping that someone could may be give me
some pointers/suggestions in the right direction.

Any help will be greatly appreciated.

P.S.: Structure of the code:
nativeclient.c [native code for client]
nativeserver.c [native code for server]
RdmaOpsServer.java [Server code calling native server code]
RdmaOpsClient.java [Client code calling native client code]

Thanks,
-- 
Syed Zafar ul Hussan Gilani | BIT-7
Research Student | CHPSC
MSP 2008-09
NUST SEECS | http://hpc.niit.edu.pk/~zafar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090522/3641f741/attachment.html>

From akepner at sgi.com  Fri May 22 08:52:08 2009
From: akepner at sgi.com (akepner at sgi.com)
Date: Fri, 22 May 2009 08:52:08 -0700
Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in
	ipoib_neigh_cleanup()
In-Reply-To: <4A167D51.1060607@gmail.com>
References: <20090519215505.GN6837@sgi.com> <adaws8bbs55.fsf@cisco.com>
	<20090520213703.GT6837@sgi.com> <4A14E766.1010005@voltaire.com>
	<20090521193910.GX6837@sgi.com> <4A167D51.1060607@gmail.com>
Message-ID: <20090522155208.GB6837@sgi.com>

On Fri, May 22, 2009 at 01:24:17PM +0300, Yossi Etigin wrote:
> ...
> So, ipoib tries to list_del(neigh) twice because the second time
> the condition (neigh != NULL) is not protected with a lock.
> How about this:
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> index ab2c192..993b5a7 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> @@ -845,23 +845,24 @@ static void ipoib_neigh_cleanup(struct neighbour *n)
>  	unsigned long flags;
>  	struct ipoib_ah *ah = NULL;
>  
> +	spin_lock_irqsave(&priv->lock, flags); <-- deadlock here
> +
>  	neigh = *to_ipoib_neigh(n);
>  	if (neigh)
>  		priv = netdev_priv(neigh->dev);
>  	else
> -		return;
> +		goto out;
>  	ipoib_dbg(priv,
>  		  "neigh_cleanup for %06x %pI6\n",
>  		  IPOIB_QPN(n->ha),
>  		  n->ha + 4);
>  
> -	spin_lock_irqsave(&priv->lock, flags);
> -
>  	if (neigh->ah)
>  		ah = neigh->ah;
>  	list_del(&neigh->list);
>  	ipoib_neigh_free(n->dev, neigh);
>  
> +out:
>  	spin_unlock_irqrestore(&priv->lock, flags);
>  
>  	if (ah)
> 

This is essentially what I did first time around, but a deadlock on 
the line marked above was quickly found. 

Instead what we've been doing is:


--- e/ofa_kernel-1.3.1/drivers/infiniband/ulp/ipoib/ipoib_main.c        2008-06-06 12:04:20.791744390 -0700
+++ f/ofa_kernel-1.3.1/drivers/infiniband/ulp/ipoib/ipoib_main.c        2008-06-06 12:10:14.129143660 -0700
@@ -835,11 +835,14 @@ static void ipoib_neigh_cleanup(struct n
		IPOIB_GID_RAW_ARG(n->ha + 4));

	spin_lock_irqsave(&priv->lock, flags);
-
-	if (neigh->ah)
-	ah = neigh->ah;
-		list_del(&neigh->list);
-	ipoib_neigh_free(n->dev, neigh);
+
+	neigh = *to_ipoib_neigh(n);
+	if (neigh) {
+		if (neigh->ah)
+			ah = neigh->ah;
+		list_del(&neigh->list);
+		ipoib_neigh_free(n->dev, neigh);
+	}

	spin_unlock_irqrestore(&priv->lock, flags);


This has worked in practice, but it obviously leaves a small hole 
open.

-- 
Arthur


From rdreier at cisco.com  Fri May 22 10:25:21 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 22 May 2009 10:25:21 -0700
Subject: [ofa-general] Infiniband with Xen
In-Reply-To: <BAY139-W11E3D9D2B56A1BD0519E0CAE560@phx.gbl> (anthony garnier's
	message of "Fri, 22 May 2009 11:34:29 +0000")
References: <BAY139-W21E7902C852FC8E202B60EAE580@phx.gbl>
	<adaoctnbiju.fsf@cisco.com>
	<BAY139-W216D23BBED1EB689F563D5AE590@phx.gbl>
	<ada4oveaxxs.fsf@cisco.com>
	<BAY139-W11E3D9D2B56A1BD0519E0CAE560@phx.gbl>
Message-ID: <adar5yh9hj2.fsf@cisco.com>


 >  You told me " you need to set up Xen so that the driver is allowed to restore the HCA and PCI bridge config space."
 > But how can I set up xen to allow the driver to restore the HCA and pci config space?

No idea -- I've never used xen pci-passthrough.  But this line in your
log might be a clue:

>  > [ 1298.655747] 1) see permissive attribute in sysfs


From yossi.openib at gmail.com  Fri May 22 10:34:57 2009
From: yossi.openib at gmail.com (Yossi Etigin)
Date: Fri, 22 May 2009 20:34:57 +0300
Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in
	ipoib_neigh_cleanup()
In-Reply-To: <20090522155208.GB6837@sgi.com>
References: <20090519215505.GN6837@sgi.com> <adaws8bbs55.fsf@cisco.com>
	<20090520213703.GT6837@sgi.com> <4A14E766.1010005@voltaire.com>
	<20090521193910.GX6837@sgi.com> <4A167D51.1060607@gmail.com>
	<20090522155208.GB6837@sgi.com>
Message-ID: <4A16E241.1050604@gmail.com>

akepner at sgi.com wrote:
> On Fri, May 22, 2009 at 01:24:17PM +0300, Yossi Etigin wrote:
>> ...
>> So, ipoib tries to list_del(neigh) twice because the second time
>> the condition (neigh != NULL) is not protected with a lock.
>> How about this:
>>
>> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
>> index ab2c192..993b5a7 100644
>> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
>> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
>> @@ -845,23 +845,24 @@ static void ipoib_neigh_cleanup(struct neighbour *n)
>>  	unsigned long flags;
>>  	struct ipoib_ah *ah = NULL;
>>  
>> +	spin_lock_irqsave(&priv->lock, flags); <-- deadlock here
>> +
>>  	neigh = *to_ipoib_neigh(n);
>>  	if (neigh)
>>  		priv = netdev_priv(neigh->dev);
>>  	else
>> -		return;
>> +		goto out;
>>  	ipoib_dbg(priv,
>>  		  "neigh_cleanup for %06x %pI6\n",
>>  		  IPOIB_QPN(n->ha),
>>  		  n->ha + 4);
>>  
>> -	spin_lock_irqsave(&priv->lock, flags);
>> -
>>  	if (neigh->ah)
>>  		ah = neigh->ah;
>>  	list_del(&neigh->list);
>>  	ipoib_neigh_free(n->dev, neigh);
>>  
>> +out:
>>  	spin_unlock_irqrestore(&priv->lock, flags);
>>  
>>  	if (ah)
>>
> 
> This is essentially what I did first time around, but a deadlock on 
> the line marked above was quickly found. 
> 

Interesting... what does it deadlock with?
And what is the hole your fix leaves? If the (neigh!=NULL) check passes
with the spinlock held, shouldn't it be OK to list_del() it?

--Yossi


From valdes at anl.gov  Fri May 22 11:40:06 2009
From: valdes at anl.gov (John Valdes)
Date: Fri, 22 May 2009 13:40:06 -0500
Subject: [ofa-general] SRP on RHEL 5.3/OFED 1.3 vs RHEL 5.1/OFED 1.2?
Message-ID: <20090522184006.GE26282@starfish.mcs.anl.gov>

Hi all,

We have a storage array (a DDN 9550) attached to 8 servers via IB.
This setup has been running fine for the last 1.5 years or so, with
the servers running RHEL 5.1 and the OFED (OpenIB) 1.2 stack that's
included with RHEL 5.1.

Recently, we tried to upgrade to new servers running RHEL 5.3 with
its bundled OFED 1.3 stack, but now we're seeing frequent timeouts
resulting in LUN resets and SCSI command aborts between the servers
and the DDN.  As far as we can tell, our IB setup on the servers under
5.3 is identical to the setup under 5.1, so we don't know why we're
seeing the timeouts and resets.  

Is anyone aware of any changes when using IB SRP w/ RHEL 5.3 and OFED
1.3 vs RHEL 5.1/OFED 1.2 which might be causing this?

For reference, here are some of the details of our setup:

OLD CONFIGURATION
-----------------
* SuperMicro P4DP6 motherboard, w/ dual Xeon CPUs (x86, single core
  "Prestonia"), all circa 2002 hardware
* Cisco SFS-HCA-X2T7-A1 IB HCA (aka Mellanox Cougar Cub), 133 MHz PCI-X,
  128 MB memory, Firmware v3.5.917, dual port (port 1 attached to DDN)
* RHEL 5.1 w/ bundled OFED/OpenIB 1.2
* ib_mthca module loaded w/o any extra options
* ib_srp module loaded w/ option "srp_sg_tablesize=255"
* Connection to DDN established using "srp_daemon" invoked as:
  "srp_daemon -coe" with options "max_sect=8192,max_cmd_per_lun=5"
  given in /etc/srp_daemon.conf (Note that due to a bug in the OFED
  1.2 srp_daemon, the "max_sect=8192" option is ignored, which is OK
  since we weren't taking advantage of that option).
* 7 DDN LUNs are accessed by all 8 servers as clustered logical
  volumes (under RedHat's CLVM) holding GFS filesystems.
* 8 unique (not-shared) DDN LUNs are accessed by the servers (one LUN
  per server) as a plain disk holding an ext3 filesystem.

NEW CONFIGURATION
-----------------
* SuperMicro H8DME-2 motherboard, w/ dual quad-core AMD Opteron 2342, x86_64
* Cisco SFS-HCA-X2T7-A1 IB HCA (aka Mellanox Cougar Cub), 133 MHz PCI-X,
  128 MB memory, Firmware v3.5.917, dual port (port 1 attached to DDN)
  --same card as in old configuration, physically moved to new servers
* RHEL 5.3 w/ bundled OFED/OpenIB 1.3
* ib_mthca module loaded w/o any extra options
* ib_srp module loaded w/ option "srp_sg_tablesize=255"
* Connection to DDN established using "srp_daemon" invoked as:
  "srp_daemon -coe -f /etc/ofed/srp_daemon.conf" with options
  "max_sect=8192,max_cmd_per_lun=5" srp_daemon.conf
* 7 DDN LUNs are accessed by all 8 servers as clustered logical
  volumes (under RedHat's CLVM) holding GFS filesystems.
* 8 unique (not-shared) DDN LUNs are accessed by the servers (one LUN
  per server) as a plain disk holding an ext3 filesystem.


With the new configuration, timeouts/resets have frequently occurred
when starting up CLVM on the servers (eg, when the servers scan the
LUNs looking for the Linux (clustered) LVM data) as well as when doing
I/O to the mounted filesystems.  Just to make sure the CLVM/GFS setup
wasn't causing problems, we tested the plain ext3 filesystem on the
non-shared LUN from one of the new servers, and when doing a simple
"dd" to the LUN, we were still seeing timeouts and LUN resets.

Does any of this sound familiar to anyone?  Do you have a recommended
IB/SRP setup for RHEL 5.3?

John

----------------------------------------------------------------------
John Valdes                  Mathematics and Computer Science Division
valdes at anl.gov                             Argonne National Laboratory


From akepner at sgi.com  Fri May 22 11:44:03 2009
From: akepner at sgi.com (akepner at sgi.com)
Date: Fri, 22 May 2009 11:44:03 -0700
Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in
	ipoib_neigh_cleanup()
In-Reply-To: <4A16E241.1050604@gmail.com>
References: <20090519215505.GN6837@sgi.com> <adaws8bbs55.fsf@cisco.com>
	<20090520213703.GT6837@sgi.com> <4A14E766.1010005@voltaire.com>
	<20090521193910.GX6837@sgi.com> <4A167D51.1060607@gmail.com>
	<20090522155208.GB6837@sgi.com> <4A16E241.1050604@gmail.com>
Message-ID: <20090522184403.GF6837@sgi.com>

On Fri, May 22, 2009 at 08:34:57PM +0300, Yossi Etigin wrote:
> ...
> Interesting... what does it deadlock with?
> And what is the hole your fix leaves? If the (neigh!=NULL) check passes
> with the spinlock held, shouldn't it be OK to list_del() it?
> 

Unfortunately, I don't have enough information to answer that 
question any longer (it's an old, closed bug). But the crash dump 
showed a hang like this:

PID: 8643   TASK: ffff810130f060c0  CPU: 3   COMMAND: "sshd"
.......
 #3 [ffff81013b3d7ea0] .text.lock.spinlock at ffffffff802ea2df (via _spin_lock_i
rqsave)
 #4 [ffff81013b3d7ea0] ipoib_neigh_cleanup at ffffffff883f8972
 #5 [ffff81013b3d7ed0] neigh_destroy at ffffffff8029011c

crash> dis ipoib_neigh_cleanup
0xffffffff883f8952 <ipoib_neigh_cleanup>:       push   %r13
0xffffffff883f8954 <ipoib_neigh_cleanup+2>:     push   %r12
0xffffffff883f8956 <ipoib_neigh_cleanup+4>:     push   %rbp
0xffffffff883f8957 <ipoib_neigh_cleanup+5>:     mov    %rdi,%rbp
0xffffffff883f895a <ipoib_neigh_cleanup+8>:     push   %rbx
0xffffffff883f895b <ipoib_neigh_cleanup+9>:     sub    $0x8,%rsp
0xffffffff883f895f <ipoib_neigh_cleanup+13>:    mov    0x18(%rdi),%rax
0xffffffff883f8963 <ipoib_neigh_cleanup+17>:    lea    0x500(%rax),%r12
0xffffffff883f896a <ipoib_neigh_cleanup+24>:    mov    %r12,%rdi
0xffffffff883f896d <ipoib_neigh_cleanup+27>:    callq  0xffffffff802ea1a5 <_spin_lock_irqsave>
0xffffffff883f8972 <ipoib_neigh_cleanup+32>:    mov    %rax,%rsi 
.....

and here is the patch to ipoib_neigh_cleanup() that was in use:

diff -rup c/ofa_kernel-1.3/drivers/infiniband/ulp/ipoib/ipoib_main.c d/ofa_kernel-1.3/drivers/infiniband/ulp/ipoib/ipoib_main.c
--- c/ofa_kernel-1.3/drivers/infiniband/ulp/ipoib/ipoib_main.c  2008-04-22 13:25:23.131563415 -0700
+++ d/ofa_kernel-1.3/drivers/infiniband/ulp/ipoib/ipoib_main.c  2008-04-22 15:24:31.475721847 -0700
@@ -821,11 +821,15 @@ static void ipoib_neigh_cleanup(struct n
        unsigned long flags;
        struct ipoib_ah *ah = NULL;

+       spin_lock_irqsave(&priv->lock, flags);
        neigh = *to_ipoib_neigh(n);
+       spin_unlock_irqrestore(&priv->lock, flags);
        if (neigh) {
-               priv = netdev_priv(neigh->dev);
-               ipoib_dbg(priv, "neigh_destructor for bonding device: %s\n",
-                         n->dev->name);
+               if (priv != netdev_priv(neigh->dev)) {
+                       ipoib_dbg(priv, "neigh_destructor for bonding device: "
+                               "%s\n", n->dev->name);
+                       priv = netdev_priv(neigh->dev);
+               }
        } else
                return;
        ipoib_dbg(priv,
@@ -835,10 +839,13 @@ static void ipoib_neigh_cleanup(struct n

        spin_lock_irqsave(&priv->lock, flags);

-       if (neigh->ah)
-               ah = neigh->ah;
-       list_del(&neigh->list);
-       ipoib_neigh_free(n->dev, neigh);
+       neigh = *to_ipoib_neigh(n);
+       if (neigh) {
+               if (neigh->ah)
+                       ah = neigh->ah;
+               list_del(&neigh->list);
+               ipoib_neigh_free(n->dev, neigh);
+       }

        spin_unlock_irqrestore(&priv->lock, flags);

I'll see if I can reproduce the deadlock (with a new kernel).

-- 
Arthur


From akepner at sgi.com  Fri May 22 13:44:45 2009
From: akepner at sgi.com (akepner at sgi.com)
Date: Fri, 22 May 2009 13:44:45 -0700
Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in
	ipoib_neigh_cleanup()
In-Reply-To: <4A16E241.1050604@gmail.com>
References: <20090519215505.GN6837@sgi.com> <adaws8bbs55.fsf@cisco.com>
	<20090520213703.GT6837@sgi.com> <4A14E766.1010005@voltaire.com>
	<20090521193910.GX6837@sgi.com> <4A167D51.1060607@gmail.com>
	<20090522155208.GB6837@sgi.com> <4A16E241.1050604@gmail.com>
Message-ID: <20090522204445.GG6837@sgi.com>

On Fri, May 22, 2009 at 08:34:57PM +0300, Yossi Etigin wrote:
> ...
> Interesting... what does it deadlock with?

(My previous mail was addressing only the question above. I 
overlooked what follows.)

> And what is the hole your fix leaves? 

Well, in this small window:

static void ipoib_neigh_cleanup(struct neighbour *n)
{
        struct ipoib_neigh *neigh;
        struct ipoib_dev_priv *priv = netdev_priv(n->dev);
        unsigned long flags;
        struct ipoib_ah *ah = NULL;

        neigh = *to_ipoib_neigh(n); <------- from here
        if (neigh)
                priv = netdev_priv(neigh->dev);
        else
                return;
        ipoib_dbg(priv,
                  "neigh_cleanup for %06x %pI6\n",
                  IPOIB_QPN(n->ha),
                  n->ha + 4);   <------------ to here
	spin_lock_irqsave(&priv->lock, flags);


we could be using a no-longer-valid neigh.

> If the (neigh!=NULL) check passes
> with the spinlock held, shouldn't it be OK to list_del() it?

Yeah, that should be OK.

-- 
Arthur


From yossi.openib at gmail.com  Fri May 22 14:13:11 2009
From: yossi.openib at gmail.com (Yossi Etigin)
Date: Sat, 23 May 2009 00:13:11 +0300
Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in
	ipoib_neigh_cleanup()
In-Reply-To: <20090522184403.GF6837@sgi.com>
References: <20090519215505.GN6837@sgi.com> <adaws8bbs55.fsf@cisco.com>
	<20090520213703.GT6837@sgi.com> <4A14E766.1010005@voltaire.com>
	<20090521193910.GX6837@sgi.com> <4A167D51.1060607@gmail.com>
	<20090522155208.GB6837@sgi.com> <4A16E241.1050604@gmail.com>
	<20090522184403.GF6837@sgi.com>
Message-ID: <4A171567.7060001@gmail.com>

akepner at sgi.com wrote:
> diff -rup c/ofa_kernel-1.3/drivers/infiniband/ulp/ipoib/ipoib_main.c d/ofa_kernel-1.3/drivers/infiniband/ulp/ipoib/ipoib_main.c
> --- c/ofa_kernel-1.3/drivers/infiniband/ulp/ipoib/ipoib_main.c  2008-04-22 13:25:23.131563415 -0700
> +++ d/ofa_kernel-1.3/drivers/infiniband/ulp/ipoib/ipoib_main.c  2008-04-22 15:24:31.475721847 -0700
> @@ -821,11 +821,15 @@ static void ipoib_neigh_cleanup(struct n
>         unsigned long flags;
>         struct ipoib_ah *ah = NULL;
> 
> +       spin_lock_irqsave(&priv->lock, flags);
>         neigh = *to_ipoib_neigh(n);
> +       spin_unlock_irqrestore(&priv->lock, flags);
>         if (neigh) {
> -               priv = netdev_priv(neigh->dev);
> -               ipoib_dbg(priv, "neigh_destructor for bonding device: %s\n",
> -                         n->dev->name);
> +               if (priv != netdev_priv(neigh->dev)) {
> +                       ipoib_dbg(priv, "neigh_destructor for bonding device: "
> +                               "%s\n", n->dev->name);
> +                       priv = netdev_priv(neigh->dev);
> +               }
>         } else
>                 return;
>         ipoib_dbg(priv,

Now I see that the patch that caused the deadlock is a little more that
moving spin_lock_irqsave() a few lines up in the code..
 The code above looks a little suspicious. The spin_lock_irqsave() above
looks redundant - someone could kfree the neigh after you release the lock
and you get a corrupted `priv'.

Besides, I see that in the 1.3.1 code there is a test 
'if (n->dev->type != ARPHRD_INFINIBAND)', check this out:
http://www.mail-archive.com/general at lists.openfabrics.org/msg00839.html

--Yossi


From yossi.openib at gmail.com  Fri May 22 14:16:24 2009
From: yossi.openib at gmail.com (Yossi Etigin)
Date: Sat, 23 May 2009 00:16:24 +0300
Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in
	ipoib_neigh_cleanup()
In-Reply-To: <20090522204445.GG6837@sgi.com>
References: <20090519215505.GN6837@sgi.com> <adaws8bbs55.fsf@cisco.com>
	<20090520213703.GT6837@sgi.com> <4A14E766.1010005@voltaire.com>
	<20090521193910.GX6837@sgi.com> <4A167D51.1060607@gmail.com>
	<20090522155208.GB6837@sgi.com> <4A16E241.1050604@gmail.com>
	<20090522204445.GG6837@sgi.com>
Message-ID: <4A171628.8090307@gmail.com>

akepner at sgi.com wrote:
> On Fri, May 22, 2009 at 08:34:57PM +0300, Yossi Etigin wrote:
>> ...
>> Interesting... what does it deadlock with?
> 
> (My previous mail was addressing only the question above. I 
> overlooked what follows.)
> 
>> And what is the hole your fix leaves? 
> 
> Well, in this small window:
> 
> static void ipoib_neigh_cleanup(struct neighbour *n)
> {
>         struct ipoib_neigh *neigh;
>         struct ipoib_dev_priv *priv = netdev_priv(n->dev);
>         unsigned long flags;
>         struct ipoib_ah *ah = NULL;
> 
>         neigh = *to_ipoib_neigh(n); <------- from here
>         if (neigh)
>                 priv = netdev_priv(neigh->dev);
>         else
>                 return;
>         ipoib_dbg(priv,
>                   "neigh_cleanup for %06x %pI6\n",
>                   IPOIB_QPN(n->ha),
>                   n->ha + 4);   <------------ to here
> 	spin_lock_irqsave(&priv->lock, flags);
> 
> 
> we could be using a no-longer-valid neigh.
> 
>> If the (neigh!=NULL) check passes
>> with the spinlock held, shouldn't it be OK to list_del() it?
> 
> Yeah, that should be OK.
> 

So it's a catch.. You can't take priv out of neigh without being 
protected by the spinlock (or someone will kfree the neigh), but you
need priv to get the spinlock in the first place..


From akepner at sgi.com  Fri May 22 14:24:38 2009
From: akepner at sgi.com (akepner at sgi.com)
Date: Fri, 22 May 2009 14:24:38 -0700
Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in
	ipoib_neigh_cleanup()
In-Reply-To: <4A171628.8090307@gmail.com>
References: <20090519215505.GN6837@sgi.com> <adaws8bbs55.fsf@cisco.com>
	<20090520213703.GT6837@sgi.com> <4A14E766.1010005@voltaire.com>
	<20090521193910.GX6837@sgi.com> <4A167D51.1060607@gmail.com>
	<20090522155208.GB6837@sgi.com> <4A16E241.1050604@gmail.com>
	<20090522204445.GG6837@sgi.com> <4A171628.8090307@gmail.com>
Message-ID: <20090522212438.GH6837@sgi.com>

On Sat, May 23, 2009 at 12:16:24AM +0300, Yossi Etigin wrote:
> ....
> So it's a catch.. You can't take priv out of neigh without being 
> protected by the spinlock (or someone will kfree the neigh), but you
> need priv to get the spinlock in the first place..

Exactly.

-- 
Arthur


From vlad at lists.openfabrics.org  Sat May 23 03:22:59 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sat, 23 May 2009 03:22:59 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090523-0200 daily build status
Message-ID: <20090523102259.B2CA7E61445@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From ogerlitz at voltaire.com  Sat May 23 22:11:32 2009
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 24 May 2009 08:11:32 +0300
Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in
	ipoib_neigh_cleanup()
In-Reply-To: <20090521193910.GX6837@sgi.com>
References: <20090519215505.GN6837@sgi.com>
	<adaws8bbs55.fsf@cisco.com>	<20090520213703.GT6837@sgi.com>
	<4A14E766.1010005@voltaire.com> <20090521193910.GX6837@sgi.com>
Message-ID: <4A18D704.1020000@voltaire.com>

akepner at sgi.com wrote:
> Hmmm, it's not obvious to me that that commit ecbb416939da77c0d107409976499724baddce7b would be relevant to the bug that I mentioned earlier
If its not relate to the phenomena addressed by that commit, then 
repeating the question posed by Roland: how come a neigh cleanup 
callback is invoked when someone out there has a ref on the neighbour? 
also I'd like to clarify with you if the rest of this thread applies 
only to 2.6.16 and possibly more old kernels, or to the current mainline 
bits?

Or.


From vlad at lists.openfabrics.org  Sun May 24 03:21:59 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sun, 24 May 2009 03:21:59 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090524-0200 daily build status
Message-ID: <20090524102159.6FECBE612F2@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From vlad at lists.openfabrics.org  Mon May 25 03:21:45 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Mon, 25 May 2009 03:21:45 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090525-0200 daily build status
Message-ID: <20090525102146.158C4E61571@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From zafargilani at gmail.com  Mon May 25 08:36:02 2009
From: zafargilani at gmail.com (Zafar Gilani)
Date: Mon, 25 May 2009 20:36:02 +0500
Subject: [ofa-general] Sending two integers via RDMA_WRITE
Message-ID: <7d4423d30905250836r431e7d1eh170f3bc22b731a55@mail.gmail.com>

Hi,

I am trying to send two integers (essentially a buf array of type uint32_t)
to the server via RDMA_WRITE method. The following is piece of code that I
rewrote:

++++++++++++++++++++++++++++++++++

  buf[0] = strtoul(argv[2], NULL, 0);
  buf[1] = strtoul(argv[3], NULL, 0);

  printf("%d + %d = ", buf[0], buf[1]);

  buf[0] = htonl(buf[0]);
  buf[1] = htonl(buf[1]);

  /* -----------------------------------
     ---- START - write operation 1 ----
     ----------------------------------- */

int c;
for(c = 0; c < 2; c++)
{
  sge.addr = (uintptr_t) buf + ((uint32_t)c*sizeof(uint32_t));
  sge.length = sizeof(uint32_t);
  sge.lkey = mr->lkey;

  send_wr.wr_id = (uint64_t)(c+1);//1;
  send_wr.opcode = IBV_WR_RDMA_WRITE;
  send_wr.sg_list = &sge;
  send_wr.num_sge = 1;
  send_wr.wr.rdma.rkey = ntohl(server_pdata.buf_rkey);
  send_wr.wr.rdma.remote_addr = ntohll(server_pdata.buf_va);

  if(ibv_post_send(cm_id->qp, &send_wr, &bad_send_wr))
        return 1;
}

++++++++++++++++++++++++++++++++++

I receive no compilation errors but it does not write to remote memory. Any
suggestions of what might be wrong?

Thanks,
-- 
Syed Zafar ul Hussan Gilani | BIT-7
Research Student | CHPSC
MSP 2008-09
NUST SEECS | http://hpc.niit.edu.pk/~zafar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090525/bb69096c/attachment.html>

From dotanba at gmail.com  Mon May 25 10:46:47 2009
From: dotanba at gmail.com (Dotan Barak)
Date: Mon, 25 May 2009 19:46:47 +0200
Subject: [ofa-general] Sending two integers via RDMA_WRITE
In-Reply-To: <7d4423d30905250836r431e7d1eh170f3bc22b731a55@mail.gmail.com>
References: <7d4423d30905250836r431e7d1eh170f3bc22b731a55@mail.gmail.com>
Message-ID: <4A1AD987.2080200@gmail.com>

Hi.

Why do you use ntohl() on the rkey/remote_addr?
Which QP type is it? (RC or UC).
Did you poll for a completion and check that the status is good?

Dotan


Zafar Gilani wrote:
> Hi,
>
> I am trying to send two integers (essentially a buf array of type 
> uint32_t) to the server via RDMA_WRITE method. The following is piece 
> of code that I rewrote:
>
> ++++++++++++++++++++++++++++++++++
>
>   buf[0] = strtoul(argv[2], NULL, 0);
>   buf[1] = strtoul(argv[3], NULL, 0);
>
>   printf("%d + %d = ", buf[0], buf[1]);
>
>   buf[0] = htonl(buf[0]);
>   buf[1] = htonl(buf[1]);
>
>   /* -----------------------------------
>      ---- START - write operation 1 ----
>      ----------------------------------- */
>
> int c;
> for(c = 0; c < 2; c++)
> {
>   sge.addr = (uintptr_t) buf + ((uint32_t)c*sizeof(uint32_t));
>   sge.length = sizeof(uint32_t);
>   sge.lkey = mr->lkey;
>
>   send_wr.wr_id = (uint64_t)(c+1);//1;
>   send_wr.opcode = IBV_WR_RDMA_WRITE;
>   send_wr.sg_list = &sge;
>   send_wr.num_sge = 1;
>   send_wr.wr.rdma.rkey = ntohl(server_pdata.buf_rkey);
>   send_wr.wr.rdma.remote_addr = ntohll(server_pdata.buf_va);
>
>   if(ibv_post_send(cm_id->qp, &send_wr, &bad_send_wr))
>         return 1;
> }
>
> ++++++++++++++++++++++++++++++++++
>
> I receive no compilation errors but it does not write to remote 
> memory. Any suggestions of what might be wrong?
>
> Thanks,
> -- 
> Syed Zafar ul Hussan Gilani | BIT-7
> Research Student | CHPSC
> MSP 2008-09
> NUST SEECS | http://hpc.niit.edu.pk/~zafar 
> <http://hpc.niit.edu.pk/%7Ezafar>
> ------------------------------------------------------------------------
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From zafargilani at gmail.com  Mon May 25 11:08:26 2009
From: zafargilani at gmail.com (Zafar Gilani)
Date: Mon, 25 May 2009 23:08:26 +0500
Subject: [ofa-general] Sending two integers via RDMA_WRITE
In-Reply-To: <4A1AD987.2080200@gmail.com>
References: <7d4423d30905250836r431e7d1eh170f3bc22b731a55@mail.gmail.com>
	<4A1AD987.2080200@gmail.com>
Message-ID: <7d4423d30905251108m5bc951ach4da2027698b0ffb6@mail.gmail.com>

Thanks for the reply. I read your name under the Author for may be all the
IBV structs/operations at linux.die.net. So I am highly impressed by the
work (can only dream of it myself). :)

1. I don't know why the original author (Roland Dreir) has used ntohl() for
rkey and remote_addr. Though to use it on buffer (buf) is essential in order
to transfer the byte order from network to host.

2. I am using QP type RC for reliable connection.

3. Yes I am checking for that but the code gets stuck before that, around
when I call ibv_get_cq_event to wait for next completion event in the event
channel. I think the second write (second iteration of for loop) is not
working properly since when I try to send buf[0] via RDMA_WRITE and buf[1]
via SEND then it works fine.

The code I am walking about:

 while(1) {
    if(ibv_get_cq_event(comp_chan, &evt_cq, &cq_context)) // here it gets
stuck
        return 1;

// does not print this
printf("after get_cq_event\n"); fflush(stdout);

    if(ibv_req_notify_cq(cq, 0))
        return 1;

printf("after req_notify_cq\n"); fflush(stdout);

    if(ibv_poll_cq(cq, 1, &wc) != 1)
        return 1;

printf("after poll_cq\n"); fflush(stdout);

    if(wc.status != IBV_WC_SUCCESS)
        return 1;

printf("after wc.status\n"); fflush(stdout);

    if(wc.wr_id == 0) {
        printf("%d\n", ntohl(buf[0])); fflush(stdout);
        return 0;
    }
  }

Your thoughts on this?

Thanks,
zafar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090525/91edde52/attachment.html>

From dotanba at gmail.com  Mon May 25 12:27:57 2009
From: dotanba at gmail.com (Dotan Barak)
Date: Mon, 25 May 2009 21:27:57 +0200
Subject: [ofa-general] Sending two integers via RDMA_WRITE
In-Reply-To: <7d4423d30905251108m5bc951ach4da2027698b0ffb6@mail.gmail.com>
References: <7d4423d30905250836r431e7d1eh170f3bc22b731a55@mail.gmail.com>	
	<4A1AD987.2080200@gmail.com>
	<7d4423d30905251108m5bc951ach4da2027698b0ffb6@mail.gmail.com>
Message-ID: <4A1AF13D.8040506@gmail.com>

Zafar Gilani wrote:
> Thanks for the reply. I read your name under the Author for may be all 
> the IBV structs/operations at linux.die.net <http://linux.die.net>. So 
> I am highly impressed by the work (can only dream of it myself). :)
thanks ...
>
> 1. I don't know why the original author (Roland Dreir) has used 
> ntohl() for rkey and remote_addr. Though to use it on buffer (buf) is 
> essential in order to transfer the byte order from network to host.
I need to check the test code to check the reason....
>
> 2. I am using QP type RC for reliable connection.
This means that if there is an error you will get a bad completion
(it you would have use UC QP, in case of an error in the receiver side, 
the packet would have dropped).
>
> 3. Yes I am checking for that but the code gets stuck before that, 
> around when I call ibv_get_cq_event to wait for next completion event 
> in the event channel. I think the second write (second iteration of 
> for loop) is not working properly since when I try to send buf[0] via 
> RDMA_WRITE and buf[1] via SEND then it works fine.
Using ibv_get_cq_event can be tricky, you must arm the CQ (call 
ibv_req_notify_cq) BEFORE a completion can enter to that CQ.
To make things more clear, why won't you just poll the CQ for completion 
directly? (without using the CQ events)

I believe that you will get a completion with error...

Dotan
>
> The code I am walking about:
>
>  while(1) {
>     if(ibv_get_cq_event(comp_chan, &evt_cq, &cq_context)) // here it 
> gets stuck
>         return 1;
>
> // does not print this
> printf("after get_cq_event\n"); fflush(stdout);
>
>     if(ibv_req_notify_cq(cq, 0))
>         return 1;
>
> printf("after req_notify_cq\n"); fflush(stdout);
>
>     if(ibv_poll_cq(cq, 1, &wc) != 1)
>         return 1;
>
> printf("after poll_cq\n"); fflush(stdout);
>
>     if(wc.status != IBV_WC_SUCCESS)
>         return 1;
>
> printf("after wc.status\n"); fflush(stdout);
>
>     if(wc.wr_id == 0) {
>         printf("%d\n", ntohl(buf[0])); fflush(stdout);
>         return 0;
>     }
>   }
>
> Your thoughts on this?
>
> Thanks,
> zafar


From zafargilani at gmail.com  Mon May 25 12:55:48 2009
From: zafargilani at gmail.com (Zafar Gilani)
Date: Tue, 26 May 2009 00:55:48 +0500
Subject: [ofa-general] Sending two integers via RDMA_WRITE
In-Reply-To: <4A1AF13D.8040506@gmail.com>
References: <7d4423d30905250836r431e7d1eh170f3bc22b731a55@mail.gmail.com>
	<4A1AD987.2080200@gmail.com>
	<7d4423d30905251108m5bc951ach4da2027698b0ffb6@mail.gmail.com>
	<4A1AF13D.8040506@gmail.com>
Message-ID: <7d4423d30905251255w3cc9a1c3ha1226251388baf2b@mail.gmail.com>

I am attaching the code files (client.c and server.c). I hope I am not
bugging you that much! Is there any guide for IBV/RDMA CM? I am using IBA
Specification but that only provides the theory, I am looking for something
close to javadocs (if any).

I will try the rest of what you said in your mail when I can log onto the
cluster (for the time being it does not seem to be responding). I will let
you know when I do so. Meanwhile kindly have a look at the code, lot of
comments for my own good. :)

Thanks,
zafar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090526/651ee299/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: client.c
Type: application/octet-stream
Size: 17153 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090526/651ee299/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: server.c
Type: application/octet-stream
Size: 5024 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090526/651ee299/attachment-0001.obj>

From dotanba at gmail.com  Mon May 25 22:04:19 2009
From: dotanba at gmail.com (Dotan Barak)
Date: Tue, 26 May 2009 08:04:19 +0300
Subject: [ofa-general] Sending two integers via RDMA_WRITE
In-Reply-To: <7d4423d30905251255w3cc9a1c3ha1226251388baf2b@mail.gmail.com>
References: <7d4423d30905250836r431e7d1eh170f3bc22b731a55@mail.gmail.com>
	<4A1AD987.2080200@gmail.com>
	<7d4423d30905251108m5bc951ach4da2027698b0ffb6@mail.gmail.com>
	<4A1AF13D.8040506@gmail.com>
	<7d4423d30905251255w3cc9a1c3ha1226251388baf2b@mail.gmail.com>
Message-ID: <2f3bf9a60905252204l5e5e9bbbj7e25e9b6badb30c@mail.gmail.com>

RDMA Write doesn't produce any completion in the receiver side.

Dotan

On Mon, May 25, 2009 at 10:55 PM, Zafar Gilani <zafargilani at gmail.com> wrote:
> I am attaching the code files (client.c and server.c). I hope I am not
> bugging you that much! Is there any guide for IBV/RDMA CM? I am using IBA
> Specification but that only provides the theory, I am looking for something
> close to javadocs (if any).
>
> I will try the rest of what you said in your mail when I can log onto the
> cluster (for the time being it does not seem to be responding). I will let
> you know when I do so. Meanwhile kindly have a look at the code, lot of
> comments for my own good. :)
>
> Thanks,
> zafar
>


From zafargilani at gmail.com  Mon May 25 22:45:24 2009
From: zafargilani at gmail.com (Zafar Gilani)
Date: Tue, 26 May 2009 10:45:24 +0500
Subject: [ofa-general] Sending two integers via RDMA_WRITE
In-Reply-To: <4A1AF13D.8040506@gmail.com>
References: <7d4423d30905250836r431e7d1eh170f3bc22b731a55@mail.gmail.com>
	<4A1AD987.2080200@gmail.com>
	<7d4423d30905251108m5bc951ach4da2027698b0ffb6@mail.gmail.com>
	<4A1AF13D.8040506@gmail.com>
Message-ID: <7d4423d30905252245o869fd93p96facb1d1b2b94e6@mail.gmail.com>

 To make things more clear, why won't you just poll the CQ for completion
directly? (without using the CQ events)

I believe that you will get a completion with error...

Yes I tried polling directly and it returns a negative number. What is the
remedy for this? Is my for loop logically correct (client.c)? I also tried
polling the server CQ directly (server.c) and polling here also returns a
negative number, which means that data write is not working properly thus no
completion events. What do you suggest I do? I am obviously lost! :(
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090526/8e19eef8/attachment.html>

From vlad at lists.openfabrics.org  Tue May 26 03:24:27 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Tue, 26 May 2009 03:24:27 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090526-0200 daily build status
Message-ID: <20090526102428.39CF2E61597@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From He.Huang at Sun.COM  Tue May 26 13:03:46 2009
From: He.Huang at Sun.COM (Isaac Huang)
Date: Tue, 26 May 2009 16:03:46 -0400
Subject: [ofa-general] two questions about RDMA_CM_EVENT_TIMEWAIT_EXIT and
	the TimeWait	state
Message-ID: <20090526200346.GQ4239@sun.com>

Hi all,

1.
In a previous discussion:
http://www.mail-archive.com/general at lists.openfabrics.org/msg19820.html

It was mentioned that:
You're allowed to destroy a QP earlier, but you have a remote chance of
getting into trouble if you reuse the same QP number before any stale
packets have drained from the fabric.

If rdma_destroy_qp is called on a QP before it exits the TimeWait
state (i.e. after RDMA_CM_EVENT_DISCONNECTED but before
RDMA_CM_EVENT_TIMEWAIT_EXIT), is it possible that a subsequent rdma_create_qp
would reuse the same QP while it's still in TimeWait? I'd think that
rdma_destroy_qp should not make a TimeWait QP immediately reusable,
but wouldn't be surprised if otherwise.

2.
In 12.9.6 of the Infiniband Architecture v1.2, it seemed that a QP
could enter the TimeWait state without having entered the Established
state first, via the RTU timeout. Could a RDMA_CM_EVENT_TIMEWAIT_EXIT
happen right after a RDMA_CM_EVENT_CONNECT_REQUEST without a
RDMA_CM_EVENT_ESTABLISHED? If yes, our ULP would have to cleanup some
resources in case RDMA_CM_EVENT_TIMEWAIT_EXIT happens on passive side.

Thanks,
Isaac


From or.gerlitz at gmail.com  Tue May 26 13:22:24 2009
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Tue, 26 May 2009 23:22:24 +0300
Subject: [ofa-general] two questions about RDMA_CM_EVENT_TIMEWAIT_EXIT and
	the TimeWait state
In-Reply-To: <20090526200346.GQ4239@sun.com>
References: <20090526200346.GQ4239@sun.com>
Message-ID: <15ddcffd0905261322t3df149fuaf27ebcd17a8ac46@mail.gmail.com>

On Tue, May 26, 2009 at 11:03 PM, Isaac Huang <He.Huang at sun.com> wrote:

> If rdma_destroy_qp is called on a QP before it exits the TimeWait state
> (i.e. after RDMA_CM_EVENT_DISCONNECTED but before
> RDMA_CM_EVENT_TIMEWAIT_EXIT), is it possible that a subsequent
> rdma_create_qp would reuse the same QP while it's still in TimeWait?


YES - as rdma_destroy/create_qp are basically  wrappers to
ib_destroy/create_qp and the latter two are not aware by any means to the QP
state from the CM point of view.


Or.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090526/b94ab5f1/attachment.html>

From sean.hefty at intel.com  Tue May 26 13:43:25 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 26 May 2009 13:43:25 -0700
Subject: [ofa-general] two questions about RDMA_CM_EVENT_TIMEWAIT_EXIT
	and	the TimeWait	state
In-Reply-To: <20090526200346.GQ4239@sun.com>
References: <20090526200346.GQ4239@sun.com>
Message-ID: <93D6D1D3B9C94280A5277CE07A98663C@amr.corp.intel.com>

>In 12.9.6 of the Infiniband Architecture v1.2, it seemed that a QP
>could enter the TimeWait state without having entered the Established
>state first, via the RTU timeout. Could a RDMA_CM_EVENT_TIMEWAIT_EXIT
>happen right after a RDMA_CM_EVENT_CONNECT_REQUEST without a
>RDMA_CM_EVENT_ESTABLISHED? If yes, our ULP would have to cleanup some
>resources in case RDMA_CM_EVENT_TIMEWAIT_EXIT happens on passive side.

Yes, it's possible to enter timewait without going through established.  I'd
have to walk through the code at this point to identify all of the cases.

Note that a lot (most?) connections between QPs are established out of band
using TCP, and these are not tracked by the CM or go through any sort of
timewait before potentially being reused.

- Sean


From rdreier at cisco.com  Tue May 26 16:13:08 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 26 May 2009 16:13:08 -0700
Subject: [ofa-general] Memory registration redux
In-Reply-To: <20090507224806.GF16280@obsidianresearch.com> (Jason Gunthorpe's
	message of "Thu, 7 May 2009 16:48:06 -0600")
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
	<adabpq6t2k8.fsf@cisco.com>
	<20090506214628.GM2590@obsidianresearch.com>
	<adatz3xsxo6.fsf@cisco.com>
	<20090506222638.GA16280@obsidianresearch.com>
	<adaprelsvnp.fsf@cisco.com>
	<20090507000231.GB16280@obsidianresearch.com>
	<adak54ssi0g.fsf@cisco.com>
	<20090507224806.GF16280@obsidianresearch.com>
Message-ID: <adabppf8nln.fsf@cisco.com>


 > >  > Or, ignore the overlapping problem, and use your original technique,
 > >  > slightly modified:
 > >  >  - Userspace registers a counter with the kernel. Kernel pins the
 > >  >    page, sets up mmu notifiers and increments the counter when
 > >  >    invalidates intersect with registrations
 > >  >  - Kernel maintains a linked list of registrations that have been
 > >  >    invalidated via mmu notifiers using the registration structure
 > >  >    and a dirty bit
 > >  >  - Userspace checks the counter at every cache hit, if different it
 > >  >    calls into the kernel:
 > >  >        MR_Cookie *mrs[100];
 > >  >        int rc = ibv_get_invalid_mrs(mrs,100);
 > >  >        invalidate_cache(mrs,rc);
 > >  >        // Repeat until drained
 > >  > 
 > >  >    get_invalid_mrs traverses the linked list and returns an
 > >  >    identifying value to userspace, which looks it up in the cache,
 > >  >    calls unregister and removes it from the cache.
 > > 
 > > What's the advantage of this?  I have to do the get_invalid_mrs() call a
 > > bunch of times, rather than just reading which ones are invalid from the
 > > cache directly?
 > 
 > This is a trade off, the above is a more normal kernel API and lets
 > the app get an list of changes it can scan. Having the kernel update
 > flags means if the app wants a list of changes it has to scan all
 > registrations.

The more I thought about this, the more I liked the idea, until I liked
it so much that I actually went ahead and prototyped this.  A
preliminary version is below -- *very* lightly tested, and no doubt
there are obvious bugs that any real use or review will uncover.  But I
thought I'd throw it out and hope for comments and/or testing.  I'm
actually pretty happy with how small and simple this ended up being.

I'll reply to this message with a simple test program I've used to
sanity check this.

===

[PATCH] ummunot: Userspace support for MMU notifications

As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925>
and follow-up messages, libraries using RDMA would like to track
precisely when application code changes memory mapping via free(),
munmap(), etc.  Current pure-userspace solutions using malloc hooks
and other tricks are not robust, and the feeling among experts is that
the issue is unfixable without kernel help.

We solve this not by implementing the full API proposed in the email
linked above but rather with a simpler and more generic interface,
which may be useful in other contexts.  Specifically, we implement a
new character device driver, ummunot, that creates a /dev/ummunot
node.  A userspace process can open this node read-only and use the fd
as follows:

 1. ioctl() to register/unregister an address range to watch in the
    kernel (cf struct ummunot_register_ioctl in <linux/ummunot.h>).

 2. read() to retrieve events generated when a mapping in a watched
    address range is invalidated (cf struct ummunot_event in
    <linux/ummunot.h>).  select()/poll()/epoll() and SIGIO are handled
    for this IO.

 3. mmap() one page at offset 0 to map a kernel page that contains a
    generation counter that is incremented each time an event is
    generated.  This allows userspace to have a fast path that checks
    that no events have occurred without a system call.

NOT-Signed-off-by: Roland Dreier <rolandd at cisco.com>
---
 drivers/char/Kconfig    |   12 ++
 drivers/char/Makefile   |    1 +
 drivers/char/ummunot.c  |  457 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/ummunot.h |   85 +++++++++
 4 files changed, 555 insertions(+), 0 deletions(-)
 create mode 100644 drivers/char/ummunot.c
 create mode 100644 include/linux/ummunot.h

diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index 735bbe2..91fe068 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -1099,6 +1099,18 @@ config DEVPORT
 	depends on ISA || PCI
 	default y
 
+config UMMUNOT
+       tristate "Userspace MMU notifications"
+       select MMU_NOTIFIER
+       help
+         The ummunot (userspace MMU notification) driver creates a
+         character device that can be used by userspace libraries to
+         get notifications when an application's memory mapping
+         changed.  This is used, for example, by RDMA libraries to
+         improve the reliability of memory registration caching, since
+         the kernel's MMU notifications can be used to know precisely
+         when to shoot down a cached registration.
+
 source "drivers/s390/char/Kconfig"
 
 endmenu
diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index 9caf5b5..dcbcd7c 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -97,6 +97,7 @@ obj-$(CONFIG_CS5535_GPIO)	+= cs5535_gpio.o
 obj-$(CONFIG_GPIO_VR41XX)	+= vr41xx_giu.o
 obj-$(CONFIG_GPIO_TB0219)	+= tb0219.o
 obj-$(CONFIG_TELCLOCK)		+= tlclk.o
+obj-$(CONFIG_UMMUNOT)		+= ummunot.o
 
 obj-$(CONFIG_MWAVE)		+= mwave/
 obj-$(CONFIG_AGP)		+= agp/
diff --git a/drivers/char/ummunot.c b/drivers/char/ummunot.c
new file mode 100644
index 0000000..1341edc
--- /dev/null
+++ b/drivers/char/ummunot.c
@@ -0,0 +1,457 @@
+/*
+ * Copyright (c) 2009 Cisco Systems.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenFabrics BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/miscdevice.h>
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/rbtree.h>
+#include <linux/sched.h>
+#include <linux/spinlock.h>
+#include <linux/uaccess.h>
+#include <linux/ummunot.h>
+
+#include <asm/cacheflush.h>
+
+MODULE_AUTHOR("Roland Dreier");
+MODULE_DESCRIPTION("Userspace MMU notifiers");
+MODULE_LICENSE("Dual BSD/GPL");
+
+enum {
+	UMMUNOT_FLAG_DIRTY	= 1,
+	UMMUNOT_FLAG_HINT	= 2,
+};
+
+struct ummunot_reg {
+	u64			user_cookie;
+	unsigned long		start;
+	unsigned long		end;
+	unsigned long		hint_start;
+	unsigned long		hint_end;
+	unsigned long		flags;
+	struct rb_node		node;
+	struct list_head	list;
+};
+
+struct ummunot_file {
+	struct mmu_notifier	mmu_notifier;
+	struct mm_struct       *mm;
+	struct rb_root		reg_tree;
+	struct list_head	dirty_list;
+	u64		       *counter;
+	spinlock_t		lock;
+	wait_queue_head_t	read_wait;
+	struct fasync_struct   *async_queue;
+};
+
+static struct ummunot_file *to_ummunot_file(struct mmu_notifier *mn)
+{
+	return container_of(mn, struct ummunot_file, mmu_notifier);
+}
+
+static void ummunot_handle_not(struct mmu_notifier *mn,
+			       unsigned long start, unsigned long end)
+{
+	struct ummunot_file *priv = to_ummunot_file(mn);
+	struct rb_node *n;
+	struct ummunot_reg *reg;
+	unsigned long flags;
+	int hit = 0;
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
+		reg = rb_entry(n, struct ummunot_reg, node);
+
+		if (reg->start >= end)
+			break;
+
+		if ((reg->start <= start && reg->end > start) ||
+		    (reg->start <= end   && reg->end > end)) {
+			hit = 1;
+
+			if (!test_and_set_bit(UMMUNOT_FLAG_DIRTY, &reg->flags))
+				list_add_tail(&reg->list, &priv->dirty_list);
+
+			if (test_bit(UMMUNOT_FLAG_HINT, &reg->flags)) {
+				clear_bit(UMMUNOT_FLAG_HINT, &reg->flags);
+			} else {
+				set_bit(UMMUNOT_FLAG_HINT, &reg->flags);
+				reg->hint_start = start;
+				reg->hint_end   = end;
+			}
+		}
+	}
+
+	if (hit) {
+		++(*priv->counter);
+		flush_dcache_page(virt_to_page(priv->counter));
+		wake_up_interruptible(&priv->read_wait);
+		kill_fasync(&priv->async_queue, SIGIO, POLL_IN);
+	}
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static void ummunot_inval_page(struct mmu_notifier *mn,
+			       struct mm_struct *mm,
+			       unsigned long addr)
+{
+	ummunot_handle_not(mn, addr, addr + PAGE_SIZE);
+}
+
+static void ummunot_inval_range_start(struct mmu_notifier *mn,
+				      struct mm_struct *mm,
+				      unsigned long start, unsigned long end)
+{
+	ummunot_handle_not(mn, start, end);
+}
+
+static const struct mmu_notifier_ops ummunot_mmu_notifier_ops = {
+	.invalidate_page	= ummunot_inval_page,
+	.invalidate_range_start	= ummunot_inval_range_start,
+};
+
+static int ummunot_open(struct inode *inode, struct file *filp)
+{
+	struct ummunot_file *priv;
+	int ret;
+
+	if (filp->f_mode & FMODE_WRITE)
+		return -EINVAL;
+
+	priv = kmalloc(sizeof *priv, GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+
+	priv->counter = (void *) get_zeroed_page(GFP_KERNEL);
+	if (!priv->counter) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	priv->reg_tree = RB_ROOT;
+	INIT_LIST_HEAD(&priv->dirty_list);
+	spin_lock_init(&priv->lock);
+	init_waitqueue_head(&priv->read_wait);
+	priv->async_queue = NULL;
+
+	priv->mmu_notifier.ops = &ummunot_mmu_notifier_ops;
+	/*
+	 * Register notifier last, since notifications can occur as
+	 * soon as we register....
+	 */
+	ret = mmu_notifier_register(&priv->mmu_notifier, current->mm);
+	if (ret)
+		goto err_page;
+
+	priv->mm = current->mm;
+	atomic_inc(&priv->mm->mm_count);
+
+	filp->private_data = priv;
+
+	return 0;
+
+err_page:
+	free_page((unsigned long) priv->counter);
+
+err:
+	kfree(priv);
+	return ret;
+}
+
+static int ummunot_close(struct inode *inode, struct file *filp)
+{
+	struct ummunot_file *priv = filp->private_data;
+	struct rb_node *n;
+	struct ummunot_reg *reg;
+
+	mmu_notifier_unregister(&priv->mmu_notifier, priv->mm);
+	mmdrop(priv->mm);
+	free_page((unsigned long) priv->counter);
+
+	for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
+		reg = rb_entry(n, struct ummunot_reg, node);
+		rb_erase(n, &priv->reg_tree);
+		kfree(reg);
+	}
+
+	kfree(priv);
+
+	return 0;
+}
+
+static ssize_t ummunot_read(struct file *filp, char __user *buf,
+			    size_t count, loff_t *pos)
+{
+	struct ummunot_file *priv = filp->private_data;
+	struct ummunot_reg *reg;
+	ssize_t ret;
+	struct ummunot_event *events;
+	int max;
+	int n;
+
+	events = (void *) get_zeroed_page(GFP_KERNEL);
+	if (!events) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	spin_lock_irq(&priv->lock);
+
+	while (list_empty(&priv->dirty_list)) {
+		spin_unlock_irq(&priv->lock);
+
+		if (filp->f_flags & O_NONBLOCK) {
+			ret = -EAGAIN;
+			goto out;
+		}
+
+		if (wait_event_interruptible(priv->read_wait,
+					     !list_empty(&priv->dirty_list))) {
+			ret = -ERESTARTSYS;
+			goto out;
+		}
+
+		spin_lock_irq(&priv->lock);
+	}
+
+	max = min(PAGE_SIZE, count) / sizeof *events;
+
+	for (n = 0; n < max; ++n) {
+		if (list_empty(&priv->dirty_list)) {
+			events[n].type = UMMUNOT_EVENT_TYPE_LAST;
+			events[n].user_cookie_counter = *priv->counter;
+			++n;
+			break;
+		}
+
+		reg = list_first_entry(&priv->dirty_list, struct ummunot_reg,
+				       list);
+
+		events[n].type = UMMUNOT_EVENT_TYPE_INVAL;
+		if (test_bit(UMMUNOT_FLAG_HINT, &reg->flags)) {
+			events[n].flags		= UMMUNOT_EVENT_FLAG_HINT;
+			events[n].hint_start	= reg->hint_start;
+			events[n].hint_end	= reg->hint_end;
+		}
+		events[n].user_cookie_counter = reg->user_cookie;
+
+		list_del(&reg->list);
+		reg->flags = 0;
+	}
+
+	spin_unlock_irq(&priv->lock);
+
+	if (copy_to_user(buf, events, n * sizeof *events))
+		ret = -EFAULT;
+	else
+		ret = n * sizeof *events;
+
+out:
+	free_page((unsigned long) events);
+	return ret;
+}
+
+static unsigned int ummunot_poll(struct file *filp, struct poll_table_struct *wait)
+{
+	struct ummunot_file *priv = filp->private_data;
+
+	poll_wait(filp, &priv->read_wait, wait);
+
+	return list_empty(&priv->dirty_list) ? 0 : (POLLIN | POLLRDNORM);
+}
+
+static long ummunot_register_region(struct ummunot_file *priv,
+				    struct ummunot_register_ioctl __user *arg)
+{
+	struct ummunot_register_ioctl parm;
+	struct ummunot_reg *reg, *treg;
+	struct rb_node **n = &priv->reg_tree.rb_node;
+	struct rb_node *pn = NULL;
+
+	if (copy_from_user(&parm, arg, sizeof parm))
+	    return -EFAULT;
+
+	if (parm.intf_version != UMMUNOT_INTF_VERSION)
+		return -EINVAL;
+
+	reg = kmalloc(sizeof *reg, GFP_KERNEL);
+	if (!reg)
+		return -ENOMEM;
+
+	reg->user_cookie	= parm.user_cookie;
+	reg->start		= parm.start;
+	reg->end		= parm.end;
+	reg->flags		= 0;
+
+	spin_lock_irq(&priv->lock);
+
+	while (*n) {
+		treg = rb_entry(pn, struct ummunot_reg, node);
+		pn = *n;
+		if (reg->start <= treg->start)
+			n = &pn->rb_left;
+		else
+			n = &pn->rb_right;
+	}
+
+	rb_link_node(&reg->node, pn, n);
+	rb_insert_color(&reg->node, &priv->reg_tree);
+
+	spin_unlock_irq(&priv->lock);
+
+	return 0;
+}
+
+static long ummunot_unregister_region(struct ummunot_file *priv,
+				      __u64 __user *arg)
+{
+	u64 user_cookie;
+	struct rb_node *n;
+	struct ummunot_reg *reg;
+	int ret = -EINVAL;
+
+	if (get_user(user_cookie, arg))
+		return -EFAULT;
+
+	spin_lock_irq(&priv->lock);
+
+	for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
+		reg = rb_entry(n, struct ummunot_reg, node);
+
+		if (reg->user_cookie == user_cookie) {
+			rb_erase(n, &priv->reg_tree);
+			if (test_bit(UMMUNOT_FLAG_DIRTY, &reg->flags))
+			    list_del(&reg->list);
+			kfree(reg);
+			ret = 0;
+			break;
+		}
+	}
+
+	spin_unlock_irq(&priv->lock);
+
+	return ret;
+}
+
+static long ummunot_ioctl(struct file *filp, unsigned int cmd,
+			  unsigned long arg)
+{
+	struct ummunot_file *priv = filp->private_data;
+	void __user *argp = (void __user *) arg;
+
+	switch (cmd) {
+	case UMMUNOT_REGISTER_REGION:
+		return ummunot_register_region(priv, argp);
+	case UMMUNOT_UNREGISTER_REGION:
+		return ummunot_unregister_region(priv, argp);
+	default:
+		return -ENOIOCTLCMD;
+	}
+}
+
+static int ummunot_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct ummunot_file *priv = vma->vm_private_data;
+
+	if (vmf->pgoff != 0)
+		return VM_FAULT_SIGBUS;
+
+	vmf->page = virt_to_page(priv->counter);
+	get_page(vmf->page);
+
+	return 0;
+
+}
+
+static struct vm_operations_struct ummunot_vm_ops = {
+	.fault		= ummunot_fault,
+};
+
+static int ummunot_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	struct ummunot_file *priv = filp->private_data;
+
+	if (vma->vm_end - vma->vm_start != PAGE_SIZE ||
+	    vma->vm_pgoff != 0)
+		return -EINVAL;
+
+	vma->vm_ops		= &ummunot_vm_ops;
+	vma->vm_private_data	= priv;
+
+	return 0;
+}
+
+static int ummunot_fasync(int fd, struct file *filp, int on)
+{
+	struct ummunot_file *priv = filp->private_data;
+
+	return fasync_helper(fd, filp, on, &priv->async_queue);
+}
+
+static const struct file_operations ummunot_fops = {
+	.owner		= THIS_MODULE,
+	.open		= ummunot_open,
+	.release	= ummunot_close,
+	.read		= ummunot_read,
+	.poll		= ummunot_poll,
+	.unlocked_ioctl	= ummunot_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= ummunot_ioctl,
+#endif
+	.mmap		= ummunot_mmap,
+	.fasync		= ummunot_fasync,
+};
+
+static struct miscdevice ummunot_misc = {
+	.minor	= MISC_DYNAMIC_MINOR,
+	.name	= "ummunot",
+	.fops	= &ummunot_fops,
+};
+
+static int __init ummunot_init(void)
+{
+	return misc_register(&ummunot_misc);
+}
+
+static void __exit ummunot_cleanup(void)
+{
+	misc_deregister(&ummunot_misc);
+}
+
+module_init(ummunot_init);
+module_exit(ummunot_cleanup);
diff --git a/include/linux/ummunot.h b/include/linux/ummunot.h
new file mode 100644
index 0000000..e1abd89
--- /dev/null
+++ b/include/linux/ummunot.h
@@ -0,0 +1,85 @@
+/*
+ * Copyright (c) 2009 Cisco Systems.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenFabrics BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef _LINUX_UMMUNOT_H
+#define _LINUX_UMMUNOT_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#define UMMUNOT_INTF_VERSION		1
+
+enum {
+	UMMUNOT_EVENT_TYPE_INVAL	= 0,
+	UMMUNOT_EVENT_TYPE_LAST		= 1,
+};
+
+enum {
+	UMMUNOT_EVENT_FLAG_HINT		= 1 << 0,
+};
+
+/*
+ * If type field is INVAL, then user_cookie_counter holds the
+ * user_cookie for the region being reported; if the HINT flag is set
+ * then hint_start/hint_end hold the start and end of the mapping that
+ * was invalidated.  (If HINT is not set, then multiple events
+ * invalidated parts of the registered range and hint_start/hint_end
+ * should be ignored)
+ *
+ * If type is LAST, then the read operation has emptied the list of
+ * invalidated regions, and user_cookie_counter holds the value of the
+ * kernel's generation counter when the empty list occurred.  The
+ * other fields are not filled in for this event.
+ */
+struct ummunot_event {
+	__u32	type;
+	__u32	flags;
+	__u64	hint_start;
+	__u64	hint_end;
+	__u64	user_cookie_counter;
+};
+
+struct ummunot_register_ioctl {
+	__u32	intf_version;	/* in */
+	__u32	reserved1;
+	__u64	start;		/* in */
+	__u64	end;		/* in */
+	__u64	user_cookie;	/* in */
+};
+
+#define UMMUNOT_MAGIC			'U'
+
+#define UMMUNOT_REGISTER_REGION		_IOWR(UMMUNOT_MAGIC, 1, \
+					      struct ummunot_register_ioctl)
+#define UMMUNOT_UNREGISTER_REGION	_IOW(UMMUNOT_MAGIC, 2, __u64)
+
+#endif /* _LINUX_UMMUNOT_H */
-- 
1.6.0.4


From rdreier at cisco.com  Tue May 26 16:13:58 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 26 May 2009 16:13:58 -0700
Subject: [ofa-general] Memory registration redux
In-Reply-To: <adabppf8nln.fsf@cisco.com> (Roland Dreier's message of "Tue, 26
	May 2009 16:13:08 -0700")
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
	<adabpq6t2k8.fsf@cisco.com>
	<20090506214628.GM2590@obsidianresearch.com>
	<adatz3xsxo6.fsf@cisco.com>
	<20090506222638.GA16280@obsidianresearch.com>
	<adaprelsvnp.fsf@cisco.com>
	<20090507000231.GB16280@obsidianresearch.com>
	<adak54ssi0g.fsf@cisco.com>
	<20090507224806.GF16280@obsidianresearch.com>
	<adabppf8nln.fsf@cisco.com>
Message-ID: <ada7i038nk9.fsf@cisco.com>

Here's the test program:

#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <linux/types.h>
#include <linux/ioctl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>

#define UMMUNOT_INTF_VERSION		1

enum {
	UMMUNOT_EVENT_TYPE_INVAL	= 0,
	UMMUNOT_EVENT_TYPE_LAST		= 1,
};

enum {
	UMMUNOT_EVENT_FLAG_HINT		= 1 << 0,
};

/*
 * If type field is INVAL, then user_cookie_counter holds the
 * user_cookie for the region being reported; if the HINT flag is set
 * then hint_start/hint_end hold the start and end of the mapping that
 * was invalidated.  (If HINT is not set, then multiple events
 * invalidated parts of the registered range and hint_start/hint_end
 * should be ignored)
 *
 * If type is LAST, then the read operation has emptied the list of
 * invalidated regions, and user_cookie_counter holds the value of the
 * kernel's generation counter when the empty list occurred.  The
 * other fields are not filled in for this event.
 */
struct ummunot_event {
	__u32	type;
	__u32	flags;
	__u64	hint_start;
	__u64	hint_end;
	__u64	user_cookie_counter;
};

struct ummunot_register_ioctl {
	__u32	intf_version;	/* in */
	__u32	reserved1;
	__u64	start;		/* in */
	__u64	end;		/* in */
	__u64	user_cookie;	/* in */
};

#define UMMUNOT_MAGIC			'U'

#define UMMUNOT_REGISTER_REGION		_IOWR(UMMUNOT_MAGIC, 1, \
					      struct ummunot_register_ioctl)
#define UMMUNOT_UNREGISTER_REGION	_IOW(UMMUNOT_MAGIC, 2, __u64)

static int umn_fd;
static volatile unsigned long long *umn_counter;

static int umn_init(void)
{
	umn_fd = open("/dev/ummunot", O_RDONLY);
	if (umn_fd < 0) {
		perror("open");
		return 1;
	}

	umn_counter = mmap(NULL, sizeof *umn_counter, PROT_READ,
			   MAP_SHARED, umn_fd, 0);
	if (umn_counter == MAP_FAILED) {
		perror("mmap");
		return 1;
	}

	return 0;
}

static int umn_register(void *buf, size_t size, __u64 cookie)
{
	struct ummunot_register_ioctl r = {
		.intf_version	= UMMUNOT_INTF_VERSION,
		.start		= (unsigned long) buf,
		.end		= (unsigned long) buf + size,
		.user_cookie	= cookie,
	};

	if (ioctl(umn_fd, UMMUNOT_REGISTER_REGION, &r)) {
		perror("ioctl");
		return 1;
	}

	return 0;
}

static int umn_unregister(__u64 cookie)
{
	if (ioctl(umn_fd, UMMUNOT_UNREGISTER_REGION, &cookie)) {
		perror("ioctl");
		return 1;
	}

	return 0;
}

int main(int argc, char *argv[])
{
	int page_size = sysconf(_SC_PAGESIZE);
	void *t;

	if (umn_init())
		return 1;

	if (*umn_counter != 0) {
		fprintf(stderr, "counter = %lld (expected 0)\n", *umn_counter);
		return 1;
	}

	t = mmap(NULL, 3 * page_size, PROT_READ,
		 MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);

	if (umn_register(t, 3 * page_size, 123))
		return 1;

	munmap(t + page_size, page_size);

	printf("ummunot events: %lld\n", *umn_counter);

	if (*umn_counter > 0) {
		struct ummunot_event ev[2];
		int len;
		int i;

		len = read(umn_fd, &ev, sizeof ev);
		printf("read %d events (%d tot)\n", len / sizeof ev[0], len);

		for (i = 0; i < len / sizeof ev[0]; ++i) {
			switch (ev[i].type) {
			case UMMUNOT_EVENT_TYPE_INVAL:
				printf("[%3d]: inval cookie %lld\n",
				       i, ev[i].user_cookie_counter);
				if (ev[i].flags & UMMUNOT_EVENT_FLAG_HINT)
					printf("  hint %llx...%lx\n",
					       ev[i].hint_start, ev[i].hint_end);
				break;
			case UMMUNOT_EVENT_TYPE_LAST:
				printf("[%3d]: empty up to %lld\n",
				       i, ev[i].user_cookie_counter);
				break;
			default:
				printf("[%3d]: unknown event type %d\n",
				       i, ev[i].type);
				break;
			}
		}
	}

	umn_unregister(123);
	munmap(t, page_size);

	printf("ummunot events: %lld\n", *umn_counter);

	return 0;
}


From jgunthorpe at obsidianresearch.com  Tue May 26 16:51:58 2009
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Tue, 26 May 2009 17:51:58 -0600
Subject: [ofa-general] Memory registration redux
In-Reply-To: <adabppf8nln.fsf@cisco.com>
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
	<adabpq6t2k8.fsf@cisco.com>
	<20090506214628.GM2590@obsidianresearch.com>
	<adatz3xsxo6.fsf@cisco.com>
	<20090506222638.GA16280@obsidianresearch.com>
	<adaprelsvnp.fsf@cisco.com>
	<20090507000231.GB16280@obsidianresearch.com>
	<adak54ssi0g.fsf@cisco.com>
	<20090507224806.GF16280@obsidianresearch.com>
	<adabppf8nln.fsf@cisco.com>
Message-ID: <20090526235158.GG29521@obsidianresearch.com>

On Tue, May 26, 2009 at 04:13:08PM -0700, Roland Dreier wrote:

>  > >  > Or, ignore the overlapping problem, and use your original technique,
>  > >  > slightly modified:
>  > >  >  - Userspace registers a counter with the kernel. Kernel pins the
>  > >  >    page, sets up mmu notifiers and increments the counter when
>  > >  >    invalidates intersect with registrations
>  > >  >  - Kernel maintains a linked list of registrations that have been
>  > >  >    invalidated via mmu notifiers using the registration structure
>  > >  >    and a dirty bit
>  > >  >  - Userspace checks the counter at every cache hit, if different it
>  > >  >    calls into the kernel:
>  > >  >        MR_Cookie *mrs[100];
>  > >  >        int rc = ibv_get_invalid_mrs(mrs,100);
>  > >  >        invalidate_cache(mrs,rc);
>  > >  >        // Repeat until drained
>  > >  > 
>  > >  >    get_invalid_mrs traverses the linked list and returns an
>  > >  >    identifying value to userspace, which looks it up in the cache,
>  > >  >    calls unregister and removes it from the cache.
>  > > 
>  > > What's the advantage of this?  I have to do the get_invalid_mrs() call a
>  > > bunch of times, rather than just reading which ones are invalid from the
>  > > cache directly?
>  > 
>  > This is a trade off, the above is a more normal kernel API and lets
>  > the app get an list of changes it can scan. Having the kernel update
>  > flags means if the app wants a list of changes it has to scan all
>  > registrations.
> 
> The more I thought about this, the more I liked the idea, until I liked
> it so much that I actually went ahead and prototyped this.  A
> preliminary version is below -- *very* lightly tested, and no doubt
> there are obvious bugs that any real use or review will uncover.  But I
> thought I'd throw it out and hope for comments and/or testing.  I'm
> actually pretty happy with how small and simple this ended up being.

Seems reasonable to me. This doesn't catch all mmap cases, ie this
kind of stuff:

 t = mmap(NULL, 3 * page_size, PROT_READ,
 		 MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
 if (umn_register(t, 3 * page_size, 123))
	 	return 1;

 t = mmap(t,page_size,PROT_READ,MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,-1,0);
 // Event? Probably

 munmap(t,page_size);
 // Event? No, no MAP_POPULATE

 t = mmap(t,page_size,PROT_READ,MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,-1,0);
 // Event? No

And I guess the use of MAP_POPULATE is deliberate as thats how mmu
notifier works..

So the use model for a MPI would be to call ibv_register/umn_register
and watch for events. Any event at all means the entire region is
toast and must be re-registered the next time someone calls with that
address. ibv_register does the same as MAP_POPULATE internally..

The MPI library uses the result of this to build a list of invalided
regions. From time to time the MPI library should unregister those
regions.

If that is the use then the kernel side should probably also be a
one-shot type of interface..

I'm also trying to think of a use case outside of RDMA and failing - if
the kernel hasn't pinned the pages being watched through some other
means it seems useless as a general feature??

Jason


From klakshman03 at hotmail.com  Tue May 26 20:10:33 2009
From: klakshman03 at hotmail.com (lakshmana swamy)
Date: Wed, 27 May 2009 08:40:33 +0530
Subject: [ofa-general] ***SPAM*** nfs over rdma in Cent OS 5.2
Message-ID: <BAY101-W21D67FB9CFE3B0E352BFD5B8530@phx.gbl>


HI All,


Iam going to setup nfs over rdma in 10 clusterd nodes which has *Cents OS 5.2*operating system. Since nfs over rdma will not support to the defalut kernel of Cent OS 5.2, we need to upgrade the kernel verson. Thus *Which versions of kernel, nfs-util and ofed is the best combination * to go ahead for smooth Installation?


Thansk

laxman
_________________________________________________________________
Live Search extreme As India feels the heat of poll season, get all the info you need on the MSN News Aggregator
http://news.in.msn.com/National/indiaelections2009/aggregator/default.aspx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090527/3e3ab58c/attachment.html>

From jon at opengridcomputing.com  Wed May 27 02:52:18 2009
From: jon at opengridcomputing.com (Jon Mason)
Date: Wed, 27 May 2009 04:52:18 -0500
Subject: [ofa-general] ***SPAM*** nfs over rdma in Cent OS 5.2
In-Reply-To: <BAY101-W21D67FB9CFE3B0E352BFD5B8530@phx.gbl>
References: <BAY101-W21D67FB9CFE3B0E352BFD5B8530@phx.gbl>
Message-ID: <20090527095216.GA21367@opengridcomputing.com>

On Wed, May 27, 2009 at 08:40:33AM +0530, lakshmana swamy wrote:
> 
> 
> HI All,
> 
> 
> 
> Iam going to setup nfs over rdma in 10 clusterd nodes which has *Cents OS 5.2*operating system. Since nfs over rdma will not support to the defalut kernel of Cent OS 5.2, we need to upgrade the kernel verson. Thus *Which versions of kernel, nfs-util and ofed is the best combination * to go ahead for smooth Installation?
 
OFED 1.4.1 will have support for the stock CentOS/RHEL 5.2 kernel, and
should be coming out any day now.

If you do not want to wait for it to come out, then you should use the
2.6.28 kernel, nfs-utils 1.5, and OFED 1.4.  The basics of how to set it
up can be found in the Linux kernel documentation at
Documentation/filesystems/nfs-rdma.txt

Thanks,
Jon

> 
> 
> 
> 
> Thansk
> 
> laxman
> _________________________________________________________________
> Live Search extreme As India feels the heat of poll season, get all the info you need on the MSN News Aggregator
> http://news.in.msn.com/National/indiaelections2009/aggregator/default.aspx
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From Line.Holen at Sun.COM  Wed May 27 03:07:41 2009
From: Line.Holen at Sun.COM (Line.Holen at Sun.COM)
Date: Wed, 27 May 2009 12:07:41 +0200
Subject: [ofa-general] [PATCH] osm_ucast_ftree.c Allow horizontal links
 between switches of max rank
Message-ID: <4A1D10ED.6040001@Sun.COM>

This patch makes it legal to have cross links (horizontal links) between
switches at max rank. These switches do have same rank, so hop count cannot
be calculated based on rank anymore.
The horizontal links are treated as downlinks. Switch A has a downlink to B
while B has a downlink to A. Tests on lids and also number of hops makes sure
that we don't loop back and forth across the link.

Signed-off-by: Frank Olaf Sem-Jacobsen <frankose at simula.no>
Signed-off-by: Line Holen <Line.Holen at sun.com>

---

diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c
index 8ed2f74..1f1d0ff 100644
--- a/opensm/opensm/osm_ucast_ftree.c
+++ b/opensm/opensm/osm_ucast_ftree.c
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2009 Simula Research Laboratory. All rights reserved.
  * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2007 Mellanox Technologies LTD. All rights reserved.
@@ -1495,7 +1496,8 @@ static void fabric_make_indexing(IN ftree_fabric_t * p_ftree)
 				p_remote_sw =
 				    p_sw->down_port_groups[i]->remote_hca_or_sw.
 				    p_sw;
-				if (tuple_assigned(p_remote_sw->tuple)) {
+				if (tuple_assigned(p_remote_sw->tuple) ||
+				    (p_sw->rank == p_remote_sw->rank)) {
 					/* this switch has been already indexed */
 					continue;
 				}
@@ -1903,12 +1905,11 @@ fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
 				   IN ftree_sw_t * p_sw,
 				   IN ftree_sw_t * p_prev_sw,
 				   IN uint16_t target_lid,
-				   IN uint8_t target_rank,
 				   IN boolean_t is_real_lid,
 				   IN boolean_t is_main_path,
 				   IN boolean_t is_target_a_sw,
-				   IN uint8_t highest_rank_in_route,
-				   IN uint16_t reverse_hops)
+				   IN uint16_t reverse_hops,
+				   IN uint8_t current_hops)
 {
 	ftree_sw_t *p_remote_sw;
 	uint16_t ports_num;
@@ -1919,6 +1920,7 @@ fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
 	uint16_t j;
 	uint16_t k;
 	boolean_t created_route = FALSE;
+	uint8_t least_hops;
 
 	/* we shouldn't enter here if both real_lid and main_path are false */
 	CL_ASSERT(is_real_lid || is_main_path);
@@ -1968,14 +1970,15 @@ fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
 		   Set on the remote switch how to get to the target_lid -
 		   set LFT(target_lid) on the remote switch to the remote port */
 		p_remote_sw = p_group->remote_hca_or_sw.p_sw;
+		least_hops = sw_get_least_hops(p_remote_sw, target_lid);
 
-		if (sw_get_least_hops(p_remote_sw, target_lid) != OSM_NO_PATH) {
+		if ((least_hops != OSM_NO_PATH) && (least_hops <= current_hops)) {
 			/* Loop in the fabric - we already routed the remote switch
 			   on our way UP, and now we see it again on our way DOWN */
 			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
 				"Loop of lenght %d in the fabric:\n                             "
 				"Switch %s (LID %u) closes loop through switch %s (LID %u)\n",
-				(p_remote_sw->rank - highest_rank_in_route) * 2,
+				current_hops,
 				tuple_to_str(p_remote_sw->tuple),
 				p_group->base_lid,
 				tuple_to_str(p_sw->tuple),
@@ -2022,20 +2025,17 @@ fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
 			p_remote_sw->p_osm_sw->new_lft[target_lid] =
 			    p_min_port->remote_port_num;
 			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
-				"Switch %s: set path to CA LID %u through port %u\n",
+				"Switch %s: set path to CA LID %u through port %u, hops %u\n",
 				tuple_to_str(p_remote_sw->tuple),
 				target_lid,
-				p_min_port->remote_port_num);
+				p_min_port->remote_port_num,
+				current_hops + 1);
 
 			/* On the remote switch that is pointed by the p_group,
 			   set hops for ALL the ports in the remote group. */
 
 			set_hops_on_remote_sw(p_group, target_lid,
-					      ((target_rank -
-						highest_rank_in_route) +
-					       (p_remote_sw->rank -
-						highest_rank_in_route) +
-					       reverse_hops * 2),
+					      current_hops + 1 + reverse_hops * 2,
 					      is_target_a_sw);
 		}
 
@@ -2050,13 +2050,13 @@ fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
 		/* Recursion step:
 		   Assign upgoing ports by stepping down, starting on REMOTE switch */
 		created_route |= fabric_route_upgoing_by_going_down(p_ftree, p_remote_sw,	/* remote switch - used as a route-upgoing alg. start point */
-								    NULL,	/* prev. position - NULL to mark that we went down and not up */
+								    p_sw,	/* prev. position */
 								    target_lid,	/* LID that we're routing to */
-								    target_rank,	/* rank of the LID that we're routing to */
 								    is_real_lid,	/* whether the target LID is real or dummy */
 								    is_main_path,	/* whether this is path to HCA that should by tracked by counters */
 								    is_target_a_sw,	/* Wheter target lid is a switch or not */
-								    highest_rank_in_route, reverse_hops);	/* highest visited point in the tree before going down */
+								    reverse_hops,
+								    current_hops + 1);
 	}
 	/* done scanning all the down-going port groups */
 
@@ -2087,12 +2087,12 @@ static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
 					       IN ftree_sw_t * p_sw,
 					       IN ftree_sw_t * p_prev_sw,
 					       IN uint16_t target_lid,
-					       IN uint8_t target_rank,
 					       IN boolean_t is_real_lid,
 					       IN boolean_t is_main_path,
 					       IN boolean_t is_target_a_sw,
 					       IN uint16_t reverse_hop_credit,
-					       IN uint16_t reverse_hops)
+					       IN uint16_t reverse_hops,
+					       IN uint8_t current_hops)
 {
 	ftree_sw_t *p_remote_sw;
 	uint16_t ports_num;
@@ -2110,12 +2110,11 @@ static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
 	fabric_route_upgoing_by_going_down(p_ftree, p_sw,	/* local switch - used as a route-upgoing alg. start point */
 					   p_prev_sw,	/* switch that we went up from (NULL means that we went down) */
 					   target_lid,	/* LID that we're routing to */
-					   target_rank,	/* rank of the LID that we're routing to */
 					   is_real_lid,	/* whether this target LID is real or dummy */
 					   is_main_path,	/* whether this path to HCA should by tracked by counters */
 					   is_target_a_sw,	/* Wheter target lid is a switch or not */
-					   p_sw->rank,	/* the highest visited point in the tree before going down */
-					   reverse_hops);	/* Number of reverse_hops done up to this point */
+					   reverse_hops,	/* Number of reverse_hops done up to this point */
+					   current_hops);
 
 	/* recursion stop condition - if it's a root switch, */
 	if (p_sw->rank == 0) {
@@ -2140,12 +2139,12 @@ static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
 				fabric_route_downgoing_by_going_up(p_ftree, p_remote_sw,	/* remote switch - used as a route-downgoing alg. next step point */
 								   p_sw,	/* this switch - prev. position switch for the function */
 								   target_lid,	/* LID that we're routing to */
-								   target_rank,	/* rank of the LID that we're routing to */
 								   is_real_lid,	/* whether this target LID is real or dummy */
 								   is_main_path,	/* whether this is path to HCA that should by tracked by counters */
 								   is_target_a_sw,	/* Wheter target lid is a switch or not */
 								   reverse_hop_credit - 1,	/* Remaining reverse_hops allowed */
-								   reverse_hops + 1);	/* Number of reverse_hops done up to this point */
+								   reverse_hops + 1,	/* Number of reverse_hops done up to this point */
+								   current_hops + 1);
 			}
 
 		}
@@ -2244,17 +2243,17 @@ static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
 				    new_lft[target_lid] =
 				    p_min_port->remote_port_num;
 				OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
-					"Switch %s: set path to CA LID %u through port %u\n",
+					"Switch %s: set path to CA LID %u through port %u, hops %u\n",
 					tuple_to_str(p_remote_sw->tuple),
 					target_lid,
-					p_min_port->remote_port_num);
+					p_min_port->remote_port_num, current_hops + 1);
 			}
 			/* On the remote switch that is pointed by the min_group,
 			   set hops for ALL the ports in the remote group. */
 
 			set_hops_on_remote_sw(p_min_group, target_lid,
-					      target_rank - p_remote_sw->rank +
-					      2 * reverse_hops, is_target_a_sw);
+					      current_hops + 1 + 2 * reverse_hops,
+					      is_target_a_sw);
 		}
 
 		/* Recursion step:
@@ -2262,12 +2261,12 @@ static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
 		fabric_route_downgoing_by_going_up(p_ftree, p_remote_sw,	/* remote switch - used as a route-downgoing alg. next step point */
 						   p_sw,	/* this switch - prev. position switch for the function */
 						   target_lid,	/* LID that we're routing to */
-						   target_rank,	/* rank of the LID that we're routing to */
 						   is_real_lid,	/* whether this target LID is real or dummy */
 						   is_main_path,	/* whether this is path to HCA that should by tracked by counters */
 						   is_target_a_sw,	/* Wheter target lid is a switch or not */
 						   reverse_hop_credit,	/* Remaining reverse_hops allowed */
-						   reverse_hops);	/* Number of reverse_hops done up to this point */
+						   reverse_hops,	/* Number of reverse_hops done up to this point */
+						   current_hops + 1);
 	}
 
 	/* we're done for the third case */
@@ -2335,22 +2334,21 @@ static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
 
 		/* On the remote switch that is pointed by the p_group,
 		   set hops for ALL the ports in the remote group. */
-
 		set_hops_on_remote_sw(p_group, target_lid,
-				      target_rank - p_remote_sw->rank +
-				      2 * reverse_hops, is_target_a_sw);
+				      current_hops + 1 + 2 * reverse_hops,
+				      is_target_a_sw);
 
 		/* Recursion step:
 		   Assign downgoing ports by stepping up, starting on REMOTE switch. */
 		fabric_route_downgoing_by_going_up(p_ftree, p_remote_sw,	/* remote switch - used as a route-downgoing alg. next step point */
 						   p_sw,	/* this switch - prev. position switch for the function */
 						   target_lid,	/* LID that we're routing to */
-						   target_rank,	/* rank of the LID that we're routing to */
 						   TRUE,	/* whether the target LID is real or dummy */
 						   FALSE,	/* whether this is path to HCA that should by tracked by counters */
 						   is_target_a_sw,	/* Wheter target lid is a switch or not */
 						   reverse_hop_credit,	/* Remaining reverse_hops allowed */
-						   reverse_hops);	/* Number of reverse_hops done up to this point */
+						   reverse_hops,	/* Number of reverse_hops done up to this point */
+						   current_hops + 1);
 	}
 
 	/* If we don't have any reverse hop credits, we are done */
@@ -2374,12 +2372,12 @@ static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
 		fabric_route_downgoing_by_going_up(p_ftree, p_remote_sw,	/* remote switch - used as a route-downgoing alg. next step point */
 						   p_sw,	/* this switch - prev. position switch for the function */
 						   target_lid,	/* LID that we're routing to */
-						   target_rank,	/* rank of the LID that we're routing to */
 						   TRUE,	/* whether the target LID is real or dummy */
 						   TRUE,	/* whether this is path to HCA that should by tracked by counters */
 						   is_target_a_sw,	/* Wheter target lid is a switch or not */
 						   reverse_hop_credit - 1,	/* Remaining reverse_hops allowed */
-						   reverse_hops + 1);	/* Number of reverse_hops done up to this point */
+						   reverse_hops + 1,	/* Number of reverse_hops done up to this point */
+						   current_hops + 1);
 	}
 
 }				/* ftree_fabric_route_downgoing_by_going_up() */
@@ -2451,7 +2449,7 @@ static void fabric_route_to_cns(IN ftree_fabric_t * p_ftree)
 			    p_port->port_num;
 
 			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
-				"Switch %s: set path to CN LID %u through port %u\n",
+				"Switch %s: set path to CN LID %u through port %u, hop 1\n",
 				tuple_to_str(p_sw->tuple),
 				hca_lid, p_port->port_num);
 
@@ -2464,12 +2462,12 @@ static void fabric_route_to_cns(IN ftree_fabric_t * p_ftree)
 			fabric_route_downgoing_by_going_up(p_ftree, p_sw,	/* local switch - used as a route-downgoing alg. start point */
 							   NULL,	/* prev. position switch */
 							   hca_lid,	/* LID that we're routing to */
-							   p_sw->rank + 1,	/* rank of the LID that we're routing to */
 							   TRUE,	/* whether this HCA LID is real or dummy */
 							   TRUE,	/* whether this path to HCA should by tracked by counters */
 							   FALSE,	/* wheter target lid is a switch or not */
 							   0,	/* Number of reverse hops allowed */
-							   0);	/* Number of reverse hops done yet */
+							   0,	/* Number of reverse hops done yet */
+							   1);
 
 			/* count how many real targets have been routed from this leaf switch */
 			routed_targets_on_leaf++;
@@ -2492,12 +2490,12 @@ static void fabric_route_to_cns(IN ftree_fabric_t * p_ftree)
 				fabric_route_downgoing_by_going_up(p_ftree, p_sw,	/* local switch - used as a route-downgoing alg. start point */
 								   NULL,	/* prev. position switch */
 								   0,	/* LID that we're routing to - ignored for dummy HCA */
-								   0,	/* rank of the LID that we're routing to - ignored for dummy HCA */
 								   FALSE,	/* whether this HCA LID is real or dummy */
 								   TRUE,	/* whether this path to HCA should by tracked by counters */
 								   FALSE,	/* Wheter the target LID is a switch or not */
 								   0,	/* Number of reverse hops allowed */
-								   0);	/* Number of reverse hops done yet */
+								   0,	/* Number of reverse hops done yet */
+								   1);
 			}
 		}
 	}
@@ -2579,12 +2577,12 @@ static void fabric_route_to_non_cns(IN ftree_fabric_t * p_ftree)
 			fabric_route_downgoing_by_going_up(p_ftree, p_sw,	/* local switch - used as a route-downgoing alg. start point */
 							   NULL,	/* prev. position switch */
 							   hca_lid,	/* LID that we're routing to */
-							   p_sw->rank + 1,	/* rank of the LID that we're routing to */
 							   TRUE,	/* whether this HCA LID is real or dummy */
 							   TRUE,	/* whether this path to HCA should by tracked by counters */
 							   FALSE,	/* Wheter the target LID is a switch or not */
 							   p_hca_port_group->is_io ? p_ftree->p_osm->subn.opt.max_reverse_hops : 0,	/* Number or reverse hops allowed */
-							   0);	/* Number or reverse hops done yet */
+							   0,	/* Number or reverse hops done yet */
+							   1);
 		}
 		/* done with all the port groups of this HCA - go to next HCA */
 	}
@@ -2632,12 +2630,12 @@ static void fabric_route_to_switches(IN ftree_fabric_t * p_ftree)
 		fabric_route_downgoing_by_going_up(p_ftree, p_sw,	/* local switch - used as a route-downgoing alg. start point */
 						   NULL,	/* prev. position switch */
 						   p_sw->base_lid,	/* LID that we're routing to */
-						   p_sw->rank,	/* rank of the LID that we're routing to */
 						   TRUE,	/* whether the target LID is a real or dummy */
 						   FALSE,	/* whether this path to HCA should by tracked by counters */
 						   TRUE,	/* Wheter the target LID is a switch or not */
 						   0,	/* Number of reverse hops allowed */
-						   0);	/* Number of reverse hops done yet */
+						   0,	/* Number of reverse hops done yet */
+						   0);
 	}
 
 	OSM_LOG_EXIT(&p_ftree->p_osm->log);
@@ -3058,7 +3056,8 @@ static int fabric_construct_sw_ports(IN ftree_fabric_t * p_ftree,
 
 			p_remote_hca_or_sw = (void *)p_remote_sw;
 
-			if (abs(p_sw->rank - p_remote_sw->rank) != 1) {
+			if ((abs(p_sw->rank - p_remote_sw->rank) != 1) &&
+			    (p_sw->rank != p_ftree->max_switch_rank)) {
 				OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR,
 					"ERR AB16: "
 					"Illegal link between switches with ranks %u and %u:\n"


From vlad at lists.openfabrics.org  Wed May 27 03:24:52 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Wed, 27 May 2009 03:24:52 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090527-0200 daily build status
Message-ID: <20090527102452.59AAFE615C2@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From Line.Holen at Sun.COM  Wed May 27 03:32:23 2009
From: Line.Holen at Sun.COM (Line.Holen at Sun.COM)
Date: Wed, 27 May 2009 12:32:23 +0200
Subject: [ofa-general] [PATCH] osm_dump.c dump port if lft is set up
Message-ID: <4A1D16B7.7070300@Sun.COM>

dump_ucast_routes() claims that a node is unreachable if the number of
hops to it is unknown. This is changed to print actual port and give
proper warning about hops.

Signed-off-by: Line Holen <Line.Holen at sun.com>

---

diff --git a/opensm/opensm/osm_dump.c b/opensm/opensm/osm_dump.c
index 946ee6a..08b3156 100644
--- a/opensm/opensm/osm_dump.c
+++ b/opensm/opensm/osm_dump.c
@@ -1,4 +1,5 @@
 /*
+ * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
  * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
@@ -201,7 +202,7 @@ static void dump_ucast_routes(cl_map_item_t * item, FILE * file, void *cxt)
 		}
 
 		if (num_hops == OSM_NO_PATH) {
-			fprintf(file, "UNREACHABLE\n");
+			fprintf(file, "%03u  : HOPS UNKNOWN\n", port_num);
 			continue;
 		}
 

From jsquyres at cisco.com  Wed May 27 10:34:22 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 27 May 2009 13:34:22 -0400
Subject: [ofa-general] Memory registration redux
In-Reply-To: <ada7i038nk9.fsf@cisco.com>
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com><adabpq6t2k8.fsf@cisco.com><20090506214628.GM2590@obsidianresearch.com><adatz3xsxo6.fsf@cisco.com><20090506222638.GA16280@obsidianresearch.com><adaprelsvnp.fsf@cisco.com><20090507000231.GB16280@obsidianresearch.com><adak54ssi0g.fsf@cisco.com><20090507224806.GF16280@obsidianresearch.com><adabppf8nln.fsf@cisco.com>
	<ada7i038nk9.fsf@cisco.com>
Message-ID: <5019F239-149F-49E1-8C23-436DE6094AB2@cisco.com>

On May 26, 2009, at 7:13 PM, Roland Dreier (rdreier) wrote:

> /*
>  * If type field is INVAL, then user_cookie_counter holds the
>  * user_cookie for the region being reported; if the HINT flag is set
>  * then hint_start/hint_end hold the start and end of the mapping that
>  * was invalidated.  (If HINT is not set, then multiple events
>  * invalidated parts of the registered range and hint_start/hint_end
>  * should be ignored)
>

I don't quite grok this.  Is the intent that HINT will only be set if  
an *entire* hint_start/hint_end range is invalidated by a single  
event?  I.e., if only part of the hint_start/hint_end range is  
invalidated, you'll get the cookie back, but not what part of the  
range is invalid (because assumedly the entire IBV registration is now  
invalid anyway)?

>  * If type is LAST, then the read operation has emptied the list of
>  * invalidated regions, and user_cookie_counter holds the value of the
>  * kernel's generation counter when the empty list occurred.  The
>  * other fields are not filled in for this event.
>

Just to be clear -- we're supposed to keep reading events until we get  
a LAST event?

>         if (*umn_counter != 0) {
>                 fprintf(stderr, "counter = %lld (expected 0)\n",  
> *umn_counter);
>                 return 1;
>         }
>

Some clarification questions about umn_counter:

1. Will it increase by 1 each time a page (or set of pages?) is  
removed from a user process?

2. Does it change if pages are *added* to a user process?  I.e., does  
the counter indicate *removals* or *changes* to the user process page  
table?

>         t = mmap(NULL, 3 * page_size, PROT_READ,
>                  MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
>
>         if (umn_register(t, 3 * page_size, 123))
>                 return 1;
>
>         munmap(t + page_size, page_size);
>
>         printf("ummunot events: %lld\n", *umn_counter);
>
>         if (*umn_counter > 0) {
>

Is the *unm_counter value guaranteed to have been changed by the time  
munmap() returns?

>                 struct ummunot_event ev[2];
>

Did you pick [2] here simply because you're only expecting an INVAL  
and a LAST event in this specific example?  I'm assuming that we  
should normally loop over reading until we get LAST, correct?

>
>                 int len;
>                 int i;
>
>                 len = read(umn_fd, &ev, sizeof ev);
>                 printf("read %d events (%d tot)\n", len / sizeof  
> ev[0], len);
>
>                 for (i = 0; i < len / sizeof ev[0]; ++i) {
>                         switch (ev[i].type) {
>                         case UMMUNOT_EVENT_TYPE_INVAL:
>                                 printf("[%3d]: inval cookie %lld\n",
>                                        i, ev[i].user_cookie_counter);
>                                 if (ev[i].flags &  
> UMMUNOT_EVENT_FLAG_HINT)
>                                         printf("  hint %llx...%lx\n",
>                                                ev[i].hint_start,  
> ev[i].hint_end);
>                                 break;
>                         case UMMUNOT_EVENT_TYPE_LAST:
>                                 printf("[%3d]: empty up to %lld\n",
>                                        i, ev[i].user_cookie_counter);
>                                 break;
>                         default:
>                                 printf("[%3d]: unknown event type %d 
> \n",
>                                        i, ev[i].type);
>                                 break;
>                         }
>                 }
>         }
>
>         umn_unregister(123);
>

What happens if I register multiple regions with the same cookie value?

Is a process responsible for guaranteeing that it umn_unregister()s  
everything before exiting, or will all pending registrations be  
cleaned up/unregistered/whatever when a process exits?

-- 
Jeff Squyres
Cisco Systems


From rdreier at cisco.com  Wed May 27 10:49:57 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 27 May 2009 10:49:57 -0700
Subject: [ofa-general] Memory registration redux
In-Reply-To: <5019F239-149F-49E1-8C23-436DE6094AB2@cisco.com> (Jeff Squyres's
	message of "Wed, 27 May 2009 13:34:22 -0400")
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
	<adabpq6t2k8.fsf@cisco.com>
	<20090506214628.GM2590@obsidianresearch.com>
	<adatz3xsxo6.fsf@cisco.com>
	<20090506222638.GA16280@obsidianresearch.com>
	<adaprelsvnp.fsf@cisco.com>
	<20090507000231.GB16280@obsidianresearch.com>
	<adak54ssi0g.fsf@cisco.com>
	<20090507224806.GF16280@obsidianresearch.com>
	<adabppf8nln.fsf@cisco.com> <ada7i038nk9.fsf@cisco.com>
	<5019F239-149F-49E1-8C23-436DE6094AB2@cisco.com>
Message-ID: <adaws8277wa.fsf@cisco.com>

 > > /*
 > >  * If type field is INVAL, then user_cookie_counter holds the
 > >  * user_cookie for the region being reported; if the HINT flag is set
 > >  * then hint_start/hint_end hold the start and end of the mapping that
 > >  * was invalidated.  (If HINT is not set, then multiple events
 > >  * invalidated parts of the registered range and hint_start/hint_end
 > >  * should be ignored)

 > I don't quite grok this.  Is the intent that HINT will only be set if
 > an *entire* hint_start/hint_end range is invalidated by a single
 > event?  I.e., if only part of the hint_start/hint_end range is
 > invalidated, you'll get the cookie back, but not what part of the
 > range is invalid (because assumedly the entire IBV registration is now
 > invalid anyway)?

Basically, I just keep one hint_start/hint_end.  If multiple events hit
the same registration then I just give up and don't give you a hint.

 > >  * If type is LAST, then the read operation has emptied the list of
 > >  * invalidated regions, and user_cookie_counter holds the value of the
 > >  * kernel's generation counter when the empty list occurred.  The
 > >  * other fields are not filled in for this event.

 > Just to be clear -- we're supposed to keep reading events until we get
 > a LAST event?

Yes, that's probably the sanest use case.

 > 1. Will it increase by 1 each time a page (or set of pages?) is
 > removed from a user process?

As it stands it increases by 1 every time there is an MMU notification,
even if that notification hits multiple registrations.  It wouldn't be
hard to change that to count the number of events generated if that
works better.

 > 2. Does it change if pages are *added* to a user process?  I.e., does
 > the counter indicate *removals* or *changes* to the user process page
 > table?

No, additions don't trigger any MMU notification -- that's inherent in
the design of the MMU notifiers stuff.  The idea is that you have a
"secondary MMU" and MMU notifications are the equivalent of TLB
shootdowns; the secondary MMU is responsible for populating itself on
faults etc.

 > Is the *unm_counter value guaranteed to have been changed by the time
 > munmap() returns?

Yes.

 > Did you pick [2] here simply because you're only expecting an INVAL
 > and a LAST event in this specific example?  I'm assuming that we
 > should normally loop over reading until we get LAST, correct?

Right.

 > What happens if I register multiple regions with the same cookie value?

You get in trouble -- I need to fix things to reject duplicated cookies
actually, because otherwise there's no way to unregister.

 > Is a process responsible for guaranteeing that it umn_unregister()s
 > everything before exiting, or will all pending registrations be
 > cleaned up/unregistered/whatever when a process exits?

The kernel cleans up of course to handle crashes etc.

 - R.


From rdreier at cisco.com  Wed May 27 11:13:34 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 27 May 2009 11:13:34 -0700
Subject: [ofa-general] Memory registration redux
In-Reply-To: <adaws8277wa.fsf@cisco.com> (Roland Dreier's message of "Wed, 27
	May 2009 10:49:57 -0700")
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
	<adabpq6t2k8.fsf@cisco.com>
	<20090506214628.GM2590@obsidianresearch.com>
	<adatz3xsxo6.fsf@cisco.com>
	<20090506222638.GA16280@obsidianresearch.com>
	<adaprelsvnp.fsf@cisco.com>
	<20090507000231.GB16280@obsidianresearch.com>
	<adak54ssi0g.fsf@cisco.com>
	<20090507224806.GF16280@obsidianresearch.com>
	<adabppf8nln.fsf@cisco.com> <ada7i038nk9.fsf@cisco.com>
	<5019F239-149F-49E1-8C23-436DE6094AB2@cisco.com>
	<adaws8277wa.fsf@cisco.com>
Message-ID: <adaocte76sx.fsf@cisco.com>

Fixed version below -- returns EINVAL for an attempt to reuse a user
cookie (since otherwise unregister would get confused).

===

ummunot: Userspace support for MMU notifications

As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925>
and follow-up messages, libraries using RDMA would like to track
precisely when application code changes memory mapping via free(),
munmap(), etc.  Current pure-userspace solutions using malloc hooks
and other tricks are not robust, and the feeling among experts is that
the issue is unfixable without kernel help.

We solve this not by implementing the full API proposed in the email
linked above but rather with a simpler and more generic interface,
which may be useful in other contexts.  Specifically, we implement a
new character device driver, ummunot, that creates a /dev/ummunot
node.  A userspace process can open this node read-only and use the fd
as follows:

 1. ioctl() to register/unregister an address range to watch in the
    kernel (cf struct ummunot_register_ioctl in <linux/ummunot.h>).

 2. read() to retrieve events generated when a mapping in a watched
    address range is invalidated (cf struct ummunot_event in
    <linux/ummunot.h>).  select()/poll()/epoll() and SIGIO are handled
    for this IO.

 3. mmap() one page at offset 0 to map a kernel page that contains a
    generation counter that is incremented each time an event is
    generated.  This allows userspace to have a fast path that checks
    that no events have occurred without a system call.

NOT-YET-Signed-off-by: Roland Dreier <rolandd at cisco.com>
---
 drivers/char/Kconfig    |   12 ++
 drivers/char/Makefile   |    1 +
 drivers/char/ummunot.c  |  457 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/ummunot.h |   85 +++++++++
 4 files changed, 555 insertions(+), 0 deletions(-)
 create mode 100644 drivers/char/ummunot.c
 create mode 100644 include/linux/ummunot.h

diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index 735bbe2..91fe068 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -1099,6 +1099,18 @@ config DEVPORT
 	depends on ISA || PCI
 	default y
 
+config UMMUNOT
+       tristate "Userspace MMU notifications"
+       select MMU_NOTIFIER
+       help
+         The ummunot (userspace MMU notification) driver creates a
+         character device that can be used by userspace libraries to
+         get notifications when an application's memory mapping
+         changed.  This is used, for example, by RDMA libraries to
+         improve the reliability of memory registration caching, since
+         the kernel's MMU notifications can be used to know precisely
+         when to shoot down a cached registration.
+
 source "drivers/s390/char/Kconfig"
 
 endmenu
diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index 9caf5b5..dcbcd7c 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -97,6 +97,7 @@ obj-$(CONFIG_CS5535_GPIO)	+= cs5535_gpio.o
 obj-$(CONFIG_GPIO_VR41XX)	+= vr41xx_giu.o
 obj-$(CONFIG_GPIO_TB0219)	+= tb0219.o
 obj-$(CONFIG_TELCLOCK)		+= tlclk.o
+obj-$(CONFIG_UMMUNOT)		+= ummunot.o
 
 obj-$(CONFIG_MWAVE)		+= mwave/
 obj-$(CONFIG_AGP)		+= agp/
diff --git a/drivers/char/ummunot.c b/drivers/char/ummunot.c
new file mode 100644
index 0000000..1341edc
--- /dev/null
+++ b/drivers/char/ummunot.c
@@ -0,0 +1,457 @@
+/*
+ * Copyright (c) 2009 Cisco Systems.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenFabrics BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/miscdevice.h>
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/rbtree.h>
+#include <linux/sched.h>
+#include <linux/spinlock.h>
+#include <linux/uaccess.h>
+#include <linux/ummunot.h>
+
+#include <asm/cacheflush.h>
+
+MODULE_AUTHOR("Roland Dreier");
+MODULE_DESCRIPTION("Userspace MMU notifiers");
+MODULE_LICENSE("Dual BSD/GPL");
+
+enum {
+	UMMUNOT_FLAG_DIRTY	= 1,
+	UMMUNOT_FLAG_HINT	= 2,
+};
+
+struct ummunot_reg {
+	u64			user_cookie;
+	unsigned long		start;
+	unsigned long		end;
+	unsigned long		hint_start;
+	unsigned long		hint_end;
+	unsigned long		flags;
+	struct rb_node		node;
+	struct list_head	list;
+};
+
+struct ummunot_file {
+	struct mmu_notifier	mmu_notifier;
+	struct mm_struct       *mm;
+	struct rb_root		reg_tree;
+	struct list_head	dirty_list;
+	u64		       *counter;
+	spinlock_t		lock;
+	wait_queue_head_t	read_wait;
+	struct fasync_struct   *async_queue;
+};
+
+static struct ummunot_file *to_ummunot_file(struct mmu_notifier *mn)
+{
+	return container_of(mn, struct ummunot_file, mmu_notifier);
+}
+
+static void ummunot_handle_not(struct mmu_notifier *mn,
+			       unsigned long start, unsigned long end)
+{
+	struct ummunot_file *priv = to_ummunot_file(mn);
+	struct rb_node *n;
+	struct ummunot_reg *reg;
+	unsigned long flags;
+	int hit = 0;
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
+		reg = rb_entry(n, struct ummunot_reg, node);
+
+		if (reg->start >= end)
+			break;
+
+		if ((reg->start <= start && reg->end > start) ||
+		    (reg->start <= end   && reg->end > end)) {
+			hit = 1;
+
+			if (!test_and_set_bit(UMMUNOT_FLAG_DIRTY, &reg->flags))
+				list_add_tail(&reg->list, &priv->dirty_list);
+
+			if (test_bit(UMMUNOT_FLAG_HINT, &reg->flags)) {
+				clear_bit(UMMUNOT_FLAG_HINT, &reg->flags);
+			} else {
+				set_bit(UMMUNOT_FLAG_HINT, &reg->flags);
+				reg->hint_start = start;
+				reg->hint_end   = end;
+			}
+		}
+	}
+
+	if (hit) {
+		++(*priv->counter);
+		flush_dcache_page(virt_to_page(priv->counter));
+		wake_up_interruptible(&priv->read_wait);
+		kill_fasync(&priv->async_queue, SIGIO, POLL_IN);
+	}
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static void ummunot_inval_page(struct mmu_notifier *mn,
+			       struct mm_struct *mm,
+			       unsigned long addr)
+{
+	ummunot_handle_not(mn, addr, addr + PAGE_SIZE);
+}
+
+static void ummunot_inval_range_start(struct mmu_notifier *mn,
+				      struct mm_struct *mm,
+				      unsigned long start, unsigned long end)
+{
+	ummunot_handle_not(mn, start, end);
+}
+
+static const struct mmu_notifier_ops ummunot_mmu_notifier_ops = {
+	.invalidate_page	= ummunot_inval_page,
+	.invalidate_range_start	= ummunot_inval_range_start,
+};
+
+static int ummunot_open(struct inode *inode, struct file *filp)
+{
+	struct ummunot_file *priv;
+	int ret;
+
+	if (filp->f_mode & FMODE_WRITE)
+		return -EINVAL;
+
+	priv = kmalloc(sizeof *priv, GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+
+	priv->counter = (void *) get_zeroed_page(GFP_KERNEL);
+	if (!priv->counter) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	priv->reg_tree = RB_ROOT;
+	INIT_LIST_HEAD(&priv->dirty_list);
+	spin_lock_init(&priv->lock);
+	init_waitqueue_head(&priv->read_wait);
+	priv->async_queue = NULL;
+
+	priv->mmu_notifier.ops = &ummunot_mmu_notifier_ops;
+	/*
+	 * Register notifier last, since notifications can occur as
+	 * soon as we register....
+	 */
+	ret = mmu_notifier_register(&priv->mmu_notifier, current->mm);
+	if (ret)
+		goto err_page;
+
+	priv->mm = current->mm;
+	atomic_inc(&priv->mm->mm_count);
+
+	filp->private_data = priv;
+
+	return 0;
+
+err_page:
+	free_page((unsigned long) priv->counter);
+
+err:
+	kfree(priv);
+	return ret;
+}
+
+static int ummunot_close(struct inode *inode, struct file *filp)
+{
+	struct ummunot_file *priv = filp->private_data;
+	struct rb_node *n;
+	struct ummunot_reg *reg;
+
+	mmu_notifier_unregister(&priv->mmu_notifier, priv->mm);
+	mmdrop(priv->mm);
+	free_page((unsigned long) priv->counter);
+
+	for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
+		reg = rb_entry(n, struct ummunot_reg, node);
+		rb_erase(n, &priv->reg_tree);
+		kfree(reg);
+	}
+
+	kfree(priv);
+
+	return 0;
+}
+
+static ssize_t ummunot_read(struct file *filp, char __user *buf,
+			    size_t count, loff_t *pos)
+{
+	struct ummunot_file *priv = filp->private_data;
+	struct ummunot_reg *reg;
+	ssize_t ret;
+	struct ummunot_event *events;
+	int max;
+	int n;
+
+	events = (void *) get_zeroed_page(GFP_KERNEL);
+	if (!events) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	spin_lock_irq(&priv->lock);
+
+	while (list_empty(&priv->dirty_list)) {
+		spin_unlock_irq(&priv->lock);
+
+		if (filp->f_flags & O_NONBLOCK) {
+			ret = -EAGAIN;
+			goto out;
+		}
+
+		if (wait_event_interruptible(priv->read_wait,
+					     !list_empty(&priv->dirty_list))) {
+			ret = -ERESTARTSYS;
+			goto out;
+		}
+
+		spin_lock_irq(&priv->lock);
+	}
+
+	max = min(PAGE_SIZE, count) / sizeof *events;
+
+	for (n = 0; n < max; ++n) {
+		if (list_empty(&priv->dirty_list)) {
+			events[n].type = UMMUNOT_EVENT_TYPE_LAST;
+			events[n].user_cookie_counter = *priv->counter;
+			++n;
+			break;
+		}
+
+		reg = list_first_entry(&priv->dirty_list, struct ummunot_reg,
+				       list);
+
+		events[n].type = UMMUNOT_EVENT_TYPE_INVAL;
+		if (test_bit(UMMUNOT_FLAG_HINT, &reg->flags)) {
+			events[n].flags		= UMMUNOT_EVENT_FLAG_HINT;
+			events[n].hint_start	= reg->hint_start;
+			events[n].hint_end	= reg->hint_end;
+		}
+		events[n].user_cookie_counter = reg->user_cookie;
+
+		list_del(&reg->list);
+		reg->flags = 0;
+	}
+
+	spin_unlock_irq(&priv->lock);
+
+	if (copy_to_user(buf, events, n * sizeof *events))
+		ret = -EFAULT;
+	else
+		ret = n * sizeof *events;
+
+out:
+	free_page((unsigned long) events);
+	return ret;
+}
+
+static unsigned int ummunot_poll(struct file *filp, struct poll_table_struct *wait)
+{
+	struct ummunot_file *priv = filp->private_data;
+
+	poll_wait(filp, &priv->read_wait, wait);
+
+	return list_empty(&priv->dirty_list) ? 0 : (POLLIN | POLLRDNORM);
+}
+
+static long ummunot_register_region(struct ummunot_file *priv,
+				    struct ummunot_register_ioctl __user *arg)
+{
+	struct ummunot_register_ioctl parm;
+	struct ummunot_reg *reg, *treg;
+	struct rb_node **n = &priv->reg_tree.rb_node;
+	struct rb_node *pn = NULL;
+
+	if (copy_from_user(&parm, arg, sizeof parm))
+	    return -EFAULT;
+
+	if (parm.intf_version != UMMUNOT_INTF_VERSION)
+		return -EINVAL;
+
+	reg = kmalloc(sizeof *reg, GFP_KERNEL);
+	if (!reg)
+		return -ENOMEM;
+
+	reg->user_cookie	= parm.user_cookie;
+	reg->start		= parm.start;
+	reg->end		= parm.end;
+	reg->flags		= 0;
+
+	spin_lock_irq(&priv->lock);
+
+	while (*n) {
+		treg = rb_entry(pn, struct ummunot_reg, node);
+		pn = *n;
+		if (reg->start <= treg->start)
+			n = &pn->rb_left;
+		else
+			n = &pn->rb_right;
+	}
+
+	rb_link_node(&reg->node, pn, n);
+	rb_insert_color(&reg->node, &priv->reg_tree);
+
+	spin_unlock_irq(&priv->lock);
+
+	return 0;
+}
+
+static long ummunot_unregister_region(struct ummunot_file *priv,
+				      __u64 __user *arg)
+{
+	u64 user_cookie;
+	struct rb_node *n;
+	struct ummunot_reg *reg;
+	int ret = -EINVAL;
+
+	if (get_user(user_cookie, arg))
+		return -EFAULT;
+
+	spin_lock_irq(&priv->lock);
+
+	for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
+		reg = rb_entry(n, struct ummunot_reg, node);
+
+		if (reg->user_cookie == user_cookie) {
+			rb_erase(n, &priv->reg_tree);
+			if (test_bit(UMMUNOT_FLAG_DIRTY, &reg->flags))
+			    list_del(&reg->list);
+			kfree(reg);
+			ret = 0;
+			break;
+		}
+	}
+
+	spin_unlock_irq(&priv->lock);
+
+	return ret;
+}
+
+static long ummunot_ioctl(struct file *filp, unsigned int cmd,
+			  unsigned long arg)
+{
+	struct ummunot_file *priv = filp->private_data;
+	void __user *argp = (void __user *) arg;
+
+	switch (cmd) {
+	case UMMUNOT_REGISTER_REGION:
+		return ummunot_register_region(priv, argp);
+	case UMMUNOT_UNREGISTER_REGION:
+		return ummunot_unregister_region(priv, argp);
+	default:
+		return -ENOIOCTLCMD;
+	}
+}
+
+static int ummunot_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct ummunot_file *priv = vma->vm_private_data;
+
+	if (vmf->pgoff != 0)
+		return VM_FAULT_SIGBUS;
+
+	vmf->page = virt_to_page(priv->counter);
+	get_page(vmf->page);
+
+	return 0;
+
+}
+
+static struct vm_operations_struct ummunot_vm_ops = {
+	.fault		= ummunot_fault,
+};
+
+static int ummunot_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	struct ummunot_file *priv = filp->private_data;
+
+	if (vma->vm_end - vma->vm_start != PAGE_SIZE ||
+	    vma->vm_pgoff != 0)
+		return -EINVAL;
+
+	vma->vm_ops		= &ummunot_vm_ops;
+	vma->vm_private_data	= priv;
+
+	return 0;
+}
+
+static int ummunot_fasync(int fd, struct file *filp, int on)
+{
+	struct ummunot_file *priv = filp->private_data;
+
+	return fasync_helper(fd, filp, on, &priv->async_queue);
+}
+
+static const struct file_operations ummunot_fops = {
+	.owner		= THIS_MODULE,
+	.open		= ummunot_open,
+	.release	= ummunot_close,
+	.read		= ummunot_read,
+	.poll		= ummunot_poll,
+	.unlocked_ioctl	= ummunot_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= ummunot_ioctl,
+#endif
+	.mmap		= ummunot_mmap,
+	.fasync		= ummunot_fasync,
+};
+
+static struct miscdevice ummunot_misc = {
+	.minor	= MISC_DYNAMIC_MINOR,
+	.name	= "ummunot",
+	.fops	= &ummunot_fops,
+};
+
+static int __init ummunot_init(void)
+{
+	return misc_register(&ummunot_misc);
+}
+
+static void __exit ummunot_cleanup(void)
+{
+	misc_deregister(&ummunot_misc);
+}
+
+module_init(ummunot_init);
+module_exit(ummunot_cleanup);
diff --git a/include/linux/ummunot.h b/include/linux/ummunot.h
new file mode 100644
index 0000000..e1abd89
--- /dev/null
+++ b/include/linux/ummunot.h
@@ -0,0 +1,85 @@
+/*
+ * Copyright (c) 2009 Cisco Systems.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenFabrics BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef _LINUX_UMMUNOT_H
+#define _LINUX_UMMUNOT_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#define UMMUNOT_INTF_VERSION		1
+
+enum {
+	UMMUNOT_EVENT_TYPE_INVAL	= 0,
+	UMMUNOT_EVENT_TYPE_LAST		= 1,
+};
+
+enum {
+	UMMUNOT_EVENT_FLAG_HINT		= 1 << 0,
+};
+
+/*
+ * If type field is INVAL, then user_cookie_counter holds the
+ * user_cookie for the region being reported; if the HINT flag is set
+ * then hint_start/hint_end hold the start and end of the mapping that
+ * was invalidated.  (If HINT is not set, then multiple events
+ * invalidated parts of the registered range and hint_start/hint_end
+ * should be ignored)
+ *
+ * If type is LAST, then the read operation has emptied the list of
+ * invalidated regions, and user_cookie_counter holds the value of the
+ * kernel's generation counter when the empty list occurred.  The
+ * other fields are not filled in for this event.
+ */
+struct ummunot_event {
+	__u32	type;
+	__u32	flags;
+	__u64	hint_start;
+	__u64	hint_end;
+	__u64	user_cookie_counter;
+};
+
+struct ummunot_register_ioctl {
+	__u32	intf_version;	/* in */
+	__u32	reserved1;
+	__u64	start;		/* in */
+	__u64	end;		/* in */
+	__u64	user_cookie;	/* in */
+};
+
+#define UMMUNOT_MAGIC			'U'
+
+#define UMMUNOT_REGISTER_REGION		_IOWR(UMMUNOT_MAGIC, 1, \
+					      struct ummunot_register_ioctl)
+#define UMMUNOT_UNREGISTER_REGION	_IOW(UMMUNOT_MAGIC, 2, __u64)
+
+#endif /* _LINUX_UMMUNOT_H */
-- 
1.6.0.4


From zafargilani at gmail.com  Wed May 27 11:21:50 2009
From: zafargilani at gmail.com (Zafar Gilani)
Date: Wed, 27 May 2009 23:21:50 +0500
Subject: [ofa-general] Verbs or RDMA code via JNI
Message-ID: <7d4423d30905271121k415bf7b1ob3327a60f333c2fe@mail.gmail.com>

This information is very important for me and I will greatly appreciate
anyone who could help me. My question is:

Has anyone tried executing native C (verbs or RDMA) code via JNI (Java
Native Interface). If somebody has, then kindly let me know whether it was
successful or not.

Thanks,
-- 
Zafar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090527/b4b7cb03/attachment.html>

From jsquyres at cisco.com  Wed May 27 12:02:36 2009
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 27 May 2009 15:02:36 -0400
Subject: [ofa-general] Memory registration redux
In-Reply-To: <adaws8277wa.fsf@cisco.com>
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com><adabpq6t2k8.fsf@cisco.com><20090506214628.GM2590@obsidianresearch.com><adatz3xsxo6.fsf@cisco.com><20090506222638.GA16280@obsidianresearch.com><adaprelsvnp.fsf@cisco.com><20090507000231.GB16280@obsidianresearch.com><adak54ssi0g.fsf@cisco.com><20090507224806.GF16280@obsidianresearch.com><adabppf8nln.fsf@cisco.com>
	<ada7i038nk9.fsf@cisco.com><5019F239-149F-49E1-8C23-436DE6094AB2@cisco.com>
	<adaws8277wa.fsf@cisco.com>
Message-ID: <C7598186-0CFA-40F8-A94C-9D4BA0E6E85D@cisco.com>

Other MPI implementors -- what do you think of this scheme?


On May 27, 2009, at 1:49 PM, Roland Dreier (rdreier) wrote:

>
>  > > /*
>  > >  * If type field is INVAL, then user_cookie_counter holds the
>  > >  * user_cookie for the region being reported; if the HINT flag  
> is set
>  > >  * then hint_start/hint_end hold the start and end of the  
> mapping that
>  > >  * was invalidated.  (If HINT is not set, then multiple events
>  > >  * invalidated parts of the registered range and hint_start/ 
> hint_end
>  > >  * should be ignored)
>
>  > I don't quite grok this.  Is the intent that HINT will only be  
> set if
>  > an *entire* hint_start/hint_end range is invalidated by a single
>  > event?  I.e., if only part of the hint_start/hint_end range is
>  > invalidated, you'll get the cookie back, but not what part of the
>  > range is invalid (because assumedly the entire IBV registration  
> is now
>  > invalid anyway)?
>
> Basically, I just keep one hint_start/hint_end.  If multiple events  
> hit
> the same registration then I just give up and don't give you a hint.
>
>  > >  * If type is LAST, then the read operation has emptied the  
> list of
>  > >  * invalidated regions, and user_cookie_counter holds the value  
> of the
>  > >  * kernel's generation counter when the empty list occurred.  The
>  > >  * other fields are not filled in for this event.
>
>  > Just to be clear -- we're supposed to keep reading events until  
> we get
>  > a LAST event?
>
> Yes, that's probably the sanest use case.
>
>  > 1. Will it increase by 1 each time a page (or set of pages?) is
>  > removed from a user process?
>
> As it stands it increases by 1 every time there is an MMU  
> notification,
> even if that notification hits multiple registrations.  It wouldn't be
> hard to change that to count the number of events generated if that
> works better.
>
>  > 2. Does it change if pages are *added* to a user process?  I.e.,  
> does
>  > the counter indicate *removals* or *changes* to the user process  
> page
>  > table?
>
> No, additions don't trigger any MMU notification -- that's inherent in
> the design of the MMU notifiers stuff.  The idea is that you have a
> "secondary MMU" and MMU notifications are the equivalent of TLB
> shootdowns; the secondary MMU is responsible for populating itself on
> faults etc.
>
>  > Is the *unm_counter value guaranteed to have been changed by the  
> time
>  > munmap() returns?
>
> Yes.
>
>  > Did you pick [2] here simply because you're only expecting an INVAL
>  > and a LAST event in this specific example?  I'm assuming that we
>  > should normally loop over reading until we get LAST, correct?
>
> Right.
>
>  > What happens if I register multiple regions with the same cookie  
> value?
>
> You get in trouble -- I need to fix things to reject duplicated  
> cookies
> actually, because otherwise there's no way to unregister.
>
>  > Is a process responsible for guaranteeing that it umn_unregister()s
>  > everything before exiting, or will all pending registrations be
>  > cleaned up/unregistered/whatever when a process exits?
>
> The kernel cleans up of course to handle crashes etc.
>
>  - R.
>


-- 
Jeff Squyres
Cisco Systems


From swise at opengridcomputing.com  Wed May 27 12:08:52 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 27 May 2009 14:08:52 -0500
Subject: [ofa-general] [PATCH] RDMA/cxgb3: Report correct port state and mtu.
Message-ID: <20090527190852.16426.82898.stgit@build.ogc.int>

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_provider.c |   32 +++++++++++++++++++++++++--
 1 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c
index 160ef48..e2a6321 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -40,6 +40,7 @@
 #include <linux/spinlock.h>
 #include <linux/ethtool.h>
 #include <linux/rtnetlink.h>
+#include <linux/inetdevice.h>
 
 #include <asm/io.h>
 #include <asm/irq.h>
@@ -1152,12 +1153,39 @@ static int iwch_query_device(struct ib_device *ibdev,
 static int iwch_query_port(struct ib_device *ibdev,
 			   u8 port, struct ib_port_attr *props)
 {
+	struct iwch_dev *dev;
+	struct net_device *netdev;
+	struct in_device *inetdev;
+
 	PDBG("%s ibdev %p\n", __func__, ibdev);
 
+	dev = to_iwch_dev(ibdev);
+	netdev = dev->rdev.port_info.lldevs[port-1];
+
 	memset(props, 0, sizeof(struct ib_port_attr));
 	props->max_mtu = IB_MTU_4096;
-	props->active_mtu = IB_MTU_2048;
-	props->state = IB_PORT_ACTIVE;
+	if (netdev->mtu >= 4096)
+		props->active_mtu = IB_MTU_4096;
+	else if (netdev->mtu >= 2048)
+		props->active_mtu = IB_MTU_2048;
+	else if (netdev->mtu >= 1024)
+		props->active_mtu = IB_MTU_1024;
+	else if (netdev->mtu >= 512)
+		props->active_mtu = IB_MTU_512;
+	else
+		props->active_mtu = IB_MTU_256;
+
+	if (!netif_carrier_ok(netdev))
+		props->state = IB_PORT_DOWN;
+	else {
+		inetdev = in_dev_get(netdev);
+		if (inetdev->ifa_list)
+			props->state = IB_PORT_ACTIVE;
+		else
+			props->state = IB_PORT_INIT;
+		in_dev_put(inetdev);
+	}
+
 	props->port_cap_flags =
 	    IB_PORT_CM_SUP |
 	    IB_PORT_SNMP_TUNNEL_SUP |


From swise at opengridcomputing.com  Wed May 27 12:35:42 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 27 May 2009 14:35:42 -0500
Subject: [ofa-general] [PATCH] RDMA/cxgb3: limit fastreg size based on T3
	limitations.
Message-ID: <20090527193542.24913.25649.stgit@build.ogc.int>

T3 firmware only supports one WR's worth of page list.  The driver
currently allows 2 WR's worth, which doesn't work for T3.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/cxio_wr.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/cxio_wr.h b/drivers/infiniband/hw/cxgb3/cxio_wr.h
index ff9be1a..32e3b14 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_wr.h
+++ b/drivers/infiniband/hw/cxgb3/cxio_wr.h
@@ -176,7 +176,7 @@ struct t3_send_wr {
 	struct t3_sge sgl[T3_MAX_SGE];	/* 4+ */
 };
 
-#define T3_MAX_FASTREG_DEPTH 24
+#define T3_MAX_FASTREG_DEPTH 10
 #define T3_MAX_FASTREG_FRAG 10
 
 struct t3_fastreg_wr {


From valdes at anl.gov  Wed May 27 13:07:11 2009
From: valdes at anl.gov (John Valdes)
Date: Wed, 27 May 2009 15:07:11 -0500
Subject: [ofa-general] ib_mthca catastrophic error detected
Message-ID: <20090527200711.GA7605@starfish.mcs.anl.gov>

All,

I had posted last week about SRP problems we've been having after
upgrading from some old servers running RHEL 5.1 to new servers
running RHEL 5.3.  We're still trying to isolate the cause of the
problems, but one of the symptoms we're seeing is that occasionally
when under stress (well, if you can call doing a "dd" from /dev/zero
to the SRP target stress...), the ib_mthca driver will report a
"Catastrophic error":

  ib_mthca 0000:04:00.0: Catastrophic error detected: internal error
   host8: ib_srp: failed receive status 4
  ib_srp:  host8: add qp_in_err timer
   host8: ib_srp: failed receive status 5
  ib_mthca 0000:04:00.0:   buf[00]: 00000000
  ib_mthca 0000:04:00.0:   buf[01]: 00000000
  ib_mthca 0000:04:00.0:   buf[02]: 00000000
  ib_mthca 0000:04:00.0:   buf[03]: 00000000
  ib_mthca 0000:04:00.0:   buf[04]: 00000000
  ib_mthca 0000:04:00.0:   buf[05]: 00000000
  ib_mthca 0000:04:00.0:   buf[06]: 00000000
  ib_mthca 0000:04:00.0:   buf[07]: 00000000
  ib_mthca 0000:04:00.0:   buf[08]: 00000000
  ib_mthca 0000:04:00.0:   buf[09]: 00000000
  ib_mthca 0000:04:00.0:   buf[0a]: 00000000
  ib_mthca 0000:04:00.0:   buf[0b]: 00000000
  ib_mthca 0000:04:00.0:   buf[0c]: 00000000
  ib_mthca 0000:04:00.0:   buf[0d]: 00000000
  ib_mthca 0000:04:00.0:   buf[0e]: 00000000
  ib_mthca 0000:04:00.0:   buf[0f]: 00000000
   host8: ib_srp: srp_qp_in_err_timer called

Checking back through the list archives, the consensus seems to be
that these are due to card problems, usually with the firmware.  We've
never had this problem w/ the old servers under RHEL 5.1 w/ the
bundled OFED 1.2, but maybe the new servers and/or the RHEL 5.3 w/
OFED 1.3 is pushing the card harder and/or tickling a bug in the
firmware?  The cards are Cisco branded Mellanox Cougar Cub cards;
"tvflash -i" identifies them as:

  HCA #0: MT23108, Cougar Cub, revision A1
    Primary image is v3.5.917 build 3.2.0.149, with label 'HCA.CougarCub.A1'
    Secondary image is v3.3.005 build 3.2.0.67, with label 'HCA.CougarCub.A1'

    Vital Product Data
      Product Name: Cougar cub
      P/N: SFS-HCA-X2T7-A1
      E/C: Rev: A0
      S/N: CS0636X00286
      Freq/Power: PW=12W;PCI 66MHZ;PCI-X 133MHZ
      Date Code: 0636
      Checksum: Ok

Unfortunately, v3.5.917 seems to be the latest version of the firmware
listed on Cisco's website, at least that I could find.

Is anyone aware of any issues with this version of the firmware?

John

----------------------------------------------------------------------
John Valdes                  Mathematics and Computer Science Division
valdes at anl.gov                             Argonne National Laboratory


From jphgross at gmail.com  Wed May 27 13:07:20 2009
From: jphgross at gmail.com (Jason Gross)
Date: Wed, 27 May 2009 13:07:20 -0700
Subject: [ofa-general] OFED 1.3.1 and 1.4.1 compatibility.
Message-ID: <899683130905271307w70650706qf97b3374735a8e3b@mail.gmail.com>

Hi All,
  We are planning to upgrade some of our nodes to OFED 1.4.1 while leaving a
few at 1.3.1. Are there any known incompatibilities between the two
versions, specifically for IPoIB or the Open SM? We have a custom
application that uses IB verbs, which I expect shouldn't be affected.

Thanks!
Jason
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090527/115f8d71/attachment.html>

From rdreier at cisco.com  Wed May 27 14:36:01 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 27 May 2009 14:36:01 -0700
Subject: [ofa-general] Memory registration redux
In-Reply-To: <adaocte76sx.fsf@cisco.com> (Roland Dreier's message of "Wed, 27
	May 2009 11:13:34 -0700")
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com>
	<adabpq6t2k8.fsf@cisco.com>
	<20090506214628.GM2590@obsidianresearch.com>
	<adatz3xsxo6.fsf@cisco.com>
	<20090506222638.GA16280@obsidianresearch.com>
	<adaprelsvnp.fsf@cisco.com>
	<20090507000231.GB16280@obsidianresearch.com>
	<adak54ssi0g.fsf@cisco.com>
	<20090507224806.GF16280@obsidianresearch.com>
	<adabppf8nln.fsf@cisco.com> <ada7i038nk9.fsf@cisco.com>
	<5019F239-149F-49E1-8C23-436DE6094AB2@cisco.com>
	<adaws8277wa.fsf@cisco.com> <adaocte76sx.fsf@cisco.com>
Message-ID: <adahbz66xfi.fsf@cisco.com>

Sigh... real version that returns EINVAL for an attempt to reuse a user
cookie (since otherwise unregister would get confused).  Previous
posting was the old patch, sorry.

===

ummunot: Userspace support for MMU notifications

As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925>
and follow-up messages, libraries using RDMA would like to track
precisely when application code changes memory mapping via free(),
munmap(), etc.  Current pure-userspace solutions using malloc hooks
and other tricks are not robust, and the feeling among experts is that
the issue is unfixable without kernel help.

We solve this not by implementing the full API proposed in the email
linked above but rather with a simpler and more generic interface,
which may be useful in other contexts.  Specifically, we implement a
new character device driver, ummunot, that creates a /dev/ummunot
node.  A userspace process can open this node read-only and use the fd
as follows:

 1. ioctl() to register/unregister an address range to watch in the
    kernel (cf struct ummunot_register_ioctl in <linux/ummunot.h>).

 2. read() to retrieve events generated when a mapping in a watched
    address range is invalidated (cf struct ummunot_event in
    <linux/ummunot.h>).  select()/poll()/epoll() and SIGIO are handled
    for this IO.

 3. mmap() one page at offset 0 to map a kernel page that contains a
    generation counter that is incremented each time an event is
    generated.  This allows userspace to have a fast path that checks
    that no events have occurred without a system call.

Signed-off-by: Roland Dreier <rolandd at cisco.com>
---
 drivers/char/Kconfig    |   12 ++
 drivers/char/Makefile   |    1 +
 drivers/char/ummunot.c  |  469 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/ummunot.h |   85 +++++++++
 4 files changed, 567 insertions(+), 0 deletions(-)
 create mode 100644 drivers/char/ummunot.c
 create mode 100644 include/linux/ummunot.h

diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index 735bbe2..91fe068 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -1099,6 +1099,18 @@ config DEVPORT
 	depends on ISA || PCI
 	default y
 
+config UMMUNOT
+       tristate "Userspace MMU notifications"
+       select MMU_NOTIFIER
+       help
+         The ummunot (userspace MMU notification) driver creates a
+         character device that can be used by userspace libraries to
+         get notifications when an application's memory mapping
+         changed.  This is used, for example, by RDMA libraries to
+         improve the reliability of memory registration caching, since
+         the kernel's MMU notifications can be used to know precisely
+         when to shoot down a cached registration.
+
 source "drivers/s390/char/Kconfig"
 
 endmenu
diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index 9caf5b5..dcbcd7c 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -97,6 +97,7 @@ obj-$(CONFIG_CS5535_GPIO)	+= cs5535_gpio.o
 obj-$(CONFIG_GPIO_VR41XX)	+= vr41xx_giu.o
 obj-$(CONFIG_GPIO_TB0219)	+= tb0219.o
 obj-$(CONFIG_TELCLOCK)		+= tlclk.o
+obj-$(CONFIG_UMMUNOT)		+= ummunot.o
 
 obj-$(CONFIG_MWAVE)		+= mwave/
 obj-$(CONFIG_AGP)		+= agp/
diff --git a/drivers/char/ummunot.c b/drivers/char/ummunot.c
new file mode 100644
index 0000000..ebfd038
--- /dev/null
+++ b/drivers/char/ummunot.c
@@ -0,0 +1,469 @@
+/*
+ * Copyright (c) 2009 Cisco Systems.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenFabrics BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/miscdevice.h>
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/rbtree.h>
+#include <linux/sched.h>
+#include <linux/spinlock.h>
+#include <linux/uaccess.h>
+#include <linux/ummunot.h>
+
+#include <asm/cacheflush.h>
+
+MODULE_AUTHOR("Roland Dreier");
+MODULE_DESCRIPTION("Userspace MMU notifiers");
+MODULE_LICENSE("Dual BSD/GPL");
+
+enum {
+	UMMUNOT_FLAG_DIRTY	= 1,
+	UMMUNOT_FLAG_HINT	= 2,
+};
+
+struct ummunot_reg {
+	u64			user_cookie;
+	unsigned long		start;
+	unsigned long		end;
+	unsigned long		hint_start;
+	unsigned long		hint_end;
+	unsigned long		flags;
+	struct rb_node		node;
+	struct list_head	list;
+};
+
+struct ummunot_file {
+	struct mmu_notifier	mmu_notifier;
+	struct mm_struct       *mm;
+	struct rb_root		reg_tree;
+	struct list_head	dirty_list;
+	u64		       *counter;
+	spinlock_t		lock;
+	wait_queue_head_t	read_wait;
+	struct fasync_struct   *async_queue;
+};
+
+static struct ummunot_file *to_ummunot_file(struct mmu_notifier *mn)
+{
+	return container_of(mn, struct ummunot_file, mmu_notifier);
+}
+
+static void ummunot_handle_not(struct mmu_notifier *mn,
+			       unsigned long start, unsigned long end)
+{
+	struct ummunot_file *priv = to_ummunot_file(mn);
+	struct rb_node *n;
+	struct ummunot_reg *reg;
+	unsigned long flags;
+	int hit = 0;
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
+		reg = rb_entry(n, struct ummunot_reg, node);
+
+		if (reg->start >= end)
+			break;
+
+		if ((reg->start <= start && reg->end > start) ||
+		    (reg->start <= end   && reg->end > end)) {
+			hit = 1;
+
+			if (!test_and_set_bit(UMMUNOT_FLAG_DIRTY, &reg->flags))
+				list_add_tail(&reg->list, &priv->dirty_list);
+
+			if (test_bit(UMMUNOT_FLAG_HINT, &reg->flags)) {
+				clear_bit(UMMUNOT_FLAG_HINT, &reg->flags);
+			} else {
+				set_bit(UMMUNOT_FLAG_HINT, &reg->flags);
+				reg->hint_start = start;
+				reg->hint_end   = end;
+			}
+		}
+	}
+
+	if (hit) {
+		++(*priv->counter);
+		flush_dcache_page(virt_to_page(priv->counter));
+		wake_up_interruptible(&priv->read_wait);
+		kill_fasync(&priv->async_queue, SIGIO, POLL_IN);
+	}
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static void ummunot_inval_page(struct mmu_notifier *mn,
+			       struct mm_struct *mm,
+			       unsigned long addr)
+{
+	ummunot_handle_not(mn, addr, addr + PAGE_SIZE);
+}
+
+static void ummunot_inval_range_start(struct mmu_notifier *mn,
+				      struct mm_struct *mm,
+				      unsigned long start, unsigned long end)
+{
+	ummunot_handle_not(mn, start, end);
+}
+
+static const struct mmu_notifier_ops ummunot_mmu_notifier_ops = {
+	.invalidate_page	= ummunot_inval_page,
+	.invalidate_range_start	= ummunot_inval_range_start,
+};
+
+static int ummunot_open(struct inode *inode, struct file *filp)
+{
+	struct ummunot_file *priv;
+	int ret;
+
+	if (filp->f_mode & FMODE_WRITE)
+		return -EINVAL;
+
+	priv = kmalloc(sizeof *priv, GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+
+	priv->counter = (void *) get_zeroed_page(GFP_KERNEL);
+	if (!priv->counter) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	priv->reg_tree = RB_ROOT;
+	INIT_LIST_HEAD(&priv->dirty_list);
+	spin_lock_init(&priv->lock);
+	init_waitqueue_head(&priv->read_wait);
+	priv->async_queue = NULL;
+
+	priv->mmu_notifier.ops = &ummunot_mmu_notifier_ops;
+	/*
+	 * Register notifier last, since notifications can occur as
+	 * soon as we register....
+	 */
+	ret = mmu_notifier_register(&priv->mmu_notifier, current->mm);
+	if (ret)
+		goto err_page;
+
+	priv->mm = current->mm;
+	atomic_inc(&priv->mm->mm_count);
+
+	filp->private_data = priv;
+
+	return 0;
+
+err_page:
+	free_page((unsigned long) priv->counter);
+
+err:
+	kfree(priv);
+	return ret;
+}
+
+static int ummunot_close(struct inode *inode, struct file *filp)
+{
+	struct ummunot_file *priv = filp->private_data;
+	struct rb_node *n;
+	struct ummunot_reg *reg;
+
+	mmu_notifier_unregister(&priv->mmu_notifier, priv->mm);
+	mmdrop(priv->mm);
+	free_page((unsigned long) priv->counter);
+
+	for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
+		reg = rb_entry(n, struct ummunot_reg, node);
+		rb_erase(n, &priv->reg_tree);
+		kfree(reg);
+	}
+
+	kfree(priv);
+
+	return 0;
+}
+
+static ssize_t ummunot_read(struct file *filp, char __user *buf,
+			    size_t count, loff_t *pos)
+{
+	struct ummunot_file *priv = filp->private_data;
+	struct ummunot_reg *reg;
+	ssize_t ret;
+	struct ummunot_event *events;
+	int max;
+	int n;
+
+	events = (void *) get_zeroed_page(GFP_KERNEL);
+	if (!events) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	spin_lock_irq(&priv->lock);
+
+	while (list_empty(&priv->dirty_list)) {
+		spin_unlock_irq(&priv->lock);
+
+		if (filp->f_flags & O_NONBLOCK) {
+			ret = -EAGAIN;
+			goto out;
+		}
+
+		if (wait_event_interruptible(priv->read_wait,
+					     !list_empty(&priv->dirty_list))) {
+			ret = -ERESTARTSYS;
+			goto out;
+		}
+
+		spin_lock_irq(&priv->lock);
+	}
+
+	max = min(PAGE_SIZE, count) / sizeof *events;
+
+	for (n = 0; n < max; ++n) {
+		if (list_empty(&priv->dirty_list)) {
+			events[n].type = UMMUNOT_EVENT_TYPE_LAST;
+			events[n].user_cookie_counter = *priv->counter;
+			++n;
+			break;
+		}
+
+		reg = list_first_entry(&priv->dirty_list, struct ummunot_reg,
+				       list);
+
+		events[n].type = UMMUNOT_EVENT_TYPE_INVAL;
+		if (test_bit(UMMUNOT_FLAG_HINT, &reg->flags)) {
+			events[n].flags		= UMMUNOT_EVENT_FLAG_HINT;
+			events[n].hint_start	= reg->hint_start;
+			events[n].hint_end	= reg->hint_end;
+		}
+		events[n].user_cookie_counter = reg->user_cookie;
+
+		list_del(&reg->list);
+		reg->flags = 0;
+	}
+
+	spin_unlock_irq(&priv->lock);
+
+	if (copy_to_user(buf, events, n * sizeof *events))
+		ret = -EFAULT;
+	else
+		ret = n * sizeof *events;
+
+out:
+	free_page((unsigned long) events);
+	return ret;
+}
+
+static unsigned int ummunot_poll(struct file *filp, struct poll_table_struct *wait)
+{
+	struct ummunot_file *priv = filp->private_data;
+
+	poll_wait(filp, &priv->read_wait, wait);
+
+	return list_empty(&priv->dirty_list) ? 0 : (POLLIN | POLLRDNORM);
+}
+
+static long ummunot_register_region(struct ummunot_file *priv,
+				    struct ummunot_register_ioctl __user *arg)
+{
+	struct ummunot_register_ioctl parm;
+	struct ummunot_reg *reg, *treg;
+	struct rb_node **n = &priv->reg_tree.rb_node;
+	struct rb_node *pn;
+	int ret = 0;
+
+	if (copy_from_user(&parm, arg, sizeof parm))
+	    return -EFAULT;
+
+	if (parm.intf_version != UMMUNOT_INTF_VERSION)
+		return -EINVAL;
+
+	reg = kmalloc(sizeof *reg, GFP_KERNEL);
+	if (!reg)
+		return -ENOMEM;
+
+	reg->user_cookie	= parm.user_cookie;
+	reg->start		= parm.start;
+	reg->end		= parm.end;
+	reg->flags		= 0;
+
+	spin_lock_irq(&priv->lock);
+
+	for (pn = rb_first(&priv->reg_tree); pn; pn = rb_next(pn)) {
+		reg = rb_entry(pn, struct ummunot_reg, node);
+
+		if (reg->user_cookie == parm.user_cookie) {
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	pn = NULL;
+	while (*n) {
+		treg = rb_entry(pn, struct ummunot_reg, node);
+		pn = *n;
+		if (reg->start <= treg->start)
+			n = &pn->rb_left;
+		else
+			n = &pn->rb_right;
+	}
+
+	rb_link_node(&reg->node, pn, n);
+	rb_insert_color(&reg->node, &priv->reg_tree);
+
+out:
+	spin_unlock_irq(&priv->lock);
+
+	return ret;
+}
+
+static long ummunot_unregister_region(struct ummunot_file *priv,
+				      __u64 __user *arg)
+{
+	u64 user_cookie;
+	struct rb_node *n;
+	struct ummunot_reg *reg;
+	int ret = -EINVAL;
+
+	if (get_user(user_cookie, arg))
+		return -EFAULT;
+
+	spin_lock_irq(&priv->lock);
+
+	for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
+		reg = rb_entry(n, struct ummunot_reg, node);
+
+		if (reg->user_cookie == user_cookie) {
+			rb_erase(n, &priv->reg_tree);
+			if (test_bit(UMMUNOT_FLAG_DIRTY, &reg->flags))
+			    list_del(&reg->list);
+			kfree(reg);
+			ret = 0;
+			break;
+		}
+	}
+
+	spin_unlock_irq(&priv->lock);
+
+	return ret;
+}
+
+static long ummunot_ioctl(struct file *filp, unsigned int cmd,
+			  unsigned long arg)
+{
+	struct ummunot_file *priv = filp->private_data;
+	void __user *argp = (void __user *) arg;
+
+	switch (cmd) {
+	case UMMUNOT_REGISTER_REGION:
+		return ummunot_register_region(priv, argp);
+	case UMMUNOT_UNREGISTER_REGION:
+		return ummunot_unregister_region(priv, argp);
+	default:
+		return -ENOIOCTLCMD;
+	}
+}
+
+static int ummunot_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct ummunot_file *priv = vma->vm_private_data;
+
+	if (vmf->pgoff != 0)
+		return VM_FAULT_SIGBUS;
+
+	vmf->page = virt_to_page(priv->counter);
+	get_page(vmf->page);
+
+	return 0;
+
+}
+
+static struct vm_operations_struct ummunot_vm_ops = {
+	.fault		= ummunot_fault,
+};
+
+static int ummunot_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	struct ummunot_file *priv = filp->private_data;
+
+	if (vma->vm_end - vma->vm_start != PAGE_SIZE ||
+	    vma->vm_pgoff != 0)
+		return -EINVAL;
+
+	vma->vm_ops		= &ummunot_vm_ops;
+	vma->vm_private_data	= priv;
+
+	return 0;
+}
+
+static int ummunot_fasync(int fd, struct file *filp, int on)
+{
+	struct ummunot_file *priv = filp->private_data;
+
+	return fasync_helper(fd, filp, on, &priv->async_queue);
+}
+
+static const struct file_operations ummunot_fops = {
+	.owner		= THIS_MODULE,
+	.open		= ummunot_open,
+	.release	= ummunot_close,
+	.read		= ummunot_read,
+	.poll		= ummunot_poll,
+	.unlocked_ioctl	= ummunot_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= ummunot_ioctl,
+#endif
+	.mmap		= ummunot_mmap,
+	.fasync		= ummunot_fasync,
+};
+
+static struct miscdevice ummunot_misc = {
+	.minor	= MISC_DYNAMIC_MINOR,
+	.name	= "ummunot",
+	.fops	= &ummunot_fops,
+};
+
+static int __init ummunot_init(void)
+{
+	return misc_register(&ummunot_misc);
+}
+
+static void __exit ummunot_cleanup(void)
+{
+	misc_deregister(&ummunot_misc);
+}
+
+module_init(ummunot_init);
+module_exit(ummunot_cleanup);
diff --git a/include/linux/ummunot.h b/include/linux/ummunot.h
new file mode 100644
index 0000000..e1abd89
--- /dev/null
+++ b/include/linux/ummunot.h
@@ -0,0 +1,85 @@
+/*
+ * Copyright (c) 2009 Cisco Systems.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenFabrics BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef _LINUX_UMMUNOT_H
+#define _LINUX_UMMUNOT_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#define UMMUNOT_INTF_VERSION		1
+
+enum {
+	UMMUNOT_EVENT_TYPE_INVAL	= 0,
+	UMMUNOT_EVENT_TYPE_LAST		= 1,
+};
+
+enum {
+	UMMUNOT_EVENT_FLAG_HINT		= 1 << 0,
+};
+
+/*
+ * If type field is INVAL, then user_cookie_counter holds the
+ * user_cookie for the region being reported; if the HINT flag is set
+ * then hint_start/hint_end hold the start and end of the mapping that
+ * was invalidated.  (If HINT is not set, then multiple events
+ * invalidated parts of the registered range and hint_start/hint_end
+ * should be ignored)
+ *
+ * If type is LAST, then the read operation has emptied the list of
+ * invalidated regions, and user_cookie_counter holds the value of the
+ * kernel's generation counter when the empty list occurred.  The
+ * other fields are not filled in for this event.
+ */
+struct ummunot_event {
+	__u32	type;
+	__u32	flags;
+	__u64	hint_start;
+	__u64	hint_end;
+	__u64	user_cookie_counter;
+};
+
+struct ummunot_register_ioctl {
+	__u32	intf_version;	/* in */
+	__u32	reserved1;
+	__u64	start;		/* in */
+	__u64	end;		/* in */
+	__u64	user_cookie;	/* in */
+};
+
+#define UMMUNOT_MAGIC			'U'
+
+#define UMMUNOT_REGISTER_REGION		_IOWR(UMMUNOT_MAGIC, 1, \
+					      struct ummunot_register_ioctl)
+#define UMMUNOT_UNREGISTER_REGION	_IOW(UMMUNOT_MAGIC, 2, __u64)
+
+#endif /* _LINUX_UMMUNOT_H */
-- 
1.6.0.4


From rdreier at cisco.com  Wed May 27 14:39:10 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 27 May 2009 14:39:10 -0700
Subject: [ofa-general] Re: [PATCH 2/2] ib_mthca: Use module parameter for
	number of MTTs per segment
In-Reply-To: <20090518085551.GA16106@mtls03> (Eli Cohen's message of "Mon, 18
	May 2009 11:55:51 +0300")
References: <20090518085551.GA16106@mtls03>
Message-ID: <adad49u6xa9.fsf@cisco.com>

Sigh... unfortunate to add a tunable that people have to mess with
rather than just making things work automatically somehow.  Anyway,
applied these.


From rdreier at cisco.com  Wed May 27 14:42:53 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 27 May 2009 14:42:53 -0700
Subject: [ofa-general] Re: [PATCH] RDMA/cxgb3: Report correct port state and
	mtu.
In-Reply-To: <20090527190852.16426.82898.stgit@build.ogc.int> (Steve Wise's
	message of "Wed, 27 May 2009 14:08:52 -0500")
References: <20090527190852.16426.82898.stgit@build.ogc.int>
Message-ID: <ada4ov66x42.fsf@cisco.com>

OK, applied.  Would be nice if we had a better way to report MTU, but whatever...


From rdreier at cisco.com  Wed May 27 14:42:57 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 27 May 2009 14:42:57 -0700
Subject: [ofa-general] Re: [PATCH] RDMA/cxgb3: limit fastreg size based on T3
	limitations.
In-Reply-To: <20090527193542.24913.25649.stgit@build.ogc.int> (Steve Wise's
	message of "Wed, 27 May 2009 14:35:42 -0500")
References: <20090527193542.24913.25649.stgit@build.ogc.int>
Message-ID: <adazlcy5iji.fsf@cisco.com>

thanks, applied.


From akepner at sgi.com  Wed May 27 16:27:21 2009
From: akepner at sgi.com (akepner at sgi.com)
Date: Wed, 27 May 2009 16:27:21 -0700
Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in
	ipoib_neigh_cleanup()
In-Reply-To: <4A18D704.1020000@voltaire.com>
References: <20090519215505.GN6837@sgi.com> <adaws8bbs55.fsf@cisco.com>
	<20090520213703.GT6837@sgi.com> <4A14E766.1010005@voltaire.com>
	<20090521193910.GX6837@sgi.com> <4A18D704.1020000@voltaire.com>
Message-ID: <20090527232721.GD5819@sgi.com>

On Sun, May 24, 2009 at 08:11:32AM +0300, Or Gerlitz wrote:

> ... how come a neigh cleanup 
> callback is invoked when someone out there has a ref on the neighbour? 

Don't know if you saw all of this thread, but in:

http://lists.openfabrics.org/pipermail/general/2009-May/059730.html

I mentioned a race between a tx completion (with an error) and 
ipoib_neigh_cleanup(), which could happen even if the callback 
is invoked at the correct time (as far as the neighbour code is 
concerned).

> ...
> also I'd like to clarify with you if the rest of this thread applies 
> only to 2.6.16 and possibly more old kernels, or to the current mainline 
> bits?
> 

Although I've only seen the bug with 2.6.16 vintage kernels (and 
maybe only once) , I think it's still possible in the latest code 
via the mechanism I mentioned above (and maybe other ways, too). 

The best idea I've got so far is to use a new set of locks to 
consistently read/write the struct ipoib_neigh pointer that's 
stashed away in the neighbour structures. 

-- 
Arthur


From abenjamin at sgi.com  Wed May 27 18:41:28 2009
From: abenjamin at sgi.com (Arputham Benjamin)
Date: Wed, 27 May 2009 18:41:28 -0700
Subject: [ofa-general] RE: [PATCH] core/mthca: Distinguish multiple IB
	cards in /proc/interrupts
In-Reply-To: <adazld69f4e.fsf@cisco.com>
References: <4A0B560B.3090606@sgi.com>
	<adad4aaf3dk.fsf@cisco.com>	<4A136DF0.7000402@sgi.com>
	<adaab58ctd4.fsf@cisco.com>	<1AB9A794DBDDF54A8A81BE2296F7BDFE992691@cf--amer001e--3.americas.sgi.com>	<adaskizbimo.fsf@cisco.com>	<1AB9A794DBDDF54A8A81BE2296F7BDFE992692@cf--amer001e--3.americas.sgi.com>	<ada8wkqayfb.fsf@cisco.com>	<1AB9A794DBDDF54A8A81BE2296F7BDFE992696@cf--amer001e--3.americas.sgi.com>
	<adazld69f4e.fsf@cisco.com>
Message-ID: <4A1DEBC8.9050207@sgi.com>

Roland Dreier wrote:
>  > Don't we need both /sys/devices/... and /proc/interrupts?
>
> Not sure what you mean.  If we put msi-x info under /sys, then you can
> figure out which interrupts belong to a given HCA by following the
> device link from /sys/class/infiniband.  Similarly if /proc/interrupts
> gives the PCI device, then you have the same ability.  So either way
> works as far as I can tell.
Linux is supposed to move away from procfs to sysfs for this type of device
related info. However, /proc/interrupts is still present in the latest
distro
releases (for example, SLES11) and OFED needs to provide support
for this in procfs until the /proc/interrupts support is removed from
kernel.

Also, OFED implementation needs to be consistent with other Ethernet
device drivers present in a system as OFED includes both Infiniband
and Ethernet functions( for example, ConnectX)

I wanted to summarize what we had discussed so far.

1) Enhance sysfs to include info. found in /proc/interrupts:

I have not seen full sysfs support for Ethernet devices .
I have seen IRQ number info but no interrupt counters on a per CPU basis.
Do we know when the full support for ethernet devices will be available
in sysfs? We can enhance OFED at the same time ethernet support is made
available in the kernel.

2) Use PCI ID in /proc/interrupts:
I have not seen Ethernet devices follow this convention.
Also, OFED tools currently use the device name convention mthcaX, mlx4_X
etc.
Therefore, this approach is not preferable for consistency reason.

As an alternative to #2,

3) We can add dev_alloc_name() functionality to mlx4_core similar to
alloc_name()
present in ib_core. This is consistent with other ethernet device driver
implementations
using the function dev_alloc_name() present in the kernel. (Please see
.../net/core/dev.c)

Any objection for going with #3?

Regards,
Benjamin


From He.Huang at Sun.COM  Wed May 27 21:33:19 2009
From: He.Huang at Sun.COM (Isaac Huang)
Date: Thu, 28 May 2009 00:33:19 -0400
Subject: [ofa-general] two questions about RDMA_CM_EVENT_TIMEWAIT_EXIT	and
	the TimeWait state
In-Reply-To: <15ddcffd0905261322t3df149fuaf27ebcd17a8ac46@mail.gmail.com>
References: <20090526200346.GQ4239@sun.com>
	<15ddcffd0905261322t3df149fuaf27ebcd17a8ac46@mail.gmail.com>
Message-ID: <20090528043319.GW4494@sun.com>

On Tue, May 26, 2009 at 11:22:24PM +0300, Or Gerlitz wrote:
>    On Tue, May 26, 2009 at 11:03 PM, Isaac Huang <[1]He.Huang at sun.com>
>    wrote:
> 
>      If rdma_destroy_qp is called on a QP before it exits the TimeWait
>      state (i.e. after RDMA_CM_EVENT_DISCONNECTED but before
>      RDMA_CM_EVENT_TIMEWAIT_EXIT), is it possible that a subsequent
>      rdma_create_qp would reuse the same QP while it's still in TimeWait?
> 
>    YES - as rdma_destroy/create_qp are basically  wrappers to
>    ib_destroy/create_qp and the latter two are not aware by any means to
>    the QP state from the CM point of view.

Thanks, they should probably be called CM ID states instead if QP
states.

Isaac


From He.Huang at Sun.COM  Wed May 27 21:43:35 2009
From: He.Huang at Sun.COM (Isaac Huang)
Date: Thu, 28 May 2009 00:43:35 -0400
Subject: [ofa-general] two questions about RDMA_CM_EVENT_TIMEWAIT_EXIT	and
	the TimeWait state
In-Reply-To: <93D6D1D3B9C94280A5277CE07A98663C@amr.corp.intel.com>
References: <20090526200346.GQ4239@sun.com>
	<93D6D1D3B9C94280A5277CE07A98663C@amr.corp.intel.com>
Message-ID: <20090528044335.GX4494@sun.com>

On Tue, May 26, 2009 at 01:43:25PM -0700, Sean Hefty wrote:
> >In 12.9.6 of the Infiniband Architecture v1.2, it seemed that a QP
> >could enter the TimeWait state without having entered the Established
> >state first, via the RTU timeout. Could a RDMA_CM_EVENT_TIMEWAIT_EXIT
> >happen right after a RDMA_CM_EVENT_CONNECT_REQUEST without a
> >RDMA_CM_EVENT_ESTABLISHED? If yes, our ULP would have to cleanup some
> >resources in case RDMA_CM_EVENT_TIMEWAIT_EXIT happens on passive side.
> 
> Yes, it's possible to enter timewait without going through established.  I'd
> have to walk through the code at this point to identify all of the cases.

Thanks, I followed cm_enter_timewait() call sites and found that it
could be entered via several paths without going through IB_CM_ESTABLISHED.

> Note that a lot (most?) connections between QPs are established out of band
> using TCP, and these are not tracked by the CM or go through any sort of
> timewait before potentially being reused.

I don't quite understand this. Could you please point me to places
(code, IB spec, so on) where I could poke around?

Thanks,
Isaac


From sean.hefty at intel.com  Wed May 27 22:50:22 2009
From: sean.hefty at intel.com (Sean Hefty)
Date: Wed, 27 May 2009 22:50:22 -0700
Subject: [ofa-general] two questions about
	RDMA_CM_EVENT_TIMEWAIT_EXIT	and	the TimeWait state
In-Reply-To: <20090528044335.GX4494@sun.com>
References: <20090526200346.GQ4239@sun.com>	<93D6D1D3B9C94280A5277CE07A98663C@amr.corp.intel.com>
	<20090528044335.GX4494@sun.com>
Message-ID: <51A9484D90DE44DFA7A3F6F31A2E6D94@amr.corp.intel.com>

>> Note that a lot (most?) connections between QPs are established out of band
>> using TCP, and these are not tracked by the CM or go through any sort of
>> timewait before potentially being reused.
>
>I don't quite understand this. Could you please point me to places
>(code, IB spec, so on) where I could poke around?

MPIs typically connect QPs by connecting over sockets and exchanging the QP
information that way.  The QPs are then modified directly using a combination of
locally read and hard-coded values.  The libibverb examples along with the
perftest programs can connect QPs in this fashion.


From pashash at gmail.com  Thu May 28 00:09:42 2009
From: pashash at gmail.com (Pavel Shamis (Pasha))
Date: Thu, 28 May 2009 10:09:42 +0300
Subject: [ofa-general] Memory registration redux
In-Reply-To: <C7598186-0CFA-40F8-A94C-9D4BA0E6E85D@cisco.com>
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com><adabpq6t2k8.fsf@cisco.com><20090506214628.GM2590@obsidianresearch.com><adatz3xsxo6.fsf@cisco.com><20090506222638.GA16280@obsidianresearch.com><adaprelsvnp.fsf@cisco.com><20090507000231.GB16280@obsidianresearch.com><adak54ssi0g.fsf@cisco.com><20090507224806.GF16280@obsidianresearch.com><adabppf8nln.fsf@cisco.com>	<ada7i038nk9.fsf@cisco.com><5019F239-149F-49E1-8C23-436DE6094AB2@cisco.com>	<adaws8277wa.fsf@cisco.com>
	<C7598186-0CFA-40F8-A94C-9D4BA0E6E85D@cisco.com>
Message-ID: <4A1E38B6.70305@dev.mellanox.co.il>

Sounds good for me,

Jeff Squyres wrote:
> Other MPI implementors -- what do you think of this scheme?
>
>
> On May 27, 2009, at 1:49 PM, Roland Dreier (rdreier) wrote:
>
>>
>>  > > /*
>>  > >  * If type field is INVAL, then user_cookie_counter holds the
>>  > >  * user_cookie for the region being reported; if the HINT flag 
>> is set
>>  > >  * then hint_start/hint_end hold the start and end of the 
>> mapping that
>>  > >  * was invalidated.  (If HINT is not set, then multiple events
>>  > >  * invalidated parts of the registered range and 
>> hint_start/hint_end
>>  > >  * should be ignored)
>>
>>  > I don't quite grok this.  Is the intent that HINT will only be set if
>>  > an *entire* hint_start/hint_end range is invalidated by a single
>>  > event?  I.e., if only part of the hint_start/hint_end range is
>>  > invalidated, you'll get the cookie back, but not what part of the
>>  > range is invalid (because assumedly the entire IBV registration is 
>> now
>>  > invalid anyway)?
>>
>> Basically, I just keep one hint_start/hint_end.  If multiple events hit
>> the same registration then I just give up and don't give you a hint.
>>
>>  > >  * If type is LAST, then the read operation has emptied the list of
>>  > >  * invalidated regions, and user_cookie_counter holds the value 
>> of the
>>  > >  * kernel's generation counter when the empty list occurred.  The
>>  > >  * other fields are not filled in for this event.
>>
>>  > Just to be clear -- we're supposed to keep reading events until we 
>> get
>>  > a LAST event?
>>
>> Yes, that's probably the sanest use case.
>>
>>  > 1. Will it increase by 1 each time a page (or set of pages?) is
>>  > removed from a user process?
>>
>> As it stands it increases by 1 every time there is an MMU notification,
>> even if that notification hits multiple registrations.  It wouldn't be
>> hard to change that to count the number of events generated if that
>> works better.
>>
>>  > 2. Does it change if pages are *added* to a user process?  I.e., does
>>  > the counter indicate *removals* or *changes* to the user process page
>>  > table?
>>
>> No, additions don't trigger any MMU notification -- that's inherent in
>> the design of the MMU notifiers stuff.  The idea is that you have a
>> "secondary MMU" and MMU notifications are the equivalent of TLB
>> shootdowns; the secondary MMU is responsible for populating itself on
>> faults etc.
>>
>>  > Is the *unm_counter value guaranteed to have been changed by the time
>>  > munmap() returns?
>>
>> Yes.
>>
>>  > Did you pick [2] here simply because you're only expecting an INVAL
>>  > and a LAST event in this specific example?  I'm assuming that we
>>  > should normally loop over reading until we get LAST, correct?
>>
>> Right.
>>
>>  > What happens if I register multiple regions with the same cookie 
>> value?
>>
>> You get in trouble -- I need to fix things to reject duplicated cookies
>> actually, because otherwise there's no way to unregister.
>>
>>  > Is a process responsible for guaranteeing that it umn_unregister()s
>>  > everything before exiting, or will all pending registrations be
>>  > cleaned up/unregistered/whatever when a process exits?
>>
>> The kernel cleans up of course to handle crashes etc.
>>
>>  - R.
>>
>
>


From sebastien.dugue at bull.net  Thu May 28 01:20:59 2009
From: sebastien.dugue at bull.net (sebastien dugue)
Date: Thu, 28 May 2009 10:20:59 +0200
Subject: [ofa-general] [PATCH 0/3] V2 - libmthca libmlx4 - Optimize memory
 allocation of QP buffers
Message-ID: <20090528102059.2fd85540@frecb007965>


  Hi,

  here is a re-spin of the QP buffers memory allocation optimization patches in
which QP buffers are allocated using mmap() regardless of the page size.


Changes V1 -> V2:
----------------

  - Use mmap whatever the page size, not only with 64K pages.


  libmthca and libmlx4 allocate QP buffers using posix_memalign(), which
results in big memory wastage on architectures with 64K pages.

  Replacing posix_memalign() with mmap() allows to fix this (more description
in the patches themselves).

  Now, for some numbers, a micro benchmark I wrote shows the heap usage and
the number of mmaped pages used with posix_memalign() and mmap() respectively
for 1000, 2000, up to 8000 QP.

  MTHCA
               posix_memalign			   mmap
  QP	   heap		mmaped(pages)	   heap		mmaped(pages)
 1000	   838736	    2988	   576512	   1000
 2000	  1751216	    5973	  1161264	   2000
 3000	  2598144	    8961	  1746016	   3000
 4000	  3510656	   11946	  2330704	   4000
 5000	  4357616	   14934	  2915440	   5000
 6000	  5270080	   17919	  3500176	   6000
 7000	  6117056	   20907	  4084912	   7000
 8000	  6963968	   23895	  4669632	   8000

  MLX4
               posix_memalign			   mmap
  QP	   heap		mmaped(pages)	   heap		mmaped(pages)
 1000	  1469424	    2982	  1010544	   1003
 2000	  2994048	    5958	  2010752	   2003
 3000	  4518672	    8934	  3010960	   3003
 4000	  5969520	   11913	  4002960	   4003
 5000	  7494176	   14889	  5003168	   5003
 6000	  8953248	   17868	  6003376	   6003
 7000	 10477856	   20844	  7003584	   7003
 8000	 12002496	   23820	  8003792	   8003


  This patchset consists in 3 patches:

  1. Optimize memory allocation of QP buffers for libmthca
  2. Optimize memory allocation of QP buffers for libmlx4
  3. Fix the fixes patches for libmlx4 after having applied the
     previous patch.

  Sebastien Dugue


From sebastien.dugue at bull.net  Thu May 28 01:24:26 2009
From: sebastien.dugue at bull.net (sebastien dugue)
Date: Thu, 28 May 2009 10:24:26 +0200
Subject: [ofa-general] [PATCH 3/3] V2 - libmlx4 - Fix fixes after QP buffers
 alloc optimization patch to allow build
In-Reply-To: <20090528102059.2fd85540@frecb007965>
References: <20090528102059.2fd85540@frecb007965>
Message-ID: <20090528102426.2ea0d35e@frecb007965>


  The patches in 'fixes/' need to be refreshed after the previous patch in
order to build properly.

Signed-off-by: Sebastien Dugue <sebastien.dugue at bull.net>
---
 fixes/lim_qp_resources.patch     |   20 ++++-------
 fixes/resize_cq_owner_bit.patch  |    4 +--
 fixes/userspace_dev_lims.patch   |   12 ++----
 fixes/xrc_consolidated_v2.patch  |   68 ++++++++++++++------------------------
 fixes/xrc_fix_close_domain.patch |    8 ++---
 fixes/xrc_rcv_qp_v2.patch        |   12 ++-----
 6 files changed, 44 insertions(+), 80 deletions(-)

diff --git a/fixes/lim_qp_resources.patch b/fixes/lim_qp_resources.patch
index 1f89256..54cc63e 100644
--- a/fixes/lim_qp_resources.patch
+++ b/fixes/lim_qp_resources.patch
@@ -7,11 +7,9 @@ qp creation also lie within the reported device limits.
     
 Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>
 
-Index: libmlx4/src/qp.c
-===================================================================
---- libmlx4.orig/src/qp.c	2008-06-04 08:24:45.000000000 +0300
-+++ libmlx4/src/qp.c	2008-06-04 08:24:49.000000000 +0300
-@@ -619,6 +619,7 @@ void mlx4_set_sq_sizes(struct mlx4_qp *q
+--- a/src/qp.c
++++ b/src/qp.c
+@@ -622,6 +622,7 @@ void mlx4_set_sq_sizes(struct mlx4_qp *q
  		       enum ibv_qp_type type)
  {
  	int wqe_size;
@@ -19,7 +17,7 @@ Index: libmlx4/src/qp.c
  
  	wqe_size = (1 << qp->sq.wqe_shift) - sizeof (struct mlx4_wqe_ctrl_seg);
  	switch (type) {
-@@ -636,8 +637,9 @@ void mlx4_set_sq_sizes(struct mlx4_qp *q
+@@ -639,8 +640,9 @@ void mlx4_set_sq_sizes(struct mlx4_qp *q
  	}
  
  	qp->sq.max_gs	     = wqe_size / sizeof (struct mlx4_wqe_data_seg);
@@ -31,10 +29,8 @@ Index: libmlx4/src/qp.c
  	cap->max_send_wr     = qp->sq.max_post;
  
  	/*
-Index: libmlx4/src/verbs.c
-===================================================================
---- libmlx4.orig/src/verbs.c	2008-06-04 08:24:45.000000000 +0300
-+++ libmlx4/src/verbs.c	2008-06-04 08:24:49.000000000 +0300
+--- a/src/verbs.c
++++ b/src/verbs.c
 @@ -390,12 +390,14 @@ struct ibv_qp *mlx4_create_qp(struct ibv
  	struct ibv_create_qp_resp resp;
  	struct mlx4_qp		 *qp;
@@ -54,9 +50,9 @@ Index: libmlx4/src/verbs.c
  	    attr->cap.max_inline_data > 1024)
  		return NULL;
  
-@@ -461,8 +463,14 @@ struct ibv_qp *mlx4_create_qp(struct ibv
- 	if (ret)
+@@ -464,8 +466,14 @@ struct ibv_qp *mlx4_create_qp(struct ibv
  		goto err_destroy;
+ 	pthread_mutex_unlock(&to_mctx(pd->context)->qp_table_mutex);
  
 -	qp->rq.wqe_cnt = qp->rq.max_post = attr->cap.max_recv_wr;
 +	qp->rq.wqe_cnt = attr->cap.max_recv_wr;
diff --git a/fixes/resize_cq_owner_bit.patch b/fixes/resize_cq_owner_bit.patch
index 6557027..0a5b564 100644
--- a/fixes/resize_cq_owner_bit.patch
+++ b/fixes/resize_cq_owner_bit.patch
@@ -3,11 +3,9 @@ for the target buffer (and not left as it was in the source buffer).
 
 Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>
 
-diff --git a/src/cq.c b/src/cq.c
-index 68e16e9..8226b6b 100644
 --- a/src/cq.c
 +++ b/src/cq.c
-@@ -455,6 +455,8 @@ void mlx4_cq_resize_copy_cqes(struct mlx4_cq *cq, void *buf, int old_cqe)
+@@ -478,6 +478,8 @@ void mlx4_cq_resize_copy_cqes(struct mlx
  	cqe = get_cqe(cq, (i & old_cqe));
  
  	while ((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) != MLX4_CQE_OPCODE_RESIZE) {
diff --git a/fixes/userspace_dev_lims.patch b/fixes/userspace_dev_lims.patch
index 07cf638..80d4d14 100644
--- a/fixes/userspace_dev_lims.patch
+++ b/fixes/userspace_dev_lims.patch
@@ -9,10 +9,8 @@ preferable to breaking the ABI.
     
 Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>
 
-Index: libmlx4/src/mlx4.c
-===================================================================
---- libmlx4.orig/src/mlx4.c	2008-06-03 15:45:18.000000000 +0300
-+++ libmlx4/src/mlx4.c	2008-06-04 08:24:10.000000000 +0300
+--- a/src/mlx4.c
++++ b/src/mlx4.c
 @@ -104,6 +104,7 @@ static struct ibv_context *mlx4_alloc_co
  	struct ibv_get_context		cmd;
  	struct mlx4_alloc_ucontext_resp resp;
@@ -42,10 +40,8 @@ Index: libmlx4/src/mlx4.c
  err_free:
  	free(context);
  	return NULL;
-Index: libmlx4/src/mlx4.h
-===================================================================
---- libmlx4.orig/src/mlx4.h	2008-06-03 15:45:18.000000000 +0300
-+++ libmlx4/src/mlx4.h	2008-06-04 08:24:10.000000000 +0300
+--- a/src/mlx4.h
++++ b/src/mlx4.h
 @@ -83,6 +83,20 @@
  
  #define PFX		"mlx4: "
diff --git a/fixes/xrc_consolidated_v2.patch b/fixes/xrc_consolidated_v2.patch
index 6fbd0a9..78a4f6c 100644
--- a/fixes/xrc_consolidated_v2.patch
+++ b/fixes/xrc_consolidated_v2.patch
@@ -18,8 +18,6 @@ V2:
 2. Changed xrc_ops to more ops
 3. Check for xrc verbs in ibv_more_ops via AC_CHECK_MEMBER
 
-diff --git a/configure.in b/configure.in
-index 25f27f7..46a3a64 100644
 --- a/configure.in
 +++ b/configure.in
 @@ -42,6 +42,12 @@ AC_CHECK_HEADER(valgrind/memcheck.h,
@@ -35,11 +33,9 @@ index 25f27f7..46a3a64 100644
  
  dnl Checks for library functions
  AC_CHECK_FUNC(ibv_read_sysfs_file, [],
-diff --git a/src/cq.c b/src/cq.c
-index 68e16e9..c598b87 100644
 --- a/src/cq.c
 +++ b/src/cq.c
-@@ -194,8 +194,9 @@ static int mlx4_poll_one(struct mlx4_cq *cq,
+@@ -194,8 +194,9 @@ static int mlx4_poll_one(struct mlx4_cq 
  {
  	struct mlx4_wq *wq;
  	struct mlx4_cqe *cqe;
@@ -50,7 +46,7 @@ index 68e16e9..c598b87 100644
  	uint32_t g_mlpath_rqpn;
  	uint16_t wqe_index;
  	int is_error;
-@@ -221,20 +223,29 @@ static int mlx4_poll_one(struct mlx4_cq *cq,
+@@ -221,20 +222,29 @@ static int mlx4_poll_one(struct mlx4_cq 
  	is_error = (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
  		MLX4_CQE_OPCODE_ERROR;
  
@@ -84,7 +80,7 @@ index 68e16e9..c598b87 100644
  
  	if (is_send) {
  		wq = &(*cur_qp)->sq;
-@@ -242,6 +254,10 @@ static int mlx4_poll_one(struct mlx4_cq *cq,
+@@ -242,6 +252,10 @@ static int mlx4_poll_one(struct mlx4_cq 
  		wq->tail += (uint16_t) (wqe_index - (uint16_t) wq->tail);
  		wc->wr_id = wq->wrid[wq->tail & (wq->wqe_cnt - 1)];
  		++wq->tail;
@@ -95,7 +91,7 @@ index 68e16e9..c598b87 100644
  	} else if ((*cur_qp)->ibv_qp.srq) {
  		srq = to_msrq((*cur_qp)->ibv_qp.srq);
  		wqe_index = htons(cqe->wqe_index);
-@@ -387,6 +403,10 @@ void __mlx4_cq_clean(struct mlx4_cq *cq, uint32_t qpn, struct mlx4_srq *srq)
+@@ -387,6 +401,10 @@ void __mlx4_cq_clean(struct mlx4_cq *cq,
  	uint32_t prod_index;
  	uint8_t owner_bit;
  	int nfreed = 0;
@@ -106,7 +102,7 @@ index 68e16e9..c598b87 100644
  
  	/*
  	 * First we need to find the current producer index, so we
-@@ -405,7 +425,12 @@ void __mlx4_cq_clean(struct mlx4_cq *cq, uint32_t qpn, struct mlx4_srq *srq)
+@@ -405,7 +423,12 @@ void __mlx4_cq_clean(struct mlx4_cq *cq,
  	 */
  	while ((int) --prod_index - (int) cq->cons_index >= 0) {
  		cqe = get_cqe(cq, prod_index & cq->ibv_cq.cqe);
@@ -120,8 +116,6 @@ index 68e16e9..c598b87 100644
  			if (srq && !(cqe->owner_sr_opcode & MLX4_CQE_IS_SEND_MASK))
  				mlx4_free_srq_wqe(srq, ntohs(cqe->wqe_index));
  			++nfreed;
-diff --git a/src/mlx4-abi.h b/src/mlx4-abi.h
-index 20a40c9..1b1253c 100644
 --- a/src/mlx4-abi.h
 +++ b/src/mlx4-abi.h
 @@ -68,6 +68,14 @@ struct mlx4_resize_cq {
@@ -152,8 +146,6 @@ index 20a40c9..1b1253c 100644
 +#endif
 +
  #endif /* MLX4_ABI_H */
-diff --git a/src/mlx4.c b/src/mlx4.c
-index 671e849..27ca75d 100644
 --- a/src/mlx4.c
 +++ b/src/mlx4.c
 @@ -68,6 +68,16 @@ struct {
@@ -173,7 +165,7 @@ index 671e849..27ca75d 100644
  static struct ibv_context_ops mlx4_ctx_ops = {
  	.query_device  = mlx4_query_device,
  	.query_port    = mlx4_query_port,
-@@ -124,6 +134,15 @@ static struct ibv_context *mlx4_alloc_context(struct ibv_device *ibdev, int cmd_
+@@ -124,6 +134,15 @@ static struct ibv_context *mlx4_alloc_co
  	for (i = 0; i < MLX4_QP_TABLE_SIZE; ++i)
  		context->qp_table[i].refcnt = 0;
  
@@ -189,7 +181,7 @@ index 671e849..27ca75d 100644
  	for (i = 0; i < MLX4_NUM_DB_TYPE; ++i)
  		context->db_list[i] = NULL;
  
-@@ -156,6 +175,9 @@ static struct ibv_context *mlx4_alloc_context(struct ibv_device *ibdev, int cmd_
+@@ -156,6 +175,9 @@ static struct ibv_context *mlx4_alloc_co
  	pthread_spin_init(&context->uar_lock, PTHREAD_PROCESS_PRIVATE);
  
  	context->ibv_ctx.ops = mlx4_ctx_ops;
@@ -199,8 +191,6 @@ index 671e849..27ca75d 100644
  
  	if (mlx4_query_device(&context->ibv_ctx, &dev_attrs))
  		goto query_free;
-diff --git a/src/mlx4.h b/src/mlx4.h
-index 8643d8f..3eadb98 100644
 --- a/src/mlx4.h
 +++ b/src/mlx4.h
 @@ -79,6 +79,11 @@
@@ -248,7 +238,7 @@ index 8643d8f..3eadb98 100644
  	struct mlx4_db_page	       *db_list[MLX4_NUM_DB_TYPE];
  	pthread_mutex_t			db_list_mutex;
  };
-@@ -260,6 +284,11 @@ struct mlx4_ah {
+@@ -266,6 +290,11 @@ struct mlx4_ah {
  	struct mlx4_av			av;
  };
  
@@ -260,7 +250,7 @@ index 8643d8f..3eadb98 100644
  static inline unsigned long align(unsigned long val, unsigned long align)
  {
  	return (val + align - 1) & ~(align - 1);
-@@ -304,6 +333,13 @@ static inline struct mlx4_ah *to_mah(struct ibv_ah *ibah)
+@@ -310,6 +339,13 @@ static inline struct mlx4_ah *to_mah(str
  	return to_mxxx(ah, ah);
  }
  
@@ -272,9 +262,9 @@ index 8643d8f..3eadb98 100644
 +#endif
 +
  int mlx4_alloc_buf(struct mlx4_buf *buf, size_t size, int page_size);
+ int mlx4_alloc_page(struct mlx4_buf *buf, size_t size, int page_size);
  void mlx4_free_buf(struct mlx4_buf *buf);
- 
-@@ -350,6 +386,10 @@ void mlx4_free_srq_wqe(struct mlx4_srq *srq, int ind);
+@@ -357,6 +393,10 @@ void mlx4_free_srq_wqe(struct mlx4_srq *
  int mlx4_post_srq_recv(struct ibv_srq *ibsrq,
  		       struct ibv_recv_wr *wr,
  		       struct ibv_recv_wr **bad_wr);
@@ -285,7 +275,7 @@ index 8643d8f..3eadb98 100644
  
  struct ibv_qp *mlx4_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr);
  int mlx4_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
-@@ -380,5 +420,16 @@ int mlx4_alloc_av(struct mlx4_pd *pd, struct ibv_ah_attr *attr,
+@@ -387,5 +427,16 @@ int mlx4_alloc_av(struct mlx4_pd *pd, st
  void mlx4_free_av(struct mlx4_ah *ah);
  int mlx4_attach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid);
  int mlx4_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid);
@@ -302,11 +292,9 @@ index 8643d8f..3eadb98 100644
 +
  
  #endif /* MLX4_H */
-diff --git a/src/qp.c b/src/qp.c
-index 01e8580..2f02430 100644
 --- a/src/qp.c
 +++ b/src/qp.c
-@@ -226,7 +226,7 @@ int mlx4_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
+@@ -226,7 +226,7 @@ int mlx4_post_send(struct ibv_qp *ibqp, 
  		ctrl = wqe = get_send_wqe(qp, ind & (qp->sq.wqe_cnt - 1));
  		qp->sq.wrid[ind & (qp->sq.wqe_cnt - 1)] = wr->wr_id;
  
@@ -315,7 +303,7 @@ index 01e8580..2f02430 100644
  			(wr->send_flags & IBV_SEND_SIGNALED ?
  			 htonl(MLX4_WQE_CTRL_CQ_UPDATE) : 0) |
  			(wr->send_flags & IBV_SEND_SOLICITED ?
-@@ -243,6 +243,9 @@ int mlx4_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
+@@ -243,6 +243,9 @@ int mlx4_post_send(struct ibv_qp *ibqp, 
  		size = sizeof *ctrl / 16;
  
  		switch (ibqp->qp_type) {
@@ -325,7 +313,7 @@ index 01e8580..2f02430 100644
  		case IBV_QPT_RC:
  		case IBV_QPT_UC:
  			switch (wr->opcode) {
-@@ -543,6 +546,7 @@ void mlx4_calc_sq_wqe_size(struct ibv_qp_cap *cap, enum ibv_qp_type type,
+@@ -543,6 +546,7 @@ void mlx4_calc_sq_wqe_size(struct ibv_qp
  		size += sizeof (struct mlx4_wqe_raddr_seg);
  		break;
  
@@ -333,7 +321,7 @@ index 01e8580..2f02430 100644
  	case IBV_QPT_RC:
  		size += sizeof (struct mlx4_wqe_raddr_seg);
  		/*
-@@ -631,6 +635,7 @@ void mlx4_set_sq_sizes(struct mlx4_qp *qp, struct ibv_qp_cap *cap,
+@@ -632,6 +636,7 @@ void mlx4_set_sq_sizes(struct mlx4_qp *q
  
  	case IBV_QPT_UC:
  	case IBV_QPT_RC:
@@ -341,11 +329,9 @@ index 01e8580..2f02430 100644
  		wqe_size -= sizeof (struct mlx4_wqe_raddr_seg);
  		break;
  
-diff --git a/src/srq.c b/src/srq.c
-index ba2ceb9..1350792 100644
 --- a/src/srq.c
 +++ b/src/srq.c
-@@ -167,3 +167,53 @@ int mlx4_alloc_srq_buf(struct ibv_pd *pd, struct ibv_srq_attr *attr,
+@@ -167,3 +167,53 @@ int mlx4_alloc_srq_buf(struct ibv_pd *pd
  
  	return 0;
  }
@@ -399,8 +385,6 @@ index ba2ceb9..1350792 100644
 +	pthread_mutex_unlock(&ctx->xrc_srq_table_mutex);
 +}
 +
-diff --git a/src/verbs.c b/src/verbs.c
-index 400050c..b7c9c8e 100644
 --- a/src/verbs.c
 +++ b/src/verbs.c
 @@ -368,18 +368,36 @@ int mlx4_query_srq(struct ibv_srq *srq,
@@ -447,7 +431,7 @@ index 400050c..b7c9c8e 100644
  
  	return 0;
  }
-@@ -415,7 +433,7 @@ struct ibv_qp *mlx4_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr)
+@@ -415,7 +433,7 @@ struct ibv_qp *mlx4_create_qp(struct ibv
  	qp->sq.wqe_cnt = align_queue_size(attr->cap.max_send_wr + qp->sq_spare_wqes);
  	qp->rq.wqe_cnt = align_queue_size(attr->cap.max_recv_wr);
  
@@ -456,7 +440,7 @@ index 400050c..b7c9c8e 100644
  		attr->cap.max_recv_wr = qp->rq.wqe_cnt = 0;
  	else {
  		if (attr->cap.max_recv_sge < 1)
-@@ -433,7 +451,7 @@ struct ibv_qp *mlx4_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr)
+@@ -433,7 +451,7 @@ struct ibv_qp *mlx4_create_qp(struct ibv
  	    pthread_spin_init(&qp->rq.lock, PTHREAD_PROCESS_PRIVATE))
  		goto err_free;
  
@@ -465,7 +449,7 @@ index 400050c..b7c9c8e 100644
  		qp->db = mlx4_alloc_db(to_mctx(pd->context), MLX4_DB_TYPE_RQ);
  		if (!qp->db)
  			goto err_free;
-@@ -442,7 +460,7 @@ struct ibv_qp *mlx4_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr)
+@@ -442,7 +460,7 @@ struct ibv_qp *mlx4_create_qp(struct ibv
  	}
  
  	cmd.buf_addr	    = (uintptr_t) qp->buf.buf;
@@ -474,7 +458,7 @@ index 400050c..b7c9c8e 100644
  		cmd.db_addr = 0;
  	else
  		cmd.db_addr = (uintptr_t) qp->db;
-@@ -485,7 +503,7 @@ err_destroy:
+@@ -489,7 +507,7 @@ err_destroy:
  
  err_rq_db:
  	pthread_mutex_unlock(&to_mctx(pd->context)->qp_table_mutex);
@@ -483,7 +467,7 @@ index 400050c..b7c9c8e 100644
  		mlx4_free_db(to_mctx(pd->context), MLX4_DB_TYPE_RQ, qp->db);
  
  err_free:
-@@ -544,7 +562,7 @@ int mlx4_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
+@@ -548,7 +566,7 @@ int mlx4_modify_qp(struct ibv_qp *qp, st
  			mlx4_cq_clean(to_mcq(qp->send_cq), qp->qp_num, NULL);
  
  		mlx4_init_qp_indices(to_mqp(qp));
@@ -492,16 +476,16 @@ index 400050c..b7c9c8e 100644
  			*to_mqp(qp)->db = 0;
  	}
  
-@@ -603,7 +621,7 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp)
- 
+@@ -611,7 +629,7 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp)
  	mlx4_unlock_cqs(ibqp);
+ 	pthread_mutex_unlock(&to_mctx(ibqp->context)->qp_table_mutex);
  
 -	if (!ibqp->srq)
 +	if (!ibqp->srq && ibqp->qp_type != IBV_QPT_XRC)
  		mlx4_free_db(to_mctx(ibqp->context), MLX4_DB_TYPE_RQ, qp->db);
  	free(qp->sq.wrid);
  	if (qp->rq.wqe_cnt)
-@@ -661,3 +679,103 @@ int mlx4_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid)
+@@ -669,3 +687,103 @@ int mlx4_detach_mcast(struct ibv_qp *qp,
  {
  	return ibv_cmd_detach_mcast(qp, gid, lid);
  }
@@ -605,8 +589,6 @@ index 400050c..b7c9c8e 100644
 +	return 0;
 +}
 +#endif
-diff --git a/src/wqe.h b/src/wqe.h
-index 6f7f309..fa2f8ac 100644
 --- a/src/wqe.h
 +++ b/src/wqe.h
 @@ -65,7 +65,7 @@ struct mlx4_wqe_ctrl_seg {
diff --git a/fixes/xrc_fix_close_domain.patch b/fixes/xrc_fix_close_domain.patch
index dfad7ac..3af2640 100644
--- a/fixes/xrc_fix_close_domain.patch
+++ b/fixes/xrc_fix_close_domain.patch
@@ -6,11 +6,9 @@ Need to pass this upward to caller.
 
 Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>
 
-Index: libmlx4/src/verbs.c
-===================================================================
---- libmlx4.orig/src/verbs.c	2008-09-01 10:51:11.000000000 +0300
-+++ libmlx4/src/verbs.c	2008-09-01 10:52:40.000000000 +0300
-@@ -774,9 +774,11 @@
+--- a/src/verbs.c
++++ b/src/verbs.c
+@@ -782,9 +782,11 @@ struct ibv_xrc_domain *mlx4_open_xrc_dom
  
  int mlx4_close_xrc_domain(struct ibv_xrc_domain *d)
  {
diff --git a/fixes/xrc_rcv_qp_v2.patch b/fixes/xrc_rcv_qp_v2.patch
index 311c500..00ffd53 100644
--- a/fixes/xrc_rcv_qp_v2.patch
+++ b/fixes/xrc_rcv_qp_v2.patch
@@ -5,11 +5,9 @@ Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>
 V2:
 1. xrc_ops changed to more_ops
 
-diff --git a/src/mlx4.c b/src/mlx4.c
-index 27ca75d..e5ded78 100644
 --- a/src/mlx4.c
 +++ b/src/mlx4.c
-@@ -74,6 +74,11 @@ static struct ibv_more_ops mlx4_more_ops = {
+@@ -74,6 +74,11 @@ static struct ibv_more_ops mlx4_more_ops
  	.create_xrc_srq   = mlx4_create_xrc_srq,
  	.open_xrc_domain  = mlx4_open_xrc_domain,
  	.close_xrc_domain = mlx4_close_xrc_domain,
@@ -21,11 +19,9 @@ index 27ca75d..e5ded78 100644
  #endif
  };
  #endif
-diff --git a/src/mlx4.h b/src/mlx4.h
-index 3eadb98..6307a2d 100644
 --- a/src/mlx4.h
 +++ b/src/mlx4.h
-@@ -429,6 +429,21 @@ struct ibv_xrc_domain *mlx4_open_xrc_domain(struct ibv_context *context,
+@@ -436,6 +436,21 @@ struct ibv_xrc_domain *mlx4_open_xrc_dom
  					    int fd, int oflag);
  
  int mlx4_close_xrc_domain(struct ibv_xrc_domain *d);
@@ -47,11 +43,9 @@ index 3eadb98..6307a2d 100644
  #endif
  
  
-diff --git a/src/verbs.c b/src/verbs.c
-index b7c9c8e..8261eae 100644
 --- a/src/verbs.c
 +++ b/src/verbs.c
-@@ -778,4 +778,59 @@ int mlx4_close_xrc_domain(struct ibv_xrc_domain *d)
+@@ -786,4 +786,59 @@ int mlx4_close_xrc_domain(struct ibv_xrc
  	free(d);
  	return 0;
  }
-- 
1.6.3.1


From sebastien.dugue at bull.net  Thu May 28 01:22:49 2009
From: sebastien.dugue at bull.net (sebastien dugue)
Date: Thu, 28 May 2009 10:22:49 +0200
Subject: [ofa-general] [PATCH 1/3] V2 - libmthca - Optimize memory allocation
 of QP buffers
In-Reply-To: <20090528102059.2fd85540@frecb007965>
References: <20090528102059.2fd85540@frecb007965>
Message-ID: <20090528102249.2ca01866@frecb007965>


QP buffers are allocated with mthca_alloc_buf(), which rounds the buffers
size to the page size and then allocates page aligned memory using
posix_memalign().

  However, this allocation is quite wasteful on architectures using 64K pages
(ia64 for example) because we then hit glibc's MMAP_THRESHOLD malloc
parameter and chunks are allocated using mmap. thus we end up allocating:

(requested size rounded to the page size) + (page size) + (malloc overhead)

rounded internally to the page size.

  So for example, if we request a buffer of page_size bytes, we end up
consuming 3 pages. In short, for each QP buffer we allocate, there is an
overhead of 2 pages. This is quite visible on large clusters especially where
the number of QP can reach several thousands.

  This patch creates a new function mthca_alloc_page() for use by
mthca_alloc_qp_buf() that does an mmap() instead of a posix_memalign().

Signed-off-by: Sebastien Dugue <sebastien.dugue at bull.net>
---
 src/buf.c   |   34 ++++++++++++++++++++++++++++++++--
 src/mthca.h |    7 +++++++
 src/qp.c    |    7 ++++---
 3 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/src/buf.c b/src/buf.c
index 6c1be4f..499edeb 100644
--- a/src/buf.c
+++ b/src/buf.c
@@ -35,6 +35,8 @@
 #endif /* HAVE_CONFIG_H */
 
 #include <stdlib.h>
+#include <sys/mman.h>
+#include <errno.h>
 
 #include "mthca.h"
 
@@ -69,8 +71,32 @@ int mthca_alloc_buf(struct mthca_buf *buf, size_t size, int page_size)
 	if (ret)
 		free(buf->buf);
 
-	if (!ret)
+	if (!ret) {
 		buf->length = size;
+		buf->type = MTHCA_MALIGN;
+	}
+
+	return ret;
+}
+
+int mthca_alloc_page(struct mthca_buf *buf, size_t size, int page_size)
+{
+	int ret;
+
+	/* Use mmap directly to allocate an aligned buffer */
+	buf->buf = mmap(0 ,align(size, page_size) , PROT_READ | PROT_WRITE,
+			MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+
+	if (buf->buf == MAP_FAILED)
+		return errno;
+
+	ret = ibv_dontfork_range(buf->buf, size);
+	if (ret)
+		munmap(buf->buf, align(size, page_size));
+	else {
+		buf->length = size;
+		buf->type = MTHCA_MMAP;
+	}
 
 	return ret;
 }
@@ -78,5 +104,9 @@ int mthca_alloc_buf(struct mthca_buf *buf, size_t size, int page_size)
 void mthca_free_buf(struct mthca_buf *buf)
 {
 	ibv_dofork_range(buf->buf, buf->length);
-	free(buf->buf);
+
+	if ( buf->type == MTHCA_MMAP )
+		munmap(buf->buf, buf->length);
+	else
+		free(buf->buf);
 }
diff --git a/src/mthca.h b/src/mthca.h
index 66751f3..7db15a7 100644
--- a/src/mthca.h
+++ b/src/mthca.h
@@ -138,9 +138,15 @@ struct mthca_context {
 	int		       qp_table_mask;
 };
 
+enum mthca_buf_type {
+	MTHCA_MMAP,
+	MTHCA_MALIGN
+};
+
 struct mthca_buf {
 	void		       *buf;
 	size_t			length;
+	enum mthca_buf_type	type;
 };
 
 struct mthca_pd {
@@ -291,6 +297,7 @@ static inline int mthca_is_memfree(struct ibv_context *ibctx)
 }
 
 int mthca_alloc_buf(struct mthca_buf *buf, size_t size, int page_size);
+int mthca_alloc_page(struct mthca_buf *buf, size_t size, int page_size);
 void mthca_free_buf(struct mthca_buf *buf);
 
 int mthca_alloc_db(struct mthca_db_table *db_tab, enum mthca_db_type type,
diff --git a/src/qp.c b/src/qp.c
index 84dd206..15f4805 100644
--- a/src/qp.c
+++ b/src/qp.c
@@ -848,9 +848,10 @@ int mthca_alloc_qp_buf(struct ibv_pd *pd, struct ibv_qp_cap *cap,
 
 	qp->buf_size = qp->send_wqe_offset + (qp->sq.max << qp->sq.wqe_shift);
 
-	if (mthca_alloc_buf(&qp->buf,
-			    align(qp->buf_size, to_mdev(pd->context->device)->page_size),
-			    to_mdev(pd->context->device)->page_size)) {
+	if (mthca_alloc_page(&qp->buf,
+			     align(qp->buf_size,
+				   to_mdev(pd->context->device)->page_size),
+			     to_mdev(pd->context->device)->page_size)) {
 		free(qp->wrid);
 		return -1;
 	}
-- 
1.6.3.1


From sebastien.dugue at bull.net  Thu May 28 01:23:58 2009
From: sebastien.dugue at bull.net (sebastien dugue)
Date: Thu, 28 May 2009 10:23:58 +0200
Subject: [ofa-general] [PATCH 2/3] V2 - libmlx4 - Optimize memory allocation
	of QP buffers
In-Reply-To: <20090528102059.2fd85540@frecb007965>
References: <20090528102059.2fd85540@frecb007965>
Message-ID: <20090528102358.0c5b2124@frecb007965>


QP buffers are allocated with mlx4_alloc_buf(), which rounds the buffers
size to the page size and then allocates page aligned memory using
posix_memalign().

  However, this allocation is quite wasteful on architectures using 64K pages
(ia64 for example) because we then hit glibc's MMAP_THRESHOLD malloc
parameter and chunks are allocated using mmap. thus we end up allocating:

(requested size rounded to the page size) + (page size) + (malloc overhead)

rounded internally to the page size.

  So for example, if we request a buffer of page_size bytes, we end up
consuming 3 pages. In short, for each QP buffer we allocate, there is an
overhead of 2 pages. This is quite visible on large clusters especially where
the number of QP can reach several thousands.

  This patch creates a new function mlx4_alloc_page() for use by
mlx4_alloc_qp_buf() that does an mmap() instead of a posix_memalign().

Signed-off-by: Sebastien Dugue <sebastien.dugue at bull.net>
---
 src/buf.c  |   34 ++++++++++++++++++++++++++++++++--
 src/mlx4.h |    7 +++++++
 src/qp.c   |    5 +++--
 3 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/src/buf.c b/src/buf.c
index 0e5f9b6..73565e6 100644
--- a/src/buf.c
+++ b/src/buf.c
@@ -35,6 +35,8 @@
 #endif /* HAVE_CONFIG_H */
 
 #include <stdlib.h>
+#include <sys/mman.h>
+#include <errno.h>
 
 #include "mlx4.h"
 
@@ -69,14 +71,42 @@ int mlx4_alloc_buf(struct mlx4_buf *buf, size_t size, int page_size)
 	if (ret)
 		free(buf->buf);
 
-	if (!ret)
+	if (!ret) {
 		buf->length = size;
+		buf->type = MLX4_MALIGN;
+	}
 
 	return ret;
 }
 
+int mlx4_alloc_page(struct mlx4_buf *buf, size_t size, int page_size)
+{
+	int ret;
+
+	/* Use mmap directly to allocate an aligned buffer */
+	buf->buf = mmap(0 ,align(size, page_size) , PROT_READ | PROT_WRITE,
+			MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+
+	if (buf->buf == MAP_FAILED)
+		return errno;
+
+	ret = ibv_dontfork_range(buf->buf, size);
+	if (ret)
+		munmap(buf->buf, align(size, page_size));
+	else {
+		buf->length = size;
+		buf->type = MLX4_MMAP;
+	}
+
+        return ret;
+ }
+
 void mlx4_free_buf(struct mlx4_buf *buf)
 {
 	ibv_dofork_range(buf->buf, buf->length);
-	free(buf->buf);
+
+	if ( buf->type == MLX4_MMAP )
+		munmap(buf->buf, buf->length);
+	else
+		free(buf->buf);
 }
diff --git a/src/mlx4.h b/src/mlx4.h
index 827a201..83547f5 100644
--- a/src/mlx4.h
+++ b/src/mlx4.h
@@ -161,9 +161,15 @@ struct mlx4_context {
 	pthread_mutex_t			db_list_mutex;
 };
 
+enum mlx4_buf_type {
+	MLX4_MMAP,
+	MLX4_MALIGN
+};
+
 struct mlx4_buf {
 	void			       *buf;
 	size_t				length;
+	enum mlx4_buf_type		type;
 };
 
 struct mlx4_pd {
@@ -288,6 +294,7 @@ static inline struct mlx4_ah *to_mah(struct ibv_ah *ibah)
 }
 
 int mlx4_alloc_buf(struct mlx4_buf *buf, size_t size, int page_size);
+int mlx4_alloc_page(struct mlx4_buf *buf, size_t size, int page_size);
 void mlx4_free_buf(struct mlx4_buf *buf);
 
 uint32_t *mlx4_alloc_db(struct mlx4_context *context, enum mlx4_db_type type);
diff --git a/src/qp.c b/src/qp.c
index d194ae3..557e255 100644
--- a/src/qp.c
+++ b/src/qp.c
@@ -604,8 +604,9 @@ int mlx4_alloc_qp_buf(struct ibv_pd *pd, struct ibv_qp_cap *cap,
 		qp->sq.offset = 0;
 	}
 
-	if (mlx4_alloc_buf(&qp->buf,
-			    align(qp->buf_size, to_mdev(pd->context->device)->page_size),
+	if (mlx4_alloc_page(&qp->buf,
+			    align(qp->buf_size,
+				  to_mdev(pd->context->device)->page_size),
 			    to_mdev(pd->context->device)->page_size)) {
 		free(qp->sq.wrid);
 		free(qp->rq.wrid);
-- 
1.6.3.1


From vlad at lists.openfabrics.org  Thu May 28 03:24:45 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Thu, 28 May 2009 03:24:45 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090528-0200 daily build status
Message-ID: <20090528102445.68B4BE28179@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From tziporet at mellanox.co.il  Thu May 28 03:37:49 2009
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Thu, 28 May 2009 13:37:49 +0300
Subject: [ofa-general] OFED 1.4.1 GA is available
In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD02C12252@mtlexch01.mtl.com>
References: <5D49E7A8952DC44FB38C38FA0D758EAD02C12252@mtlexch01.mtl.com>
Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD02D5F39D@mtlexch01.mtl.com>

 
I am pleased to announce that OFED-1.4.1 GA release is done

The tarball is available on:
http://www.openfabrics.org/downloads/OFED/ofed-1.4.1/OFED-1.4.1.tgz

To get BUILD_ID run ofed_info

Please report any issues in bugzilla https://bugs.openfabrics.org/  for
OFED 1.4.1

Vladimir & Tziporet

========================================================================


Release information:
------------------------------
Linux Operating Systems:
      - RedHat EL4 up4:  2.6.9-42.ELsmp      *
      - RedHat EL4 up5:  2.6.9-55.ELsmp
      - RedHat EL4 up6:  2.6.9-67.ELsmp
      - RedHat EL4 up7:  2.6.9-78.ELsmp
      - RedHat EL5:        2.6.18-8.el5
      - RedHat EL5 up1:  2.6.18-53.el5
      - RedHat EL5 up2:  2.6.18-92.el5
      - RedHat EL5 up3:  2.6.18-128.el5
      - OEL 4.5:              2.6.9-55.ELsmp
      - OEL 5.2:              2.6.18-92.el5
      - CentOS 5.2:         2.6.18-92.el5
      - Fedora C9:           2.6.25-14.fc9          *
      - SLES10:              2.6.16.21-0.8-smp
      - SLES10 SP1:       2.6.16.46-0.12-smp
      - SLES10 SP1 up1: 2.6.16.53-0.16-smp
      - SLES10 SP2:       2.6.16.60-0.21-smp
      - SLES11 GA:         2.6.27.13-1-default
      - OpenSuSE 10.3:   2.6.22.5-31             *
      - kernel.org:             2.6.26 and 2.6.27

    * Minimal QA for these versions

Systems:
      * x86_64
      * x86
      * ia64
      * ppc64

Main Changes from OFED-1.4.0
==========================
- New OSes: Added support for RHEL 5.3 and SLES11
- NFS/RDMA: In beta quality with backports for RHEL 5.2, 5.3 and SLES 10
SP2
- Updated MPI packages:
        - Open MPI 1.3.2 - new version - see OpenMPI release notes for
details
        MVAPICH 1.1.0-3355 - bug fixes version
- Updated bonding package: ib-bonding-0.9.0-40
- Updated DAPL: compat-dapl-1.2.14 and dapl-2.0.19
- Updated opensm version to include critical bug fixes
- Fixed RDS iWARP support and fixed stability issues
- Low level drivers updated: ehca, mlx4, cxgb3, nes, ipath, mthca
- mstflint update
- Bug fixes

See each component release notes for details on enhancements and bug
fixes


From swise at opengridcomputing.com  Thu May 28 06:49:26 2009
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 28 May 2009 08:49:26 -0500
Subject: [ofa-general] Re: [PATCH] RDMA/cxgb3: Report correct port state and
	mtu.
In-Reply-To: <ada4ov66x42.fsf@cisco.com>
References: <20090527190852.16426.82898.stgit@build.ogc.int>
	<ada4ov66x42.fsf@cisco.com>
Message-ID: <4A1E9666.2060807@opengridcomputing.com>

Roland Dreier wrote:
> OK, applied.  Would be nice if we had a better way to report MTU, but whatever...
>   
Agreed.  At this point iWARP is still just plugging into the IB 
infrastructure for much of this.

Got any ideas on how to do this better?


From Bob.Ciotti at nasa.gov  Thu May 28 10:57:57 2009
From: Bob.Ciotti at nasa.gov (Bob Ciotti)
Date: Thu, 28 May 2009 10:57:57 -0700
Subject: [ofa-general] SubnAdmGet (6777)
Message-ID: <20090528175757.GA95655@nas.nasa.gov>


Sorry to bounce this off the list - should it be too remedial. I promise
that I've been consuming a lot of the spec and OFA code. Maybe you consider
that a promise or a warning we will be more active :|

Our configuration is >6000 CA in a mix of infinihostIII/connectx and
longbow extenders and >800 24 port switches on a single subnet. (SGI ICE
with lots of other stuff plugged in). Its DDR everywhere except across the
longbows. Hosts range from a few different generations of x86 xeon, x86
opteron and itanium. We use lustre but have the srp traffic on a separate
subnet.

A few weeks ago connection setup times were mentioned on this list along
with ARP and path record lookups not being scalable. We experience these
problems as well and need to address these scalability issues. I have a quite
a bit of test data and a few different ideas to bounce off the list RE path
records, once I am a little more versed in the spec. There has already been 
some work done to limit ARP traffic.


Todays question has to do with SM errors.  
We have been seeing lots of these - sometimes more than others. Digging
around some it appears that the 6777 represents the number of duplicates?
This value fluctuates around some, but not alot. Comments in the code
indicate that any valuse >1 is a problem. Question is, should or is this
OK to be happening and how does it occur?

We will probably do an update to the 1.4 or 1.4.1 SM in the next few days.
We are currently running a pre 1.4 top of tree pull from back in dec. bob


May 28 00:07:13 324705 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:13 336912 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:13 338067 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:13 570497 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:14 417242 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:14 899329 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:15 245929 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:17 076558 [6E38940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:18 118151 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:18 328783 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:18 341440 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:18 578154 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:19 425249 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:19 907407 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
May 28 00:07:20 267818 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)

....


-------------------------------------------------------------------------
Robert B. Ciotti                              Supercomputing Systems Lead
NASA Advanced Supercomputing (NAS) Division            TEL (650) 604-4408
NASA Ames Research Center                              FAX (650) 604-4377
Moffett Field, CA 94035-1000                          Bob.Ciotti at NASA.gov
-------------------------------------------------------------------------


From hal.rosenstock at gmail.com  Thu May 28 12:06:38 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 28 May 2009 15:06:38 -0400
Subject: [ofa-general] SubnAdmGet (6777)
In-Reply-To: <20090528175757.GA95655@nas.nasa.gov>
References: <20090528175757.GA95655@nas.nasa.gov>
Message-ID: <f0e08f230905281206m67f448efia95f72cc417d53be@mail.gmail.com>

On Thu, May 28, 2009 at 1:57 PM, Bob Ciotti <Bob.Ciotti at nasa.gov> wrote:
>
> Sorry to bounce this off the list - should it be too remedial. I promise
> that I've been consuming a lot of the spec and OFA code. Maybe you consider
> that a promise or a warning we will be more active :|
>
> Our configuration is >6000 CA in a mix of infinihostIII/connectx and
> longbow extenders and >800 24 port switches on a single subnet. (SGI ICE
> with lots of other stuff plugged in). Its DDR everywhere except across the
> longbows. Hosts range from a few different generations of x86 xeon, x86
> opteron and itanium. We use lustre but have the srp traffic on a separate
> subnet.
>
> A few weeks ago connection setup times were mentioned on this list along
> with ARP and path record lookups not being scalable. We experience these
> problems as well and need to address these scalability issues. I have a quite
> a bit of test data and a few different ideas to bounce off the list RE path
> records, once I am a little more versed in the spec. There has already been
> some work done to limit ARP traffic.
>
> Todays question has to do with SM errors.
> We have been seeing lots of these - sometimes more than others. Digging
> around some it appears that the 6777 represents the number of duplicates?
> This value fluctuates around some, but not alot. Comments in the code
> indicate that any valuse >1 is a problem. Question is, should or is this
> OK to be happening and how does it occur?

It's an error (and error status of too many records is returned to the
SA client in the end node).

Gets are only allowed to return 1 record (GetTable requests can deal
with more than 1 record in the response) yet many were found by the SA
that satisfied the request in responding to the Get. Any idea on what
the specific get is that causes this to occur ?

-- Hal

> We will probably do an update to the 1.4 or 1.4.1 SM in the next few days.
> We are currently running a pre 1.4 top of tree pull from back in dec. bob
>
>
> May 28 00:07:13 324705 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:13 336912 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:13 338067 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:13 570497 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:14 417242 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:14 899329 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:15 245929 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:17 076558 [6E38940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:18 118151 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:18 328783 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:18 341440 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:18 578154 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:19 425249 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:19 907407 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> May 28 00:07:20 267818 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>
> ....
>
>
>
> -------------------------------------------------------------------------
> Robert B. Ciotti                              Supercomputing Systems Lead
> NASA Advanced Supercomputing (NAS) Division            TEL (650) 604-4408
> NASA Ames Research Center                              FAX (650) 604-4377
> Moffett Field, CA 94035-1000                          Bob.Ciotti at NASA.gov
> -------------------------------------------------------------------------
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From rdreier at cisco.com  Thu May 28 15:31:10 2009
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 28 May 2009 15:31:10 -0700
Subject: [ofa-general] RE: [PATCH] core/mthca: Distinguish multiple IB
	cards in /proc/interrupts
In-Reply-To: <4A1DEBC8.9050207@sgi.com> (Arputham Benjamin's message of "Wed, 
	27 May 2009 18:41:28 -0700")
References: <4A0B560B.3090606@sgi.com> <adad4aaf3dk.fsf@cisco.com>
	<4A136DF0.7000402@sgi.com> <adaab58ctd4.fsf@cisco.com>
	<1AB9A794DBDDF54A8A81BE2296F7BDFE992691@cf--amer001e--3.americas.sgi.com>
	<adaskizbimo.fsf@cisco.com>
	<1AB9A794DBDDF54A8A81BE2296F7BDFE992692@cf--amer001e--3.americas.sgi.com>
	<ada8wkqayfb.fsf@cisco.com>
	<1AB9A794DBDDF54A8A81BE2296F7BDFE992696@cf--amer001e--3.americas.sgi.com>
	<adazld69f4e.fsf@cisco.com> <4A1DEBC8.9050207@sgi.com>
Message-ID: <ada63fk6es1.fsf@cisco.com>


 > > Not sure what you mean.  If we put msi-x info under /sys, then you can
 > > figure out which interrupts belong to a given HCA by following the
 > > device link from /sys/class/infiniband.  Similarly if /proc/interrupts
 > > gives the PCI device, then you have the same ability.  So either way
 > > works as far as I can tell.
 > Linux is supposed to move away from procfs to sysfs for this type of device
 > related info. However, /proc/interrupts is still present in the latest
 > distro
 > releases (for example, SLES11) and OFED needs to provide support
 > for this in procfs until the /proc/interrupts support is removed from
 > kernel.

I think we're talking past each other.  I agree that /proc/interrupts is
still needed.  However, there are two things I see that we can add, and
each one suffices to make everything unambiguous:

1) Add PCI device info ("mlx4-comp-1 at pci...") to the interrupt name.
   Then if userspace cares about the interrupts for device "foo", it can
   look at the /sys/class/infiniband/foo/device symlink to find the PCI
   device, and then look in /proc/interrupts for all interrupts related
   to that PCI device.

*OR*

2) Add /sys/devices/pci.../msix/vectorN files (or something like that)
   so userspace can similarly follow the
   /sys/class/infiniband/foo/device symlink to the PCI directory and
   read the MSI-X vector numbers for the device, and then get all info
   for that interrupt from /proc/interrupts, /proc/irq/NNN/smp_affinity,
   etc.

Either option by itself is completely sufficient.

 > I have not seen full sysfs support for Ethernet devices .
 > I have seen IRQ number info but no interrupt counters on a per CPU basis.
 > Do we know when the full support for ethernet devices will be available
 > in sysfs? We can enhance OFED at the same time ethernet support is made
 > available in the kernel.

Umm... for ethernet you can get per-CPU counters from /proc/interrupts,
if you know the IRQ number.  But if you have multiple MSI-X interrupt
then you have to get the IRQ number some other way.

 > 3) We can add dev_alloc_name() functionality to mlx4_core similar to
 > alloc_name()
 > present in ib_core. This is consistent with other ethernet device driver
 > implementations
 > using the function dev_alloc_name() present in the kernel. (Please see
 > .../net/core/dev.c)

Not sure how this could work.  If mlx4_core is allocating device
numbers, and I have 3 adapters, only 2 of which are IB HCAs and 1 of
which is an ethernet adapter, then how mlx4_core assign numbers that
match what the RDMA layer will use?

 - R.


From Bob.Ciotti at nasa.gov  Thu May 28 16:41:33 2009
From: Bob.Ciotti at nasa.gov (Bob Ciotti)
Date: Thu, 28 May 2009 16:41:33 -0700
Subject: [ofa-general] SubnAdmGet (6777)
In-Reply-To: <f0e08f230905281206m67f448efia95f72cc417d53be@mail.gmail.com>
References: <20090528175757.GA95655@nas.nasa.gov>
	<f0e08f230905281206m67f448efia95f72cc417d53be@mail.gmail.com>
Message-ID: <20090528234133.GA45460@nas.nasa.gov>

On Thu, May 28, 2009 at 02:06:38PM -0500, Hal Rosenstock wrote:
> On Thu, May 28, 2009 at 1:57 PM, Bob Ciotti <Bob.Ciotti at nasa.gov> wrote:
> >
> > Sorry to bounce this off the list - should it be too remedial. I promise
> > that I've been consuming a lot of the spec and OFA code. Maybe you consider
> > that a promise or a warning we will be more active :|
> >
> > Our configuration is >6000 CA in a mix of infinihostIII/connectx and
> > longbow extenders and >800 24 port switches on a single subnet. (SGI ICE
> > with lots of other stuff plugged in). Its DDR everywhere except across the
> > longbows. Hosts range from a few different generations of x86 xeon, x86
> > opteron and itanium. We use lustre but have the srp traffic on a separate
> > subnet.
> >
> > A few weeks ago connection setup times were mentioned on this list along
> > with ARP and path record lookups not being scalable. We experience these
> > problems as well and need to address these scalability issues. I have a quite
> > a bit of test data and a few different ideas to bounce off the list RE path
> > records, once I am a little more versed in the spec. There has already been
> > some work done to limit ARP traffic.
> >
> > Todays question has to do with SM errors.
> > We have been seeing lots of these - sometimes more than others. Digging
> > around some it appears that the 6777 represents the number of duplicates?
> > This value fluctuates around some, but not alot. Comments in the code
> > indicate that any valuse >1 is a problem. Question is, should or is this
> > OK to be happening and how does it occur?
> 
> It's an error (and error status of too many records is returned to the
> SA client in the end node).
> 
> Gets are only allowed to return 1 record (GetTable requests can deal
> with more than 1 record in the response) yet many were found by the SA
> that satisfied the request in responding to the Get. Any idea on what
> the specific get is that causes this to occur ?

 Thats the problem. The at the debug level we are running at I can pin down 
the source. Is there a state I can go look for on the clients to see what 
its trying to do?

bob


> -- Hal
> 
> > We will probably do an update to the 1.4 or 1.4.1 SM in the next few days.
> > We are currently running a pre 1.4 top of tree pull from back in dec. bob
> >
> >
> > May 28 00:07:13 324705 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:13 336912 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:13 338067 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:13 570497 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:14 417242 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:14 899329 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:15 245929 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:17 076558 [6E38940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:18 118151 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:18 328783 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:18 341440 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:18 578154 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:19 425249 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:19 907407 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> > May 28 00:07:20 267818 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
> >
> > ....
> >
> >
> >
> > -------------------------------------------------------------------------
> > Robert B. Ciotti ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??Supercomputing Systems Lead
> > NASA Advanced Supercomputing (NAS) Division ?? ?? ?? ?? ?? ??TEL (650) 604-4408
> > NASA Ames Research Center ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??FAX (650) 604-4377
> > Moffett Field, CA 94035-1000 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??Bob.Ciotti at NASA.gov
> > -------------------------------------------------------------------------
> >
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >


From vlad at lists.openfabrics.org  Fri May 29 03:24:13 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Fri, 29 May 2009 03:24:13 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090529-0200 daily build status
Message-ID: <20090529102413.3C9B4E61689@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From hry at platform.com  Thu May 28 23:28:54 2009
From: hry at platform.com (Hans Westgaard Ry)
Date: Fri, 29 May 2009 08:28:54 +0200
Subject: [ofa-general] Memory registration redux
In-Reply-To: <C7598186-0CFA-40F8-A94C-9D4BA0E6E85D@cisco.com>
References: <CCC4612E-CA82-4D3D-AA5F-776FF14AD34E@cisco.com><adabpq6t2k8.fsf@cisco.com><20090506214628.GM2590@obsidianresearch.com><adatz3xsxo6.fsf@cisco.com><20090506222638.GA16280@obsidianresearch.com><adaprelsvnp.fsf@cisco.com><20090507000231.GB16280@obsidianresearch.com><adak54ssi0g.fsf@cisco.com><20090507224806.GF16280@obsidianresearch.com><adabppf8nln.fsf@cisco.com>
	<ada7i038nk9.fsf@cisco.com><5019F239-149F-49E1-8C23-436DE6094AB2@cisco.com>
	<adaws8277wa.fsf@cisco.com>
	<C7598186-0CFA-40F8-A94C-9D4BA0E6E85D@cisco.com>
Message-ID: <4A1F80A6.2010709@platform.com>

The scheme looks fine to me !

Hans W. Ry

Jeff Squyres skrev:
> Other MPI implementors -- what do you think of this scheme?
>
>
> On May 27, 2009, at 1:49 PM, Roland Dreier (rdreier) wrote:
>
>>
>>  > > /*
>>  > >  * If type field is INVAL, then user_cookie_counter holds the
>>  > >  * user_cookie for the region being reported; if the HINT flag 
>> is set
>>  > >  * then hint_start/hint_end hold the start and end of the 
>> mapping that
>>  > >  * was invalidated.  (If HINT is not set, then multiple events
>>  > >  * invalidated parts of the registered range and 
>> hint_start/hint_end
>>  > >  * should be ignored)
>>
>>  > I don't quite grok this.  Is the intent that HINT will only be set if
>>  > an *entire* hint_start/hint_end range is invalidated by a single
>>  > event?  I.e., if only part of the hint_start/hint_end range is
>>  > invalidated, you'll get the cookie back, but not what part of the
>>  > range is invalid (because assumedly the entire IBV registration is 
>> now
>>  > invalid anyway)?
>>
>> Basically, I just keep one hint_start/hint_end.  If multiple events hit
>> the same registration then I just give up and don't give you a hint.
>>
>>  > >  * If type is LAST, then the read operation has emptied the list of
>>  > >  * invalidated regions, and user_cookie_counter holds the value 
>> of the
>>  > >  * kernel's generation counter when the empty list occurred.  The
>>  > >  * other fields are not filled in for this event.
>>
>>  > Just to be clear -- we're supposed to keep reading events until we 
>> get
>>  > a LAST event?
>>
>> Yes, that's probably the sanest use case.
>>
>>  > 1. Will it increase by 1 each time a page (or set of pages?) is
>>  > removed from a user process?
>>
>> As it stands it increases by 1 every time there is an MMU notification,
>> even if that notification hits multiple registrations.  It wouldn't be
>> hard to change that to count the number of events generated if that
>> works better.
>>
>>  > 2. Does it change if pages are *added* to a user process?  I.e., does
>>  > the counter indicate *removals* or *changes* to the user process page
>>  > table?
>>
>> No, additions don't trigger any MMU notification -- that's inherent in
>> the design of the MMU notifiers stuff.  The idea is that you have a
>> "secondary MMU" and MMU notifications are the equivalent of TLB
>> shootdowns; the secondary MMU is responsible for populating itself on
>> faults etc.
>>
>>  > Is the *unm_counter value guaranteed to have been changed by the time
>>  > munmap() returns?
>>
>> Yes.
>>
>>  > Did you pick [2] here simply because you're only expecting an INVAL
>>  > and a LAST event in this specific example?  I'm assuming that we
>>  > should normally loop over reading until we get LAST, correct?
>>
>> Right.
>>
>>  > What happens if I register multiple regions with the same cookie 
>> value?
>>
>> You get in trouble -- I need to fix things to reject duplicated cookies
>> actually, because otherwise there's no way to unregister.
>>
>>  > Is a process responsible for guaranteeing that it umn_unregister()s
>>  > everything before exiting, or will all pending registrations be
>>  > cleaned up/unregistered/whatever when a process exits?
>>
>> The kernel cleans up of course to handle crashes etc.
>>
>>  - R.
>>
>
>


From hal.rosenstock at gmail.com  Fri May 29 06:09:49 2009
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 29 May 2009 09:09:49 -0400
Subject: [ofa-general] SubnAdmGet (6777)
In-Reply-To: <20090528234133.GA45460@nas.nasa.gov>
References: <20090528175757.GA95655@nas.nasa.gov>
	<f0e08f230905281206m67f448efia95f72cc417d53be@mail.gmail.com>
	<20090528234133.GA45460@nas.nasa.gov>
Message-ID: <f0e08f230905290609j7870873flb186888d2626f3d4@mail.gmail.com>

On Thu, May 28, 2009 at 7:41 PM, Bob Ciotti <Bob.Ciotti at nasa.gov> wrote:
> On Thu, May 28, 2009 at 02:06:38PM -0500, Hal Rosenstock wrote:
>> On Thu, May 28, 2009 at 1:57 PM, Bob Ciotti <Bob.Ciotti at nasa.gov> wrote:
>> >
>> > Sorry to bounce this off the list - should it be too remedial. I promise
>> > that I've been consuming a lot of the spec and OFA code. Maybe you consider
>> > that a promise or a warning we will be more active :|
>> >
>> > Our configuration is >6000 CA in a mix of infinihostIII/connectx and
>> > longbow extenders and >800 24 port switches on a single subnet. (SGI ICE
>> > with lots of other stuff plugged in). Its DDR everywhere except across the
>> > longbows. Hosts range from a few different generations of x86 xeon, x86
>> > opteron and itanium. We use lustre but have the srp traffic on a separate
>> > subnet.
>> >
>> > A few weeks ago connection setup times were mentioned on this list along
>> > with ARP and path record lookups not being scalable. We experience these
>> > problems as well and need to address these scalability issues. I have a quite
>> > a bit of test data and a few different ideas to bounce off the list RE path
>> > records, once I am a little more versed in the spec. There has already been
>> > some work done to limit ARP traffic.
>> >
>> > Todays question has to do with SM errors.
>> > We have been seeing lots of these - sometimes more than others. Digging
>> > around some it appears that the 6777 represents the number of duplicates?
>> > This value fluctuates around some, but not alot. Comments in the code
>> > indicate that any valuse >1 is a problem. Question is, should or is this
>> > OK to be happening and how does it occur?
>>
>> It's an error (and error status of too many records is returned to the
>> SA client in the end node).
>>
>> Gets are only allowed to return 1 record (GetTable requests can deal
>> with more than 1 record in the response) yet many were found by the SA
>> that satisfied the request in responding to the Get. Any idea on what
>> the specific get is that causes this to occur ?
>
>  Thats the problem. The at the debug level we are running at I can pin down
> the source.

Can you change the debug level ? If not, can you instrument OpenSM
(add some debug info into osm_sa_path_record.c) ?

> Is there a state I can go look for on the clients to see what
> its trying to do?

Perhaps use madeye.

-- Hal

> bob
>
>
>> -- Hal
>>
>> > We will probably do an update to the 1.4 or 1.4.1 SM in the next few days.
>> > We are currently running a pre 1.4 top of tree pull from back in dec. bob
>> >
>> >
>> > May 28 00:07:13 324705 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:13 336912 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:13 338067 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:13 570497 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:14 417242 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:14 899329 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:15 245929 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:17 076558 [6E38940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:18 118151 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:18 328783 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:18 341440 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:18 578154 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:19 425249 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:19 907407 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> > May 28 00:07:20 267818 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777)
>> >
>> > ....
>> >
>> >
>> >
>> > -------------------------------------------------------------------------
>> > Robert B. Ciotti ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??Supercomputing Systems Lead
>> > NASA Advanced Supercomputing (NAS) Division ?? ?? ?? ?? ?? ??TEL (650) 604-4408
>> > NASA Ames Research Center ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??FAX (650) 604-4377
>> > Moffett Field, CA 94035-1000 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??Bob.Ciotti at NASA.gov
>> > -------------------------------------------------------------------------
>> >
>> > _______________________________________________
>> > general mailing list
>> > general at lists.openfabrics.org
>> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>> >
>> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>> >
>


From hnrose at comcast.net  Fri May 29 08:35:15 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Fri, 29 May 2009 11:35:15 -0400
Subject: [ofa-general] [PATCH] libibmad/resolve.c: Determine SL properly
Message-ID: <20090529153515.GA10301@comcast.net>


rather than assuming SL 0

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c
index 691bdc3..f17da11 100644
--- a/libibmad/src/resolve.c
+++ b/libibmad/src/resolve.c
@@ -59,6 +59,7 @@ int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout,
 		return -1;
 
 	mad_decode_field(portinfo, IB_PORT_SMLID_F, &lid);
+	mad_decode_field(portinfo, IB_PORT_SMSL_F, &sm_id->sl);
 
 	return ib_portid_set(sm_id, lid, 0, 0);
 }
@@ -74,12 +75,23 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
 {
 	ib_portid_t sm_portid;
 	char buf[IB_SA_DATA_SIZE] = { 0 };
+	ib_portid_t self = { 0 };
+	uint64_t selfguid;
+	ibmad_gid_t selfgid;
+	uint8_t nodeinfo[64];
 
 	if (!sm_id) {
 		sm_id = &sm_portid;
 		if (ib_resolve_smlid_via(sm_id, timeout, srcport) < 0)
 			return -1;
 	}
+
+	if (!smp_query_via(nodeinfo, &self, IB_ATTR_NODE_INFO, 0, 0, srcport))
+		return -1;
+	mad_decode_field(nodeinfo, IB_NODE_PORT_GUID_F, &selfguid);
+	mad_set_field64(selfgid, 0, IB_GID_PREFIX_F, IB_DEFAULT_SUBN_PREFIX);
+	mad_set_field64(selfgid, 0, IB_GID_GUID_F, selfguid);
+
 	if (*(uint64_t *) & portid->gid == 0)
 		mad_set_field64(portid->gid, 0, IB_GID_PREFIX_F,
 				IB_DEFAULT_SUBN_PREFIX);
@@ -87,10 +99,11 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid,
 		mad_set_field64(portid->gid, 0, IB_GID_GUID_F, *guid);
 
 	if ((portid->lid =
-	     ib_path_query_via(srcport, portid->gid, portid->gid, sm_id,
+	     ib_path_query_via(srcport, selfgid, portid->gid, sm_id,
 			       buf)) < 0)
 		return -1;
 
+	mad_decode_field(buf, IB_SA_PR_SL_F, &portid->sl);
 	return 0;
 }
 
@@ -167,6 +180,7 @@ int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid,
 		return -1;
 
 	mad_decode_field(portinfo, IB_PORT_LID_F, &portid->lid);
+	mad_decode_field(portinfo, IB_PORT_SMSL_F, &portid->sl);
 	mad_decode_field(portinfo, IB_PORT_GID_PREFIX_F, &prefix);
 	mad_decode_field(nodeinfo, IB_NODE_PORT_GUID_F, &guid);
 

From hnrose at comcast.net  Fri May 29 12:31:12 2009
From: hnrose at comcast.net (Hal Rosenstock)
Date: Fri, 29 May 2009 15:31:12 -0400
Subject: [ofa-general] [PATCH] infiniband-diags/ibdiag_common.c: Eliminate
	compile warning on x86_64 archs
Message-ID: <20090529193112.GA14170@comcast.net>


src/ibdiag_common.c: In function pretty_print
src/ibdiag_common.c:95: warning: field precision should have type int,
but argument 3 has type long int

Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
---
diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-diags/src/ibdiag_common.c
index 4ffa3f0..6fb8e01 100644
--- a/infiniband-diags/src/ibdiag_common.c
+++ b/infiniband-diags/src/ibdiag_common.c
@@ -92,7 +92,7 @@ static void pretty_print(int start, int width, const char *str)
 		}
 		if (e - str == 1)
 			e = p;
-		fprintf(stderr, "%.*s\n%*s", e - str, str, start, "");
+		fprintf(stderr, "%.*s\n%*s", (int)(e - str), str, start, "");
 		str = e;
 	}
 }


From vlad at lists.openfabrics.org  Sat May 30 03:28:18 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sat, 30 May 2009 03:28:18 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090530-0200 daily build status
Message-ID: <20090530102819.1F56BE613C4@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From ogerlitz at Voltaire.com  Sat May 30 23:41:54 2009
From: ogerlitz at Voltaire.com (Or Gerlitz)
Date: Sun, 31 May 2009 09:41:54 +0300
Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in
	ipoib_neigh_cleanup()
In-Reply-To: <20090521210049.GY6837@sgi.com>
References: <20090519215505.GN6837@sgi.com> <4A13ADDA.5040908@Voltaire.com>
	<20090521210049.GY6837@sgi.com>
Message-ID: <4A2226B2.5070604@Voltaire.com>

akepner at sgi.com wrote @ http://lists.openfabrics.org/pipermail/general/2009-May/059730.html
> What would prevent a race between a tx completion (with an 
> error) and the cleanup of a neighbour? 

Okay, so maybe this code/design of using the stashed ipoib_neighbour at the tx
completion code is the root cause of all these troubles?! 

>From a quick look on the code and two patches that touched this area (f56bcd801... "Use separate CQ for UD send completions" and 57ce41d1... "Fix transmit queue stalling forever") - I see that the original tx cq handler - ipoib_ib_handle_tx_wc() doesn't touch the neigbour but today is called only from the drain timer & dev-stop flows. Now, ipoib_cm_handle_tx_wc() is called for "normal" flow both for datagram and connected modes, and this function touches he neighbour.

I am not sure why commit f56bcd801... made UD completions to go through ipoib_cm_handle_tx_wc() nor why this function must use the neighbor to access the data-structure it needs to, maybe Eli can comment on that?

Or.


From eli at dev.mellanox.co.il  Sun May 31 00:21:15 2009
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Sun, 31 May 2009 10:21:15 +0300
Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in
	ipoib_neigh_cleanup()
In-Reply-To: <4A2226B2.5070604@Voltaire.com>
References: <20090519215505.GN6837@sgi.com> <4A13ADDA.5040908@Voltaire.com>
	<20090521210049.GY6837@sgi.com> <4A2226B2.5070604@Voltaire.com>
Message-ID: <20090531072115.GA9211@mtls03>

On Sun, May 31, 2009 at 09:41:54AM +0300, Or Gerlitz wrote:
> akepner at sgi.com wrote @ http://lists.openfabrics.org/pipermail/general/2009-May/059730.html
> > What would prevent a race between a tx completion (with an 
> > error) and the cleanup of a neighbour? 
> 
> Okay, so maybe this code/design of using the stashed ipoib_neighbour at the tx
> completion code is the root cause of all these troubles?! 
> 
> >From a quick look on the code and two patches that touched this area (f56bcd801... "Use separate CQ for UD send completions" and 57ce41d1... "Fix transmit queue stalling forever") - I see that the original tx cq handler - ipoib_ib_handle_tx_wc() doesn't touch the neigbour but today is called only from the drain timer & dev-stop flows. Now, ipoib_cm_handle_tx_wc() is called for "normal" flow both for datagram and connected modes, and this function touches he neighbour.

Or, I don't follow on you - ipoib_cm_handle_tx_wc() called
ipoib_neigh_free() from the first commit. Also please note the
following designation of CQs:
recv_cq: used for all receives and for CM send
send_cq: used for UD send

Thus, since in ipoib_poll() we poll "recv_cq", any none receive must
be that of CM mode sends.

> 
> I am not sure why commit f56bcd801... made UD completions to go through ipoib_cm_handle_tx_wc() nor why this function must use the neighbor to access the data-structure it needs to, maybe Eli can comment on that?
> 
> Or.
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From vlad at lists.openfabrics.org  Sun May 31 03:25:20 2009
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sun, 31 May 2009 03:25:20 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20090531-0200 daily build status
Message-ID: <20090531102520.E5604E616A5@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:


From ogerlitz at Voltaire.com  Sun May 31 04:34:56 2009
From: ogerlitz at Voltaire.com (Or Gerlitz)
Date: Sun, 31 May 2009 14:34:56 +0300
Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in
	ipoib_neigh_cleanup()
In-Reply-To: <20090531072115.GA9211@mtls03>
References: <20090519215505.GN6837@sgi.com>
	<4A13ADDA.5040908@Voltaire.com>	<20090521210049.GY6837@sgi.com>
	<4A2226B2.5070604@Voltaire.com> <20090531072115.GA9211@mtls03>
Message-ID: <4A226B60.9070006@Voltaire.com>

Eli Cohen wrote:
> ipoib_cm_handle_tx_wc() called ipoib_neigh_free() from the first commit. 

Okay, thanks for pointing this out. Looking on the code, I'm not sure why the non sucess/flush path of ipoib_cm_handle_tx_wc() must access the neighbour while ipoib_ib_handle_tx_wc
can get a way with only a warning print... do we agree that accessing the neigbour from the cm tx completion flow is buggy?

Or.


From dorfman.eli at gmail.com  Sun May 31 07:44:46 2009
From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire))
Date: Sun, 31 May 2009 17:44:46 +0300
Subject: [ofa-general] [PATCH] infiniband-diags: Do not change logical state
	on SubnAdmSet
Message-ID: <4A2297DE.3050707@gmail.com>

Do not change logical state on SubnAdmSet

When changing physical state do not change logical port state.
>From the IB spec When writing PortInfo:PortState, only legal transitions are
valid. So if PortState is ACTIVE and we try to set it to ACTIVE this will fail.

This patch allows reset in a single MAD.

Signed-off-by: Eli Dorfman <elid at voltaire.com>
---
 infiniband-diags/src/ibportstate.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/infiniband-diags/src/ibportstate.c b/infiniband-diags/src/ibportstate.c
index 65c9ca1..d19a2e5 100644
--- a/infiniband-diags/src/ibportstate.c
+++ b/infiniband-diags/src/ibportstate.c
@@ -275,8 +275,10 @@ int main(int argc, char **argv)
 
 	/* Only if one of the "set" options is chosen */
 	if (port_op) {
-		if (port_op == 1)		/* Enable port */
+		if (port_op == 1) {		/* Enable port */
 			mad_set_field(data, 0, IB_PORT_PHYS_STATE_F, 2);	/* Polling */
+			mad_set_field(data, 0, IB_PORT_STATE_F, 0);             /* No Change */
+		}
 		else if ((port_op == 2) || (port_op == 3)) { /* Disable port */
 			mad_set_field(data, 0, IB_PORT_STATE_F, 1);             /* Down */
 			mad_set_field(data, 0, IB_PORT_PHYS_STATE_F, 3);        /* Disabled */
@@ -292,6 +294,7 @@ int main(int argc, char **argv)
 
 		if (port_op == 3) {	/* Reset port - so also enable */
 			mad_set_field(data, 0, IB_PORT_PHYS_STATE_F, 2);	/* Polling */
+			mad_set_field(data, 0, IB_PORT_STATE_F, 0);             /* No Change */
 			err = set_port_info(&portid, data, portnum, port_op);
 			if (err < 0)
 				IBERROR("smp set portinfo failed");
-- 
1.5.5